Impact
For around one hour, service may have been slow or strange errors may have been shown.
Timeline
- Around 11 PM PST on Sep 7, a user deleted a large mail account. This caused a cleanup call that ran 30 minutes later to time out and retry.
- At 12:06 PM PST, staff noticed errors and began researching the problem.
- The problem was quickly identified and a fix began deploying around 12:20. However, due to operator error the fix was not fully effective.
- A second fix began deploying around 12:50, and finished around 1:30.
- At around 1:10, staff manually killed off blocking queries in the database, which began to ameliorate the problem.
- By 1:30 the problem was fixed and service returned to normal.
Technical details
Deleted mail (and other files) are soft-deleted for 30 minutes before a background job hard-deletes them. This is done to offer a grace period in case a file is being read when it is deleted.
However, the job that deletes these files was not written to operate in chunks, instead doing one database call to delete all eligible files. In a sufficiently large batch, this could cause the default SQL library to timeout. This timeout should have canceled the query, and the query should have only run once concurrently per mailserver; however in practice a backlog of blocked queries grew in the database. Further queries were forced to wait on locks for queries that would never finish.
As this backlog grew, it eventually reached the point where the database became less responsive. Killing the backlog of queries with pg_cancel_backend() sufficed until the app could be stopped from spawning new ones.
Remediation
- The delete code will be rewritten to operate in smaller, lock-skipping chunks and will not be subject to normal SQL timeouts. Most other potentially long queries are already chunked; it is a limitation of SQL that it makes it so easy to write statements that can brick because they try to do everything in one transaction when that behavior is not required.
- Our Postgres database does need an update at some point- the slow-closing behavior seen could be a long fixed bug.
- We will add a monitoring alarm for average active sessions instead of just database CPU, so the problem is identified sooner.
- We do still want a faster deployment system for new code.