Mail delivery (incoming and outgoing) was delayed during the affected period.
- The problem began at roughly 8:00 PM PST.
- At about 9:30 PM PST, staff noticed some personal emails were not being delivered.
- By 10:50 PM PST, the problem was identified and fixed, and the mail backlog was fully processed.
When mail is received, it needs to be added to a per-user search index. Multiple servers may be adding mail to a user’s mailbox simultaneously, but only one can make changes to the search index at a time. To enforce this behavior, servers that wish to write to a user’s search index must hold a lock on it.
This lock is implemented as a database row. Since the lock’s lifetime is dynamic (a server processing multiple mails for a user might write all the changes to the index at once and thus hold the index open for a bit) and transaction-based locks are expensive in Postgres, it is implemented via updating a row’s locked attribute, with the server needing to continue periodically updating the attribute as long as it holds the lock. (If it doesn’t, it’s assumed to have crashed and the lock is released.)
A recent change to fix some incorrect behavior of the search index required no longer allowing Lucene (the search index library) to control when the lock was acquired, although it still controlled when it was closed. Unfortunately this meant that if constructing the index writer failed (due to the index being corrupted) the lock would no longer be released.
With the lock being indefinitely held, any attempts to add to its search index would wait indefinitely. Since there are a limited number of threads processing email, once the user who the index belong to had more mails than the number of threads, all became deadlocked waiting to write to the same search index, and all mail processing effectively stopped.
Wait, don’t you have monitors for this sort of thing?
Yes, we have monitors whose job is to alert us to any problems with mail delivery. However in a fit of terrible luck, these monitors were temporarily disabled earlier in the day when for unrelated reasons they misfired, and were planned to be reinstated about the same time as the incident.
Disabling them in the first place was not a great decision, but poor decisions are a natural consequence of being awakened after only two hours of sleep.
- It’s always a good idea to have failsafes. In this case, ones that would release the database lock in the event that it went out of scope.
- We’re going to limit the number of threads that can be concurrently processing email for a single account, to mitigate this sort of issue in the future.