Impact: Users may have experienced delays in receiving or sending mail during the affected period.
Timeline:
- The problem began at roughly 2:30 AM PST, and staff were alerted by internal alarms.
- By 3:00 AM PST, we began attempting to recycle servers to clear the mail queue. However, one user had inadvertently created an open relay mass-sending spam, leading to the deadlock triggering at higher than normal sending load. An emergency fix was deployed to drop these emails.
- By 4:26, service status had returned to normal.
Technical details: A bug in a mail library Purelymail uses caused a deadlock where mail was removed from the queue but not actually delivered to processing threads. This bug was introduced in a recent version of the library (Apache James).
Remediation: We will integrate fixes for the bug into our fork of the library, and more carefully audit updates to the affected library in the future.
Takeaways: - We still need an improved ability to recycle servers. - We may also need more comprehensive load testing to catch these issues earlier.