Mail delivery delays

Posted at 2020-04-02T02:46:00-07:00 by Scott

Impact: Users may have experienced delays in receiving or sending mail during the affected period.

Timeline:

  • The problem began at roughly 2:30 AM PST, and staff were alerted by internal alarms.
  • By 3:00 AM PST, we began attempting to recycle servers to clear the mail queue. However, one user had inadvertently created an open relay mass-sending spam, leading to the deadlock triggering at higher than normal sending load. An emergency fix was deployed to drop these emails.
  • By 4:26, service status had returned to normal.

Technical details: A bug in a mail library Purelymail uses caused a deadlock where mail was removed from the queue but not actually delivered to processing threads. This bug was introduced in a recent version of the library (Apache James).

Remediation: We will integrate fixes for the bug into our fork of the library, and more carefully audit updates to the affected library in the future.

Takeaways: - We still need an improved ability to recycle servers. - We may also need more comprehensive load testing to catch these issues earlier.