This is potentially the stupidest production issue in the world. I sincerely apologize.
Impact: Users may have experienced delays in receiving or sending mail during the affected period.
Timeline:
- The problem began at 4:46 PM PST, when a new change was deployed to production.
- At approximately 8:50 PM PST, staff were alerted by internal monitors that emails were not being sent or received.
- By 9:31 PM PST, we had identified and were deploying a fix.
- By 9:45 PM PST, service had begun returning to normal.
Technical details: A change was deployed which introduced, among other things, a configuration option for running the Purelymail servers locally but against a production-like environment that disabled processing mails from the mail queue. This was not intended to affect production, but through a coding oversight it did. Subsequent to the deployment, production servers stopped processing mail.
This would not have been that big of a problem if internal alarms had functioned correctly. For unclear reasons we are still investigating, they did not alert staff to the problem for another 3.5 hours.
Remediation: The particular change introducing the configuration option was actually from some older work that had laid dormant and was just integrated into the master code branch. Such code may be particularly issue prone, and should be reviewed as if it were new.
However, we know that no amount of code QA or testing can catch every possible problem. We will need a last-ditch way for users to alert staff of issues that’s completely independent of email.
Takeaways:
- Who monitors the internal monitors?
- More seriously, these internal monitors are critical to alerting staff of any issues. We hope to find out what went wrong with them and improve them to be more reliable.