Instance death

Posted at 2021-03-31T05:16:00-07:00 by Scott

Impact

Some users may have experienced issues connecting to services during the affected period.

Timeline

  • The problem occurred at roughly 9:20 AM-9:40 AM PST on March 30th, and again from 10:00 AM PST to 10:20 AM PST. It self resolved around 10:20 AM PST.
  • Staff noticed the problem around 3 PM PST.

Technical details

One of our two running mailservers began experiencing issues. We’re not certain what the cause was, but it was likely high memory use by SpamAssassin causing the machine to run out of memory, which somehow caused the mailserver to experience long freezes. Eventually, the instance was automatically rotated out by the loadbalancer for unresponsiveness and replaced.

A few user search indices were corrupted as a side effect and needed to be rebuilt.

Why the long interval between problem and staff action?

This problem both self-resolved and never actually impaired the service enough to trigger most alarms. The other mailserver instance was unaffected. The alarm configured to notice unhealthy servers did not properly alert staff.

Remediation

We’ll keep tweaking the SpamAssassin configuration. SpamAssassin is notoriously memory hungry and prone to leaking; we’ve added memory caps to it before but those were apparently configured per-child rather than for the overall daemon and needed to be lowered substantially.

We may in the future need to run SpamAssassin on a separate host entirely to mitigate its faults. (This is not currently done for infrastructural simplicity.)

Also, staff alerting alarms need to be tweaked- we might even make something to let users alert us if problems seem apparent.