Intermittent functioning due to open file limits

Posted at 2019-11-10T21:49:00-08:00 by Scott

Impact: Users may have seen strange error messages or experienced delays in receiving mail during the affected period.

Timeline:

  • At 7:36 AM PST, one server briefly began exhibiting symptoms, but quickly recovered.
  • At 7:46 PM PST and lasting until resolution at 10:03, one server succumbed, and was unable to successfully process many requests.
  • At 9:29 PM PST, our automated internal scripts detected an issue receiving email and alerted staff.
  • At 9:36 PM PST, we identified that the issue was localized to one of our servers and began recycling servers.
  • By 10:03, service status had returned to normal.

Technical details: Possibly due to load or recent changes, servers ran into the very low default limits on open files set by the operating system, and began failing to open files, preventing them from e.g. reading or writing mail. Due to Purelymail using two servers and inbuilt resilience in the SMTP protocol, no mail was lost, as the other still-functioning server could take over delivery.

Remediation: We will substantially raise open file limits and improve our error detection to more quickly address issues like this in the future.

We don’t see any cost effective process changes that would’ve lead to early detection of this issue. Load and stress testing done on our local machines did not detect this issue, as the open file limits on these were already set very high. The possibility of this error was not foreseen.

Takeaways:

  • Improved detection of errors: While our internal scripts were eventually able to detect an impact to email delivery, if we set up automatic monitors to error rates this could’ve been noticed much sooner.
  • Improved ability to recycle servers: Time to fix once aware could’ve been significantly reduced if we had a better way to recycle our servers; the one used required rebuilding the same image already in production from scratch.