Instance death (2)

Posted at 2021-05-04T04:30:00-07:00 by Scott

Impact

Some users may have experienced issues connecting to services during the affected period.

Timeline

  • The problem occurred at roughly 6:00 PM-6:45 PM PST on May 2nd.
  • Staff were alerted by automated alarms at 6:26 PM PST.
  • The problem was resolved by 6:45 PM PST.

Technical details

Both of our two running mailserver instances simultaneously ran out of memory, in a similar manner as an earlier incident.

Remediation

While some memory-saving remediations implemented after the first incident seem to have helped, we obviously still need to work on this. At the moment it seems like the main server process is leaking JNI memory (a type of native memory in the JVM our servers run on which is difficult to debug). Debugging is made more difficult when the server is out of memory, as itis difficult to access it for debugging and the server requires immediate restart to not have production impact.

For the moment, we’ve added swap space to our instances. Swap space basically uses disk space for RAM when the system runs out, which will hopefully allow us to better debug situations where RAM usage on servers goes slightly over maximum due to leaked memory. It’s not a panacea- indeed it can massively slow servers down when they have to utilize it heavily- but it might help.