EC2 Instance Failure

Posted at 2025-11-23T10:37:25-0500

Impact

We experienced a significant unrecoverable error on one of our EC2 instances that led to temporary IMAP connection failures (you couldn’t download mail) and undelivered mail (both incoming and outgoing). We had roughly 3900 messages that weren’t delivered (mostly incoming mail) until Sunday, November 23.

Timeline

On November 21, 2025, one of our virtual machines experienced a significant unrecoverable error (for that instance). Around 15:33 EST that day, we logged the 1st error indicating that our S3 client connection pool had shut down. This led to an increasing number of errors until 17:00 EST at which point a “graceful shutdown” started for the impacted EC2 instance. By 17:20 EST, the shutdown was mostly complete. At that point, we saw a very significant increase in errors (mostly related to “Connection reset” IMAP errors). Around this same time, we started to redeploy our servers and these errors were resolved as the corrupted instance was replaced with a healthy one. By November 23, we had cleared out most of the mail that was delayed due to the instance error (there were about 30 messages that could not be sent/received due to corruption issues).

Technical details

We use the AWS S3 Java SDK, and we follow the recommended best practices by creating a single, thread-safe instance of the service client that we use for the application’s lifecycle. In that instance, we have a connection pool that allows multiple S3 connections to perform actions such as storing mail, retreiving mail, etc. We had one particular server that was unexpectedly shut down due to a “Connection pool shut down” error. Amazon suggests that this error is likely due to either:

  • closing the SDK client prematurely.
  • A java.lang.Error was thrown (such as OutOfMemory).
  • You try to use DefaultCredentialsProvider#create() after it was closed.

We do not use the DefaultCredentialsProvider, so that was ruled out. And we only close the SDK client when the instance is shut down (and even this should only happen long after the instance has no work left to do). So, it appears that some other error happened that triggered the error. We have not identified what that error was, though we have put safeguards in place so that any future errors of a similar nature are identified sooner so that we can redeploy servers faster.