Purelymail News - Imap Timeouts

Impact

Sporadically and with varying frequency, users would experience timeouts when connecting to IMAP over port 993. This could often be fixed by retrying.

Timeline

The problem first began to occur sporadically after Feb 24, a few days after a code deploy
Throughout the week, staff would restart servers when the problem occurred in them
The problem was thought temporarily fixed on Feb 29th, but users noticed it again in subsequent days
The problem was finally identified and fixed on Mar 2

Technical details

Ultimately this was a race condition causing a deadlock when closing channels, which would block some of the threads that handled IMAP connections. It was introduced by an innocuous looking code change and took a while to debug.

We were still using the Netty 3 library, which has been outdated for a while in favor of Netty 4. Netty has a concept of a pipeline, a list of handlers that messages go through, and Netty 3 lets multiple messages be processed for a single connection simultaneously (which 4 doesn’t).

When a connection was abruptly reset, it could send both an event for the channel being closed inbound into the pipeline (from head to tail), and an exception event for the closed channel exception outbound (from tail to head). Our old pipeline was to (simplify):

SSL handler -> execution handler -> exception handler -> traffic shaping (bandwidth limiting) handler

All we did was move the traffic shaping handler:

SSL handler -> traffic shaping handler -> execution handler -> exception handler

This was done because it didn’t seem like the traffic shaping handler should logically be on a separate thread by being after the execution handler.

What then happened was that when the connection was abruptly closed, the exception handler would try to close the connection itself (as it does for all exceptions) by writing a close request; because the execution handler puts it on a different thread (this is intended for processing messages, not exceptions), it can do this concurrently with an event reporting that the socket was closed.

The close request (going the reverse direction because it’s outbound) acquired a lock on the traffic shaping handler, and then tried to acquire a lock on the SSL handler. But meanwhile the closed event acquired a lock on the SSL handler, and then tried to acquire a lock on the traffic shaping handler. This caused a deadlock.

Before this wouldn’t happen because the execution handler would make the SSL and traffic shaper always acquire locks on different threads.

The Netty 4 model is immune to this issue, as it only ever processes one event at a time regardless of thread. We plan to upgrade to it.

Why not just roll the code back?

Partially hubris- we thought it’d be easier to find and fix the problem. Partly for morale- the feeling of never being able to deploy because something random will break sucks, and we at least wanted to be able to figure out what was going wrong. The old code was broken in ways that the new code was not, and almost all of it worked just fine.