Purelymail News - IMAP access issues

Impact

For around 3.5 hours, users were unable to access mail over IMAP.

Timeline

Code changes made the previous day (Feb 7) exposed flaws that would later cause the problem
Around 12:00 PM PST on Feb 8th, the problem quickly became evident. However, due to flaws in message monitoring, no alarms were sounded.
Users managed to alert staff at 2:20 PM of the issue.
The problem was identified and remediation began around 3:00 PM, clearing up the issue completely around 3:30.

Technical details

Purelymail used to use an older and much more CPU expensive compression library (XZ) on its stored data. This was replaced by a much more performant format years ago, but since messages could not be immediately migrated (the compression is “underneath” the encryption which we may not be able to decrypt), the old format was still used in older messages.

Because the old format was extremely CPU and memory intensive, message decryption was guarded by a semaphore. The semaphore logic was fatally flawed because the old format was also flawed; it had a compression-encryption-compression layer that required two separate semaphore acquisitions. A classic case of hidden deadlock- if the semaphores ran out, two concurrent attempts to decode an old-format message could acquire their first semaphore and then deadlock on acquiring the rest. These semaphore acquisitions were importantly time unbounded, and could claim threads indefinitely.

This was avoided earlier by having many more semaphore permits available, but this could cause server responsiveness issues when a large amount of old format messages needed to be decrypted, something which had happened on Feb 7th. A code change was made to reduce the number of semaphore permits, which made it much easier to deadlock.

The next day, deadlock it did. IMAP-processing threads soon began getting stuck waiting for XZ permits that would never become available, leading to IMAP access becoming effectively disabled.

Remediation

The direct root cause of the error (poor semaphore management) was fixed, and timeouts added to semaphore access so that threads cannot be blocked indefinitely.
We’ve also added code that will automatically start translating old messages into the newer format after they’ve been accessed. This should reduce server load and reduce issues with the old format in the future. This was planned a while ago, but wasn’t prioritized since it didn’t seem to be an issue until now.
Monitoring failed us yet again; the alarm that should have caught this issue within 20 minutes did not have IMAP timeouts, so quietly waited instead of alarming. This will be fixed. Fortunately an out-of-band method for users to wake and alert staff worked, although this was far more outage time than we find acceptable.
We will also look into adding concurrent processing limits for IMAP by account, to screen out issues from a subset of users from affecting all users, although that may not have completely solved the problem for this case.