Purelymail News - Database out of IOPS

Impact

General slowness, mostly performing IMAP queries. Users with large mailboxes experienced could IMAP timeouts. Service was down for about 20 minutes during staff-initiated database upgrade.

Timeline

At about 6 AM PST, the number of IOPS the database was using increased about 20% over its baseline.
Around 3 PM, the burst bucket on the database’s IOPS ran out, limiting how many disk-based queries the database could fulfill.
Staff started looking into this around this time, and unsuccessfully tried code changes to reduce IOPS use.
At 9:19 PM, staff began restarting the database to upgrade its instance capacity.
At 9:38 PM, the upgrade was complete and service started to return to normal.

Technical details

An AWS RDS database is limited both by the IOPS (instruction operation per second- a unit of disk read/write) of its storage, which can be upgraded at any time, and the maximum IOPS of its instance class. We knew our database was running close to its IOPS limit and were already working on ways to mitigate this, but these weren’t quite ready in time so we resorted to upgrading our database.

Basically, a lot of IMAP clients are outdated or poorly written and will request the ID of every single piece of mail in their mailbox every time they want to check for new mail. These queries can be pretty heavy on the database, but this wasn’t previously much of a problem.

The straightforward solution is to cache mailbox data on the server so it doesn’t have to talk to the database at all for a client asking (again) for every piece of mail. For this, we were working on setting up a SQLite based cache, which was nearly ready at the time the problem occurred. Unfortunately since its simplified implementation loads all mailbox data from the database before answering queries, this didn’t seem to help when the database was already overstressed (and that load could timeout).

The more expensive solution is to simply upgrade the database’s instance size, which we tried to avoid both because it would require downtime and increase cost. Ultimately we just upgraded it for the moment, doubling its IOPS capacity- though given that the database now has more RAM, it also seems to halve its IOPS usage so far.