Purelymail News - Unplanned Database Migration

Summary

In short: unexpected database load combined with a maintenance task depleted our I/O capacity faster than anticipated, forcing an emergency database upgrade resulting in 44 minutes of downtime.

Impact

This past Thursday, we had to perform an unplanned database migration starting at 12:37 EST. This required us to go into maintenance mode for 44 minutes, which precluded everyone from sending and receiving emails.

Timeline

We had been seeing significant demand on our database over the last few weeks. We began to implement some query optimizations to help relieve the burden on the database, and at the same time started planning for a database upgrade at some point in the near future. Unfortunately, our timeline was accelerated Thursday morning due to one of the improvements we released combined with higher than normal demand on the database.

Around 08:20 EST that day, we received a notification from our alarm system that our database was using up more resources than it can handle under “baseline” conditions. This happens sometimes, and is usually not a cause for alarm. We are able to exceed baseline resources without issue, as long as we do not exceed them for too long. Whenever this alarm was triggered in the past, though, it was normally around 10:30 EST - 11:30 EST due to the increased demand from US-based west coast customers, so this was the initial warning sign. We remained at 74% of our “burstable” capacity until until 09:50 EST, at which point we started to lose excess capacity from west coast customer demand like we had done in the past. This continued until 11:35 EST when we stabilized at around 15% of capacity. This was far lower than what we were expecting or had seen before, but it did stabilize. Our hope at this point was that it would stabilize, and then start to slowly increase as the day went on. This would allow us to perform a (sooner than anticipated) upgrade that evening, but at around 12:05 EST, capacity started to deteriorate again and we eventually hit 0% at 12:24 EST. Once that happens, customers get significantly throttled and bigger issues start to crop up, so we began to put the server into maintenance mode so that we could perform the upgrade. By 12:37 EST, we had the server in maintenance mode and began the database upgrade. By 13:25 EST, the upgrade was complete and we restarted the mail servers.

Technical details

Over the last two weeks, we noticed that our “EBSIO credits” were depleting faster than normal. For those not versed in AWS, EBS stands for “Elastic Block Storage,” and it’s the technology AWS uses to save data to disk. When you create a database, you need to choose what type of instance you’ll use (think CPU and RAM) as well as what kind of storage you’ll use (i.e., how fast can you read and write to disk). Your ability to read and write data is actually codependent on both the type of instance you have (due to how much data the instance can transfer from disk to memory) as well as the type of storage medium you choose (what the disk is actually capable of delivering). AWS allows you to exceed the baseline performance for a given instance, but only up to a certain point. Once you exceed that threshold, AWS throttles your read/writes back to the baseline performance. The only way to get more performance is by performing fewer read/writes (i.e., below the baseline) so that you can start building credits up again, or to shut down and upgrade to a bigger instance.

Up until this past Thursday, our database was on an AWS instance that allowed us to meet demand during peak times, but didn’t require us to maintain a bigger machine than needed during off-peak hours. We would definitely dip into excess capacity, but nothing that caused us too much concern. The challenge with this type of setup, though, is that if we ran through our “burstable I/O credits,” we’d get throttled at the exact wrong moment for our users, which is what happened Thursday.

In more specific terms, our prior instance had a 4000 IOPS (i.e., input/output operations per second) baseline that was able to burst to 15700 IOPS for up to 30 minutes every day. We started to exceed the limit early that morning (EST) at least partially due to a change we made to one of the indexes we were modifying to improve query performance. In hindsight, we should have performed this change in the evening or over the weekend, rather than on a Thursday morning. This change caused our IOPS to hit 7000 for about 10 minutes as the index was being rebuilt. We then saw the normal spike in activity around 10:10 EST, but it was bigger than normal - also around 7000 IOPS for another 15 minutes or so. This was, unfortunately, followed by an even larger spike of around 8000 IOPS for a 10 minute period starting at 11:00 EST. The highest spikes we had seen over the last 4 weeks were all below 7000 IOPS, and prior to that we hadn’t had many alarms in several months due to a caching library Scott implemented earlier this year. So, at 11:00 EST, our credits started to deplete fast. They leveled out at around 15% of capacity at around 11:35 EST, but the final blow came at 12:15 EST when we saw another spike close to 8000 IOPS at which point our credit balance was depleted and we started the upgrade process.

We upgraded to an instance with a 6000 IOPS baseline and a 40000 max capacity. This represents a 50% increase in baseline capacity compared to our previous instance, and is well above our average needs. The machine also has 2 more vCPUs, and twice as much RAM, which should allow us to keep more data in memory (i.e., more caching / fewer IOPS) and respond to queries faster.

Next Steps

Going forward, we will keep a closer eye on:

ReadIOPS / WriteIOPS (i.e., the actual IOPS we are using, knowing that we have a 6000 baseline that cannot be changed without upgrading to a bigger machine).
DiskQueueDepth (which will represent a canary in the coalmine - if it starts to grow, it will be a sign we are at capacity).
ReadThroughput / WriteThroughput (since we have a ~ 1250 MB/s ceiling), though this is likely less important as we have a lot of small reads and writes.

We are also looking at additional optimization strategies such as:

Continuing to optimize query performance wherever possible
Distributing reads across the primary database and read-replicas
Perform additional caching to eliminate the need to hit the database in the first place

Lastly, we will also plan to perform upgrades using a blue/green deployment sooner than we did in this instance. Had we upgraded sooner, we would have had 1-2 minutes of downtime rather than 44. We would have also been able to notify everyone of the timeline and plan, rather than springing it on everyone unexpectedly. We were trying to step into the upgrade cautiously, but we’ll be more aggressive next time to ensure that there is minimal downtime due to these types of issues.

We’re sorry for the unexpected downtime and not fixing this faster. We appreciate everyone’s patience, and we’ll try to do better next time!