Heavy database load

Posted at 2022-09-07T23:59:00-07:00 by Scott

Impact

For around two hours, service may have been slow or strange errors may have been shown.

Timeline

  • At 11:51 AM PST, database queries started to be slow and unresponsive.
  • Staff was alerted to the problem at 2:20 PST and solved it a few minutes later by stopping a maintenance job that was using a heavy load.

Technical details

The same file-cleaning process mentioned in the September incident had been rewritten to operate in chunks, skip all locked rows, and periodically vacuum the table to prevent any surprises. While it was disabled, a backlog of deletable records had built up.

Since earlier attempts to clean out this backlog had resulted in high database load, a periodic job was added to slowly but steadily clean it up. This worked well for many days until suddenly the deleting query and other DB queries stalled.

We think this is due to the underlying volume of our database running out of IOPS. Our database is not very big and we didn’t overprovision on storage, so with a gp2 volume type AWS only allocated it 750 IOPS, with a burst up to 3000. Heavy loads such as those the cleanup query induced could thus cause it to run out of IOPS and tank DB performance.

Fortunately, the gp3 volume type for AWS became available in November, which allows us a constant 3000 IOPS/s and the ability to provision more capacity as needed.

Remediation

  • Hopefully gp3 IOPS should suffice for a while. We’ll need to add monitoring in case the DB IOPS get saturated again.