Some users may have experienced issues connecting to services during the affected period (gateway timeouts).
- The problem occurred at roughly 5:30 PM-6:45 PM PST on May 3rd.
- Staff were alerted by users at 5:45 PM PST.
- The problem was resolved by 6:45 PM PST.
A user rapidly began adding emails in multiple sessions, causing an instance to peg at 100% CPU. This led to timeouts for other users, as adding emails used more load than expected due to requiring each be scanned for learning by SpamAssassin.
- For immediate remediation, we disabled an expensive (and somewhat questionable) SpamAssassin scan on all added messages, intended to prime the bayes tokens database with ham.
- Most likely, we will move expensive IMAP-driven tasks like updating search indices and scanning to a jobs queue, so that they can be processed as the servers are ready for them and not immediately.
- In the long run we do need systems to more appropriately:
- Balance load: The AWS load balancer utterly failed in this use case, directing half of traffic to a very busy instance. Most likely we will switch to a different (and more effort from us) solution, as AWS ELB is not very customizable.
- Ensure users get fair access to server resources: Ideally servers would seamlessly scale as necessary, but in the interim it would make sense to more effectively prioritize requests.
Why so late in publishing this and the previous issue?
This is my fault. Usually when issues like these occur, I work on an immediate fix, then afterwards work on more comprehensive fixes and refactors. Since this and the previous issue were (unfortunately) timed so close together, I got caught up in another fix cycle instead of posting the previous, then got to this one only now.