From 12:46 PM PDT to 9:04 PM PDT, users sporadically and transiently could not connect to Purelymail servers. This was caused by an underlying issue in AWS Cloud Services (unfortunately we cannot find a permalink to the AWS issue).
Impact: As far as we could tell, users (and some of our scripts) occasionally experienced timeouts connecting to servers.
Remediation: Really, we just waited for AWS to fix it. (If the issue were more serious we could’ve done more, but hasty action is itself a risk.)
Takeaways: We could’ve mitigated this by spreading infrastructure across more than one AWS region, but honestly the infrastructural complexity is a lot less and this isn’t bad for only impacting issue in five years.
AWS described the issue as: > Sep 28 9:04 PM PDT Beginning at 11:10 AM PDT, we began experiencing increased error rates and delays for newly launched EC2 instances in a single Availability Zone (use1-az2). Between 2:00 PM and 2:50 PM PDT, we also experienced an unrelated issue that resulted in increased error rates for new instance launches in use1-az1. As of 8:37 PM PDT, we observed recovery of the underlying subsystem responsible for propagating the network mappings. At this time, customers should be in recovery for increased error rates and delays in network propagation for newly launched EC2 instances in a single Availability zone (use1-az2). Additionally, customers should see recovery for elevated connectivity errors for PrivateLink. We will gradually shift traffic back in to the previously affected Availability Zone over the next few hours, but do not expect our traffic shift back to the affected Availability Zone to result in any additional customer impact. Given that we have now resolved the issue, we recommend shifting traffic back to the previously affected Availability Zone at your convenience. We also wanted to reiterate that this event was not a repeat of the networking issue that occurred on September 18th. Although both issues affected network mapping propagation times, they involved very different subsystems within the EC2 Networking Distribution Plane. We fully understand the impact that these types of issues have on customers’ workloads, and we apologize for any inconvenience. The issue has been resolved and the service is operating normally.
(Though from our experience it seemed to impact very existing EC2 instances.)