On August 26th 2020, we experienced a major service outage for a duration of 21 minutes from 12:57 PST to 13:18 PST (Time will be referenced in PST onward) due to a burst of request volume (9000% increase). During the incident, we got throttled by our primary and secondary email providers (Postmark and SendGrid) so the login emails were not being sent out until the email provider issue was resolved. Our infrastructure was unresponsive as the excessive requests overloaded our services. At 13:15, our services started to recover with the majority of requests being served, and by 13:18, we were fully recovered.
The 9000% volume burst in user login requests required cluster scaling beyond our standard upper limits and caused our infrastructure to become unresponsive. In addition, the burst also caused our email providers (Postmark and SendGrid) to throttle our requests.
Starting at 12:45, we’ve observed a significant increase in user login requests. The initial burst of login requests caused a downtime to our authentication service from 13:00 to 13:15. The service started to recover at 13:15 after we increased our overall scaling upper limits and addressed the problem with our email provider.
At 12:57, we were alerted that our primary email provider (Postmark) was throttling us. We were in contact with Postmark immediately to have them lift the throttling. Simultaneously, we switched to our backup email service provider (SendGrid). The switchover worked for a few moments as we soon discovered that our SendGrid’s dedicated sending IP was flagged by email clients (Google, Yahoo, Outlook, etc.) as being spammy. This caused massive delays (1+ hours) in emails reaching their destination (show in the image below). Postmark uses a pool of dedicated IPs to support the high volume of email traffic, as a result we never observed a noticeable delivery delay while emails were being processed through them. At 13:18, we switched back to Postmark once they resolved the issue, and emails were sent out through Postmark successfully afterward.
During the incident, our Nginx ran into OOM errors. We’ve identified the cause which was by the high volume of simultaneous connections with long response time from the API service. Our API servers were overly utilized in both CPU and memory. At 12:55, our autoscaler kicked in and doubled our API fleet within the first 10 mins. But unfortunately, this did not prove to be sufficient. During the scaling process, our replicas were running at 100% CPU at all time and killed due to OOM errors. This is most likely caused by the new requests that were coming in but old requests were still in process waiting for results from the underlying resources (i.e. database). On average, prior to the incident, our memory usage was at 32MB with 1.7GB as the cap. During the incident, our memory spiked up to 2.8GB representing a ~90x increase, this caused the OOM errors despite the initial pre-allocated ~50x overhead. On the database side, our RDS CPU utilization went from 50% on average to 100%. This was a hard cap on RDS utilization, and no dynamic scaling was possible during the incident. We’ve identified slow queries and are looking into allocating more resources to our database instance.