Magic Service Outage

Incident Report for Magic

Postmortem

Summary

On August 26th 2020, we experienced a major service outage for a duration of 21 minutes from 12:57 PST to 13:18 PST (Time will be referenced in PST onward) due to a burst of request volume (9000% increase). During the incident, we got throttled by our primary and secondary email providers (Postmark and SendGrid) so the login emails were not being sent out until the email provider issue was resolved. Our infrastructure was unresponsive as the excessive requests overloaded our services. At 13:15, our services started to recover with the majority of requests being served, and by 13:18, we were fully recovered.

Root Cause(s)

The 9000% volume burst in user login requests required cluster scaling beyond our standard upper limits and caused our infrastructure to become unresponsive. In addition, the burst also caused our email providers (Postmark and SendGrid) to throttle our requests.

Impact and Analysis

Starting at 12:45, we’ve observed a significant increase in user login requests. The initial burst of login requests caused a downtime to our authentication service from 13:00 to 13:15. The service started to recover at 13:15 after we increased our overall scaling upper limits and addressed the problem with our email provider.

At 12:57, we were alerted that our primary email provider (Postmark) was throttling us. We were in contact with Postmark immediately to have them lift the throttling. Simultaneously, we switched to our backup email service provider (SendGrid). The switchover worked for a few moments as we soon discovered that our SendGrid’s dedicated sending IP was flagged by email clients (Google, Yahoo, Outlook, etc.) as being spammy. This caused massive delays (1+ hours) in emails reaching their destination (show in the image below). Postmark uses a pool of dedicated IPs to support the high volume of email traffic, as a result we never observed a noticeable delivery delay while emails were being processed through them. At 13:18, we switched back to Postmark once they resolved the issue, and emails were sent out through Postmark successfully afterward.

During the incident, our Nginx ran into OOM errors. We’ve identified the cause which was by the high volume of simultaneous connections with long response time from the API service. Our API servers were overly utilized in both CPU and memory. At 12:55, our autoscaler kicked in and doubled our API fleet within the first 10 mins. But unfortunately, this did not prove to be sufficient. During the scaling process, our replicas were running at 100% CPU at all time and killed due to OOM errors. This is most likely caused by the new requests that were coming in but old requests were still in process waiting for results from the underlying resources (i.e. database). On average, prior to the incident, our memory usage was at 32MB with 1.7GB as the cap. During the incident, our memory spiked up to 2.8GB representing a ~90x increase, this caused the OOM errors despite the initial pre-allocated ~50x overhead. On the database side, our RDS CPU utilization went from 50% on average to 100%. This was a hard cap on RDS utilization, and no dynamic scaling was possible during the incident. We’ve identified slow queries and are looking into allocating more resources to our database instance.

Lessons Learned

Regular load testing and benchmarking is the key for us to understand how much traffic that our services can sustain and support
Pre-scale our services at least 4 times when we are anticipating a similar request pattern and volume. This will help keep our existing fleet responsive, while the autoscaler kicks in and introduces new instances to support the load.
Standardize a communication cadence with internal and external stakeholders to set the expectations on anticipated request volume that can be reasonably supported.

Action Items

[DONE] Support multiple sender IP addresses in SendGrid to mitigate email provider throttling
[DONE] Investigate slow DB queries
[DONE] Create template to pre-scale our services
[DONE] Increase auto-scaling limit for our services x3
[DONE] Add Nginx memory based auto-scaling
[DONE] Add API memory based auto-scaling
[DONE] Permanently increase public Nginx node count from 3->5
[DONE] Stress test email provider deliverability with burst volume to simulate the incident
[DONE] Create cadence to perform regular service load testing + benchmarking
[IN PROGRESS] Scale RDS permanently from db.r5.large -> db.r5.xlarge

Posted Sep 18, 2020 - 14:35 PDT

Resolved

Posted Aug 26, 2020 - 01:00 PDT