A spike in traffic occurred and our API pods did not scale out in enough time.
The following were not possible or experienced partly degraded service:
During the outage, our engineers rapidly joined a virtual war-room to triage the situation and to find the fastest, most impactful step forward. After investigating the issue, we began to increase our overall server fleet size, which stabilized the traffic spike.
We also found that our scale-up policies had room for improvement. Our team immediately modified our scale-up policies to better accommodate for high traffic spikes in the future, which we were able to observe during later traffic spikes.