This one’s juicy. At one point, the front-end was blocked up, so the back-end saw less traffic and scaled down. Then when the traffic came flooding back, the back-end was ill-prepared. We can all learn from this.
Cloudflare has what amounts to a sophisticated staging environment for testing new code.
Yan Zhai — Cloudflare
Sometimes rolling back doesn’t actually get you back to a good state, especially when there’s pent-up demand.
Rachel By the Bay
Here’s Google’s follow-up on a Google Meet outage earlier this month.
Those are some seriously big database servers.
Josh Aas and James Renken — Let’s Encrypt
A great general overview of all aspects of incident response, including definitions and best practices.
Check out what happens when you unleash a generalized language model AI on some log messages related to an incident.
Larry Lancaster — Zebrium
The CRE team at VMware undertook a project to find and reduce toil. Note that “with VMware CRE” does not mean “with some product named VMware CRE™”.
Gustavo Franco — VMware
This is Slack’s RCA for their outage earlier this month. This is a great example of a complex incident with many contributing factors — certainly no single “root cause” here.