Articles
Awesome resource! In each section, they explain what to include, why to include it, and an example from their playbook.
Blake Thorne — Atlassian
I didn’t make it to Failover Conf, and it sounds like I missed a great time, so I’m especially grateful for this writeup.
Rich Burroughs — FireHydrant
And this one!
Hannah Culver — Blamelss
I’m a little late with this one, sorry folks! Survey ends tomorrow, April 27.
This is an anonymous survey to look at the impact that COVID-19 has had on oncall teams in tech.
FireHydrant
Most post-incident review documents are written to be filed, not written to be read.
This slide deck is awesome and well worth the read.
John Allspaw — Adaptive Capacity Labs
A deep dive into the math behind anomaly detection.
Nikita Butakov — Ericsson
This article brings together thoughts on on-call work during the pandemic from folks at different companies.
Rich Burroughs — FireHydrant
A frontend engineer shares their key takeaways from their time shadowing.
Laura Montemayor — GitLab
Outages
- GitHub
- DataDog
- Poloniex
- DigitalOcean
- Apple Pay
- ShipStation
- Sendy
- Sharp online store and IoT devices
- Sharp retooled one of its factories to produce masks and started selling them commercially. The increased load caused problems with their online store and existing consumer IoT devices.
- Discord
- Fastly
- Also a control plane issue earlier the same day.Full disclosure: Fastly is my employer.