Articles
Their on-call started out as four 24 hour shifts per person interspersed throughout the year. Find out how they transitioned to a new approach in a process that spanned the start of the pandemic.
Mary Moore-Simmons — GitHub
A new Meet version had a higher storage usage requirement, and a backend system filled up.
This is webinar on alert fatigue, coming up on January 14.
Sarah Wells — Financial Times
Jamie Dobson — Container Solutions
The chaos experiments you do for security purposes can often expose weak points in reliability as well.
Aaron Rinehart — Verica
Kelly Shortridge — Capsul8
Here are four nifty outside-the-box ideas to use the data you may already have.
Emily Arnott — Blameless
Their custom incident management tool, DropSEV, can detect incident-worthy availability drops and file an incident automatically, obviating the need for an engineer to decide on severity level on the fly.
Joey Beyda and Ross Delinger — DropBox
This one has some additional detail on a November outage involving MySQL replication lag.
Keith Ballinger — GitHub
Outages
- Slack
- My first couple hours of work this year were oddly quiet…
- Heroku
- Google Meet
- This is different from the one above.
- Fanduel
- Twitch
- Coinbase
- Archive of Our Own