This very clearly written paper describes the Google G Suite team’s search for a meaningful availability metric: one that accurately reflected what their end users experienced, and that could be used by engineers to pinpoint issues and guide improvements.
Hauer et al. — NSDI’20 (original paper)
Adrian Colyer — The Morning Paper (summary)
Their top 5 are:
- Use Meaningful Severity Levels
- Create Detailed Runbooks
- Load Balance Through Qualitative Metrics
- Get Ahead of Incidents
- Cultivate a Culture of On-Call Empathy
Emily Arnott — Blameless
Synchronizing clocks can be critical in an HA system, and Facebook went to great lengths to ensure clock accuracy.
Zoe Talamantes and Oleg Obleukhov — Facebook
You might end up just breaking things.
Dawn Parzych — LaunchDarkly
LinkedIn’s message search system takes advantage of the fact that relatively few users actually search their message. It only builds a search index the first time a user performs a search.
Suruchi Shah and Hari Shankar — LinkedIn
This followup post from Bungie covers two related incidents in February that caused loss of user data.
An interview about how one company got their developers to join the on-call rotation. It covers how they trained them to help them build confidence and what benefits they got by joining.
Ben Linders — InfoQ
- The text of this incident originally mentioned Heroku, and it lines up with the Heroku outage below.
- They also had this unrelated outage.
- Microsoft Teams and Office 365
- Discord posted this gem of a followup analysis just a few days after their outage last week.
- Google Nest