Articles
After adopting a “full ownership” philosophy, this company faced burnout, with five or more separate developers on call simultaneously. Read about their awesome solution involving a shared on-call rotation staffed entirely by volunteers, spurred by the incentive of extra compensation.
Brian Scanlan — Intercom
What exactly is SRE and how does it relate to DevOps? Earlier this year, we (Liz Fong-Jones and Seth Vargo) launched a video series to help answer some of these questions and reduce the friction between the communities. This blog post summarizes the themes and lessons of each video in the series to offer actionable steps toward better, more reliable systems.
Liz Fong-Jones and Seth Vargo — Google
After a load test uncovered a scaling issue, they dug deep, finding issues with garbage collection settings, cascading failures, and an overeager retry strategy.
Val Markovic — LinkedIn
These tips cover the basics and will be especially useful for teams onboarding engineers that have never been on-call before.
This article examines a case study of an EMS company attempting to adopt a just culture policy. There’s a great discussion of why it’s not a good idea to lay blame on individuals when systemic problems may be far more important.
Larry Boxman and Paul LeSage — JEMS (Journal of Emergency Medical Services)
In this third and final article in a series, Xero lays out their process for analyzing incidents after the fact. Thanks to the Xero folks for being so open about your processes and for taking the time to write these articles!
Karthik Nilakant — Xero
I like the nifty heat maps with example distributed traces. Neat idea!
JBD — Google
Outages
- Sutter Health
- Fortnite (incident analysis)
- I really love how deep and technical Fortnite is with their incident analysis articles! Here’s one for their outage in mid-april.
The Epic Team
- I really love how deep and technical Fortnite is with their incident analysis articles! Here’s one for their outage in mid-april.
- Google Compute Engine (us-east4 region)
- Atlassian Statuspage
- Roku
- Hulu