I’m going on vacation, so I’m going to prepare next week’s issue in advance. It’ll look much like most issues, except there won’t be an Outages section. See you all in two weeks!
Articles
In the previous articles in this series, they described a process of interviewing incident responders before a full retrospective meeting. This one discusses what to do if you can’t conduct those interviews, and the particular challenges this will bring and how to deal with them.
Emily Ruppe — Jeli
Some interesting ideas on potential downsides of circuit breakers and how we might ameliorate them.
Marc Brooker
GitHub has had a bit of a hard time lately. Here’s an update on what they’re dealing with and how they’re planning to address it.
Keith Ballinger — GitHub
All sorts of “mean time to” metrics, including 6(!) different MTTR metrics and how they might be used.
Alex Ewerlöf — InfoQ
This is a huge 100+-page report on the benefits of a model in which development teams own the operation of their systems. There’s a lot in here, with carefully spelled-out pros/cons and cost/benefit analyses. Need to convince someone? Send them this.
We’ve written this playbook for CxOs, product managers, delivery managers, and
operations managers.
Bethan Timmins and Steve Smith — Equal Experts
It’s easy to miss MTUs, until they sneak up on you and cause really confusing problems.
Aaron Kalair — Hudl
Should you compensate for on-call? How? I really want to see more articles about this, so send them my way if you see or write any.
Chris Evans — Incident.io
Some good tips in this article, and I love the case studies.
Prathamesh Sonpatki — Last9
Outages
- PagerDuty
- Apple App Store, Apple Music and iCloud
- GitHub
-
They had several incidents this week.
-
- .au TLD
-
DNSSec.
-
- Sportsbook.ag