SRE Weekly Issue #315

View on sreweekly.com

I’m going on vacation, so I’m going to prepare next week’s issue in advance. It’ll look much like most issues, except there won’t be an Outages section. See you all in two weeks!

Articles

Incident Analysis 101: Facilitating a Learning Review Without Prior Interviews

In the previous articles in this series, they described a process of interviewing incident responders before a full retrospective meeting. This one discusses what to do if you can’t conduct those interviews, and the particular challenges this will bring and how to deal with them.

Emily Ruppe — Jeli

Will circuit breakers solve my problems?

Some interesting ideas on potential downsides of circuit breakers and how we might ameliorate them.

Marc Brooker

[GitHub] An update on recent service disruptions

GitHub has had a bit of a hard time lately. Here’s an update on what they’re dealing with and how they’re planning to address it.

Keith Ballinger — GitHub

How to Best Use MTT* Metrics to Optimize Your Incident Response

All sorts of “mean time to” metrics, including 6(!) different MTTR metrics and how they might be used.

Alex Ewerlöf — InfoQ

You Build It You Run It Playbook

This is a huge 100+-page report on the benefits of a model in which development teams own the operation of their systems. There’s a lot in here, with carefully spelled-out pros/cons and cost/benefit analyses. Need to convince someone? Send them this.

We’ve written this playbook for CxOs, product managers, delivery managers, and
operations managers.

Bethan Timmins and Steve Smith — Equal Experts

Operation Jumbo Drop: How sending large packets broke our AWS network

It’s easy to miss MTUs, until they sneak up on you and cause really confusing problems.

Aaron Kalair — Hudl

What’s a fair compensation for being on-call?

Should you compensate for on-call? How? I really want to see more articles about this, so send them my way if you see or write any.

Chris Evans — Incident.io

How to Improve On-Call Experience!

Some good tips in this article, and I love the case studies.

Prathamesh Sonpatki — Last9

Outages

PagerDuty
Apple App Store, Apple Music and iCloud
GitHub
- They had several incidents this week.
.au TLD
- DNSSec.
Sportsbook.ag

SRE Weekly Issue #315

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues