SRE Weekly Issue #210

A message from our sponsor, VictorOps:

See the tangible benefits of incident management reporting and analytics when it comes to faster incident detection, acknowledgment, response and resolution. Read on to learn about real KPIs and incident metrics that drive reliability:


Netflix open sourced their incident management system.

Put simply, Dispatch is:

All of the ad-hoc things you’re doing to manage incidents today, done for you, and a bunch of other things you should’ve been doing, but have not had the time!

Kevin Glisson, Marc Vilanova, Forest Monsen — Netflix

I wasn’t aware of this little pitfall of memory cgroups.


Your failover DB instance is cute. Try 4x+ redundancy. That’s the kind of engineering required when designing systems to operate in space.

Glenn Fleishman — Increment

This post enumerates some of the risks introduced when a single person carries 100% of the on-call duties of a team, and shows why those risks are not simply eliminated by increasing the number of people in the rotation.

Daniel Condomitti — FireHydrant

This is a pretty nifty experiment showing the importance of letting folks use their judgement to handle unexpected situations rather than relying on adherence to procedures.

Thai Wood — Resilience Roundup (summary)

Makoto Takahashi, Daisuke Karikawa, Genta Sawasato and Yoshitaka Hoshii — Tohoku University (original paper)

FYI: SRECon Americas West has been rescheduled to June 2-4.

This week, we have another summary of the Physalia paper. I especially like the bit about poison pills.

Adrian Colyer — The Morning Paper (summary)

Brooker et al. — NSDI’20 (original paper)

In this case, “proof” means “formal proof”.

It’s not that software got so reliable without proof: it’s that systems that include software got so reliable without proof.

Lorin Hochstein


Updated: March 8, 2020 — 9:11 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme