SRE Weekly Issue #145

A message from our sponsor, VictorOps:

When SRE teams track incident management KPIs and benchmarks, they better optimize the way they operate–helping SREs create more resilient teams and build more reliable systems:


An article on looking past human error in investigating air sports (definition) accidents, drawing on the writing of Don Norman. Special emphasis on slips versus mistakes:

“Slips tend to occur more frequently to skilled people than to novices

Mara Schmid — Blue Skies Magazine

An VP of NS1 explains how his company rewrote and deployed their core service without downtime.

Shannon Weyric — NS1

This guide from Hosted Graphite has a ton of great advice and reads almost as if they’ve released their internal incident response guidelines. Bonus content: check out this exemplary post-incident followup from their status site.

Fran Garcia — Hosted Graphite

Check it out, Atlassian posted their incident management documentation publicly!

On Monday I gave a talk at DOES18 called “All the World’s a Platform”, where I talked about a bunch of the lessons learned by using and abusing and running and building platforms at scale.

I promised to do a blog post with the takeaways, so here they are.

Charity Majors

[…] at a certain point, it’s too expensive to keep fixing bugs because of the high-opportunity cost of building new features. You need to decide your target for stability just like you would availability, and it should not be 100%.

Kristine Pinedo — Bugsnag

Maelstrom is Facebook’s tool to assist engineers in safely moving traffic off of impaired infrastructure.

Adrian Colyer — The Morning Paper (summary)
Veeraraghavan et al. — Facebook (original paper)

Attempting to stamp out failure entirely can have the paradoxical effect of reducing resiliency to anomalous situations. Instead, we need to handle failure constructively.

Daniel Hummerdal — Safety Differently



Updated: October 28, 2018 — 8:24 pm
SRE WEEKLY © 2015 Frontier Theme