SRE Weekly Issue #152

A message from our sponsor, VictorOps:

SRE teams can leverage automation in chat to improve incident response and make on-call suck less. Learn the ins and outs about using automated ChatOps for incident response:

http://try.victorops.com/sreweekly/automated-chatops-in-incident-response

Articles

It’s hard to summarize all the awesome here, but it boils down to empathy, collaboration, and asking, “How can I help?”. These pay dividends all over an organization, especially in reliability.

Note: Will Gallego is my coworker, although I came across this post on my own.

Will Gallego

This followup post for a Google Groups outage was (fittingly) hidden away in a Google Group.

Thanks to Jonathan Rudenberg for this one.

Now I can link directly to specific incidents! I miss the graphs, though.

Jamie Hannaford — GitHub

I laughed so hard I scared my cats:

COWORKER: we need to find the root cause asap
ME: takes long drag the root cause is that our processes are not robust enough to prevent a person from making this mistake
COWORKER: amy please not right now”

Great discussion in the thread!

Amy Nguyen

In Air Traffic Control parlance, if a pilot or controller can’t satisfy with a request, they should state that they are “unable” to comply. It can be difficult to decide in the moment what one is truly “unable” to do. There are a lot of great lessons here that apply equally well to IT incident response.

Tarrance Kramer — AVweb

The common theme at KubeCon is that SRE teams at many companies produce reliable, reusable patterns for their developers to build with.

Beth Pariseau — TechTarget

This is the story of a tenacious fight to find out what went wrong during an incident. If you read nothing else, the Conclusion section has a lot of great tidbits.

Tony Meehan — Endgame

Here’s a new guide on how to apply Restorative Just Culture. This made me laugh:

They also fail to address the systemic issues that gave rise to the harms caused, since they reduce an incident to an individual who needs to be ‘just cultured’.

Sidney Dekker — Safety Differently

Outages

Updated: December 16, 2018 — 9:53 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme