SRE Weekly Issue #86

SPONSOR MESSAGE

More efficiently reach on-call teams and incident responders with a new way to deploy Live Call Routing using Twilio Functions and VictorOps. Check it out:
http://try.victorops.com/LiveCallRouting/SREWeekly

Articles

Charity Majors knocks one out of the park with this article on the importance of testing (safely) in production.

Why does testing in production get such a bad rap when we all do it? The key is to do it right.

And speaking of baseball metaphors, here’s a PagerDuty engineer’s first-person account of shadowing on-call during an incident and the lessons she learned.

If you have time, please consider filling out this short survey on post-incident reviews (a.k.a. “retrospectives”) as part of a master’s thesis.

Mathias Lafeldt of Gremlin Inc. gives us this tutorial on moving from hand-run chaos experiments to a fully automated chaos system.

Recently, Jason Hand’s new ebook, Post-Incident Reviews, was published. Here’s his summary of the key points in the first three chapters.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

This article describes metrics in three main categories and explains how (and whether) to set up alerts for each kind.

Good output metrics are a close proxy for dollars earned or saved by the system per minute.

Like the previous article, Ilan Rabinovitch of Datadog advocates for symptom-based monitoring and alerting. I like his concept of the improved “durability” of symptom-based alerting (as opposed to cause-based):

[…] you don’t have to update your alert definitions every time your underlying system architectures change.

Our systems are always in flux, and this sometimes leads to failure. Mathias expands on this line of thinking to urge seeking to understand the many conditions that led to a failure, rather than a particular root cause.

Hosted Graphite had a gnarly problem to solve: how to get information about overload conditions from the backend to the front end where throttling could be enacted.

Outages

Updated: August 27, 2017 — 11:13 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme