SRE Weekly Issue #86

Articles

Testing in production: Yes, you can (and should)

Charity Majors knocks one out of the park with this article on the importance of testing (safely) in production.

Why does testing in production get such a bad rap when we all do it? The key is to do it right.

Stepping Up to the Plate: A Story About Being On-Call – PagerDuty

And speaking of baseball metaphors, here’s a PagerDuty engineer’s first-person account of shadowing on-call during an incident and the lessons she learned.

Post-Incident Reviews Survey

If you have time, please consider filling out this short survey on post-incident reviews (a.k.a. “retrospectives”) as part of a master’s thesis.

A Primer on Automating Chaos – Gremlin, Inc.

Mathias Lafeldt of Gremlin Inc. gives us this tutorial on moving from hand-run chaos experiments to a fully automated chaos system.

Post-Incident Reviews: Learning from Failure for Improved Incident Response – VictorOps

Recently, Jason Hand’s new ebook, Post-Incident Reviews, was published. Here’s his summary of the key points in the first three chapters.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Operational Metrics and Alerts for Distributed Software Systems

This article describes metrics in three main categories and explains how (and whether) to set up alerts for each kind.

Good output metrics are a close proxy for dollars earned or saved by the system per minute.

Cutting Alert Fatigue in Modern Ops – PagerDuty

Like the previous article, Ilan Rabinovitch of Datadog advocates for symptom-based monitoring and alerting. I like his concept of the improved “durability” of symptom-based alerting (as opposed to cause-based):

[…] you don’t have to update your alert definitions every time your underlying system architectures change.

Impermanence: The Single Root Cause – Production Ready

Our systems are always in flux, and this sometimes leads to failure. Mathias expands on this line of thinking to urge seeking to understand the many conditions that led to a failure, rather than a particular root cause.

Per-metric rate limiting: How we protect our backend

Hosted Graphite had a gnarly problem to solve: how to get information about overload conditions from the backend to the front end where throttling could be enacted.

Outages

Honeycomb
- Honeycomb suffered their first major outage this week. I’m impressed by how quickly they were able to diagnose and fix the problem, owing at least in part to their use of their own service during troubleshooting.
PagerDuty
- Here’s a followup from PagerDuty on an incident in May caused by “unanticipated side-effects of a system-wide load test”.
Botched Firmware Update Bricks Hundreds of Smart Door Locks
RCA for SYNQ dashboard login and registration outage on August 11th, 2017
DreamHost
- DreamHost suffered a couple of DDoS attacks this week.Thanks to an anonymous SRE Weekly reader for this one.
Facebook
- Facebook had a couple of outages this week.

SRE Weekly Issue #86

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues