Articles
Charity Majors knocks one out of the park with this article on the importance of testing (safely) in production.
Why does testing in production get such a bad rap when we all do it? The key is to do it right.
And speaking of baseball metaphors, here’s a PagerDuty engineer’s first-person account of shadowing on-call during an incident and the lessons she learned.
If you have time, please consider filling out this short survey on post-incident reviews (a.k.a. “retrospectives”) as part of a master’s thesis.
Mathias Lafeldt of Gremlin Inc. gives us this tutorial on moving from hand-run chaos experiments to a fully automated chaos system.
Recently, Jason Hand’s new ebook, Post-Incident Reviews, was published. Here’s his summary of the key points in the first three chapters.
This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.
This article describes metrics in three main categories and explains how (and whether) to set up alerts for each kind.
Good output metrics are a close proxy for dollars earned or saved by the system per minute.
Like the previous article, Ilan Rabinovitch of Datadog advocates for symptom-based monitoring and alerting. I like his concept of the improved “durability” of symptom-based alerting (as opposed to cause-based):
[…] you don’t have to update your alert definitions every time your underlying system architectures change.
Our systems are always in flux, and this sometimes leads to failure. Mathias expands on this line of thinking to urge seeking to understand the many conditions that led to a failure, rather than a particular root cause.
Hosted Graphite had a gnarly problem to solve: how to get information about overload conditions from the backend to the front end where throttling could be enacted.
Outages
- Honeycomb
- Honeycomb suffered their first major outage this week. I’m impressed by how quickly they were able to diagnose and fix the problem, owing at least in part to their use of their own service during troubleshooting.
- PagerDuty
- Here’s a followup from PagerDuty on an incident in May caused by “unanticipated side-effects of a system-wide load test”.
- Botched Firmware Update Bricks Hundreds of Smart Door Locks
- RCA for SYNQ dashboard login and registration outage on August 11th, 2017
- DreamHost
- DreamHost suffered a couple of DDoS attacks this week.Thanks to an anonymous SRE Weekly reader for this one.
- Facebook
- Facebook had a couple of outages this week.