Here’s a take on incident severity levels. I enjoy learning what criteria folks use for this, so please send similar articles my way (or maybe write your own?).
Nancy Chauhan — Rootly
Counterfactuals (“should haves”) stifle incident retrospectives by tempting us to stop digging deeper. This article points out that there are unending possible counterfactuals for any incident.
Read to find out how counting incidents (or “# days since an outage”) won’t help and will cause more problems than it’s worth. Also included: options for what to count instead.
Sloth is a tool for generating SLOs as Prometheus metrics, claiming to support “any kind of service”.
If you’re looking for a way to evaluate your SRE process, this might help.
Alex Bramley — Google
This article tries to put an actual number on the cost of adding more nines of reliability.
Jack Shirazi — Expedia
It’s time for Catchpoint’s yearly SRE report, downloadable in PDF form through this link. Note: you have to give them your email address.
- This outage impacted banks and airlines, among other Akamai customers.