Articles
This is an article version of an interview with Dr. Danielle Ofri, author of a new book When We Do Harm, on NPR’s Fresh Air. I especially loved the part about near misses.
Bridget Bentz, Molly Seavy-Nesper, Deborah Franklin, Sam Briger, and Thea Chaloner — NPR
Maintenance of the logging system had unintended downstream effects including log loss and failure of the system that manages dynos.
In this incident, a TLS certificate was deployed without its intermediate, resulting in failures for some clients.
I wrote this after attending the Resilience Engienering Association’s webinar with panelists Dr. Richard Cook, John Allspaw, and Nora Jones, moderated by Laura Maguire. Once the recording is posted, I highly recommend watching!
Lex Neva
As SREs, we need to be laser focused on the user’s experience. Our SLIs should reflect that.
Emily Arnott — Blameless
This two-part series is an in-depth look at how Twitter adopted SRE, before SRE was even a thing.
Blameless
Outages
- Elevated 500 errors on status pages and management portal
- Gmail
- Tinder
- Australian Taxation Office
- Discord
- This status page post is interesting and worth a read.
- GitHub
- Twitch