Last Sunday, there was a major backbone Internet provider outage after I finished putting SRE Weekly together. There were so many outages that I’m not even going to bother listing all of them in the Outages section.
I love the way this article portrays SRE by placing less emphasis on specific skills and more on a holistic approach to reliability.
Emily Arnott — Blameless
Incident review is an important part of the organizational learning process, but it can be practiced in a way where the focus shifts away from learning to fixing.
John Carrol (original paper)
Thai Wood — Resilience Roundup (summary)
My latest adventures in (negligently) running sreweekly.com. It started with a surprise AWS bill, and then it got kinda weird…
Deep technical details on a series of recent incidents involving Basecamp.
Troy Toman — Basecamp
Here’s why eyes-on-glass constant monitoring won’t help and can be actively harmful.
In August, we experienced no incidents resulting in service downtime. This month’s GitHub Availability Report will dive into updates to the GitHub Status Page and provide follow-up details on how we’ve addressed the incident mentioned in July’s report.
Keith Ballinger — GitHub
Here are Cloudflare’s thoughts on what happened with Sunday’s Internet trouble.
Matthew Prince — Cloudflare
This is ThousandEyes’s analysis of the outage, which goes along similar lines to Cloudflare’s and includes a lot more detail.
Angelique Medina and Archana Kesavan — ThousandEyes