This is the first of a pair of articles this week on a major Slack outage in May. This one explores the technical side, with a lot of juicy details on what happened and how.
Laura Nolan — Slack
This is the companion article that describes Slack’s incident response process, using the same incident as a case study.
Ryan Katkov — Slack
The author saw room for improvement in the retrospective process at Indeed. The article explains the recommendations they made and why, including de-emphasizing generation remediation items in favor of learning.
The datacenter was purposefully switched to generator power during planned power maintenance, but unfortunately the fuel delivery system failed.
This is a good primer on the ins and outs of running a post-incident analysis.
Anusuya Kannabiran — Squadcast
This article goes through an interesting technique for setting up SLO metrics and alerts in GCP using Terraform and OpenCensus.
Cindy Quach — Google
GitHub is committing to publishing a report on their availability each month with detail on incidents. This intro includes the reports for May and June with a description of 4 incidents.
Keith Ballinger — GitHub
This is neat: Blameless transitioned from “startup mode” toward an SRE methodology, becoming customer 0 of their own product in the process.