SRE Weekly Issue #227

Articles

A Terrible, Horrible, No-Good, Very Bad Day at Slack

This is the first of a pair of articles this week on a major Slack outage in May. This one explores the technical side, with a lot of juicy details on what happened and how.

Laura Nolan — Slack

All Hands on Deck

This is the companion article that describes Slack’s incident response process, using the same incident as a case study.

Ryan Katkov — Slack

Improving Incident Retrospectives at Indeed

The author saw room for improvement in the retrospective process at Indeed. The article explains the recommendations they made and why, including de-emphasizing generation remediation items in favor of learning.

Alex Elman

Google Cloud Networking Incident #20005 Follow-Up

The datacenter was purposefully switched to generator power during planned power maintenance, but unfortunately the fuel delivery system failed.

Towards More Effective Incident Postmortems

This is a good primer on the ins and outs of running a post-incident analysis.

Anusuya Kannabiran — Squadcast

Setting SLOs: observability using custom metrics

This article goes through an interesting technique for setting up SLO metrics and alerts in GCP using Terraform and OpenCensus.

Cindy Quach — Google

Introducing the GitHub Availability Report

GitHub is committing to publishing a report on their availability each month with detail on incidents. This intro includes the reports for May and June with a description of 4 incidents.

Keith Ballinger — GitHub

Blameless’ SRE Journey

This is neat: Blameless transitioned from “startup mode” toward an SRE methodology, becoming customer 0 of their own product in the process.

Blameless

Outages

Facebook SDK
- Like in May, a Facebook SDK release caused problems on iOS for Spotify, Pinterest, Tinder.
Uber Eats
Crunchyroll
TikTok
Spotify

SRE Weekly Issue #227

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues