SRE Weekly Issue #219

Articles

Download our new on-call book [Atlassian]

Check out this new 100-page ebook on incident response from Atlassian, great for folks setting up a brand new on-call structure or improving their existing one. It even has a section on compensating teams for being on-call.

Serhat Can — Atlassian

How Many Is Too Much? Exploring Costs of Coordination During Outages

Laura Maguire discusses the compelling data from her PhD dissertation that the Incident Command System actually makes incident response less efficient, along with lots of other interesting findings.

Laura Maguire

“The Future of DevOps is Resilience Engineering” Incident Retrospective

A summary of a great talk by Amy Tobey at Failover Conf, amusingly framed as a “retrospective”.

Hannah Culver — Blameless

Operations in the Cloud

In this case, the “cloud” refers to actual clouds, the ones in the sky. It’s a comparison between concepts in aviation and SRE, fields that have significant overlaps.

Bill Duncan

Five causes of poor availability to watch out for

My favorite:

The fact that you need to make changes to maintain availability, will itself threaten your availability.

Lee Atchison — diginomica

How a Facebook Bug Took Down Spotify, TikTok, and Other Major iOS Apps

A bug in a new release of the Facebook SDK caused some iOS apps to crash.

Brian Barrett — WIRED

Making peace with “root cause” during anomaly response

[…] I’m no longer going to try do get engineers to stop saying root cause while they are engaged in an incident. If the experts at anomaly response find it a useful phrase while they are doing their work, we should recognize this as a part of their expertise.

Loren Hochstein

Outages

Slack
- Slack’s server infrastructure scales up every day to handle volume in North America by increasing the size of the server pool available to handle requests. Some of these servers did not successfully register with our load balancing infrastructure during this process of scaling up, and this ultimately led to a decline in the health of the server pool over time.
Youtube
Coinbase
Google Play Store
Microsoft Outlook
reddit
Zoom

SRE Weekly Issue #219

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues