SRE Weekly Issue #219

Articles

Check out this new 100-page ebook on incident response from Atlassian, great for folks setting up a brand new on-call structure or improving their existing one. It even has a section on compensating teams for being on-call.

Serhat Can — Atlassian

Laura Maguire discusses the compelling data from her PhD dissertation that the Incident Command System actually makes incident response less efficient, along with lots of other interesting findings.

Laura Maguire

A summary of a great talk by Amy Tobey at Failover Conf, amusingly framed as a “retrospective”.

Hannah Culver — Blameless

In this case, the “cloud” refers to actual clouds, the ones in the sky. It’s a comparison between concepts in aviation and SRE, fields that have significant overlaps.

Bill Duncan

My favorite:

The fact that you need to make changes to maintain availability, will itself threaten your availability.

Lee Atchison — diginomica

A bug in a new release of the Facebook SDK caused some iOS apps to crash.

Brian Barrett — WIRED

[…] I’m no longer going to try do get engineers to stop saying root cause while they are engaged in an incident. If the experts at anomaly response find it a useful phrase while they are doing their work, we should recognize this as a part of their expertise.

Loren Hochstein

Outages

  • Slack
    • Slack’s server infrastructure scales up every day to handle volume in North America by increasing the size of the server pool available to handle requests. Some of these servers did not successfully register with our load balancing infrastructure during this process of scaling up, and this ultimately led to a decline in the health of the server pool over time.

  • Youtube
  • Coinbase
  • Google Play Store
  • Microsoft Outlook
  • reddit
  • Zoom
Updated: May 17, 2020 — 8:44 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme