SRE Weekly Issue #320

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

Slack shared this write-up of their February outage, which involved complex systems interactions and cascading failure.

  Laura Nolan — Slack

Go watch this lightning talk now! She had me hooked within the first ten seconds.

Hi, my name is Emily Ruppe, I work at Jeli.io, and I am a recovering incident commander, and I am sick of the phrase “to prevent this incident from ever happening again”.

  Emily Ruppe — DevOpsDays Rockies

This is my personal story of starting the SRE organization at Uber.

This article was written by a former Uber employee and is posted on their personal blog.

  Will Larson

This is total transparency at its finest. This write-up has all the details you could ever hope for on what went wrong, how they responded, and what comes next.

  Sri Viswanath — Atlassian

The target audience is new SREs and executive sponsors who would keep hearing these terms repeatedly but not take the time to read 1000s of words at a time.

[source: author comment on Reddit]

  Ash P. — SREPath

Dropbox wanted to be able to handle datacenter failure. To reach this goal, they moved from an active/active model to active/passive and spun up a new Disaster Readiness team to rework their failover system.

  Krishelle Hardson-Hurley, Ross Delinger, and Tong Pham — Dropbox

HelloFresh drove the implementation of SLOs in their Kubernetes-based infrastructure using Prometheus and Sloth.

  Chris Loukas — HelloFresh

A Roblox engineer outlines the way that Roblox handles reliability at scale.

  Alberto Covarrubias — Roblox

[…] let’s look at some common on call antipatterns and some simple things we can do to alleviate their common pitfalls.

  Nickolas Means — Sym

Outages

Updated: May 1, 2022 — 9:26 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme