SRE Weekly Issue #320

Articles

Slack shared this write-up of their February outage, which involved complex systems interactions and cascading failure.

Laura Nolan — Slack

Go watch this lightning talk now! She had me hooked within the first ten seconds.

Hi, my name is Emily Ruppe, I work at Jeli.io, and I am a recovering incident commander, and I am sick of the phrase “to prevent this incident from ever happening again”.

Emily Ruppe — DevOpsDays Rockies

Founding Uber SRE.

This is my personal story of starting the SRE organization at Uber.

This article was written by a former Uber employee and is posted on their personal blog.

Will Larson

Post-Incident Review on the Atlassian April 2022 outage

This is total transparency at its finest. This write-up has all the details you could ever hope for on what went wrong, how they responded, and what comes next.

Sri Viswanath — Atlassian

Site Reliability Engineering Glossary

The target audience is new SREs and executive sponsors who would keep hearing these terms repeatedly but not take the time to read 1000s of words at a time.

[source: author comment on Reddit]

Ash P. — SREPath

That time we unplugged a data center to test our disaster readiness

Dropbox wanted to be able to handle datacenter failure. To reach this goal, they moved from an active/active model to active/passive and spun up a new Disaster Readiness team to rework their failover system.

Krishelle Hardson-Hurley, Ross Delinger, and Tong Pham — Dropbox

SLOs for everyone with Sloth

HelloFresh drove the implementation of SLOs in their Kubernetes-based infrastructure using Prometheus and Sloth.

Chris Loukas — HelloFresh

Delivering Large-Scale Platform Reliability

A Roblox engineer outlines the way that Roblox handles reliability at scale.

Alberto Covarrubias — Roblox

Your On Call Rotation is Harmful (And Here’s How to Make it Better)

[…] let’s look at some common on call antipatterns and some simple things we can do to alleviate their common pitfalls.

Nickolas Means — Sym

SRE Weekly Issue #320

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues