Articles
Slack shared this write-up of their February outage, which involved complex systems interactions and cascading failure.
Laura Nolan — Slack
Go watch this lightning talk now! She had me hooked within the first ten seconds.
Hi, my name is Emily Ruppe, I work at Jeli.io, and I am a recovering incident commander, and I am sick of the phrase “to prevent this incident from ever happening again”.
Emily Ruppe — DevOpsDays Rockies
This is my personal story of starting the SRE organization at Uber.
This article was written by a former Uber employee and is posted on their personal blog.
Will Larson
This is total transparency at its finest. This write-up has all the details you could ever hope for on what went wrong, how they responded, and what comes next.
Sri Viswanath — Atlassian
The target audience is new SREs and executive sponsors who would keep hearing these terms repeatedly but not take the time to read 1000s of words at a time.
[source: author comment on Reddit]
Ash P. — SREPath
Dropbox wanted to be able to handle datacenter failure. To reach this goal, they moved from an active/active model to active/passive and spun up a new Disaster Readiness team to rework their failover system.
Krishelle Hardson-Hurley, Ross Delinger, and Tong Pham — Dropbox
HelloFresh drove the implementation of SLOs in their Kubernetes-based infrastructure using Prometheus and Sloth.
Chris Loukas — HelloFresh
A Roblox engineer outlines the way that Roblox handles reliability at scale.
Alberto Covarrubias — Roblox
[…] let’s look at some common on call antipatterns and some simple things we can do to alleviate their common pitfalls.
Nickolas Means — Sym