SRE Weekly Issue #268

Articles

Manageable On-Call for Companies without Money Printers

The SRE book has a chapter covering on-call, but it’s best suited for huge-scale companies. What should the rest of us do?

Utsav Shah

Breaking the top five myths around chaos engineering

If you’re feeling hesitant about chaos engineering, or you’re trying to convince someone who is, this might be useful. The myths are:

Myth #1: Chaos engineering is testing in production
Myth #2: Chaos engineering is about randomly breaking things
Myth #3: Chaos engineering is only for large, modern distributed systems
Myth #4: We don’t need more chaos – we already have plenty!
Myth #5: Chaos engineering is only for very mature teams/products

Mikolaj Pawlikowski

Seeing Like an SRE: Site Reliability Engineering as High Modernism

Drawing parallels to the high modernism movement during the cold war, this article raises interesting questions about the direction SRE is going, and system administration in general.

Laura Nolan — USENIX

Is faster actually safer? How software physics beats human psychology

Riffing off of a tweet by Charity Majors, this article explores the idea that moving faster can actually be safer, despite an urge one may feel to slow down.

Bruce Johnston

NTSB Aircraft Accident Report: Eastern Air Lines, May 5, 1983

An extreme oversimplification of this incident would be: multiple engine failure on a plane subsequent to a maintenance error on all engines. This accident is cited as a reason to have separate mechanics work on each engine, in hopes of avoiding duplicated errors.

US National Transportation Safety Board (multiple authors)

How we ship code faster and safer with feature flags

[…] in order to ship new features and improvements faster while lowering the risk in our deployments, we have a simple but powerful tool: feature flags.

Alberto Gimeno — GitHub

Reverse debugging at scale

This one blew my mind. By recording instruction execution traces in a ring buffer, they’re able to reconstruct enough information to step through the execution leading up to a crash — even though they weren’t running the application under a debugger!

Walter Erquinigo, David Carrillo-Cisneros, Alston Tang — Facebook

The Plane Paradox: More Automation Should Mean More Training

Automation is supposed to take some of the load off of the human operator, right? But in reality, humans need to build a mental model of what the automation is doing in order to use it safely and effectively.

Shem Malmquist — WIRED

SRE Weekly Issue #268

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues