SRE Weekly Issue #268

A message from our sponsor, StackHawk:

Join StackHawk Tuesday May 4 at 9 am PT for a hands-on technical workshop! By the end of the session, you will have three types of security testing running in your GitHub pipeline. Register:
http://sthwk.com/technical-workshop

Articles

The SRE book has a chapter covering on-call, but it’s best suited for huge-scale companies. What should the rest of us do?

Utsav Shah

If you’re feeling hesitant about chaos engineering, or you’re trying to convince someone who is, this might be useful. The myths are:

Myth #1: Chaos engineering is testing in production
Myth #2: Chaos engineering is about randomly breaking things
Myth #3: Chaos engineering is only for large, modern distributed systems
Myth #4: We don’t need more chaos – we already have plenty!
Myth #5: Chaos engineering is only for very mature teams/products

Mikolaj Pawlikowski

Drawing parallels to the high modernism movement during the cold war, this article raises interesting questions about the direction SRE is going, and system administration in general.

Laura Nolan — USENIX

Riffing off of a tweet by Charity Majors, this article explores the idea that moving faster can actually be safer, despite an urge one may feel to slow down.

Bruce Johnston

An extreme oversimplification of this incident would be: multiple engine failure on a plane subsequent to a maintenance error on all engines. This accident is cited as a reason to have separate mechanics work on each engine, in hopes of avoiding duplicated errors.

US National Transportation Safety Board (multiple authors)

[…] in order to ship new features and improvements faster while lowering the risk in our deployments, we have a simple but powerful tool: feature flags.

Alberto Gimeno — GitHub

This one blew my mind. By recording instruction execution traces in a ring buffer, they’re able to reconstruct enough information to step through the execution leading up to a crash — even though they weren’t running the application under a debugger!

Walter Erquinigo, David Carrillo-Cisneros, Alston Tang — Facebook

Automation is supposed to take some of the load off of the human operator, right? But in reality, humans need to build a mental model of what the automation is doing in order to use it safely and effectively.

Shem Malmquist — WIRED

Outages

Updated: May 2, 2021 — 9:29 pm
SRE WEEKLY © 2015 Frontier Theme