SRE Weekly Issue #416

View on sreweekly.com

4 Instructive Postmortems on Data Downtime and Loss

What can we, in turn, learn from some of the most honest and blameless—and public—postmortems of the last few years?

They cover incidents from GitLab, Tarsnap, Roblox, and Cloudflare with great summaries and takeaways.

The Hacker News

Resilience and Incident Management with Vanessa Huerta Granda

My favorite part of this interview is when Vanessa describes parenting twin babies as constant incident response.

Shane Hastie — InfoQ

Beyond the beep and saving sleep: optimizing the On-Call experience

Here follow some lessons I’ve learned from the trenches in small start-ups and larger engineering teams, to improve your on-call shift experience and remediation time for production issues and make sure you’re spending on-call efforts on what has the most impact.

Alex Wauters

The case for Fault Injection testing in Production

Doing your chaos experiments in a non-production environment can feel safer, but what are you giving up?

Sam Rossoff — Gremlin

In Defense of Shell Scripts

Sometimes, shell is just the right tool for the job.

Amin Astaneh — Certo Modo

Tank Explosions at Midland Resource Recovery

Catherine from Mastodon summarized this incident report beautifully:

this is one of the most violently unhinged CSB reports i’ve ever read […]

while investigating an explosion at a facility, CSB staff tried to prevent another explosion of the same kind in the same facility, and being unable to convince the workers to not cause it, ended up hiding behind a shipping container

U.S. Chemical Safety and Hazard Investigation Board

Broken windows: why the ‘Single Pane of Glass’ is impossible

This one’s about why people tend to want a “SPoG” and what we should want instead. Bonus points for the Star Trek reference.

Nočnica Mellifera — Checkly

How we built our infrastructure fail-over checklist

Right in the middle of migrating from one datacenter to an HA pair of new datacenters, one of the new ones failed. They had to quickly do a partial rollback of the migration to ride out the outage.

Gauthier François — Doctolib

Announcing bpftop: Streamlining eBPF performance optimization

Today, we are thrilled to announce the release of bpftop, a command-line tool designed to streamline the performance optimization and monitoring of eBPF programs.

Jose Fernandez — Netflix

SRE Weekly Issue #416

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues