SRE Weekly Issue #373

A message from our sponsor, Rootly:

Rootly is hiring for a Sr. Developer Relations Advocate to continue helping more world-class companies like Figma, NVIDIA, Squarespace, accelerate their incident management journey. Looking for previous on-call engineers with a passion for making the world a more reliable place. Learn more:

https://rootly.com/careers?gh_jid=4015888007

Articles

Datadog posted a report on their major outage in March, and it’s a doozy. An unattended updates system that they didn’t even want, need, or know about triggered across all hosts in multiple clouds nearly simultaneously, causing a regression.

  Alexis Lê-Quôc — Datadog

GitHub has had a string of apparently unrelated outages recently, and they’ve posted this description.

  Mike Hanley — GitHub

Oh look, another awesome-* repo relevant to our interests!

A repo of links to articles, papers, conference talks, and tooling related to load management in software services: loadshedding, circuitbreaking, quota management and throttling. PRs welcome.

  Laura Nolan and Niall Murphy — Stanza Systems

This interview covers a lot of ground including looking beyond just “up or down” when considering reliability.

  Prathamesh Sonpatki — SRE Stories

If you’re in the mood for a deep systems debugging story, you’re in for a treat. The author takes you along for the ride with a wealth of detailed code snippets.

  Tycho Andersen — Netflix

Regardless of the replication mechanism you must fsync() your data to prevent global data loss in non-Byzantine protocols.

  Denis Rystsov and Alexander Gallego — Redpanda

Emotional intelligence is a critical skill for SREs, especially when we interact with other teams in fraught situations.

  Amin Astaneh — Certo Modo

Wow! Spotify created a set of tools to perform automated refactoring of thousands of repositories at once. This includes the ability to run tests, automatically merge pull requests without human review, and roll refactorings out gradually.

  Matt Brown — Spotify

Jeli has published a one-page cheat-sheet for their highly-detailed Howie guide for running incident retrospectives.

  Jeli

Updated: May 21, 2023 — 9:21 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme