SRE Weekly is now on Mastodon at @SREWeekly@social.linux.pizza! Follow to get notified of each new issue as it comes out.
This replaces the Twitter account @SREWeekly, which I am now retiring in favor of Mastodon. For those of you following @SREWeekly on Twitter, you’ll need to choose a different way to get notified of new issues. If Mastodon isn’t your jam, try RSS or a straight email subscription (by filling out the form at sreweekly.com).
Don’t beat yourself up! This is like another form of blamelessness.
Robert Ross — FireHydrant + The New Stack
In this article, I will share with you how setting up passive guardrails in and around developer workflows can reduce the frequency and severity of incidents and outages.
Ash Patel — SREPath
This conference talk summary outlines the three main lessons Jason Cox learned as director of SRE at Disney.
Shaaron A Alvares — InfoQ
Here’s a look at how Meta has structured its Production Engineer role, their name for SREs.
Jason Kalich — Meta
Bit-flips caused by cosmic rays seem incredibly rare, but they become more likely as we make circuits smaller and our infrastructures larger.
Chris Baraniuk — BBC
Cloudflare shares details about their 87-minute partial outage this past Tuesday.
John Graham-Cumming — Cloudflare
In reaction to a major outage, these folks revamped their alerting and incident response systems. Here’s what they changed.
Vivek Aggarwal — Razorpay
The author of this post sought to test a simple algorithm from a research paper that purported to reduce tail latency. Yay for independent verfication!