SRE Weekly Issue #345

SRE Weekly is now on Mastodon at! Follow to get notified of each new issue as it comes out.

This replaces the Twitter account @SREWeekly, which I am now retiring in favor of Mastodon. For those of you following @SREWeekly on Twitter, you’ll need to choose a different way to get notified of new issues. If Mastodon isn’t your jam, try RSS or a straight email subscription (by filling out the form at

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:


Don’t beat yourself up! This is like another form of blamelessness.

  Robert Ross — FireHydrant + The New Stack

In this article, I will share with you how setting up passive guardrails in and around developer workflows can reduce the frequency and severity of incidents and outages.

  Ash Patel — SREPath

This conference talk summary outlines the three main lessons Jason Cox learned as director of SRE at Disney.

  Shaaron A Alvares — InfoQ

Here’s a look at how Meta has structured its Production Engineer role, their name for SREs.

  Jason Kalich — Meta

Bit-flips caused by cosmic rays seem incredibly rare, but they become more likely as we make circuits smaller and our infrastructures larger.

  Chris Baraniuk — BBC

Cloudflare shares details about their 87-minute partial outage this past Tuesday.

  John Graham-Cumming — Cloudflare

In reaction to a major outage, these folks revamped their alerting and incident response systems. Here’s what they changed.

  Vivek Aggarwal — Razorpay

The author of this post sought to test a simple algorithm from a research paper that purported to reduce tail latency. Yay for independent verfication!

  Marc Brooker

Updated: October 30, 2022 — 8:23 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme