SRE Weekly Issue #348

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

Here’s a good intro to creating SLOs including a section on best practices.

  Cortex

When they started to get complaints from customers, they knew it was time to get serious about measuring and monitoring their reliability.

  arun — Reputation

As an SRE and sysadmin with 10+ years of industry experience, I wanted to write up a few scenarios that are real threats to the integrity of the bird site over the coming weeks.

What follows is a thread with tens of realistic failure scenarios, many of which apply not just to Twitter.

  @MosquitoCapital on Twitter

A few amusing anecdotes reveal deeper lessons in SRE.

  David Cassel — The New Stack

A resilient system like Twitter isn’t likely to go down instantly just because of a few changes. It’s much more likely to slowly degrade, per this article.

  Christopher Carbone — Daily Mail

It’s really interesting to see where this write-up differs from a video summary of the same accident by Mentour Pilot. Given the differences, I wonder if there are even more details that both left out?

  Admiral Cloudberg

This is a really great description of common ground breakdown, referencing Woods and Klein.

  Dan Slimmon

SRE Weekly Issue #347

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

Check it out, a conference from the Learning From Incidents people!

Echoing Bainbridge’s Ironies of Automation, this article discusses the potential dangers of over-automation, using an air accident as a case study. I hadn’t been aware of the term “Children of the Magenta” before.

   Katie Mingle — 99% Invisible

There’s more to it than just hacking together some slack workflows.

   Ryan McDonald — FireHydrant

Honeycomb doesn’t do its SLOs “by the book”.

The way Honeycomb defines SLOs is radically different from what I expected. Instead of the definitions I wrote about at the beginning of this post, I saw:

  Reid Savage — Honeycomb
  Full disclosure: Honeycomb is my employer.

An anonymous Twitter engineer talks about what’s going on over there and how they think it might play out.

  Chris Stokel-Walker — MIT Technology Review

They rolled out automated rollbacks across a complex infrastructure, and in this article, they share the lessons they learned in the process.

  Will Sewell and Joseph Pallamidessi — Monzo

Okay. Here’s the Important Thing:

As you approach maximum throughput, average queue size – and therefore average wait time – approaches infinity.

  Dan Slimmon

It was not clear to the pilots that the fuel estimation system was not designed to be used in the way they were using it.

  Admiral Cloudberg

As is usually the case with air accidents, the crash of Air Florida flight 90 did not have a single cause. In fact, the accident was the result of the confluence of two proximate factors, each of which was itself the culmination of a long chain of errors.

  Admiral Cloudberg

SRE Weekly Issue #346

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

The theme of this article is, somebody knows. So often this is the case with lurking infrastructure issues, and it only becomes clear that somebody knew about the underlying risk once things blow up (or never). How can we find out these things that someone already knows, soon enough to act?

  Elizabeth Ayer

In this air crash investigation report, somebody knew: the maintenance supervisor had written multiple memos about a risky maintenance practice to no avail, and the practice directly contributed to the crash.

  Admiral Cloudberg

And in this one, somebody knew too: a trained pilot in a nearby village called air traffic control to warn them that a plane looked likely to crash into a mountain and needed to pull up — shortly before it hit the mountain.

  Admiral Cloudberg

A lolsob-worthy comment on laying off SREs. And here‘s a totally on-point reply with the somebody knew moment.

Partly, it’s about accepting that this is hard work. The other part is choosing where your energy input can yield the most learning.

Full disclosure: Fred is my teammate at work.

  Fred Hebert

Check it out, the incident.io folks started a podcast about incidents!

  incident.io

Here’s Google’s report for a BigQuery outage that occurred on October 13.

  Google

At last9, we auto-delete slack messages after 2 days on all personal Direct Messages. These retention policies force teams to improve documentation, kill tribal knowledge and drive accountability for mistakes, errors.

  Nishant Modak — Last9

There are some interesting tidbits in the pile of incidents in this report.

  Jakub Oleksy — GitHub

SRE Weekly Issue #345

SRE Weekly is now on Mastodon at @SREWeekly@social.linux.pizza! Follow to get notified of each new issue as it comes out.

This replaces the Twitter account @SREWeekly, which I am now retiring in favor of Mastodon. For those of you following @SREWeekly on Twitter, you’ll need to choose a different way to get notified of new issues. If Mastodon isn’t your jam, try RSS or a straight email subscription (by filling out the form at sreweekly.com).

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

Don’t beat yourself up! This is like another form of blamelessness.

  Robert Ross — FireHydrant + The New Stack

In this article, I will share with you how setting up passive guardrails in and around developer workflows can reduce the frequency and severity of incidents and outages.

  Ash Patel — SREPath

This conference talk summary outlines the three main lessons Jason Cox learned as director of SRE at Disney.

  Shaaron A Alvares — InfoQ

Here’s a look at how Meta has structured its Production Engineer role, their name for SREs.

  Jason Kalich — Meta

Bit-flips caused by cosmic rays seem incredibly rare, but they become more likely as we make circuits smaller and our infrastructures larger.

  Chris Baraniuk — BBC

Cloudflare shares details about their 87-minute partial outage this past Tuesday.

  John Graham-Cumming — Cloudflare

In reaction to a major outage, these folks revamped their alerting and incident response systems. Here’s what they changed.

  Vivek Aggarwal — Razorpay

The author of this post sought to test a simple algorithm from a research paper that purported to reduce tail latency. Yay for independent verfication!

  Marc Brooker

SRE Weekly Issue #344

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

In this story of SLOs gone bad, error budgets and code freezes provided a perverse incentive that caused a great deal of harm.

  dobbse.net

This article seeks to apply SRE principles to security in the form of a Threat Budget.

  Jason Bloomberg — Intellyx

After talking to hundreds of engineers about their processes, we’ve identified five of the most common challenges we see across companies looking to put more structure behind how they manage their incidents.

  Mike Lacsamana — FireHydrant

The Analysis section has a lot of important lessons. What really stands out in this incident review is the fact that Honeycomb plainly lays out the fact that they don’t yet know what went wrong, and why not.

  Fred Hebert — Honeycomb
  Full disclosure: Honeycomb is my employer.

several, small staging clusters—each fit for their purpose—offers a more maintainable, cheaper alternative.

  Tyler Cipriana

I’m really enjoying the Admiral Cloudberg series of aircraft accident investigation reports. How did I not know about these before??

A lot has improved in aviation safety since this crash in 1967, but there’s still a lot we can learn in SRE even now. For example: the operator’s view into the system should make the result of their inputs clear.

  Admiral Cloudberg

An unannounced (maybe inadvertent?) breaking change in an Azure API caused an outage. Here’s the story of the investigation.

  Nikko Campbell — Metrist

Another Admiral Cloudberg air accident investigation, this time showing how easily critical details can slip through the cracks.

  Admiral Cloudberg

SRE WEEKLY © 2015 Frontier Theme