General

SRE Weekly Issue #369

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

if we focus only on issues with individual components, there’s so much we miss about the nature of failure in our systems.

  Lorin Hochstein

Will Gallego says that we need to prioritize and incentivize learning from near misses, not just actual incidents.

We’ve made headway into expending energy towards learning from incidents. We’ll be even better off when we can regularly make learning from successes our regular work as well.

  Will Gallego

This air crash in 1977 taught us many important lessons including surprising details about the behavior of jet engines in rain. The water ingestion testing apparatus shown in one of the photos is pretty impressive.

  Admiral Cloudberg

When your alerts cover systems owned by different teams, who should be on call?

  Nathan Lincoln — Honeycomb
  Full disclosure: Honeycomb is my employer.

Cloudflare does some pretty eye-opening things with the network stack and file descriptors, as described in this amusingly-named article.

   Quang Luong and Chris Branch

While ostensibly about security incident response, this article has a lot of useful ideas for improving response to any kind of incident.

  Jess Chang — Vanta (for incident.io)

An argument for monoliths over microservices, but with an important caveat: be careful about compartmentalizing your failure domains.

  Lawrence Jones — incident.io

Here’s a great summary of the key themes from last month’s SRECon Americas.

  Paige Cruz — Chronosphere

SRE Weekly Issue #368

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

This article uses a simulation to demonstrate the power of shuffle sharding to limit the blast radius of overload conditions.

   Eugene Retunsky — DZone

A comprehensive look at stress testing, how it differs from load testing, how to implement it, and how to analyze the results.

  Lambdatest

Retries and high availability are great, but for critical dependencies, we can go a step further and define an alternative in case a dependency is down.

  Leart Gjoni — DoorDash

From the archives, here’s an incident report from a major outage at DoorDash in 2022.

  Ryan Sokol — DoorDash

Amazon’s old internal “retrospective” process sounds painful and scary. Fortunately the author took the good parts and learned some valuable lessons from the rest.

  Lee Atchison — Container Journal

Instead of asking PMs to “speak SRE,” span the communication gap by using the common language of user stories to build business-cogent SLOs.

  Kit Merker — DevOps.com

Amazon advantages their service offerings like RDS by making the (normally pricey) cross-availability-zone data transfer free.

  Corey Quinn — Last Week In AWS

It’s easy to think of reasons to run a retrospective on an incident. What about the reverse? Which incidents should we skip over?

  Lex Neva — The New Stack
  Full disclosure: Honeycomb, my employer, is mentioned.

SRE Weekly Issue #367

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

Reading this article will teach you the math you need to know to build alerting that has a low false positive rate and why this is trickier than it may seem.

  Dan Slimmon

Cloudflare adapted a technique shared by PagerDuty to detect failed Kafka consumers and restart them.

   Chris Shepherd and Andrea Medda — Cloudflare

Gracefully shutting down is important, otherwise every deploy will result in client-facing errors.

  Srinavas — eightnoteight

There’s a wealth of lessons learned in this article. My favorite: idempotency was never part of the contract, but consumers nevertheless depended on it.

  Gergely Orosz — Pragmatic Engineer

Making our companies into High Reliability Organizations (HROs) rarely makes sense, but we can still learn useful skills and techniques from them. This article gives a good overview and analysis of HROs.

  Dominic Cooper — Safety & Health Practitioner

By “tiered”, this article means having discussions about reliability at three levels: the engineering team level, the director level, and the executive level.

  Ross Brodbeck

After explaining why deploys aren’t the right approach, this article proposes feature flags as a safer approach.

  Charity Majors — Honeycomb
  Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #366

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

In incident management as in so many areas, there’s the shiny work and the unglamorous but critical parts, and the latter often fall to women. This article seeks to reverse that trend by reminding us of the incredibly important glue work Women have been doing since the dawn of computing.

  Emily Arnott — Blameless

I love stories about applying IT incident response processes to non-IT incidents.

  Robert Ross — FireHydrant

Dear reader, perhaps you would enjoy reading this article on the many benefits of engineering blogs… then go write more great content and send me a link. :D

  New York Times — Jordan News

Okay, this isn’t exactly an SRE story, but it sounds really familiar. It’s a story of “user error” that’s really about designing systems to help users catch errors.

  Jakub Roztocil — httpie

nginx has a pretty nifty zero-downtime restart system, but it didn’t quite fit Cloudflare’s needs.

  Maciej Lechowski — Cloudflare

This article does a great job of summarizing SRECon Americas by pulling out five major themes that ran through multiple talks.

  Gavin Cahill — Gremlin

Building buy-in is everything.

[…] the key function of SRE being to help shape engineering’s perception of reality rather than act as a gatekeeper.

  Ross Brodbeck

By “FinOps”, they mean a team in your company dedicated to reducing cloud computing costs. Does that really help?

  Lydia Leong

[…] it is also possible to create incident writeups that engineers choose to read, that clearly describe and highlight difficult and poorly-understood aspects of our systems, and that become part of the organisation’s collective understanding.

  Laura Nolan — Container Solutions`

Years after we both started doing the newsletter thing, I finally sat down with Corey Quinn for an episode of his podcast. We talked about running newsletters, my other side project, and of course, reliability.

  Corey Quinn — Last Week In AWS

SRE Weekly Issue #365

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

They take us from the requirements analysis all the way through implementation of a high-throughput data store based on CockroachDB.

  Chuanpin Zhu and Debalin Das — DoorDash

On March 14th, Reddit engineers upgraded a Kubernetes cluster from 1.23 to 1.24, and all hell broke loose. I admire their precision in being down for 100Ď€ minutes.

  Jayme Howard — Reddit

With a huge user-base of students and teachers, these folks upped their incident response game, and they share how.

  Nadinastiti and Estu Fardani — GovTech Edu

A lurking bug in redis-py allowed users to see one another’s data, and OpenAI took ChatGPT down to limit the damage.

  OpenAI

In Linux, source port allocation can be complex. This article shows why with a ton of code and tracing examples.

  Jakub Sitnicki — Cloudflare

The gap between “paying for peak” and “earning on average” is critical to understand how the economics of large-scale cloud systems differ from traditional single-tenant systems.

  Marc Brooker

A configuration error was masked because the app automatically fell back to the original configuration. The problem only surfaced when the service was redeployed.

  Heroku

A production of Tinker Tinker Tinker, LLC Frontier Theme