SRE Weekly Issue #370

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

[…] although “getting the system back up” should be our first priority, to do so safely, we first need to very carefully define what “up” means.

What functionality is critical? Should we sacrifice feature A to save feature B?  It’s important to plan ahead.

  Boris Cherkasky

It turns out that it depends on how you define “uptime”. Does claiming “100%” actually benefit you?

  Ellen Steinke — Metrist

Skipping the retro shouldn’t be an option. Ditch the one-size-fits-all process to ensure that this important step is held at the end of every incident.

  JouhnĂ© Scott — FireHydrant

Another good one to have in your back pocket for those “What would you say… you do here?” moments.

  Ash Patel — SREPath

Build versus buy for incident management systems: what is the true cost of rolling your own?

   Biju Chacko and Nir Sharma — Squadcast

A plugin to give ChatGPT the ability to run AWS API calls. I’m not sure how I feel about this.

   Banjo Obayomi — DZone

They solved a cardinality explosion by switching from query-based alerting to stream data processing.

  Ruchir Jha, Brian Harrington, and Yingwu Zhao — Netflix

SRE Weekly Issue #369

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

if we focus only on issues with individual components, there’s so much we miss about the nature of failure in our systems.

  Lorin Hochstein

Will Gallego says that we need to prioritize and incentivize learning from near misses, not just actual incidents.

We’ve made headway into expending energy towards learning from incidents. We’ll be even better off when we can regularly make learning from successes our regular work as well.

  Will Gallego

This air crash in 1977 taught us many important lessons including surprising details about the behavior of jet engines in rain. The water ingestion testing apparatus shown in one of the photos is pretty impressive.

  Admiral Cloudberg

When your alerts cover systems owned by different teams, who should be on call?

  Nathan Lincoln — Honeycomb
  Full disclosure: Honeycomb is my employer.

Cloudflare does some pretty eye-opening things with the network stack and file descriptors, as described in this amusingly-named article.

   Quang Luong and Chris Branch

While ostensibly about security incident response, this article has a lot of useful ideas for improving response to any kind of incident.

  Jess Chang — Vanta (for incident.io)

An argument for monoliths over microservices, but with an important caveat: be careful about compartmentalizing your failure domains.

  Lawrence Jones — incident.io

Here’s a great summary of the key themes from last month’s SRECon Americas.

  Paige Cruz — Chronosphere

SRE Weekly Issue #368

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

This article uses a simulation to demonstrate the power of shuffle sharding to limit the blast radius of overload conditions.

   Eugene Retunsky — DZone

A comprehensive look at stress testing, how it differs from load testing, how to implement it, and how to analyze the results.

  Lambdatest

Retries and high availability are great, but for critical dependencies, we can go a step further and define an alternative in case a dependency is down.

  Leart Gjoni — DoorDash

From the archives, here’s an incident report from a major outage at DoorDash in 2022.

  Ryan Sokol — DoorDash

Amazon’s old internal “retrospective” process sounds painful and scary. Fortunately the author took the good parts and learned some valuable lessons from the rest.

  Lee Atchison — Container Journal

Instead of asking PMs to “speak SRE,” span the communication gap by using the common language of user stories to build business-cogent SLOs.

  Kit Merker — DevOps.com

Amazon advantages their service offerings like RDS by making the (normally pricey) cross-availability-zone data transfer free.

  Corey Quinn — Last Week In AWS

It’s easy to think of reasons to run a retrospective on an incident. What about the reverse? Which incidents should we skip over?

  Lex Neva — The New Stack
  Full disclosure: Honeycomb, my employer, is mentioned.

SRE Weekly Issue #367

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

Reading this article will teach you the math you need to know to build alerting that has a low false positive rate and why this is trickier than it may seem.

  Dan Slimmon

Cloudflare adapted a technique shared by PagerDuty to detect failed Kafka consumers and restart them.

   Chris Shepherd and Andrea Medda — Cloudflare

Gracefully shutting down is important, otherwise every deploy will result in client-facing errors.

  Srinavas — eightnoteight

There’s a wealth of lessons learned in this article. My favorite: idempotency was never part of the contract, but consumers nevertheless depended on it.

  Gergely Orosz — Pragmatic Engineer

Making our companies into High Reliability Organizations (HROs) rarely makes sense, but we can still learn useful skills and techniques from them. This article gives a good overview and analysis of HROs.

  Dominic Cooper — Safety & Health Practitioner

By “tiered”, this article means having discussions about reliability at three levels: the engineering team level, the director level, and the executive level.

  Ross Brodbeck

After explaining why deploys aren’t the right approach, this article proposes feature flags as a safer approach.

  Charity Majors — Honeycomb
  Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #366

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

In incident management as in so many areas, there’s the shiny work and the unglamorous but critical parts, and the latter often fall to women. This article seeks to reverse that trend by reminding us of the incredibly important glue work Women have been doing since the dawn of computing.

  Emily Arnott — Blameless

I love stories about applying IT incident response processes to non-IT incidents.

  Robert Ross — FireHydrant

Dear reader, perhaps you would enjoy reading this article on the many benefits of engineering blogs… then go write more great content and send me a link. :D

  New York Times — Jordan News

Okay, this isn’t exactly an SRE story, but it sounds really familiar. It’s a story of “user error” that’s really about designing systems to help users catch errors.

  Jakub Roztocil — httpie

nginx has a pretty nifty zero-downtime restart system, but it didn’t quite fit Cloudflare’s needs.

  Maciej Lechowski — Cloudflare

This article does a great job of summarizing SRECon Americas by pulling out five major themes that ran through multiple talks.

  Gavin Cahill — Gremlin

Building buy-in is everything.

[…] the key function of SRE being to help shape engineering’s perception of reality rather than act as a gatekeeper.

  Ross Brodbeck

By “FinOps”, they mean a team in your company dedicated to reducing cloud computing costs. Does that really help?

  Lydia Leong

[…] it is also possible to create incident writeups that engineers choose to read, that clearly describe and highlight difficult and poorly-understood aspects of our systems, and that become part of the organisation’s collective understanding.

  Laura Nolan — Container Solutions`

Years after we both started doing the newsletter thing, I finally sat down with Corey Quinn for an episode of his podcast. We talked about running newsletters, my other side project, and of course, reliability.

  Corey Quinn — Last Week In AWS

A production of Tinker Tinker Tinker, LLC Frontier Theme