SRE Weekly Issue #372

A message from our sponsor, Rootly:

Rootly is hiring for a Sr. Developer Relations Advocate to continue helping more world-class companies like Figma, NVIDIA, Squarespace, accelerate their incident management journey. Looking for previous on-call engineers with a passion for making the world a more reliable place. Learn more:

https://rootly.com/careers?gh_jid=4015888007

Articles

At Pulumi we read every single error message that our API produces. This is the primary mechanism that led to a 17x YoY reduction in our error rate

  Evan Boyle — Pulumi

Rather than striving for a million nines, we should choose the right reliability target based on an evaluation of the effect of downtime on the business.

  Itzy Sabo — HEY

This is a presentation of a study of harm and trauma resulting from incident response work. I especially like the part about blamelessness in theory versus practice.

  Jessica DeVita — InfoQ

Perhaps a sensationalist title, but there’s a really good point here: learning from incidents is only practical if it actually improves the business.

  Chris Evans — incident.io

A highly-detailed proposal for a system to track which users are online at a huge scale.

  Nk — System Design

However, for any cache to be used for the purpose of upscaling, it must operate completely independent from the source of truth (SOT) and must not be allowed to fall back to the SOT on failures.

  Estella Pham and Guanlin Lu – LinkedIn

If you design your system to make lying the only viable option, then people will lie. To me, this article is all about understanding that our systems involve real, squishy humans, an designing appropriately.

  Admiral Cloudberg

SRE Weekly Issue #371

A message from our sponsor, Rootly:

Rootly is hiring for a Sr. Developer Relations Advocate to continue helping more world-class companies like Figma, NVIDIA, Squarespace, accelerate their incident management journey. Looking for previous on-call engineers with a passion for making the world a more reliable place.  Learn more:

https://rootly.com/careers?gh_jid=4015888007

Articles

NASA chose to squeeze just a bit more science out of the Voyager spacecrafts’ aging power supplies by sacrificing a layer of redundancy. I love this so much, because it sounds just like the kinds of decisions we make during incidents.

  Robert Barron — IBM

I really debated about including this one, because I don’t often include articles about new products, and Ii think especially critically when the the company in question is my employer.

With all that in mind, I’m including this one anyway because Charity Majors really put a fine point on exactly why I, too, am cranky about AIOps.

  Beth Pariseau — TechTarget
  Full disclosure: Honeycomb, my employer, is mentioned.

The main reason that MTTR is a flawed metric is that the nature of each incident varies so wildly. Time to assemble, though, is much closer to being under our control.

  Robert Ross — FireHydrant

The folks at incident.io recommend being expansive in what is considered an incident and then using a defined process to find the real incidents, determine impact and priority, and assign to the right team for resolution.

  Luis Gonzalez — incident.io

GitHub had some interesting incidents this time around, in several cases stemming from changes made with the intention of improving reliability.

  Jakub Oleksy — GitHub

Netflix records and replays live traffic in a testbed environment in order to validate a migration plan before they ever impact real customers.

  Shyam Gala, Javier Fernandez-Ivern, Anup Rokkam Pratap, and Devang Shah — Netflix

The move from a distributed microservices architecture to a monolith application helped achieve higher scale, resilience, and reduce costs.

I’ve seen this sentiment more frequently recently. Are we at the cusp of a general shift away from microservices?

  Marcin Kolny — Amazon Prime Video

SRE Weekly Issue #370

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

[…] although “getting the system back up” should be our first priority, to do so safely, we first need to very carefully define what “up” means.

What functionality is critical? Should we sacrifice feature A to save feature B?  It’s important to plan ahead.

  Boris Cherkasky

It turns out that it depends on how you define “uptime”. Does claiming “100%” actually benefit you?

  Ellen Steinke — Metrist

Skipping the retro shouldn’t be an option. Ditch the one-size-fits-all process to ensure that this important step is held at the end of every incident.

  Jouhné Scott — FireHydrant

Another good one to have in your back pocket for those “What would you say… you do here?” moments.

  Ash Patel — SREPath

Build versus buy for incident management systems: what is the true cost of rolling your own?

   Biju Chacko and Nir Sharma — Squadcast

A plugin to give ChatGPT the ability to run AWS API calls. I’m not sure how I feel about this.

   Banjo Obayomi — DZone

They solved a cardinality explosion by switching from query-based alerting to stream data processing.

  Ruchir Jha, Brian Harrington, and Yingwu Zhao — Netflix

SRE Weekly Issue #369

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

if we focus only on issues with individual components, there’s so much we miss about the nature of failure in our systems.

  Lorin Hochstein

Will Gallego says that we need to prioritize and incentivize learning from near misses, not just actual incidents.

We’ve made headway into expending energy towards learning from incidents. We’ll be even better off when we can regularly make learning from successes our regular work as well.

  Will Gallego

This air crash in 1977 taught us many important lessons including surprising details about the behavior of jet engines in rain. The water ingestion testing apparatus shown in one of the photos is pretty impressive.

  Admiral Cloudberg

When your alerts cover systems owned by different teams, who should be on call?

  Nathan Lincoln — Honeycomb
  Full disclosure: Honeycomb is my employer.

Cloudflare does some pretty eye-opening things with the network stack and file descriptors, as described in this amusingly-named article.

   Quang Luong and Chris Branch

While ostensibly about security incident response, this article has a lot of useful ideas for improving response to any kind of incident.

  Jess Chang — Vanta (for incident.io)

An argument for monoliths over microservices, but with an important caveat: be careful about compartmentalizing your failure domains.

  Lawrence Jones — incident.io

Here’s a great summary of the key themes from last month’s SRECon Americas.

  Paige Cruz — Chronosphere

SRE Weekly Issue #368

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

This article uses a simulation to demonstrate the power of shuffle sharding to limit the blast radius of overload conditions.

   Eugene Retunsky — DZone

A comprehensive look at stress testing, how it differs from load testing, how to implement it, and how to analyze the results.

  Lambdatest

Retries and high availability are great, but for critical dependencies, we can go a step further and define an alternative in case a dependency is down.

  Leart Gjoni — DoorDash

From the archives, here’s an incident report from a major outage at DoorDash in 2022.

  Ryan Sokol — DoorDash

Amazon’s old internal “retrospective” process sounds painful and scary. Fortunately the author took the good parts and learned some valuable lessons from the rest.

  Lee Atchison — Container Journal

Instead of asking PMs to “speak SRE,” span the communication gap by using the common language of user stories to build business-cogent SLOs.

  Kit Merker — DevOps.com

Amazon advantages their service offerings like RDS by making the (normally pricey) cross-availability-zone data transfer free.

  Corey Quinn — Last Week In AWS

It’s easy to think of reasons to run a retrospective on an incident. What about the reverse? Which incidents should we skip over?

  Lex Neva — The New Stack
  Full disclosure: Honeycomb, my employer, is mentioned.

A production of Tinker Tinker Tinker, LLC Frontier Theme