SRE Weekly Issue #482

A message from our sponsor, PagerDuty:

Incidents move fast. But you’ll never get left behind with PagerDuty’s GenAI incident response assistant, available in all paid plans. Get instant business impact analysis, troubleshooting steps, and auto-drafted status updates—directly in Slack. Stop context-switching, start resolving faster.

https://fnf.dev/4dZ5V36

Salesforce posted an analysis of their major outage on June 10. An autmated update restarted networking, and routing rules ended up in a bad state. This is remarkably similar to Datadog’s incident in March of 2023.

  Salesforce.

In this article, the author likens LLMs to magic, in that they’re a black box in some ways. That has implications for how we go about building reliable systems around them.

  Lorin Hochstein

An executive learns a valuable lesson about the ways they can be useful during an incident — and ways they might inadvertently cause disruption.

  Hamed Silatani — Uptime Labs

This article is a summary of a new paper on how to figure out if your system is susceptible to metastable failure modes.

  Murat Demirbas

This article explores how modern teams can effectively implement, track, and leverage CFR [Change Failure Rate] to drive continuous improvement in their delivery pipelines.

   Saumen Biswas — DZone

A primer on the theory and practice of circuit breakers, including example code using Resilience4j.

   Narendra Lakshmana gowda — DZone

Airbnb introduces their internal load testing framework, Impulse, and shares details about how they perform load testing at scale.

  Chenhao Yang — Airbnb

In this first of a three-part series, HelloFresh introduces their effort to manage complexity. They start by showing what they stand to gain and then introduce high-level strategies.

  Konstantin Rohleder — HelloFresh

SRE Weekly Issue #481

A message from our sponsor, PagerDuty:

Need Slack-native E2E incident management? PagerDuty delivers! Automatic incident workflows that set up Slack channels? ✅ Incident roles and built-in commands? ✅ AI-powered chat that provides real-time customer impact? ✅ Now available on ALL paid PagerDuty plans.

https://fnf.dev/4dZ5V36

On Thursday, GCP had a major incident, returning 500 errors for many services worldwide. Click through for Google’s incident report.

  Google

Cloudflare’s KV service has a dependency on GCP, and Cloudflare posted this report on their incident.

  Jeremy Hartman and CJ Desai — Cloudflare

Lorin Hochstein’s perspective on an incident report often makes me see things I didn’t in my first pass.

  Lorin Hochstein

Should you escalate early or avoid pulling folks in unless absolutely necessary? This article goes into these questions and beyond, delving into the definition and purpose of escalation.

  Hamed Silatani — Uptime Labs

How do we ensure the reliability of an LLM-based system? Can we apply traditional SRE principles and techniques to AI? This article gave me a lot to think about.

  Denys Vasyliev — The New Stack

In this blog post, we’ll discuss our experiences in identifying the challenges associated with EC2 network throttling. We’ll also delve into how we developed network performance monitoring for the Pinterest EC2 fleet and discuss various techniques we implemented to manage network bursts, ensuring dependable network performance for our critical online serving workloads.

  Jia Zhan and Sachin Holla — Pinterest

High Availability keeps things stable in small failures. DR is the safety net for large-scale disasters.

After explaining why HA by itself isn’t enough, this article covers strategies, costs, and best practices for disaster recovery.

   Yakaiah Bommishetti — HackerNoon

This article explains how observability costs can ramp up quickly, especially if we’re not careful about what data we store.

There’s a lot of nuance here, and the author posted this followup the next day after receiving many responses.

   Leon Adato

SRE Weekly Issue #480

A message from our sponsor, PagerDuty:

🔍 Notable PagerDuty shift: Full incident management now spans all paid tiers. The upgraded Slack-first and Teams-first experience means fewer tools to juggle during incidents. Only leveraging PagerDuty for basic alerting? Time to check out what’s newly available in your plan!

https://fnf.dev/4dZ5V36

the idea that the highest ROI for risk reduction work is in the highest severity incidents is not a fact, it’s a hypothesis that simply isn’t supported by data.

  Lorin Hochstein

Incidents are bad, so should we try to have fewer of them? This article challenges the assumptions contained within that goal and suggests other ways to frame one’s thinking.

  Hamed Silatani — Uptime Labs.

This guide goes deeply into the details of how Prometheus uses memory, and then it shows you how to get a handle on it.

  Vladimir Guryanov — Palark

This article discusses the DNS-related challenges encountered at Mercari on our Kubernetes clusters and the significant improvements achieved by implementing Node-Local DNS Cache.

  Satyadarshi Sanu — Mercari

In this post we’ll explore the fundamentals of distributed consensus, compare the dominant consensus algorithms Paxos and Raft, and examine recent implementations like Kafka Raft.

  Narendra Reddy Sanikommu — DEV

A discussion of two techniques the folks at Cash App used to improve their reliability: adopting a two-cluster topology with Kubernetes, and using Amazon’s Fault Injection Service to simulate the failure of an availability zone.

  Dustin Ellis, Deepak Garg, Ben Apprederisse, Jan Zantinge, and Rachel Sheikh — Amazon

Reading this one taught me a couple of techniques I wasn’t aware of for finding queries in need of optimization in MySQL.

  Vinicius Grippa — Readyset

Ouch — and a great learning opportunity for all of us:

When our backend circuit breakers triggered, aggressive websocket reconnect logic initiated on every connected client at once, further overwhelming an already stressed database.

  Jake Cooper — Railway

SRE Weekly Issue #479

Rollbacks don’t always return you to a previous system state. They can return you to a state you’ve never tested or operated before.

  Steve Fenton — Octopus Deploy

This article explains the math of burn rate alerting and gives well thought out reasoning or why burn rates are better.

  James Frullo — Datadog

This hot take is worth thinking about: what do you want to get out of assigning incident severity levels, and is it working?

  Hamed Silatani — Uptime Labs

Less defense, and more about how to best cope with a code freeze and avoid the downsides when you’ve got no choice.

  Tom Elliott

MTTI in this case is Mean Time to Isolate. How long are you taking to figure out what system component is at the heart of an incident? What does MTTI say about your system, and what can you do about it?

  Old School Burke

This article doesn’t answer the question in its title concretely, but it does give one a lot to think about. It also shares some ideas for how to cope with the potential challenges identified.

  Sylvain Kalache — LeadDev

This one starts off as a review of a workbook on root cause analysis by the UK Health and Safety Executive. Then it raises concerns about RCA-based reasoning and contrasts with a different model based on resilience engineering.

  Lorin Hochstein

I wrote this article in response to Azure’s post, Introducing Azure SRE Agent. There’s a lot we can learn from the example agent interactions that Microsoft chose to share.

  Lex Neva

SRE Weekly Issue #478

Datadog has fully merged their SRE and Security teams.

In this post, we’ll look at essential elements of SRE and security, the benefits we’ve realized by combining the two disciplines, and what that approach looks like for us.

  Bianca Lankford — Datadog

I love the way this article describes three different audiences for your communication during incidents. It describes what each audience is looking for and gives both positive and negative examples of how to communicate with them.

  Hamed Silatani — Uptime Labs

My favorite part of this article is the section on where to run your load tests: production, staging, or something else?

  Tom Elliot

What is complexity? This article gives a clear definition and breaks down the qualities one can find in a complex system. Then it goes over various methods of dealing with that complexity.

  Teiva Harsanyi — The Coder Cafe

Cloudflare has a history of doing some pretty interesting things with sockets in Linux — and taking us along for the journey with highly-detailed explanations. This article is no exception, sharing the unique challenges encountered when restarting processes that handle UDP streams.

  Marek Majkowski

This article examines the standard friday deploy prohibition and ultimately pushes back.

Ok… but why not?

  Adrien Guéret — OpenClassrooms

This article introduces the STAMP (System-Theoretic Accident Model and Processes) framework being adopted at Google, after first explaining the shortcomings in traditional SRE practices that prompted Google to adopt STAMP.

  Jorge Lainfiesta — Rootly

I really love this framing of what’s wrong with picking a single root cause.

  Lorin Hochstein

A production of Tinker Tinker Tinker, LLC Frontier Theme