General

SRE Weekly Issue #508

SRE Weekly will be going on hiatus for 6 weeks, while I’m on leave caring for my partner after her kidney transplant surgery this week. It’s incredible that the National Kidney Registry’s Paired Exchange program allowed me to donate a kidney to help her even though we don’t have matching blood types!

A message from our sponsor, Costory:

Tired of manually explaining your cloud & LLM bills?
Check our live preview to see how Costory links every cost spike to deployments, infra changes, and usage patterns. And delivers a clean summary straight in Slack.

Explore the demo

What do we miss when we have LLMs write our code for us? This article explains that one thing we can miss out on is building a mental model.

  Shayon Mukherjee

I really love this explanation of the concept of compensation.

Compensation is a very interesting mechanism in software systems because it can keep complex systems alive, but also because it can be a factor in how they quickly and unexpectedly collapse.

  Fred Hebert — Resilience in Software Foundation

When you investigate an incident and tell the story about what you found, but no one believes you because there’s no smoking gun or bad actor…

  Lorin Hochstein

To build and maintain reliable systems, organizations must align responsibility with control. This is where the Ownership TrioMandate, Knowledge, and Accountability—comes in.

  Spiros Economakis

I love when an article goes through the designs they passed over (and why) before reaching their final design, as in this one.

  Julianne Walker — Tines

If you’re unfamiliar with Docker image lazy loading like I was, this is a great primer on two options, Estargz and SOCI.

   Huong Vuong and Joseph Sahayaraj — Grab

But don’t let MTTR become the thing you’re optimising for. The goal is to build systems and processes where you’re constantly learning and improving, not systems where you’re just really efficient at fighting the same fires over and over.

  Dave O’Connor

I watched a supposedly “resilient” Multi-Region setup completely implode recently. The architecture diagram looked great – active workloads in US-East, cold standby in US-West. But when the provider had a global IAM service degradation, the whole thing became a brick.

  u/NTCTech on Reddit

SRE Weekly Issue #507

A message from our sponsor, incident.io:

incident.io lives inside Slack and Microsoft Teams, breaking down emergencies into actionable steps to resolution. Alerts auto-create channels, ping the right people, log actions in real time, and generate postmortems with full context.
Move fast when you break things and stay organized inside the tool your team already uses everyday.

https://fandf.co/4pRFm4d

There’s a lot you can get out of this one even if you don’t happen to be using one of the helm charts they evaluated. Their evaluation criteria are useful and easy to apply to other charts — and also a great study guide for those new to kubernetes.

  Prequel

This is the best explanation I’ve seen yet of exactly why SSL certificates are so difficult to get right in production.

  Lorin Hochstein

An article on the importance of incident simulation for training, drawing from external experience in using simulations.

  Stuart Rimell — Uptime Labs

I especially like the discussion of checklists, since they are often touted as a solution to the attention problem.

  Chris Siebenmann

This is a new product/feature announcement, but it also has a ton of detail on their implementation, and it’s really neat to see how they built cloud provider region failure tolerance into WarpStream.

  Dani Torramilans — WarpStream

It’s interesting to think of money spent on improving reliability as offsetting the cost of responding to incidents. It’s not one-to-one, but there’s an argument to be made here.

  Florian Hoeppner

An explanation of the Nemawashi principle for driving buy-in for your initiatives. This is not specifically SRE-targeted, but we so often find ourselves seeking buy-in for our reliability initiatives.

  Matt Hodgkins

The next time you’re flooded with alerts, ask yourself: Does this metric reflect customer pain, or is it just noise? The answer could change how you approach reliability forever.

  Spiros Economakis

SRE Weekly Issue #506

A message from our sponsor, Costory:

You didn’t sign up to do FinOps.
Costory automatically explains why your cloud costs change, and reports it straight to Slack.
Built for SREs who want to code, not wrestle with spreadsheets.
Now on AWS & GCP Marketplaces.

Start your free trial at costory.io

I didn’t know that some resolvers care about the order of some DNS records in a response, but I’m not surprised. The DNS spec, despite its age and multiple revisions, has a number of ambiguities like this.

  Sebastiaan Neuteboom — Cloudflare

Severity isn’t always the best indicator of the incidents we can learn the most from. What if we rate our incidents on their potential for learning?

  Lorin Hochstein

This one discusses three ways you can lose time in incidents and ideas for what you can do about it.

  Hrishikesh Barua — Uptime Labs

An interesting discussion of a bias: we tend to solve problems by adding things to our systems, and that increases complexity. AI can amplify this bias.

  Uwe Friedrichsen

Ever wondered how OTel auto-instrumentation works? This article explains it in detail (with code examples) for Python, Java, and Go.

  Elizabeth — Observability Real Talk

This article stands out from others about AI SRE agents because it goes into some detail on their method for evaluating whether their agent works. I’d love to see more of the actual evaluation results, and examples of it getting things right vs wrong.

  Daniel Shan and Tristan Ratchford — Datadog

I recently got an error from GitHub saying I’d exceeded a rate limit (when I definitely didn’t), and this article explains why.

See why observability and lifecycle management are critical for defense systems.

  Thomas Kjær Aabo — GitHub

Poor telemetry makes us want to add more telemetry, which can decrease our telemetry quality and make us add more, yikes! How can we fix the feedback loop?

Note for blind or low-vision readers: there’s a pretty important diagram in this one without a caption or alt text.

  Ash Patel

SRE Weekly Issue #505

A message from our sponsor, Hopp:

Paging at 2am? 🚨 Make incident triage feel like you’re at the same keyboard with Hopp.

  • crisp, readable screen-sharing
  • no more “can you zoom in?”
  • click + type together
  • bring the incident bridge into one session

Start pair programming: https://www.gethopp.app/?via=sreweekly

An incident write-up from the archives, and it’s a juicy one. An update to their code caused a crash only after some time had passed, so their automated testing didn’t catch it before they deployed it worldwide.

  Xandr

This article covers an independent review of the Optus outage.

I personally find it astounding that somebody conducting an incident investigation would not delve deeper into how a decision that appears to be astounding would have made sense in the moment.

  Lorin Hochstein

Cloudflare needed a tool to look for overlapping impact across their many maintenance events in order to avoid unintentionally impairing redundancy.

  Kevin Deems and Michael Hoffmann — Cloudflare

Another great piece on expiration dates. I especially like the discussion of abrupt cliffs as a design choice.

  Chris Siebenmann — University of Toronto

It’s not always easy to see how to automate a given bit of toil, especially when cross-team interactions are involved.

  Thomas A. Limoncelli and Christian Pearce — ACM Queue

How do resilience and fault tolerance relate? Are they synonyms, do they overlap, or does one contain the other?

  Uwe Friedrichsen

After unexpectedly losing their observability vendor, these folks were able to migrate to a new solution within a couple days.

  Karan Abrol, Yating Zhou, Pratyush Verma, Aditya Bhandari, and Sameer Agarwal — Deductive.ai

A great dive into what blameless incident analysis really means.

Blameless also doesn’t mean you stop talking about what people did.

  Busra Koken

SRE Weekly Issue #504

Salt is Cloudflare’s configuration management tool.

How do you find the root cause of a configuration management failure when you have a peak of hundreds of changes in 15 minutes on thousands of servers?

The result of this has been a reduction in the duration of software release delays, and an overall reduction in toilsome, repetitive triage for SRE.

  Opeyemi Onikute, Menno Bezema, Nick Rhodes — Cloudflare

In this post, I’ll give a high-level overview of what Temporal offers users, the problems we were experiencing operating Spinnaker that motivated its initial adoption at Netflix, and how Temporal helped us reduce the number of transient deployment failures at Netflix from 4% to 0.0001%.

  Jacob Meyers and Rob Zienert — Netflix

DrP provides an SDK that teams can use to define “analyzers” to perform investigations, plus post-processors to perform mitigations, notifications, and more.

  Shubham Somani, Vanish Talwar, Madhura Parikh, Chinmay Gandhi — Meta

This article goes in detail on the ways the QA folks can reskill and map their responsibilities and skills to SRE practices.

   Nidhi Sharma — DZone

“Correction of Error” is the name used by Amazon for their incident review processand there’s a lot to unpack there.

  Lorin Hocshtein

In 2019, Charity Majors came down hard on deploy freezes with an article, Friday Deploy Freezes are Exactly Like Murdering Puppies.

This one takes a more moderate approach: maybe a deploy freeze is the right choice for your organization, but you should work to understand why rather than assuming.

  Charity Majors

A piece defining the term “resilience”, with an especially interesting discussion of the inherent trade-off between efficiency and resiliency.

  Uwe Friedrichsen

Honeycomb experienced a major, extended incident in December, and they published this (extensive!) interim report. Resolution required multiple days’ worth of engineering on new functionality and procedures related to Kafka. A theme of managing employees’ energy and resources is threaded throughout the report.

  Honeycomb

A production of Tinker Tinker Tinker, LLC Frontier Theme