SRE Weekly Issue #511

A message from our sponsor, Depot:

CI was designed for humans who context-switch while waiting. Agents don’t. They’re just blocked. Depot CEO Kyle Galbraith on how they re-imagined Depot CI to close the loop: run against local patches, rerun a single job, SSH into the runner to check reality. Per-second billing, no minimums.

Run depot ci migrate

This one’s definitely going to be good to keep in mind during my next incident.

FYI for folks with no or low vision, there’s a screenshot of J. Paul Reed quoting Vanessa Huerta Granda: “Incidents are where engineers are made”.

  Stuart Rimell — Uptime Labs

Etsy migrated a 1,000-table DB with 1,000 shards (with their own custom ORM!) over to vitess, and it took some care, especially in how they handled transactions.

  Ella Yarmo-Gray — Etsy

Wow, this one sure hits hard.

  Kenneth Eversole

The section on lessons learned toward the end of this debugging story is a goldmine.

  Lokesh Soni

How do you ensure reliability in a system you can’t access? How can you monitor SLIs/SLOs without metrics?

  Alex Ewerlöf

I love a good debugging story, and this one delivers, with a confluence of gnarly problems and lessons we can all learn from.

  James Sawyer — Phantom Tide

Oof, what a nasty little gotcha in the API call at the heart of this incident.

  David Tuber and Dzevad Trumic — Cloudflare

Lorin’s Law strikes again!

System intended to improve reliability contributed to incident

  Lorin Hochstein

SRE Weekly Issue #510

A message from our sponsor, Clickhouse:

AI isn’t replacing SREs. It’s changing how they work.

The near future of observability isn’t autonomous agents, it’s collaboration. ClickHouse’s ClickStack Notebooks bring SREs and AI into a shared investigative workspace, combining human intuition with structured, reliable tooling to debug faster and think more clearly.

Read more

ML systems decay gradually instead of breaking suddenly, so we need error budgets for model accuracy, data freshness, and fairness — not just uptime.

   Varun Kumar Reddy Gajjala — DZone

Enterprises rarely fail because they don’t care about reliability.
They fail because:

  • failure is loud,
  • prevention is quiet,
  • and budgeting systems are wired to respond to noise.

  Florian Hoeppner

They had hundreds of databases to migrate, so they built a tested, self-service migration workflow.

  Ram Srivasta Kannan, Wale Akintayo, Jay Bharadwaj, John Crimmins, Shengwei Wang, and Zhitao Zhu — Netflix

I love the technical description of socket juggling to achieve a graceful restart. I could swear that this technique has been around for decades though, for example in TinyMUX et al…

  Manuel Olguín Muñoz — Cloudflare

Lorin goes into what an AI incident manager might look like, since no tools of the sort exist yet.

  Lorin Hochstein

By default, Kubernetes keeps a pretty short event history. This article argues that what we really need is the ability to know the state of the system at a specific time.

   Shamsher Khan — DZone

They built a platform for safely rolling out configuration changes. I like that it has a special mode for use in incident response.

  Cosmo W. Q — Airbnb

This is a cool debugging story, and I love the emphasis on mental models. The bit about simulating different paths through the software is quite intriguing.

  Michael Victor Zink — Readyset (via Antithesis)

SRE Weekly Issue #509

SRE Weekly is back! My partner is doing well, and thanks for all the kind words and well-wishes.

A message from our sponsor, Costory :

Tracking cloud and AI costs across AWS, GCP, and Datadog shouldn’t require three dashboards and a spreadsheet.

Costory correlates cost, usage, and deployment data. Explains what changed and why. Straight to Slack. Terraform setup.

Try it free → https://www.costory.io/lp/no-time-4-finops?utm_source=sre-weekly&utm_medium=newsletter&utm_campaign=&utm_id=no-time

There’s a lot you miss out on if you get an LLM to write your incident review.

incident reviews are fundamentally a socio-technical process, and they do not provide benefit if people don’t engage with them.

  Fischer

I love this concept of reliability debt.

  Spiros Economakis

This one starts with an insightful comparison of two commercial aviation incidents and the crew’s actions. It goes on to draw broader lessons that we can use as SREs.

  Hamed Silatani — Uptime Labs

What happens now that SQL is being written by LLMs? I love the analogy to the advent of ORMs that abstracted away the generation of SQL.

  Tanmay Sinha — Readyset

What specific kind of bugs is AI more likely to generate? Do some categories of bugs show up more often? How severe are they? How is this impacting production environments?

They did a survey of 470 codebases and share the numbers on the rate of bugs generated by LLMs versus humans.

  David Loker — CodeRabbit

This post looks at ten real status page examples from teams that have dealt with outages at scale. Each example highlights what they communicate well, where they set expectations clearly, and how small details reduce confusion during incidents.

  Laura Clayton — UptimeRobot

If you don’t explicitly state your expected level of reliability, your customers will infer one and hold you to it anyway. “Disappoint” them early by telling them what to expect.

  Dave O’Connor

Humans exhibit variation in how we respond to a given situation, and this article argues that it’s one of our strengths. LLMs intentionally also exhibit variability.

  Lorin Hochstein

SRE Weekly Issue #508

SRE Weekly will be going on hiatus for 6 weeks, while I’m on leave caring for my partner after her kidney transplant surgery this week. It’s incredible that the National Kidney Registry’s Paired Exchange program allowed me to donate a kidney to help her even though we don’t have matching blood types!

A message from our sponsor, Costory:

Tired of manually explaining your cloud & LLM bills?
Check our live preview to see how Costory links every cost spike to deployments, infra changes, and usage patterns. And delivers a clean summary straight in Slack.

Explore the demo

What do we miss when we have LLMs write our code for us? This article explains that one thing we can miss out on is building a mental model.

  Shayon Mukherjee

I really love this explanation of the concept of compensation.

Compensation is a very interesting mechanism in software systems because it can keep complex systems alive, but also because it can be a factor in how they quickly and unexpectedly collapse.

  Fred Hebert — Resilience in Software Foundation

When you investigate an incident and tell the story about what you found, but no one believes you because there’s no smoking gun or bad actor…

  Lorin Hochstein

To build and maintain reliable systems, organizations must align responsibility with control. This is where the Ownership TrioMandate, Knowledge, and Accountability—comes in.

  Spiros Economakis

I love when an article goes through the designs they passed over (and why) before reaching their final design, as in this one.

  Julianne Walker — Tines

If you’re unfamiliar with Docker image lazy loading like I was, this is a great primer on two options, Estargz and SOCI.

   Huong Vuong and Joseph Sahayaraj — Grab

But don’t let MTTR become the thing you’re optimising for. The goal is to build systems and processes where you’re constantly learning and improving, not systems where you’re just really efficient at fighting the same fires over and over.

  Dave O’Connor

I watched a supposedly “resilient” Multi-Region setup completely implode recently. The architecture diagram looked great – active workloads in US-East, cold standby in US-West. But when the provider had a global IAM service degradation, the whole thing became a brick.

  u/NTCTech on Reddit

SRE Weekly Issue #507

A message from our sponsor, incident.io:

incident.io lives inside Slack and Microsoft Teams, breaking down emergencies into actionable steps to resolution. Alerts auto-create channels, ping the right people, log actions in real time, and generate postmortems with full context.
Move fast when you break things and stay organized inside the tool your team already uses everyday.

https://fandf.co/4pRFm4d

There’s a lot you can get out of this one even if you don’t happen to be using one of the helm charts they evaluated. Their evaluation criteria are useful and easy to apply to other charts — and also a great study guide for those new to kubernetes.

  Prequel

This is the best explanation I’ve seen yet of exactly why SSL certificates are so difficult to get right in production.

  Lorin Hochstein

An article on the importance of incident simulation for training, drawing from external experience in using simulations.

  Stuart Rimell — Uptime Labs

I especially like the discussion of checklists, since they are often touted as a solution to the attention problem.

  Chris Siebenmann

This is a new product/feature announcement, but it also has a ton of detail on their implementation, and it’s really neat to see how they built cloud provider region failure tolerance into WarpStream.

  Dani Torramilans — WarpStream

It’s interesting to think of money spent on improving reliability as offsetting the cost of responding to incidents. It’s not one-to-one, but there’s an argument to be made here.

  Florian Hoeppner

An explanation of the Nemawashi principle for driving buy-in for your initiatives. This is not specifically SRE-targeted, but we so often find ourselves seeking buy-in for our reliability initiatives.

  Matt Hodgkins

The next time you’re flooded with alerts, ask yourself: Does this metric reflect customer pain, or is it just noise? The answer could change how you approach reliability forever.

  Spiros Economakis

A production of Tinker Tinker Tinker, LLC Frontier Theme