General

SRE Weekly Issue #467

A message from our sponsor, incident.io:

SEV0 is back. This fall, we’re bringing together the best minds in incident management for a day of learning, sharing, and networking in San Francisco and London. RSVP now—tickets are complimentary.

https://go.incident.io/SEV0-2025

It’s been awhile since we’ve seen any updates from the LFI folks, but here’s a brand new home for the community. I’ve bought my membership.

I like this article’s measured approach to anomaly detection and other AIOps features. Will it work? With your data?

  Jacek Migdal — Quesma

A structured approach to system design includes defining the problem, scope, tenets, risks, assumptions, and architecture choices.

I like how this article follows the process it lays out by writing an example design for a distributed search engine.

  Nikunj Agarwal — DZone

A mental model to detect and prevent optimizing the wrong thing, at the wrong time, or for the wrong reasons

This is the first time I’ve seen premature optimization dissected in this way, and I really like this model.

  Alex Ewerlöf

My favorite part of this podcast episode is the discussion of the unintended consequences of automation and “humans-are-better-at/machines-are-better-at” oversimplification. The transcript is great in case you’re not able to listen.

  Shane Hastie, with guest Courtney Nash — InfoQ

What role is an AI tool going to play in your sociotechnical system? This article gives you 12 insightful questions that will help guide your approach.

  Fred Hebert — Honeycomb

As long as there’s at least one HDD ‘tape’ filesystem mounted, you can count them, but once there are none, the result of counting them is not 0 but nothing.

And “nothing” doesn’t cause an alert. Oops!

  Chris Siebenmann

SRE Weekly Issue #466

A bit of a short issue this week, as I spent most of my weekend at my child’s first First Robotics Competition of the season. FRC truly is a microcosm of reliability engineering, balancing limited time and resources while trying to produce the most reliable bot possible.

A message from our sponsor, incident.io:

What does “good” incident management look like? MTTx metrics track speed, but speed alone doesn’t mean success. We analyzed 100,000+ incidents from companies of all sizes to identify benchmarks for every stage of the incident lifecycle. See how your team stacks up.

https://go.incident.io/good-incident-management-report

Just because Google, Amazon, or Facebook does it doesn’t mean you should. Here are four ‘best practices’ of the hyperscalers you have permission to ignore.

  Matt Asay — InfoWorld

An introduction to distributed transactions using the Saga pattern, including pros and cons and two approaches for implementing sagas.

  Sid — Scalable Thread

Here’s an argument against real-world “war rooms” for incident response, including a great incident story as an example.

I can’t imagine doing that kind of multi-window parallel investigation stuff on a teeny little laptop screen with people right next to me on either side

  rachelbythebay

This one includes a list of responsibilities a lead incident responder has and another list of things they should delegate.

Incident lead isn’t an extra job that you do “on top of” engineering. It’s the main job.

  r/devoopseng — Reddit r/sre

Scaling Elasticsearch requires balancing sharding, query performance, and memory tuning for optimal efficiency in high-traffic, real-time applications.

   Vivek Kumar — DZone

SRE Weekly Issue #465

A message from our sponsor, incident.io:

On-call shouldn’t be a constant source of stress. On Feb 26 at 1 PM EST, join us to hear from teams who’ve moved from PagerDuty to incident.io On-call—reducing noise, improving alerting, and making on-call less painful. Insights from engineers who’ve been there.

https://go.incident.io/events/migrating-from-pagerduty

An incident report from the vault, along with its accompanying blog post, involving a rare but serious kernel freeze on GCP.

  Jake Cooper — Railway

Let’s discuss logging – unstructured, structured and canonical log lines – what they are and what value they bring to your production systems.

This one includes an example of implementing a logging system in an example project.

  Obakeng Mosadi

This article aims to answer one question: How can Redis be used as a primary database for complex applications that need to store data in multiple formats?

It covers persistence and scaling options, including Redis Enterprise’s built-in CRDTs.

   Mohammed Talib

In this blog post we’re going to explore how the hung task warning works, why it happens, whether it is a bug in the Linux kernel or application itself, and whether it is worth monitoring at all.

  Oxana Kharitonova and Jesper Brouer — Cloudflare

This post discusses key preconditions for building resilience, including resources, flexibility, expertise, diversity, and coordination.

  Lorin Hochstein

So the main problem with blameful postmortems is not the blame. It’s the very idea that particular decisions can be categorically unsafe.

  u/devoopseng — Reddit r/sre

This may be the shortest article I’ve ever linked to here, but it’ll make you think.

  Dean Wilson

If you use SLOs at all levels in your system, a failure of a core part (like the DB) may page multiple teams. This article offers strategies to handle this better.

  Fred Hebert — Honeycomb

SRE Weekly Issue #464

A message from our sponsor, incident.io:

For years, on-call has felt more like a burden than a solution. But modern teams are making a change. On Feb 26 at 1 PM EST, hear why—and how—they’re moving from PagerDuty to incident.io On-call. Register now.

https://go.incident.io/events/migrating-from-pagerduty

These folks decided that Google Cloud wasn’t for them, and they built and migrated to their own datacenter in 9 months. This article goves over the physical buildout.

  Charith Amarasinghe — Railway

I remember when this incident happened in 2017. It was a huge one, and GitLab was very open with information about what happened. Here’s a look back at what happened.

  Byte-Sized Design

When your distributed system deals in nanosecond precision, an extra second is a big deal.

  Oleg Obleukhov and Patrick Cullen — Meta

Learn how AWS uses formal verification and other techniques.

Alongside industry-standard testing methods (such as unit and integration testing), AWS has adopted model checking, fuzzing, property-based testing, fault-injection testing, deterministic simulation, event-based simulation, and runtime validation of execution traces.

  Marc Brooker and Ankush Desai — ACM Queue

Normally, we rely on the thoughts, decisions, and actions of individuals to create resilizence in our sociotechnical systems, but in some time-critical situations, it can be best for one expert to call the shots.

  Robert Poston, MD

You do not have to choose between gold-plating dressed as craftsmanship or perfectionism and corner-cutting framed as pragmatism or realism. You can have the quality of the former at the speed and focus of the latter. I call this the Best Simple System for Now.

  Dan North & Associates

This is the first I’ve heard of I-PASS, and I like it!

  u/devoopseng — r/sre

This article is a roundup of schools of thought on how systems fail, with a pretty excellent list of links to related articles at the end.

  Evan Smith

SRE Weekly Issue #463

A message from our sponsor, incident.io:

Incidents move fast—so should your response. That’s why we’re building an AI responder that thinks like your team, not a machine. See how we’re doing it, the challenges faced, and what else is on the AI roadmap.

https://www.youtube.com/watch?v=rNpwZPOUhuE

Sometimes, we can harness randomness to improve throughput and reliability.

  Teiva Harsanyi — The Coder Cafe

Not just the “how”, but also the “why”, along with the challenges they found along the way.

  Daniel Paulus and Umut Uzgur — Checkly

It’s a classic problem: how do you detect problems that badly impact a specific set of customers, when the overall percentage affected is tiny?

  Lakshmi Narayan and Joshua Delman — Stripe

This is the clearest and most concise explanation of the Byzantine Generals Problem that I’ve read.

  Sid — The Scalable Thread

Th[is] article describes some different methods and tools that engineers can use to simulate their clusters and what knowledge they can gain from it, and it presents a case study using SimKube, the Kubernetes simulator developed by Applied Computing Research Labs in 2024.

  David R. Morrison — ACM Queue

An IaaC nightmare: when a list went from having IPs to being empty, suddenly the IP block rule was interpreted as “block everything” rather than “block nothing”.

  Jake Cooper — Railway

The incident occurred due to human error and insufficient validation safeguards during a routine abuse remediation for a report about a phishing site hosted on R2.

  Matt Silverlock and Javier Castro — Cloudflare

Along with being blatantly illegal, DOGE’s actions are incredibly risky from a reliability perspective. Thanks, Liz, for putting into words concerns that I also share.

  Liz Fong-Jones — Bulletin of the Atomic Scientists

A production of Tinker Tinker Tinker, LLC Frontier Theme