SRE WEEKLY – Page 4 – scalability, availability, incident response, automation

SRE Weekly Issue #468

lex

March 16, 2025

No matter how bullet-proof you build the components of your system, the only way to make nines go up is to be ready to deal with the host of surprises that take them back down.

Clint Byrum

Bloom Filters & Supercharged Product Recommendations

Here’s an example of a really great application of bloom filters, in which speed is key and a slight risk of false is acceptable.

Alex Gardiner — Klaviyo

Should we replace traffic controllers with raspberry pi??

This fun video gives us a small glimpse into the world of traffic light controllers, and more importantly, what makes them reliable. There’s also a longer video that goes deeper into why a Raspberry Pi isn’t up to the job.

Traffic Light Doctor

Scaling Prometheus: From Single Node to Enterprise-Grade Observability

Here’s an overview of several options to scale Prometheus beyond a single instance, including a handy table of features and functionality.

Gaurav Maheshwari

Learning from Failure, Why You Should Write Post-Mortems for Your Homelab

A nice guide for using incident analysis in your home lab setup, plus a write-up for an incident experienced by the author.

Barush Mendez

Paxos made visual in FizzBee

A highly detailed explanation of Paxos with diagrams and a model in FizzBee.

Lorin Hochstein

Why I’m No Longer Talking to Architects About Microservices

I’ve boiled my frustration down to three problems:

No one agrees on what “microservice” means.

Microservices conversations are abstract, with little tie-in to real business goals

Adopting microservices without changing your organisation is pointless.

Ian Miell — Container Solutions

SRE Weekly Issue #467

lex

March 9, 2025

General

Comments

View on sreweekly.com

Introducing the Resilience in Software Foundation

It’s been awhile since we’ve seen any updates from the LFI folks, but here’s a brand new home for the community. I’ve bought my membership.

Lessons from the pre-LLM AI in Observability: Anomaly Detection and AIOps vs. P99

I like this article’s measured approach to anomaly detection and other AIOps features. Will it work? With your data?

Jacek Migdal — Quesma

A Step-by-Step Guide to Write a System Design Document

A structured approach to system design includes defining the problem, scope, tenets, risks, assumptions, and architecture choices.

I like how this article follows the process it lays out by writing an example design for a distributed search engine.

Nikunj Agarwal — DZone

Premature optimization

A mental model to detect and prevent optimizing the wrong thing, at the wrong time, or for the wrong reasons

This is the first time I’ve seen premature optimization dissected in this way, and I really like this model.

Alex Ewerlöf

Resilience, Observability and Unintended Consequences of Automation

My favorite part of this podcast episode is the discussion of the unintended consequences of automation and “humans-are-better-at/machines-are-better-at” oversimplification. The transcript is great in case you’re not able to listen.

Shane Hastie, with guest Courtney Nash — InfoQ

AI: Where in the Loop Should Humans Go?

What role is an AI tool going to play in your sociotechnical system? This article gives you 12 insightful questions that will help guide your approach.

Fred Hebert — Honeycomb

A Prometheus gotcha with alerts based on counting things

As long as there’s at least one HDD ‘tape’ filesystem mounted, you can count them, but once there are none, the result of counting them is not 0 but nothing.

And “nothing” doesn’t cause an alert. Oops!

Chris Siebenmann

SRE Weekly Issue #466

lex

March 2, 2025

General

Comments

View on sreweekly.com

A bit of a short issue this week, as I spent most of my weekend at my child’s first First Robotics Competition of the season. FRC truly is a microcosm of reliability engineering, balancing limited time and resources while trying to produce the most reliable bot possible.

No, you don’t have to run like Google

Just because Google, Amazon, or Facebook does it doesn’t mean you should. Here are four ‘best practices’ of the hyperscalers you have permission to ignore.

Matt Asay — InfoWorld

What is Saga Pattern in Distributed Systems?

An introduction to distributed transactions using the Saga pattern, including pros and cons and two approaches for implementing sagas.

Sid — Scalable Thread

Answering reader feedback: war rooms vs. deep investigations

Here’s an argument against real-world “war rooms” for incident response, including a great incident story as an example.

I can’t imagine doing that kind of multi-window parallel investigation stuff on a teeny little laptop screen with people right next to me on either side

rachelbythebay

https://www.reddit.com/r/sre/comments/1j145fx/delegate_aggressively_when_leading_an_incident/

This one includes a list of responsibilities a lead incident responder has and another list of things they should delegate.

Incident lead isn’t an extra job that you do “on top of” engineering. It’s the main job.

r/devoopseng — Reddit r/sre

How to Scale Elasticsearch to Solve Your Scalability Issues

Scaling Elasticsearch requires balancing sharding, query performance, and memory tuning for optimal efficiency in high-traffic, real-time applications.

Vivek Kumar — DZone

SRE Weekly Issue #465

lex

February 23, 2025

General

Comments

View on sreweekly.com

Incident Report: Dec 1st, 2023

An incident report from the vault, along with its accompanying blog post, involving a rare but serious kernel freeze on GCP.

Jake Cooper — Railway

It’s a log eat log world!

Let’s discuss logging – unstructured, structured and canonical log lines – what they are and what value they bring to your production systems.

This one includes an example of implementing a logging system in an example project.

Obakeng Mosadi

Redis as a Primary Database for Complex Applications

This article aims to answer one question: How can Redis be used as a primary database for complex applications that need to store data in multiple formats?

It covers persistence and scaling options, including Redis Enterprise’s built-in CRDTs.

Mohammed Talib

Searching for the cause of hung tasks in the Linux kernel

In this blog post we’re going to explore how the hung task warning works, why it happens, whether it is a bug in the Linux kernel or application itself, and whether it is worth monitoring at all.

Oxana Kharitonova and Jesper Brouer — Cloudflare

Resilience: some key ingredients

This post discusses key preconditions for building resilience, including resources, flexibility, expertise, diversity, and coordination.

Lorin Hochstein

Blame is not the root cause of bad postmortems

So the main problem with blameful postmortems is not the blame. It’s the very idea that particular decisions can be categorically unsafe.

u/devoopseng — Reddit r/sre

Incident Initiation: Pinpointing the Precise Problem Point

This may be the shortest article I’ve ever linked to here, but it’ll make you think.

Dean Wilson

Slicing Up—and Iterating on—SLOs

If you use SLOs at all levels in your system, a failure of a core part (like the DB) may page multiple teams. This article offers strategies to handle this better.

Fred Hebert — Honeycomb

SRE Weekly Issue #464

lex

February 16, 2025

General

Comments

View on sreweekly.com

So You Want to Build Your Own Data Center

These folks decided that Google Cloud wasn’t for them, and they built and migrated to their own datacenter in 9 months. This article goves over the physical buildout.

Charith Amarasinghe — Railway

How GitLab Lost 300GB of Production Data and What We Can Learn

I remember when this incident happened in 2017. It was a huge one, and GitLab was very open with information about what happened. Here’s a look back at what happened.

Byte-Sized Design

How Precision Time Protocol handles leap seconds

When your distributed system deals in nanosecond precision, an extra second is a big deal.

Oleg Obleukhov and Patrick Cullen — Meta

Systems Correctness Practices at AWS

Learn how AWS uses formal verification and other techniques.

Alongside industry-standard testing methods (such as unit and integration testing), AWS has adopted model checking, fuzzing, property-based testing, fault-injection testing, deterministic simulation, event-based simulation, and runtime validation of execution traces.

Marc Brooker and Ankush Desai — ACM Queue

Surviving Cardiac Surgical Chaos

Normally, we rely on the thoughts, decisions, and actions of individuals to create resilizence in our sociotechnical systems, but in some time-critical situations, it can be best for one expert to call the shots.

Robert Poston, MD

Best Simple System for Now

You do not have to choose between gold-plating dressed as craftsmanship or perfectionism and corner-cutting framed as pragmatism or realism. You can have the quality of the former at the speed and focus of the latter. I call this the Best Simple System for Now.

Dan North & Associates

How doctors handoff patients (how it applies to incidents)

This is the first I’ve heard of I-PASS, and I like it!

u/devoopseng — r/sre

The Theory Behind Understanding Failure

This article is a roundup of schools of thought on how systems fail, with a pretty excellent list of links to related articles at the end.

Evan Smith

SRE Weekly Issue #468

SRE Weekly Issue #467

SRE Weekly Issue #466

SRE Weekly Issue #465

SRE Weekly Issue #464

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, incident.io:

A message from our sponsor, incident.io:

A message from our sponsor, incident.io:

A message from our sponsor, incident.io:

A message from our sponsor, incident.io:

Subscribe

RSS

Mastodon

Search Issues