General

SRE Weekly Issue #465

lex

February 23, 2025

Incident Report: Dec 1st, 2023

An incident report from the vault, along with its accompanying blog post, involving a rare but serious kernel freeze on GCP.

Jake Cooper — Railway

It’s a log eat log world!

Let’s discuss logging – unstructured, structured and canonical log lines – what they are and what value they bring to your production systems.

This one includes an example of implementing a logging system in an example project.

Obakeng Mosadi

Redis as a Primary Database for Complex Applications

This article aims to answer one question: How can Redis be used as a primary database for complex applications that need to store data in multiple formats?

It covers persistence and scaling options, including Redis Enterprise’s built-in CRDTs.

Mohammed Talib

Searching for the cause of hung tasks in the Linux kernel

In this blog post we’re going to explore how the hung task warning works, why it happens, whether it is a bug in the Linux kernel or application itself, and whether it is worth monitoring at all.

Oxana Kharitonova and Jesper Brouer — Cloudflare

Resilience: some key ingredients

This post discusses key preconditions for building resilience, including resources, flexibility, expertise, diversity, and coordination.

Lorin Hochstein

Blame is not the root cause of bad postmortems

So the main problem with blameful postmortems is not the blame. It’s the very idea that particular decisions can be categorically unsafe.

u/devoopseng — Reddit r/sre

Incident Initiation: Pinpointing the Precise Problem Point

This may be the shortest article I’ve ever linked to here, but it’ll make you think.

Dean Wilson

Slicing Up—and Iterating on—SLOs

If you use SLOs at all levels in your system, a failure of a core part (like the DB) may page multiple teams. This article offers strategies to handle this better.

Fred Hebert — Honeycomb

SRE Weekly Issue #464

lex

February 16, 2025

General

Comments

View on sreweekly.com

So You Want to Build Your Own Data Center

These folks decided that Google Cloud wasn’t for them, and they built and migrated to their own datacenter in 9 months. This article goves over the physical buildout.

Charith Amarasinghe — Railway

How GitLab Lost 300GB of Production Data and What We Can Learn

I remember when this incident happened in 2017. It was a huge one, and GitLab was very open with information about what happened. Here’s a look back at what happened.

Byte-Sized Design

How Precision Time Protocol handles leap seconds

When your distributed system deals in nanosecond precision, an extra second is a big deal.

Oleg Obleukhov and Patrick Cullen — Meta

Systems Correctness Practices at AWS

Learn how AWS uses formal verification and other techniques.

Alongside industry-standard testing methods (such as unit and integration testing), AWS has adopted model checking, fuzzing, property-based testing, fault-injection testing, deterministic simulation, event-based simulation, and runtime validation of execution traces.

Marc Brooker and Ankush Desai — ACM Queue

Surviving Cardiac Surgical Chaos

Normally, we rely on the thoughts, decisions, and actions of individuals to create resilizence in our sociotechnical systems, but in some time-critical situations, it can be best for one expert to call the shots.

Robert Poston, MD

Best Simple System for Now

You do not have to choose between gold-plating dressed as craftsmanship or perfectionism and corner-cutting framed as pragmatism or realism. You can have the quality of the former at the speed and focus of the latter. I call this the Best Simple System for Now.

Dan North & Associates

How doctors handoff patients (how it applies to incidents)

This is the first I’ve heard of I-PASS, and I like it!

u/devoopseng — r/sre

The Theory Behind Understanding Failure

This article is a roundup of schools of thought on how systems fail, with a pretty excellent list of links to related articles at the end.

Evan Smith

SRE Weekly Issue #463

lex

February 9, 2025

General

Comments

View on sreweekly.com

Probabilistic Increment

Sometimes, we can harness randomness to improve throughput and reliability.

Teiva Harsanyi — The Coder Cafe

How We Migrated Checkly From Heroku to AWS

Not just the “how”, but also the “why”, along with the challenges they found along the way.

Daniel Paulus and Umut Uzgur — Checkly

Using ML to detect and respond to performance degradations in slices of Stripe payments

It’s a classic problem: how do you detect problems that badly impact a specific set of customers, when the overall percentage affected is tiny?

Lakshmi Narayan and Joshua Delman — Stripe

What is the Byzantine Generals Problem in Distributed Systems?

This is the clearest and most concise explanation of the Byzantine Generals Problem that I’ve read.

Sid — The Scalable Thread

Simulation: An Underutilized Tool in Distributed Systems

Th[is] article describes some different methods and tools that engineers can use to simulate their clusters and what knowledge they can gain from it, and it presents a case study using SimKube, the Kubernetes simulator developed by Applied Computing Research Labs in 2024.

David R. Morrison — ACM Queue

Incident Report: December 16th, 2024

An IaaC nightmare: when a list went from having IPs to being empty, suddenly the IP block rule was interpreted as “block everything” rather than “block nothing”.

Jake Cooper — Railway

Cloudflare incident on February 6, 2025

The incident occurred due to human error and insufficient validation safeguards during a routine abuse remediation for a report about a phishing site hosted on R2.

Matt Silverlock and Javier Castro — Cloudflare

Why DOGE’s meddling at Treasury could have catastrophic consequences for the US economy

Along with being blatantly illegal, DOGE’s actions are incredibly risky from a reliability perspective. Thanks, Liz, for putting into words concerns that I also share.

Liz Fong-Jones — Bulletin of the Atomic Scientists

SRE Weekly Issue #462

lex

February 2, 2025

General

Comments

View on sreweekly.com

The non-existence of ACID consistency – Part 1

This article series asks, do you really need ACID consistency?

Well, of course ACID consistency exists – and it is a good thing that it exists. Thus, feel free to call the post title clickbait … ;)

My point here is that it should not exist as functional requirement.

Uwe Friedrichsen

Increased errors for ChatGPT

OpenAI posted this mini report on their outage on January 30.

OpenAI

‘Too Much Security’ brought down Philippine EDU sites?

It’s never DNS, except when it’s definitely DNS, such as in the case of this probable DNSSEC misconfiguration.

Wilson Chua — Manila Bulletin

Fail Open vs. Fail Closed

Do you want to prioritize availability or control?

Teiva Harsanyi — The Coder Cafe

You’re missing your near misses

The amount of attention an incident gets is proportional to the severity of the incident: the greater the impact to the organization, the more attention that post-incident activities will get.

The problem is that the severity of a near-miss incident is zero, but it can have significant value for learning even still.

Lorin Hochstein

Restructuring How We Think About Alerts

This article urges caution in creating alerts that recommend a specific course of action when they fire. It explains why this can be dangerous and suggests alternative methods.

Fred Hebert — Honeycomb

Kubernetes Best Practices I Wish I Had Known Before

In this post, I will highlight some crucial Kubernetes best practices. They are from my years of experience with Kubernetes in production. Think of this as the curated “Kubernetes cheat sheet” you wish you had from Day 1.

Engin Diri — Pulumi

Strobelight: A profiling service built on open source technology

Meta’s profiling system has helped them save thousands of servers’ worth of computing resources, through continuous profiling and centralized symbolization.

Jordan Rome — Meta

SRE Weekly Issue #461

lex

January 26, 2025

General

Comments

View on sreweekly.com

The importance of resilience engineering

Written in 2020 after an AWS outage, this article analyzes dependence on third-party services and the responsibility to understand their reliability.

Uwe Friedrichsen

How we invalidate cache for resource-heavy & long-running requests

When a cache expired, these folks found that their application stampeded the database with expensive queries, so they searched for a solution.

Punit Sethi

The danger of overreaction

When a high-severity incident happens, its associated risks becomes salient: the incident looms large in our mind, and the fact that it just happened leads us to believe that the risk of a similar incident is very high.

Lorin Hochstein

Managing Trace Volume at monday.com

These folks landed on a hybrid approach using two vendors, allowing them to avoid sending their entire trace volume to an expensive observability vendor.

Jakub Sokół — monday

Adaptive LIFO

Under heavy load, requests are handled in LIFO order to maximize the chance of successfully completing fresh requests.

LIFO = Last In, First Out

Teiva Harsanyi

Kafka vs NATS: A Comparison for Message Processing

More than just a simple feature comparison, this article also presents two use cases and analyzes which tool is best in each case.

Josson Paul Kalapparambath — DZone

Go All the Way: Why Golang is Your Swiss Army Knife for Modern Development

These folks explain why they use Go for everything: application code, infrastructure as code, tooling, and even as a wrapper around Helm charts for Kubernetes.

Akhilesh Krishnan — Oodle AI

SRE Weekly Issue #465

SRE Weekly Issue #464

SRE Weekly Issue #463

SRE Weekly Issue #462

SRE Weekly Issue #461

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, incident.io:

A message from our sponsor, incident.io:

A message from our sponsor, incident.io:

A message from our sponsor, incident.io:

A message from our sponsor, incident.io:

Subscribe

RSS

Mastodon

Search Issues