Search Results for “outages”

Search Results for – "outages"

SRE Weekly Issue #484

lex

July 6, 2025

Exact Code Search: Find code faster across repositories

This is really neat! They’ve developed a new approach to search that uses 3-letter “trigrams” rather than tokenizing words, making it especially well-suited to code search. It converts regular expressions to trigram searches behind the scenes.

Dmitry Gruzd — GitLab

Pattern machines that we don’t understand

This article about LLMs is by a regularly featured author here in the newsletter. It’s not, strictly speaking, directly SRE-related, but I really got a lot out of it, so I’m including it anyway.

Lorin Hochstein

Soft vs. Hard Dependency: A Better Way to Think About Dependencies for More Reliable Systems

This one explains the difference between a soft and hard dependency, why it matters, and how to use this information to improve reliability. I like the section on soft dependencies evolving into hard dependencies when you’re not looking.

Teiva Harsanyi — The Coder Cafe

Breaking up a monolith: How we’re unwinding a shared database at scale

In this post, we’ll walk through how we’re splitting apart our shared database into independently owned instances. We’ll explain how we defined the right boundaries, minimized risk during migrations, and built the tooling to make the process safe and scalable.

Fabiana Scala and Tali Gutman — Datadog

Big Enough to Fail

At some point, the external dependencies which our systems rely on become so tightly coupled, large, and fundamental that should those foundations inevitably fail, that blame can actually go down in response to an incident.

This thought-provoking article explores why we’re more tolerant of outages from large tech companies like Google Cloud or Salesforce, and what this means for how we think about reliability engineering and incident response.

Will Gallego

Use AWS FIS to test the resilience of self-managed Cassandra

This practical guide shows how to use AWS Fault Injection Service (FIS) to perform chaos engineering experiments on self-managed Cassandra clusters. It walks through setting up experiments to test node failure scenarios and validate that applications can properly handle database outages through connection pooling and retry mechanisms.

Hans Nesbitt and Lwanga Phillip — AWS

Building a Billing Usage Recovery System

Klaviyo shares how they built an automated recovery system to handle billing usage tracking failures. The system uses S3 for data storage and SQS for message queuing to ensure that missed usage events are automatically recovered, eliminating manual intervention and reducing customer confusion.

Kaavya Antony — Klaviyo

Taming Complexity: HelloFresh’s Playbook for Managing Large-Scale Change (Part 3/3)

Final stretch! We’ve handled people and processes, now let’s crack the code side and stitch everything together into a four-stage framework you can reuse.

SRE Weekly Issue #469

lex

March 23, 2025

General

Comments

View on sreweekly.com

MTTR Is (Still) Lying to You

I’ve shared this article before, but it’s so critical that it’s time to give it another read. MTTR is a statistically useless metric, and by using it, we will draw faulty conclusions and potentially take harmful actions. Courtney Nash does a really great job of laying out the science in an easy-to-understand way.

Courtney Nash — Resilience in Software Foundation / The VOID

System Diagrams are Performance Caches for Cognitive Load

I like the analogy here: when we say people are components in or sociotechnical systems, system diagrams are like a form of cache.

Clint Byrum

In S3 simplicity is table stakes

From Werner Vogels’s intro to this article:

Andy takes us through S3’s evolution from simple object store to sophisticated data platform, illustrating how customer feedback has shaped every aspect of the service. It’s a fascinating look at how we maintain simplicity even as systems scale to handle hundreds of trillions of objects.

Andy Warfield — Amazon

The Serverless Trilemma: Cost, Performance, and Complexity

Instead of a traditional Cost/Performance/Reliability trade-off, this article argues that serverless presents a tradeoff of Cost, Performance, and Complexity.

Luc van Donkersgoed

STPA (System Theoretic Process Analysis) — Teaching a new way to prevent outages at Google

Google uses System Theoretic Process Analysis to identify problems in their systems. They found that the most effective way to spread adoption of STPA was to build their own training program.

Garrett Holthaus — Google

Scaling Nextdoor’s Datastores: Part 1

So far, I’m liking this new post series from Nextdoor about their efforts to scale their datastore. Here’s the first installment, about the things they’ve tried up to now.

I’ll share the rest of the series as I work my way through them.

Slava Markeyev — Nextdoor

The Real Failure Rate of EBS

Wow, I had no idea EBS volumes failed this often!

Nick Van Wiggeren — PlanetScale

SRE Weekly Issue #459

lex

January 12, 2025

General

Comments

View on sreweekly.com

How to end-to-end test microservices across bounded contexts?

In a microservices environment, testing user journeys that span across multiple bounded contexts requires collaboration and a clear delineation of responsibilities.

Yan Cui

A smooth CDN provider migration and future initiatives

These folks migrated from Fastly to Cloudflare using Terraform. They wrote a Go program to translate from their Fastly VCL configurations to an equivalent set of parameters to their Terraform module.

hatappi1225 — Mercari

Use of Time in Distributed Databases (part 1)

This 3-part series does a deep dive on how time and clocks work in distributed data stores. Part 2 is here and part 3 is here.

Murat

Seconds Since the Epoch

TIL: “Unix time” (seconds since the epoch) does not include leap seconds.

Kyle Kingsbury

Facebook Engineering Disasters Are Not Inevitable: Moving Past Casual Commentary to Real Change

This post argues that tech companies should avoid outages like Facebook’s in 2021 by using much more rigorous principles such as those used to build bridges. I’m not so sure about that, but it was an interesting read.

Davi Ottenheimer

Behind the scenes with Stream Live, Cloudflare’s live streaming service

There’s a lot going on beneath the surface in a live video streaming service. Cloudflare walks us through it, including key design decisions like on-the-fly transcoding.

Kyle Boutette and Jacob Curtis — Cloudflare

DSQL Vignette: Wait! Isn’t That Impossible?

DSQL is Amazon’s new serverless PostgreSQL-compatible datastore.

Aurora DSQL is designed to remain available, durable, and strongly consistent even in the face of infrastructure failures and network partitions.

But what about the CAP Theorem? Click through to find out how.

Marc Brooker

The long way towards resilience

This new installment introduces the next level of resilience, which involves the ability to radically change your approach if the usual adaptation strategies fall short.

Uwe Friedrichsen

SRE Weekly Issue #446

lex

October 13, 2024

General

Comments

View on sreweekly.com

Why I like discussing actions items in incident reviews

This one is a direct response to an article by Lorin Hochstein from a couple weeks back. There’s a lot here to think about, and it’s really great to see the back-and-forth discussion.

Chris Evans — incident.io

Building and operating a pretty big storage system called S3

A tour through the design of S3 by its VP. I found the discussion of managing “heat” (I/O load) especially interesting.

Andy Warfield — Amazon

A Comprehensive Guide to Database Sharding

This one introduced me to a new concept: vertical vs horizontal sharding. Vertical sharding by whole tables, and horizontal is sharding by related rows across tables, as with users or groups of users.

Suleiman Dibirov

Build a serverless ACID database with this one neat trick (atomic PutIfAbsent)

Thanks to its simplicity, in this post we’ll implement a Delta Lake-inspired serverless ACID database in 500 lines of Go code with zero dependencies.

PutIfAbsent maps nicely to API features available in S3, Azure, and Google Cloud Storage, among others.

Phil Eaton

Implicit SLOs and their dangers

If your API has been quietly delivering five nines, and you add an SLO with a target of three nines, you’re gonna have issues.

Niall Murphy

Why the Future of the .io Domain Extension is Uncertain

Those .io domains seemed super cool, but maybe not so much now. If your company depends on one, especially for a public API endpoint, it’s probably about time to get a fallback domain lined up.

Vivek Naskar

Improving platform resilience at Cloudflare through automation

Cloudflare built an automated workflow processor on Temporal to handle routine failures, reducing toil.

Opeyemi Onikute — Cloudflare

Don’t Let an Expired Certificate Cause Critical Downtime. Prevent Outages with a Smart CLM

It’s hard enough handling certificate expiry yearly, but this article introduced me to the fact that browser root programs are pushing for standardization on 3-month certificates.

Krupa Patil — Security Boulevard

SRE Weekly Issue #442

lex

September 15, 2024

General

Comments

View on sreweekly.com

SLO: Elastic vs Datadog vs Grafana

Here’s a hands-on evaluation of the SLO offerings of three big players in the space. The author includes screenshots of their tests and shares their opinions on each.

Alex Ewerlöf

“SRE” doesn’t seem to mean anything useful any more

🔥🔥🔥 Can calling yourself an SRE be a liability?

rachelbythebay

Aggregating SLIs

This article outlines some options for combining multiple SLIs together. I like the emphasis on ensuring that the result provides a useful overview without sacrificing too much.

Ali Sattari

Safety first!

Lorin Hochstein proposes a rubric for judging whether a company truly is “safety first” in terms of preventing outages.

Lorin Hochstein

Reliability recommendations when adopting Kubernetes

In this blog, we’ll present four strategies for successfully managing reliability while adopting Kubernetes.

Andre Newman — Gremlin

Two Multi-Master DBs Aligned With a Vector Clock

I haven’t seen a migration like this before. They managed a slow transition from an old system to a new one, keeping data in sync even though the two applications had entirely different database systems.

Claudio Guidi and Giovanni Cuccu — DZone

Asynchronous IO: the next billion-dollar mistake?

[…] what if instead of spending 20 years developing various approaches to dealing with asynchronous IO (e.g. async/await), we had instead spent that time making OS threads more efficient, such that one wouldn’t need asynchronous IO in the first place?

Yorick Peterse

Google Cloud Incident Report: September 7 incident in asia-northeast1

I love a multi-level complex failure.

[…] during this disruption, a secondary issue caused automated failover to not work, rendering the entire metadata storage unavailable despite two other healthy zones being available.

Google