General

SRE Weekly Issue #445

A message from our sponsor, FireHydrant:

FireHydrant has acquired Blameless! The addition of Blameless’ enterprise capabilities combined with FireHydrant’s platform creates the most comprehensive enterprise incident management solution in the market.

https://firehydrant.com/blog/press-release-firehydrant-acquires-blameless-to-further-solidify-enterprise/

Providing incident resolution times to customers is an unneeded stress for responders with very little gain.

  Robert Ross — FireHydrant

I can’t tell you how many times I’ve found myself lost in thought, wondering how something like EBS works. While this isn’t an architecture overview, it does contain a bunch of juicy tidbits. I especially like the bit about the value of a “full stack engineer”.

  Marc Olson — All Things Distributed

This article explains how to use eBPF to gather observability data, including an example eBPF program and instructions on how to run it.

   Kranthi Kiran Erusu — DZone

Netflix uses multiple kinds of data stores. It was difficult for developers to manage the differences between data stores, so they wrote an abstraction layer.

Our goal was to build a versatile and efficient data storage solution that could handle a wide variety of use cases, ranging from the simplest hashmaps to more complex data structures, all while ensuring high availability, tunable consistency, and low latency.

  Vidhya Arvind, Rajasekhar Ummadisetty, Joey Lynch, and Vinay Chella — Netflix

This post looks at the challenges of predicting capacity in a global CDN, including dealing with uncertainties in customer growth, traffic routing, hardware failure, and more.

  Curt Robords — Cloudflare

GitHub tells us about the tools they use to improve reliability and performance, including Scientist and Flipper.

  Nick Hengeveld — GitHub

If you’re heavily action-item-oriented like I used to be, this is a great read to get you thinking down a different path.

My coworker wrote this awesome script to update our various @team-oncall aliases in Slack automatically, following our PagerDuty on-call schedule. This one thing has already saved us so much in the way of toil, frustration, and missed notifications!

  Fred Hebert — Honeycomb

  Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #444

A message from our sponsor, FireHydrant:

FireHydrant has acquired Blameless! The addition of Blameless’ enterprise capabilities combined with FireHydrant’s platform creates the most comprehensive enterprise incident management solution in the market.

https://firehydrant.com/blog/press-release-firehydrant-acquires-blameless-to-further-solidify-enterprise/

When you’re doing something 60 million times per second, even a modest optimization makes a huge difference.

  Kevin Guthrie — Cloudflare

Meet Pushy, Netflix’s websocket-based push system with an impressive five nines of reliability in message delivery.

  Karthik Yagna, Baskar Odayarkoil, and Alex Ellis — Netflix

If your early-stage startup can’t afford an observability solution from a vendor, you could try rolling your own. This article has an overview and pointers but stops short of explicit instructions.

  Malay Hazarika — Osuite

With AI SRE “agents” cropping up everywhere, what should we think? Here’s an overview of what’s going on with links to read more.

  Clay Smith — Montoring Monitoring

An overview of the two kinds of RabbitMQ queues along with performance numbers from load tests.

   Josephine Eskaline Joyce and Anilkumar Mallakkanavar — DZone

In this blog post, I’ll discuss the evolution of our Chef infrastructure over the years and the challenges we encountered along the way.

  Archie Gunasekara — Slack

Using LLMs to generate test cases to test an AI agent’s ability to diagnose Kubernetes problems, with a kubectl simulator running on an LLM. Whew, that’s a lot of AI!

  Jeffrey Tsaw — Parity

I was having some major FOMO last week, so this recap of the SEV0 incident management conference is especially welcome.

  Amin Astaneh — Certo Modo

SRE Weekly Issue #443

I’m working on launching a new sibling project to SRE Weekly that will have a different format. I’m on the lookout for potential sponsors now, so if you’re interested, reply by email or drop me a note at lex at sreweekly dot com. And don’t worry! SRE Weekly itself is here to stay.

A message from our sponsor, FireHydrant:

FireHydrant has acquired Blameless! The addition of Blameless’ enterprise capabilities combined with FireHydrant’s platform creates the most comprehensive enterprise incident management solution in the market.

https://firehydrant.com/blog/press-release-firehydrant-acquires-blameless-to-further-solidify-enterprise/

Thinking of creating a microservice architecture? Maybe think twice, says this article — backed by solid arguments.

  Thiago Caserta

Octopus describes how their cell-based architecture is built for reliability, but it comes with a couple of trade-offs.

  Pawel Pabich — Octopus Deploy

In this blog post, we’ll reveal how we leveraged eBPF to achieve continuous, low-overhead instrumentation of the Linux scheduler, enabling effective self-serve monitoring of noisy neighbor issues.

  Jose Fernandez, Sebastien Dabdoub, Jason Koch, Artem Tkachuk — Netflix

Some great insights in this one, including these gems:

Myth #1: Redundancy Equals Reliability
Myth #2: Preventing Failure is the Only Goal
Myth #3: More Responders Equals Faster Resolution

  Paula Thrasher — PagerDuty

These folks learned the hard way that Node doesn’t implement Happy Eyeballs. Definitely worth a read if you use Node or if you aren’t familiar with Happy Eyeballs.

  Umut Uzgur and Nočnica Mellifera — Checkly

In this post, we’ll cover the basics of on-call scheduling, the different types of on-call schedules you can use and when each is most appropriate, best practices for managing on-call shifts, and all the mistakes people normally make along the way.

  Chris Evans — incident.io

There’s a subtle distinction between heterogeneous and homogeneous SLIs, but it’s important to understand which kind you’re working with and the pros and cons of each.

  Alex Ewerlöf

Cloudflare inadvertently revoked their advertisement for some IPv4 addresses that were still being used for customer traffic due to a subtle bug in their automation.

SRE Weekly Issue #442

A message from our sponsor, FireHydrant:

FireHydrant has acquired Blameless! The addition of Blameless’ enterprise capabilities combined with FireHydrant’s platform creates the most comprehensive enterprise incident management solution in the market.

https://firehydrant.com/blog/press-release-firehydrant-acquires-blameless-to-further-solidify-enterprise/

Here’s a hands-on evaluation of the SLO offerings of three big players in the space. The author includes screenshots of their tests and shares their opinions on each.

  Alex Ewerlöf

🔥🔥🔥  Can calling yourself an SRE be a liability?

  rachelbythebay

This article outlines some options for combining multiple SLIs together. I like the emphasis on ensuring that the result provides a useful overview without sacrificing too much.

  Ali Sattari

Lorin Hochstein proposes a rubric for judging whether a company truly is “safety first” in terms of preventing outages.

  Lorin Hochstein

In this blog, we’ll present four strategies for successfully managing reliability while adopting Kubernetes.

  Andre Newman — Gremlin

I haven’t seen a migration like this before. They managed a slow transition from an old system to a new one, keeping data in sync even though the two applications had entirely different database systems.

   Claudio Guidi and Giovanni Cuccu — DZone

[…] what if instead of spending 20 years developing various approaches to dealing with asynchronous IO (e.g. async/await), we had instead spent that time making OS threads more efficient, such that one wouldn’t need asynchronous IO in the first place?

  Yorick Peterse

I love a multi-level complex failure.

[…] during this disruption, a secondary issue caused automated failover to not work, rendering the entire metadata storage unavailable despite two other healthy zones being available.

  Google

SRE Weekly Issue #441

A message from our sponsor, FireHydrant:

FireHydrant has acquired Blameless! The addition of Blameless’ enterprise capabilities combined with FireHydrant’s platform creates the most comprehensive enterprise incident management solution in the market.

https://firehydrant.com/blog/press-release-firehydrant-acquires-blameless-to-further-solidify-enterprise/

This post aims to shed some light on why we migrated to Prometheus, as well as outline some of the technical challenges we faced during the process.

  Eddie Bracho — Mixpanel

Amazon posted this thorough summary of a multi-service outage at the end of July. The impact stems from a complex distributed system failure in Kinesis.

  Amazon

This team shows what they did to ferret out and eliminate occurrences of N+1 DB queries triggered by a single request in their Django app.

  Gonzalo Lopez — Mixpanel

The folks at incident.io share about how they baked observability into the infrastructure for their new on-call tool.

Note for folks using screen readers: there’s a picture without alt-text that contains the following important text:

  1. Overview dashboard
  2. System dashboard
  3. Logs
  4. Tracing

It’s right after this sentence:

Those pieces fit together something like this:

  Martha Lambert — incident.io

An overview of DST, which was a new concept for me. It’s about running simulations to try to find faults in a distributed system.

  Phil Eaton

If you build software that people depend on and are not operationally responsible for it (particularly on-call): you should be. 🛑

I like the way this one draws from the author’s experience, plus the emphasis on feedback loops.

  Amin Astaneh

Retries help increase service availability. However, if not done right, they can have a devastating impact on the service and elongate recovery time.

   Rajesh Pandey

Keepalive pings are critical in any system that uses TCP, since connections can hang at any point. I’ve been meaning to write this one for years!

  Lex Neva — Honeycomb

  Full disclosure: Honeycomb is my employer.

A production of Tinker Tinker Tinker, LLC Frontier Theme