General

SRE Weekly Issue #443

lex

September 22, 2024

I’m working on launching a new sibling project to SRE Weekly that will have a different format. I’m on the lookout for potential sponsors now, so if you’re interested, reply by email or drop me a note at lex at sreweekly dot com. And don’t worry! SRE Weekly itself is here to stay.

Microservices vs. Monoliths: Why Startups Are Getting “Nano-Services” All Wrong

Thinking of creating a microservice architecture? Maybe think twice, says this article — backed by solid arguments.

Thiago Caserta

Octopus Cloud architecture

Octopus describes how their cell-based architecture is built for reliability, but it comes with a couple of trade-offs.

Pawel Pabich — Octopus Deploy

Noisy Neighbor Detection with eBPF

In this blog post, we’ll reveal how we leveraged eBPF to achieve continuous, low-overhead instrumentation of the Linux scheduler, enabling effective self-serve monitoring of noisy neighbor issues.

Jose Fernandez, Sebastien Dabdoub, Jason Koch, Artem Tkachuk — Netflix

Myth vs. Reality: Lessons in Reliability from the July 19 Outage

Some great insights in this one, including these gems:

Myth #1: Redundancy Equals Reliability
Myth #2: Preventing Failure is the Only Goal
Myth #3: More Responders Equals Faster Resolution

Paula Thrasher — PagerDuty

How a tcpdump led us to a bug in Node’s IPv6 handling

These folks learned the hard way that Node doesn’t implement Happy Eyeballs. Definitely worth a read if you use Node or if you aren’t familiar with Happy Eyeballs.

Umut Uzgur and Nočnica Mellifera — Checkly

The ultimate guide to on-call schedules

In this post, we’ll cover the basics of on-call scheduling, the different types of on-call schedules you can use and when each is most appropriate, best practices for managing on-call shifts, and all the mistakes people normally make along the way.

Chris Evans — incident.io

Heterogeneous SLI vs Homogeneous SLI

There’s a subtle distinction between heterogeneous and homogeneous SLIs, but it’s important to understand which kind you’re working with and the pros and cons of each.

Alex Ewerlöf

Cloudflare incident on September 17, 2024

Cloudflare inadvertently revoked their advertisement for some IPv4 addresses that were still being used for customer traffic due to a subtle bug in their automation.

SRE Weekly Issue #442

lex

September 15, 2024

General

Comments

View on sreweekly.com

SLO: Elastic vs Datadog vs Grafana

Here’s a hands-on evaluation of the SLO offerings of three big players in the space. The author includes screenshots of their tests and shares their opinions on each.

Alex Ewerlöf

“SRE” doesn’t seem to mean anything useful any more

🔥🔥🔥 Can calling yourself an SRE be a liability?

rachelbythebay

Aggregating SLIs

This article outlines some options for combining multiple SLIs together. I like the emphasis on ensuring that the result provides a useful overview without sacrificing too much.

Ali Sattari

Safety first!

Lorin Hochstein proposes a rubric for judging whether a company truly is “safety first” in terms of preventing outages.

Lorin Hochstein

Reliability recommendations when adopting Kubernetes

In this blog, we’ll present four strategies for successfully managing reliability while adopting Kubernetes.

Andre Newman — Gremlin

Two Multi-Master DBs Aligned With a Vector Clock

I haven’t seen a migration like this before. They managed a slow transition from an old system to a new one, keeping data in sync even though the two applications had entirely different database systems.

Claudio Guidi and Giovanni Cuccu — DZone

Asynchronous IO: the next billion-dollar mistake?

[…] what if instead of spending 20 years developing various approaches to dealing with asynchronous IO (e.g. async/await), we had instead spent that time making OS threads more efficient, such that one wouldn’t need asynchronous IO in the first place?

Yorick Peterse

Google Cloud Incident Report: September 7 incident in asia-northeast1

I love a multi-level complex failure.

[…] during this disruption, a secondary issue caused automated failover to not work, rendering the entire metadata storage unavailable despite two other healthy zones being available.

Google

SRE Weekly Issue #441

lex

September 8, 2024

General

Comments

View on sreweekly.com

How We Migrated from StatsD to Prometheus in One Month

This post aims to shed some light on why we migrated to Prometheus, as well as outline some of the technical challenges we faced during the process.

Eddie Bracho — Mixpanel

Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region

Amazon posted this thorough summary of a multi-service outage at the end of July. The impact stems from a complex distributed system failure in Kinesis.

Amazon

Finding and optimizing N+1 queries on a relational database

This team shows what they did to ferret out and eliminate occurrences of N+1 DB queries triggered by a single request in their Django app.

Gonzalo Lopez — Mixpanel

Building On-call: Our observability strategy

The folks at incident.io share about how they baked observability into the infrastructure for their new on-call tool.

Note for folks using screen readers: there’s a picture without alt-text that contains the following important text:

Overview dashboard

System dashboard

Logs

Tracing

It’s right after this sentence:

Those pieces fit together something like this:

Martha Lambert — incident.io

What’s the big deal about Deterministic Simulation Testing?

An overview of DST, which was a new concept for me. It’s about running simulations to try to find faults in a distributed system.

Phil Eaton

Hot take: you build it, you run it

If you build software that people depend on and are not operationally responsible for it (particularly on-call): you should be. 🛑

I like the way this one draws from the author’s experience, plus the emphasis on feedback loops.

Amin Astaneh

Overcome the Retry Dilemma in Distributed Systems

Retries help increase service availability. However, if not done right, they can have a devastating impact on the service and elongate recovery time.

Rajesh Pandey

Always. Enable. Keepalives.

Keepalive pings are critical in any system that uses TCP, since connections can hang at any point. I’ve been meaning to write this one for years!

Lex Neva — Honeycomb

Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #440

lex

September 1, 2024

General

Comments

View on sreweekly.com

Continually testing our product with smoke tests

As part of designing their new paging product, incident.io created a set of end-to-end tests to exercise the system and alert on failures. Click through for details on how they designed the tests and lessons learned.

Rory Malcolm — incident.io

Unified Grid: How We Re-Architected Slack for Our Largest Customers

As Slack rolled out their new experience for large, multi-workspace customers, they had to re-work fundamental parts of their infrastructure, including database sharding.

Ian Hoffman and Mike Demmer — Slack

Heroku incident 2678 Followup: Issues with Essential Tier Databases in EU region

A third-party vendor’s Support Engineer […] acknowledged that the root cause for both outages was a monitoring agent consuming all available resources.

Heroku

Prepare to Be Unprepared: Investing in Capacity to Adapt to Surprises in Software-Reliant Businesses

Resilience engineering is about focusing on making your organization better able to handle the unexpected, rather than preventing repetition of the same incident. This article gives a thought-provoking overview of the difference.

John Allspaw — InfoQ

3 reasons traces are better than metrics for debugging

Metrics are great for many other things, but they can’t compete with traces for investigating problems.

Jean-Mark Wright

Good Retry, Bad Retry: An Incident Story

Through fictional storytelling, this article explains not just the benefits of retries, but how they can go wrong.

Denis Isaev — Yandex

Just use Postgres

Hot take? Sure, but they back it up with a well-reasoned argument.

Ethan McCue

Dealing with rejection (in distributed systems)

A detailed look at the importance of backpressure and how to use it to reduce load effectively, as implemented in WarpStream.

Richard Artoul — WarpStream

SRE Weekly Issue #439

lex

August 25, 2024

General

Comments

View on sreweekly.com

Client-Side Monitoring Is a Must for Mobile Apps

Read on to learn why client-side network monitoring is so important and what you are missing if your only visibility into network performance is from a backend perspective.

Fredric Newberg — The New Stack

Piloting through the Fog: A Tale of Migrating to a New Kubernetes Platform

An engineer with no Kubernetes experience migrates an app to Kubernetes — with a bit of help from StackOverflow and Copilot, of course.

Jacob Brandt — Klaviyo

How our data team handles incidents

As data teams become increasingly critical, problems in their systems become incidents. Here’s an overview of how one data team has designed their incident response process.

Navo Das — incident.io

Avoiding downtime: modern alternatives to outdated certificate pinning practices

Certificate pinning can be a useful practice, but it’s also fraught with pitfalls and outage risks, especially with the modern tendency toward shorter certificates and multiple intermediates. What can we do instead?

Dina Kozlov — Cloudflare

What is an SLA?

A super-thorough overview of SLAs with a helpful section on how to chose the level for an SLA.

Diana Bocco — UptimeRobot

Optimizing global message transit latency: a journey through TCP configuration

This debugging story focuses on a Linux TCP option I wasn’t familiar with: tcp_slow_start_after_idle.

Amnon Cohen — Ably

When Publicity Gets in the Way of Scalability: Dreamport Case

This is the story of a company that got an unexpectedly huge rush of interest in their platform—and traffic too. They made a number of changes to quickly scale to meet the demand.

Jekaterina Petrova — Dyninno

[Honeycomb] UI and API unavailable

This Honeycomb incident followup seems to be related to their post that I shared last week.

Honeycomb

Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #443

SRE Weekly Issue #442

SRE Weekly Issue #441

SRE Weekly Issue #440

SRE Weekly Issue #439

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues