SRE WEEKLY – Page 14 – scalability, availability, incident response, automation

SRE Weekly Issue #447

lex

October 20, 2024

SLOs for Mobile Apps: Hard Truths to Consider

There are quite a few pitfalls waiting for you if you try to implement SLOs for your mobile app. This article explains and offers strategies.

Virna Sekuj — The New Stack

Is your “blameless” culture really blameless?

Blamelessness in incident retrospectives can be a difficult concept to truly internalize. This article describes 3 common “failure modes”, that is, ways in which organizations struggle with blamelessness.

Tom Elliott — The Friday Deploy

Thermal design supporting Gen 12 hardware: cool, efficient and reliable

Cloudflare spends a lot of time thinking about cooling, and it’s fascinating. I didn’t realize that spinning a fan faster consumed so much more energy!

Leslye Paniagua — Cloudflare

Microservice Proliferation: Too Many Microservices

Explore the pitfalls associated with the excessive creation of microservices, insights on their causes, implications, and potential strategies for mitigation.

Sumit Kumar — DZone

Introducing Netflix’s TimeSeries Data Abstraction Layer

Netflix stores a truly obscene number of events, each of which has a timestamp and a set of key-value pairs. This article goes into a ton of detail on how they built their system.

Rajiv Shringi, Vinay Chella, Kaidan Fullerton, Oleksii Tkachuk, and Joey Lynch — Netflix

We’re All Just Looking for Connection

A fun debugging story for a confusing crash bug, in which they found 6 other related bugs along the way.

Brett Wines — Slack

Five lessons from a minor production incident

My favorite one is about the principle “You Ain’t Gonna Need It”:

The flip side of YAGNI, however, is that at some point you might actually need it.

Luc van Donkersgoed

Order matters – making a compound index 50x faster

When you create an index on multiple columns in Postgres, you’ll need to be sure that the order of the fields in the index allows it to be applied to your queries, as these folks learned.

Jean-Mark Wright

SRE Weekly Issue #446

lex

October 13, 2024

General

Comments

View on sreweekly.com

Why I like discussing actions items in incident reviews

This one is a direct response to an article by Lorin Hochstein from a couple weeks back. There’s a lot here to think about, and it’s really great to see the back-and-forth discussion.

Chris Evans — incident.io

Building and operating a pretty big storage system called S3

A tour through the design of S3 by its VP. I found the discussion of managing “heat” (I/O load) especially interesting.

Andy Warfield — Amazon

A Comprehensive Guide to Database Sharding

This one introduced me to a new concept: vertical vs horizontal sharding. Vertical sharding by whole tables, and horizontal is sharding by related rows across tables, as with users or groups of users.

Suleiman Dibirov

Build a serverless ACID database with this one neat trick (atomic PutIfAbsent)

Thanks to its simplicity, in this post we’ll implement a Delta Lake-inspired serverless ACID database in 500 lines of Go code with zero dependencies.

PutIfAbsent maps nicely to API features available in S3, Azure, and Google Cloud Storage, among others.

Phil Eaton

Implicit SLOs and their dangers

If your API has been quietly delivering five nines, and you add an SLO with a target of three nines, you’re gonna have issues.

Niall Murphy

Why the Future of the .io Domain Extension is Uncertain

Those .io domains seemed super cool, but maybe not so much now. If your company depends on one, especially for a public API endpoint, it’s probably about time to get a fallback domain lined up.

Vivek Naskar

Improving platform resilience at Cloudflare through automation

Cloudflare built an automated workflow processor on Temporal to handle routine failures, reducing toil.

Opeyemi Onikute — Cloudflare

Don’t Let an Expired Certificate Cause Critical Downtime. Prevent Outages with a Smart CLM

It’s hard enough handling certificate expiry yearly, but this article introduced me to the fact that browser root programs are pushing for standardization on 3-month certificates.

Krupa Patil — Security Boulevard

SRE Weekly Issue #445

lex

October 6, 2024

General

Comments

View on sreweekly.com

Hot Take: Don’t provide incident resolution estimates

Providing incident resolution times to customers is an unneeded stress for responders with very little gain.

Robert Ross — FireHydrant

Continuous reinvention: A brief history of block storage at AWS

I can’t tell you how many times I’ve found myself lost in thought, wondering how something like EBS works. While this isn’t an architecture overview, it does contain a bunch of juicy tidbits. I especially like the bit about the value of a “full stack engineer”.

Marc Olson — All Things Distributed

Observability With eBPF

This article explains how to use eBPF to gather observability data, including an example eBPF program and instructions on how to run it.

Kranthi Kiran Erusu — DZone

Introducing Netflix’s Key-Value Data Abstraction Layer

Netflix uses multiple kinds of data stores. It was difficult for developers to manage the differences between data stores, so they wrote an abstraction layer.

Our goal was to build a versatile and efficient data storage solution that could handle a wide variety of use cases, ranging from the simplest hashmaps to more complex data structures, all while ensuring high availability, tunable consistency, and low latency.

Vidhya Arvind, Rajasekhar Ummadisetty, Joey Lynch, and Vinay Chella — Netflix

Removing uncertainty through “what-if” capacity planning

This post looks at the challenges of predicting capacity in a global CDN, including dealing with uncertainties in customer growth, traffic routing, hardware failure, and more.

Curt Robords — Cloudflare

How we improved availability through iterative simplification

GitHub tells us about the tools they use to improve reliability and performance, including Scientist and Flipper.

Nick Hengeveld — GitHub

Why I don’t like discussing action items during incident reviews

If you’re heavily action-item-oriented like I used to be, this is a great read to get you thinking down a different path.

Syncing PagerDuty Schedules to Slack Groups

My coworker wrote this awesome script to update our various @team-oncall aliases in Slack automatically, following our PagerDuty on-call schedule. This one thing has already saved us so much in the way of toil, frustration, and missed notifications!

Fred Hebert — Honeycomb

Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #444

lex

September 29, 2024

General

Comments

View on sreweekly.com

A good day to trie-hard: saving compute 1% at a time

When you’re doing something 60 million times per second, even a modest optimization makes a huge difference.

Kevin Guthrie — Cloudflare

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

Meet Pushy, Netflix’s websocket-based push system with an impressive five nines of reliability in message delivery.

Karthik Yagna, Baskar Odayarkoil, and Alex Ellis — Netflix

Self hosting full stack observability

If your early-stage startup can’t afford an observability solution from a vendor, you could try rolling your own. This article has an overview and pointers but stops short of explicit instructions.

Malay Hazarika — Osuite

AI agents invade observability: snake oil or the future of SRE?

With AI SRE “agents” cropping up everywhere, what should we think? Here’s an overview of what’s going on with links to read more.

Clay Smith — Montoring Monitoring

Battle of the RabbitMQ Queues: Classic and Quorum

An overview of the two kinds of RabbitMQ queues along with performance numbers from load tests.

Josephine Eskaline Joyce and Anilkumar Mallakkanavar — DZone

Advancing Our Chef Infrastructure

In this blog post, I’ll discuss the evolution of our Chef infrastructure over the years and the challenges we encountered along the way.

Archie Gunasekara — Slack

How and Why We Made SREBench, SWEBench for Kubernetes

Using LLMs to generate test cases to test an AI agent’s ability to diagnose Kubernetes problems, with a kubectl simulator running on an LLM. Whew, that’s a lot of AI!

Jeffrey Tsaw — Parity

Thoughts From The First SEV0 Conference

I was having some major FOMO last week, so this recap of the SEV0 incident management conference is especially welcome.

Amin Astaneh — Certo Modo

SRE Weekly Issue #443

lex

September 22, 2024

General

Comments

View on sreweekly.com

I’m working on launching a new sibling project to SRE Weekly that will have a different format. I’m on the lookout for potential sponsors now, so if you’re interested, reply by email or drop me a note at lex at sreweekly dot com. And don’t worry! SRE Weekly itself is here to stay.

Microservices vs. Monoliths: Why Startups Are Getting “Nano-Services” All Wrong

Thinking of creating a microservice architecture? Maybe think twice, says this article — backed by solid arguments.

Thiago Caserta

Octopus Cloud architecture

Octopus describes how their cell-based architecture is built for reliability, but it comes with a couple of trade-offs.

Pawel Pabich — Octopus Deploy

Noisy Neighbor Detection with eBPF

In this blog post, we’ll reveal how we leveraged eBPF to achieve continuous, low-overhead instrumentation of the Linux scheduler, enabling effective self-serve monitoring of noisy neighbor issues.

Jose Fernandez, Sebastien Dabdoub, Jason Koch, Artem Tkachuk — Netflix

Myth vs. Reality: Lessons in Reliability from the July 19 Outage

Some great insights in this one, including these gems:

Myth #1: Redundancy Equals Reliability
Myth #2: Preventing Failure is the Only Goal
Myth #3: More Responders Equals Faster Resolution

Paula Thrasher — PagerDuty

How a tcpdump led us to a bug in Node’s IPv6 handling

These folks learned the hard way that Node doesn’t implement Happy Eyeballs. Definitely worth a read if you use Node or if you aren’t familiar with Happy Eyeballs.

Umut Uzgur and Nočnica Mellifera — Checkly

The ultimate guide to on-call schedules

In this post, we’ll cover the basics of on-call scheduling, the different types of on-call schedules you can use and when each is most appropriate, best practices for managing on-call shifts, and all the mistakes people normally make along the way.

Chris Evans — incident.io

Heterogeneous SLI vs Homogeneous SLI

There’s a subtle distinction between heterogeneous and homogeneous SLIs, but it’s important to understand which kind you’re working with and the pros and cons of each.

Alex Ewerlöf

Cloudflare incident on September 17, 2024

Cloudflare inadvertently revoked their advertisement for some IPv4 addresses that were still being used for customer traffic due to a subtle bug in their automation.

SRE Weekly Issue #447

SRE Weekly Issue #446

SRE Weekly Issue #445

SRE Weekly Issue #444

SRE Weekly Issue #443

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues