SRE WEEKLY – Page 12 – scalability, availability, incident response, automation

SRE Weekly Issue #453

lex

December 1, 2024

Cloudflare incident on November 14, 2024, resulting in lost logs

It’s a case of cascading failure, but with an interesting twist: their system was designed to handle floods but the safety mechanism was left unconfigured.

Jamie Herre, Tom Walwyn, Christian ndres, Gabriele Viglianisi, Mik Kocikowski, and Rian van der Merwe — Cloudflare

Quick takes on the latest Cloudflare public incident write-up

Lorin takes apart the Cloudflare write-up with style, including a really insightful section on safety mechanisms in complex systems.

Lorin Hochstein

How Meta built large-scale cryptographic monitoring

Meta wanted to log details about the encrypted communications in their systems to help track key use, outdated algorithms, and the like. It’s a ton of telemetry, so they did smart sampling (which they call aggregation):

During the aggregation, a “count” is maintained for every unique event. When it comes time to flush, this count is exported along with the log to convey how often that particular event took place.

Hussain Humadi, Sasha Frolov, Rafael Misoczki, Dong Wu — Meta

Go Profiling in Production

A primer on using Golang’s profiling tools including CPU profiling, memory profiling, goroutine leak analysis, and execution tracing.

Gaurav Maheshwari — Oodle

Local Optimizations Don’t Lead to Global Optimums

A thought-provoking piece of automation, friction, and adaptive capacity. I especially enjoyed the section on decompensation.

Fred Hebert

Understanding Timings in Distributed Systems

With various tools for different kinds of telemetry, these folks needed to up their game to be able to fully understand what happened in a customer request. They also needed a custom sampling strategy to make sure they didn’t miss anything important.

Martin Fahy — Klaviyo

Ably’s four pillars: no scale ceiling

we’ll be looking at how Ably’s platform achieves scalability, and how, as a result, there’s no effective ceiling on the scale of applications that can be supported.

Paddy Byers — Ably

Building a User Signals Platform at Airbnb

Airbnb built a system for tracking and analyzing user actions to aid in personalization. Their system uses Flink and Kafka to handle over a million events per second.

Kidai Kwon — Airbnb

SRE Weekly Issue #452

lex

November 24, 2024

General

Comments

View on sreweekly.com

There’s No Such Thing as a Free Lunch!

The Lunch Exercise was my favorite part of the Blackrock3 training, and now Slack has adapted it for their ongoing training.

How Slack trains engineers in incident response by ordering lunch together.

Scott Nelson Windels — Slack

How we prevent conflicts in authoritative DNS configuration using formal verification

Cloudflare runs programs written in their custom language Topaz in the hot path. They use formal verification in production(!) to ensure that the set of Topaz programs make sense.

ames Larisch, Suleman Ahmad, and Marwan Fayed — Cloudflare

Netflix’s Distributed Counter Abstraction

Distributed counting is a challenging problem in computer science. In this blog post, we’ll explore the diverse counting requirements at Netflix, the challenges of achieving accurate counts in near real-time, and the rationale behind our chosen approach, including the necessary trade-offs.

Rajiv Shringi, Oleksii Tkachuk and Kartik Sathyanarayanan — Netflix

Designing chat architecture for reliable message ordering at scale

It’s hard, and this article explains why in excellent detail. It also includes a discussion of options to consider when designing a chat system.

Ably

Making AWS News stupid fast with smart caching

In anticipation of https://aws-news.com‘s busiest period of the year, I redesigned the API access patterns to support very effective caching. This resulted in significantly reduced backend load and a much faster frontend.

Luc van Donkersgoed — AWS News

The “R” in MTTR: Repair or Recover? What’s the difference?

Recover means that not only is everything back online, but the system is performing well and satisfying any QoS or SLAs AND a preventative approach has been implemented.

Will Searle — Causely

AWS re:Invent ‘24 – The Unofficial SRE Guide

Here’s a list of recommended talks for SREs attending re:Invent, with short descriptions explaining why they’re interesting.

Jamie Baker

Organizing ownership: How we assign errors in our monolith

In this post, I’ll share exactly how we link our code to the team that owns it, so errors and alerting are routed to the right place with minimal maintenance burden.

Martha Lambert — incident.io

SRE Weekly Issue #451

lex

November 17, 2024

General

Comments

View on sreweekly.com

DEATHTRAP! The Strange story of Air France flight 736

Most fascinating air incident report I’ve seen in awhile! The pilots deviated from the non-normal checklist, and it immediately made me think of runbooks. On the one hand, you want the runbook to be simple and easy to handle in an incident. On the other hand, it can be very useful to tell the operator why they should do something.

Mentour Pilot

Migrating billions of records: moving our active DNS database while it’s in use

With their claimed 14.5% of all websites depending on Cloudflare’s DNS, they had to be super careful with this migration. Lots of good stuff in here including:

replacing direct DB access by multiple services with an API
keeping the old and new DB in sync
ensuring both forward and reverse migration were possible in case of rollback

Alex Fattouche and Corey Horton — Cloudflare

Observability as a superpower

I didn’t get to experience the value of a good tracing tool until recently in my career, and I didn’t understand the hype. If you’re in the same boat, this article may help you understand the value of tracing.

Sam Starling — incident.io

Against Incident Severities and in Favor of Incident Types

About a year ago, Honeycomb git rid of incident severity levels in favor of incident types, which are purposefully not sortable. Here’s how their experiment has gone so far.

Fred Hebert — Honeycomb

Full disclosure: Honeycomb is my employer.

SLI vs KPI

Is Service Level Indicator (SLI) the same as Key Performance Indicator (KPI)?

There’s a really cool framing in there: KPIs are moonshots, so we aim high and rarely hit all of them, while with SLOs, we under-promise and over-deliver.

Alex Ewerlöf

Way too many ways to wait on a child process with a timeout

A fun dive into some unix/linux internals with nine different methods to run a program with timeouts and retries. If you have a soft spot in your heart for signals and system calls, this one’s for you.

Philippe Gaultier

When to Use Cosmos DB

Cosmos DB is Azure’s answer to Amazon’s DynamoDB. This article gives a nice overview and compares it to various other data stores to help you decide whether it’s right for your use case.

Adam Gordon Bell — Pulumi

Designing a Zero Downtime Migration Solution with Strong Data Consistency – Part I

An engineer at Mercari shares their plan for migrating to their new payment system in this five-part article series, all of which are published now. They created their design after reading 80(!) similar articles from folks at other companies.

resotto — Mercari

SRE Weekly Issue #450

lex

November 10, 2024

General

Comments

View on sreweekly.com

The Unofficial SRE Track for KubeCon NA ’24

If you’re heading to KubeCon this week, here are some talks to consider.

JJ Tang — Rootly

Choosing the right Postgres indexes

This article shows you how to manage Postgres indexes: when you need one, what type of index to choose, and how to set it up.

Milly Leadley — incident.io

Cloudflare’s perspective of the October 30, 2024, OVHcloud outage

It’s neat that Cloudflare can see evidence of a BGP route leak in a third party that affected OVHcloud.

Bryton Herdes, David Belson, and Tanner Ryan — Cloudflare

Making Temporal Cloud a multi-cloud platform

In this post, we’ll explore how we leveraged Temporal’s own capabilities to expand our infrastructure from AWS to Google Cloud, the challenges we faced along the way, and how we solved them using cloud-agnostic workflows.

Raphaël Beamonte — Temporal

Ivory Tower Architect

This deeply opinionated piece advocates against the “Architect” role, at least in certain forms. Among other problems, the Architect role breaks ownership models and impedes others.

Alex Ewerlöf

We’re leaving Kubernetes

These folks were using Kubernetes for their product that provides hosted developer environments (build systems, toolchains, and the like). While they directly acknowledge that their use case is not the same as common production environments, I still found it pretty interesting to learn about the problems they ran into that ultimately caused them to find another platform.

Christian Weichel and Alejandro de Brito Fontes — Gitpod

How to build an AI Agent for SRE

This extensive guide shows you how to build an LLM-based agent to assist with incident response. It includes python code snippets and shows you how to provide the LLM agent with documentation and access to external data sources like PagerDuty.

Eric Abruzzese — Aptible

Corporate governance, defence in depth and the Swiss Cheese Model of incident causation

A primer on the Swiss Cheese Model for modeling how accidents happen. I especially like the section at the end that suggests more nuanced thinking.

Jonathan Cheyne — Johnson Winter Slattery

SRE Weekly Issue #449

lex

November 3, 2024

General

Comments

View on sreweekly.com

Announcing: 52 Weeks of SRE – A Journey to Master Site Reliability Engineering

This new series seems promising! I won’t link to every article in the series here, but if you’re an early SRE, the intro-level articles published so far in this series are definitely worth a read.

Today, I’m thrilled to announce an ambitious project that’s been in the works for some time: “52 Weeks of SRE” – a comprehensive, year-long deep dive into the world of Site Reliability Engineering.

J. Pereira

The Karpenter Effect: Redefining Our Kubernetes Operations

Adevinta shifted from Kubernetes’s cluster autoscaler to AWS’s Karpenter. The change brought huge advantages that they discuss in detail, along with a few challenges and pitfalls they needed to overcome.

Tanat Lokejaroenlarb — Adevinta

Is this thing on? Using OpenBMC and ACPI power states for reliable server boot

An adventure in adopting an open source firmware for Baseboard Management Controllers, including fixing a few bugs themselves.

Nnamdi Ajah, Ryan Chow, and Giovanni Pereira Zantedeschi — Cloudflare

Accelerating Connection Handshakes

[…] an overview of methods like TCP FastOpen, TLSv1.3, 0-RTT, and HTTP/3 to reduce handshake delays and improve server response times in secure environments.

Maksim Kupriianov — DZone

How to – Choose the Right Instance Size for AWS RDS

This article includes general tips and a specific rubric you can follow to decide when to choose a larger or smaller RDS instance type.

Prabesh

Incident Management from an Astronaut

It turns out that a lot of the lessons that Mike Massimino learned as an astronaut apply very well to incident management.

Eric Silberstein — Klaviyo

How we avoided an outage caused by running out of IPs in EKS

Solving IP exhaustion in EKS: Avoiding a network outage by implementing custom networking

Fabián Sellés Rosa — Adevinta

What’s new with Robinhood, our in-house load balancing service

By leveraging proportional–integral–derivative (PID) controllers, Robinhood can now more quickly and effectively manage load imbalances.

This was my first introduction to PID controllers. Neat!

Yi-Shu Tai — Dropbox

The carefulness knob

Through an allegory about an imaginary knob to adjust between risk-avoidance and speed, Lorin Hochstein shows us that these trade-offs are being made, just implicitly.

Lorin Hochstein

SRE Weekly Issue #453

SRE Weekly Issue #452

SRE Weekly Issue #451

SRE Weekly Issue #450

SRE Weekly Issue #449

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues