SRE WEEKLY – Page 22 – scalability, availability, incident response, automation

SRE Weekly Issue #377

lex

June 18, 2023

Articles

Why did AWS Support fail with US-EAST-1 again?

AWS had a major Lambda outage in us-east-1, and it took out many customer systems and quite a few other AWS systems, including their support portal.

The Stack

How I went from Operations Manager to Site Reliability Engineer In 6 Months!

This person had a fascinating path to SRE, starting out their career as a generator repair technician and transitioning through devops to SRE.

Brian Hellinger — Towards AWS

Migrating Critical Traffic At Scale with No Downtime — Part 2

In part 1, they outlined how they replay real traffic to test a new system before deploying it. In this article, they build on that with three additional techniques: sticky canaries, A/B testing, and gradually shifting traffic to the new system in production.

Shyam Gala, Javier Fernandez-Ivern, Anup Rokkam Pratap, and Devang Shah — Netflix

The Data Behind Delayed Status Page Updates

By comparing status page posting to their independent monitoring of services, Metrist is able to produce statistics about how long companies take to post to their status pages when they have an outage.

Jeff Martens — Metrist

When there’s no plan for this scenario, you’ve got to improvise

Improvising during an incident isn’t just a one-off occurrence, and we should plan for it.

Lorin Hochstein — Surfing Complexity

Heroku Incident 2558 Followup

A foreign key column had a smaller integer data type than the key that it referenced, and it failed when the referenced key went too high.

Heroku

Scalable chat app architecture: How to get it right the first time

Here, we’ll look at the key considerations you need to make when it comes to the architecture of your chat app, the structure and components of that architecture, and some of the technology options that can help support you in building a reliable chat experience.

Ably

A Watery Surprise: The crash of National Airlines flight 193

A departure from the normal air traffic control procedure allowed the pilots to lose situational awareness. A commonly-held myth about flotation equipment contributed to three deaths in a quite survivable accident.

Admiral Cloudberg

It’s not always DNS — unless it is

They kept finding what they thought was the problem, and their fixes helped, but the problem kept coming back.

Tanat Paul Lokejaroenlarb — Adevinta

SRE Weekly Issue #376

lex

June 11, 2023

General

Comments

View on sreweekly.com

Articles

2023 03 08 Incident: A Deep Dive into Our Incident Response

With 100 workstreams and over 500 engineers engaged, this was the biggest incident response I’ve read about in years.

We had to force ourselves to identify the facts on the ground instead of “what ought to be,” and overrule our instincts to look for data in the places we normally looked (since our own monitoring was impacted).

Laura de Vesine — Datadog

How the ‘3 Pillars of Observability’ Miss the Big Picture

When you unify these three “pillars” into one cohesive approach, a new ability to understand the full state of your system in several new ways also emerges.

Danyel Fisher — The New Stack
Full disclosure: Honeycomb, my employer, is mentioned.

Azure DevOps Outage in South Brazil

This report details the 10-hour incident response following the accidental deletion of live databases (rather than their snapshots, as intended).

Eric Mattingly — Azure

Show HN: Keep – Create production alerts from plain English

Neat trick: write your alerts in English and get GPT to convert them to real alert configurations.

Shahar and Tal — Keep (via HackerNews)

A potential issue with outstanding query limits in your DNS resolver

If your DNS resolver is responsible for handling queries for both internal and external domains, what happens when external DNS requests fail? Can internal ones still proceed?

Chris Siebenmann

Delusion Soup: How Observability Got Here, and What We Can Do About It

This article explains potential pitfalls and downsides to observability tools and the ways vendors might try to get you to use them, along with tips for how to avoid the traps.

David Caudill

Treating uncertainty as a first-class concern

Too often, we dismiss the anomaly we just faced in an incident as a weird, one-off occurrence. And while that specific failure mode likely will be a one-off, we’ll be faced with new anomalies in the future.

Loron Hochstein — Surfing Complexity

SRE Weekly Issue #375

lex

June 4, 2023

General

Comments

View on sreweekly.com

Articles

How can you land 5 kilometers above the Moon?

An in-depth analysis of the crash of a recent lunar lander. It’s really interesting that a feature designed specifically to improve robustness to failures instead made the system less reliable in unforeseen circumstances.

Robert Barron — IBM

Cloud Dependencies Need to Stop F—ing Us When They Go Down

With each external cloud service you deploy, you introduce the amount of unreliability that product has into your own product’s reliability (even if it’s incredibly small).

Jeff Martens — The New Stack

How to Get an SRE Role

Are you a software engineer or an IT professional interested in transitioning to an SRE role? You’ve come to the right place! This article provides guidance on the skills and behaviors needed to apply for an SRE position at medium-to-large-sized tech companies successfully.

Amin Astaneh — Certo Modo

Incident vs. bug: How to distinguish between these two (seemingly) related concepts

While it can seem pretty insignificant, properly distinguishing between an incident and a bug is worthwhile. Why? Because it will ultimately help dictate your response to it.

Luis Gonzalez — incident.io

An educational side project

This is impressive: an engineer built an entire model of a ride-share system, complete with simulated riders and drivers, metrics, containerization, the works, all to gain a better understanding of how these kinds of systems work.

Gergely Orosz — Pragmatic Engineer

Why bother with SLI and SLO?

This article answers the most important questions:
* How is using service levels any different than “regular” alarms?
* What’s in it for the company and the teams?
* Why bother? Don’t we already have enough work to do?

Alex Ewerlöf

eBay’s Common Automation Solution for Platform Evolution

Here at eBay, we’ve crafted a brand new approach to automate platform evolution for all applications — one that provides a repeatable and reusable infrastructure to streamline evolution.

Paul Zhang and Tao Jin

How Traceloop Leverages Honeycomb and LLMs to Generate E2E Tests

Interesting idea: feeding trace data into an LLM and asking it to build an end-to-end (E2E) test for the entire system. This article is a good description of what they’re doing but I’d be interested to hear more about the results.

Nir Gazit — Honeycomb
Full disclosure: Honeycomb is my employer.

Reflections on Amazon Prime Video’s Monolith Move

What conclusions can we draw from the recent announcement that Amazon Prime Video is moving from serverless to a monolith?

The supposed difference between the two methods is not based on the technology itself, but the context in which you’re working.

Ian Miell

SRE Weekly Issue #374

lex

May 28, 2023

General

Comments

View on sreweekly.com

Articles

More Memory, More Problems

A fascinating Postgresql debugging story that hinges on code comments, of all things.

Christopher White — Prefect

Redpanda’s official Jepsen report: What we fixed, and what we shouldn’t

If you’re a distributed systems nerd, this one’s a real treat. It’s a detailed breakdown of the results of a Jepsen test.

Denis Rystsov — RedPAnda

Unbounded memory usage by TCP for receive buffers, and how we fixed it

An investigation into a kernel bug that caused excessive TCP memory usage in certain situations.

Mike Freemon — Cloudflare

Scaling Site Reliability Engineering (SRE) Teams the Right Way

Let’s unpack what scaling a team is all about, what are the indicators, what are steps you can take, and how you know if you’re done.

Biju Chacko — Squadcast

Running Post-Mortems

Here’s another guide on running incident retrospectives and building a repeatable retrospective process.

Amin Astaneh — Certo Modo

New playground: memory spy

Here’s a fun little tool that lets you inspect how data in a C program is represented in memory.

Julia Evans

How we learned to improve Kubernetes CronJobs at Scale

This two-part series explores some shortcomings in Kubernetes’s CronJob system and the ways that Lyft fixed and worked around them.

Kevin Yang — Lyft

Kubernetes CronJob Failed For 24 Days: a Retrospective

And here’s a case where someone ran into the Kubernetes CronJob bug described in the previous article.

Vallery Lancey

SRE Weekly Issue #373

lex

May 21, 2023

General

Comments

View on sreweekly.com

Articles

2023 03 08 Incident: Infrastructure connectivity issue affecting multiple regions

Datadog posted a report on their major outage in March, and it’s a doozy. An unattended updates system that they didn’t even want, need, or know about triggered across all hosts in multiple clouds nearly simultaneously, causing a regression.

Alexis Lê-Quôc — Datadog

Addressing GitHub’s recent availability issues

GitHub has had a string of apparently unrelated outages recently, and they’ve posted this description.

Mike Hanley — GitHub

StanzaSystems/awesome-load-management

Oh look, another awesome-* repo relevant to our interests!

A repo of links to articles, papers, conference talks, and tooling related to load management in software services: loadshedding, circuitbreaking, quota management and throttling. PRs welcome.

Laura Nolan and Niall Murphy — Stanza Systems

SRE Story with Matthew Iselin

This interview covers a lot of ground including looking beyond just “up or down” when considering reliability.

Prathamesh Sonpatki — SRE Stories

Debugging a FUSE deadlock in the Linux kernel

If you’re in the mood for a deep systems debugging story, you’re in for a treat. The author takes you along for the ride with a wealth of detailed code snippets.

Tycho Andersen — Netflix

Why `fsync()`: Losing Unsynced Data

Regardless of the replication mechanism you must fsync() your data to prevent global data loss in non-Byzantine protocols.

Denis Rystsov and Alexander Gallego — Redpanda

Emotional Intelligence

Emotional intelligence is a critical skill for SREs, especially when we interact with other teams in fraught situations.

Amin Astaneh — Certo Modo

Fleet Management at Spotify (Part 3): Fleet-wide Refactoring

Wow! Spotify created a set of tools to perform automated refactoring of thousands of repositories at once. This includes the ability to run tests, automatically merge pull requests without human review, and roll refactorings out gradually.

Matt Brown — Spotify

Teach me how to Howie!

Jeli has published a one-page cheat-sheet for their highly-detailed Howie guide for running incident retrospectives.

Jeli

SRE Weekly Issue #377

Articles

SRE Weekly Issue #376

Articles

SRE Weekly Issue #375

Articles

SRE Weekly Issue #374

Articles

SRE Weekly Issue #373

Articles

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

Subscribe

RSS

Mastodon

Search Issues