General

SRE Weekly Issue #435

lex

July 28, 2024

CrowdStrike Preliminary Post Incident Review

CrowdStrike released a lot more discussion about what happened widetailth their bad deployment, and yet there’s still a frustrating lack of detail on the actual cause of the blue screens.

CrowdStrike

Can a good explanation really prevent a prod incident?

A story of how properly positioned rationales can be powerful enough to prevent prod incidents

And a great place to put that rationale is in your git commit, says this article.

Jean-Mark Wright

Performance and Scalability of Redis and Memcached

Need to choose between Redis and Memcached? This one’s for you, with a qualitative comparison and relative performance numbers.

Rahul Chandel

Realtime fan experiences: Making them economically viable, at scale

How do you promote interactions between fans without exploding your system with n-squared worth of messages, where n is the number of users?

Matthew O’Riordan — Ably

I’m sorry, but the way you adopt serverless is wrong

If you want to convert to serverless, don’t switch to microservices or change your datastore at the same time, argues this article.

Yan Cui

Delivering Millions of Notifications within Seconds During the Super Bowl

It’s all about the owl’s butt (and sending 4 million push notifications in 5 seconds).

Zhen Zhou — InfoQ

Destroy on Friday: The Big Day 🧨 A Chaos Engineering Experiment – Part 2

Here’s the second part of the article I wrote on my recent project at work, taking down a full AZ in production.

Lex Neva — Honeycomb

Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #434

lex

July 21, 2024

General

Comments

View on sreweekly.com

Technical Details: Falcon Update for Windows Hosts

The big news this week, of course, is the CrowdStrike-related series of outages in airports, banks, and many other businesses. Here’s their statement on the situation.

Rumor has it that Southwest Airlines survived because they run Windows 3.1. Well, that’s one way to do it.

CrowdStrike

Take the Annual SRE Survey

It’s time for Catchpoint’s annual SRE survey again! We get a lot of interesting information about SRE trends from this, so it’d be great if you could take a moment to fill it out.

Note, usually I try to avoid giving you “utm” stuff in links, but this link is specifically set up to track whether folks come from SRE Weekly, so I left it in this time.

Catchpoint

You don’t always need a queue

Queues have a cost, as this article explains.

Jean-Mark Wright

Deploy on Friday? How About Destroy on Friday! A Chaos Engineering Experiment – Part 1

I wrote this article about an exciting project I led recently: taking down an entire availability zone in production to test reliability. Part 2 is due out next week!

Lex Neva — Honeycomb

Full disclosure: Honeycomb is my employer.

How to prevent accidental load balancer deletions

Deletion protection: it can really save you!

Andre Newman — Gremlin

A Look Into Netflix System Architecture

A thorough overview of Netflix’s architecture, with focus on data stores, content processing, billing, and the CDN, among other topics.

Rahul Shivalkar — ClickIT

Degradation vs disruption

This article compares the terms “degradation”, “disruption”, and “service outage” through the lens of service levels.

Alex Ewerlöf

Enhancing cloud storage efficiency with s3-batch-object-store

Their workload involved writing many small objects but reading very few. By batching many writes into a single object in S3, they saved a ton of money, and now they’re open sourcing their solution.

Pablo Matias Gomez — Embrace

SRE Weekly Issue #433

lex

July 14, 2024

General

Comments

View on sreweekly.com

5 Non-Technical Skills Every Site Reliability Engineer Should Master

This article covers five skills:

Ability to Lead

Taking Charge in Critical Situations

Expressing Opinions in a Non-Conflicting Way

Leading Initiatives for Continuous Improvement

Building and Maintaining Relationships

Prabesh

Automating Telemetry Capture in Python using Bytecode

I was pretty dubious most of the way through this article — until I realized it was a story about why this solution didn’t work for them. Now it’s an interesting read about Python and exercising restraint in complexity.

Jean-Mark Wright

Leveraging AI for efficient incident response

Meta is training an LLM to suggest commits that may have caused a given incident, and its suggestions are right 42% of the time.

Diana Hsu, Michael Neu, Mohamed Farrag, and Rahul Kindi — Meta

Percentile

Percentiles, because when your math(s) teacher told you you’d use math all the time when you grew up, they were right! This article does a great job of explaining percentiles if you’re having trouble wrapping your mind around them.

Alex Ewerlöf

Enhancing Netflix Reliability with Service-Level Prioritized Load Shedding

Netflix designed their load shedding system to efficiently drop the requests that don’t matter as much and prioritize what users really care about.

Anirudh Mendiratta, Kevin Wang, Joey Lynch, Javier Fernandez-Ivern, and Benjamin Fedorka — Netflix

Cascading failures and the impossibility of scheduling team lunches

This article illustrates cascading delays in microservices and describes three techniques for dealing with them: timeouts, retries, and circuit breakers.

Jean-Mark Wright

Cloudflare 1.1.1.1 incident on June 27, 2024

Cloudflare’s public DNS resolver had an outage due to a (probably accidental?) BGP hijack. 1.1.1.1 is a common address used internally for testing routing, so it’s easy to understand how an accidental route leak happened.

Bryton Herdes, Mingwei Zhang, and Tanner Ryan — Cloudflare

A write-ahead log is not a universal part of durability

Here’s a new post about durability and write-ahead logs. Write-ahead logs are used almost everywhere. But to build an intuition for why, it is helpful to imagine what you would do without a WAL.

Phil Eaton

SRE Weekly Issue #432

lex

July 7, 2024

General

Comments

View on sreweekly.com

Investigating Mysterious Kafka Broker I/O When Using Confluent Tiered Storage

In this debugging story, an engineer wielded SystemTap to figure out why a Kafka broker was doing a ridiculous amount of reads.

Terra Field — Honeycomb

Full disclosure: Honeycomb is my employer.

The Reality of Adding Nines to Your SLOs

A concise breakdown of the math involved in getting that extra nine of reliability.

It all boils down to creating the SLOs and requirements to keep your users happy, but nothing more. Unnecessary reliability is a high cost.

Thomas Stringer

Becoming a Senior Site Reliability Engineer: A Guide to Upskilling

If you’re looking to advance in SRE, this article has some examples of the skills and experience you should aim for.

Prabesh

How Many Maybe’s until Empathy?

Will Gallego shows us a way of thinking that helps turn “should haves” into deeper understanding of our sociotechnical systems.

Will Gallego

Vassil Popovski on LinkedIn Re: scalability

Some words of wisdom I came across this week around startups choosing not to work on scalability too early.

Vassil Popovski

r/sre: do you think it’s become easier or harder to be an SRE in the last 5-10 years?

Some commenters in this reddit thread are saying it’s easier to be called an SRE, but what does it mean? Some say SRE has gotten easier, and some say it’s gotten harder. What do you think?

u/sreiously and others — reddit

Assessment of Rogers Networks for Resiliency and Reliability Following the 8 July 2022 Outage – Executive Summary

The full report isn’t available yet (and may not ever be?) but this executive summary has a lot of juicy bits about the major 2022 Rogers internet and emergency service outage in Canada.

Xona Partners, Inc.

Quick takes on Rogers Network outage executive summary

The Rogers report executive summary includes some blamey and blame-adjacent language, and this analysis does a good job of calling it out and suggesting ways to recast it.

Lorin Hochstein

“Out of band” network management is not trivial

The Rogers outage report executive summary indicates that truly out-of-band network management access may have made recovery easier. What exactly is involved in setting that up?

Chris Siebenmann

SRE Weekly Issue #431

lex

June 30, 2024

General

Comments

View on sreweekly.com

Cloudflare incident on June 20, 2024

This is a really thorny one. As individual subprocesses started infinitely looping, their system shifted load to other datacenters, masking the problem. A coinciding failure in the load shifting system made things even more interesting.

Lloyd Wallis, Julien Desgats, and Manish Arora — Cloudflare

Are dashboards dead? Not quite. They just haven’t evolved

A great discussion of where dashboards fall short and what we should look for instead.

Adam Kinniburgh — SquaredUp

How we improved push processing on GitHub

Read how we have significantly improved the ability of our monolith to correctly and fully process pushes from our users.

Will Haltom — GitHub

Can you run in a tight loop and still be well-behaved?

Timing things to happen at specific intervals is yet another way that we collectively find out that dealing with time is a hard problem.

This article illustrates the subtle but important pitfalls in trying to create a system that does something on a strict interval.

rachelbythebay

Using LLMs to Generate Terraform Code

This article reads more like a case study. The author gave a prompt to three different LLMs and actually tested the Terraform config it produced.

Mike Vanbuskirk — Terrateam

How the Pusher team built subscription counting at scale

When your pub/sub system can have a million subscribers, even something mundane as notifying about subscriber counts requires careful thought.

Ashmeet Singh — Pusher

Quick and Dirty vs. Polished and Perfect: The Two Sides of Engineering

To me, this concept comes up over and over in SRE, and it’s a core part of SLOs.

Juraj Masar — BetterStack

Feature flag vs feature management

In this blog post, we’ll dive deep into the technical aspects of feature flags and feature management, exploring how they can be leveraged by SREs to enable progressive delivery, improve system resilience, and optimize the user experience.

Hope Lynch — CloudBees

Pilots Unable to PULL UP!! Air Transat flight 211

This week’s Mentour Pilot video covers an accident that involved an inaccurate flight simulator. I wasn’t familiar with the term “negative training” before, but now I’m going to be keeping an eye out for it in the systems I manage!

Mentour Pilot

SRE Weekly Issue #435

SRE Weekly Issue #434

SRE Weekly Issue #433

SRE Weekly Issue #432

SRE Weekly Issue #431

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues