SRE Weekly Issue #437

This week’s issue is entirely focused on the CrowdStrike incident: more details on what happened, analysis, and learnings. I’ll be back next week with a selection of all of the great stuff you folks have been writing while I’ve been off on vacation for the past two weeksmy RSS reader is packed with awesomeness!

A message from our sponsor, FireHydrant:

Migrate off of PagerDuty, save money, and then have all of your configuration exported as Terraform modules? We did that. We know one of the hardest parts of leaving a legacy tool is the old configuration, that’s why we dedicated time to build the Signals migrator, making it easy to switch.

https://firehydrant.com/blog/speedrun-to-signals-automated-migrations-are-here/

This week, CrowdStrike posted quite a bit more detail about what happened on July 19. The short of it seems to be an argument count mismatch, but as with any incident of this sort, there are multiple contributing factors.

The report also continues the conversation about the use of kernel mode in a product such as this, amounting to a public conversation with Microsoft that is intriguing to watch from the outside.

  CrowdStrike

This article has some interesting details about antitrust regulations(!) related to security vendors running code in kernel mode. There’s also an intriguing story of a very similar crash on Linux endpoints running CrowdStrike’s Falcon.

Note: this one is from a couple of weeks ago and some of its conjectures don’t quite line up with details that have been released in the interim.

  Gergely Orosz

While it mentions the CrowdStrike incident only in vague terms, this article discusses why slowly rolling out updates isn’t a universal solution and can bring its own problems.

  Chris Siebenmann

Some thoughts on staged rollouts and the CrowdStrike outage:

The notion we tried to get known far and wide was “nothing goes everywhere at once”.

Note that this post was published before CrowdStrike’s RCA which subsequently confirmed that their channel file updates were not deployed through staged rollouts.

  rachelbythebay

[…] there may be risks in your system that haven’t manifested as minor outages.

Jumping off from the CrowdStrike incident, this one asks us to look for reliability problems in parts of our infrastructure that we’ve grown to trust.

  Lorin Hochstein

While CrowdStrike’s RCA has quite a bit of technical detail, this post reminds us that we need a lot more context to really understand how an incident came to be.

  Lorin Hochstein

In the future, computers will not crash due to bad software updates, even those updates that involve kernel code. In the future, these updates will push eBPF code.

I didn’t realize that Microsoft is working on eBPF for Windows.

  Brendan Gregg

This post isn’t about what Crowdstrike should have done. Instead, I use the resources to provide context and takeaways we can apply to our teams and organizations.

  Bob Walker — Octopus Deploy

SRE Weekly Issue #436

A message from our sponsor, FireHydrant:

Migrate off of PagerDuty, save money, and then have all of your configuration exported as Terraform modules? We did that. We know one of the hardest parts of leaving a legacy tool is the old configuration, that’s why we dedicated time to build the Signals migrator, making it easy to switch.

https://firehydrant.com/blog/speedrun-to-signals-automated-migrations-are-here/

As we can see from the above, any reliability problem like this invalid memory access issue can lead to widespread availability issues when not combined with safe deployment practices.

This analysis from Microsoft starts off by examining crash dumps from the incident that were voluntarily submitted by Windows users. Then they explain why security vendors like CrowdStrike might choose to operate in kernel mode, the inherent risks, and alternative options they could use instead.

  Microsoft

This is CrowdStrike’s initial technical analysis posted shortly after the incident, which I shared here previously.  I’m linking to it again to highlight an apparent contradiction with the analysis from Microsoft as to whether the CrowdStrike component involved was a kernel driver:

Although Channel Files end with the SYS extension, they are not kernel drivers.

I’m guessing the technical resolution to this apparent contradiction is that the channel files are merely data files and not kernel drivers, whereas the thing that processes the channel files is in fact a kernel driver. To me this seems like a needless clarification that was highly likely to mislead readers into thinking that kernel drivers were not at play, which is exactly how I interpreted it at the time.

  CrowdStrike

Here’s a summary and opinion piece on Microsoft’s analysis article, including more on the trade-off of vendors running code in kernel mode.

  Thom Holwerda — OSNews

The challenge is, how do you formulate the right free-text representation of your system to get a useful answer out of an LLM?

  Amir Krayden — DevOps.com

Will artfully uses a refrigeration-based metaphor to discuss creating a blameless culture. Trust me, it works.

  Will Gallego

These folks wanted to allow log lines greater than 128 bytes in their observability product, but their data store made that tricky. They used bloom filters and other techniques to achieve acceptable performance.

  Nathan Ostgard and Javier Schoijet — Embrace

It turns out sending texts and making phone calls automatically is really hard, and many assumptions you might make turn out to be wrong.

  Leo Sjöberg — incident.io

Wow, I had no idea Systemd could limit a program’s ability to access certain IPs. This one’s worth a read to save you from hair-pulling if you ever run into this.

  rachelbythebay

SRE Weekly Issue #435

A message from our sponsor, FireHydrant:

We’ve gone all out on our new integration with Microsoft Teams. If you’re a MS Teams user, FireHydrant now supports the most comprehensive integration for incident management. Run the entire IM process without ever leaving the chat.

https://firehydrant.com/blog/introducing-a-brand-new-microsoft-teams-integration/

CrowdStrike released a lot more discussion about what happened widetailth their bad deployment, and yet there’s still a frustrating lack of detail on the actual cause of the blue screens.

  CrowdStrike

A story of how properly positioned rationales can be powerful enough to prevent prod incidents

And a great place to put that rationale is in your git commit, says this article.

  Jean-Mark Wright

Need to choose between Redis and Memcached? This one’s for you, with a qualitative comparison and relative performance numbers.

  Rahul Chandel

How do you promote interactions between fans without exploding your system with n-squared worth of messages, where n is the number of users?

  Matthew O’Riordan — Ably

If you want to convert to serverless, don’t switch to microservices or change your datastore at the same time, argues this article.

  Yan Cui

It’s all about the owl’s butt (and sending 4 million push notifications in 5 seconds).

  Zhen Zhou — InfoQ

Here’s the second part of the article I wrote on my recent project at work, taking down a full AZ in production.

  Lex Neva — Honeycomb

  Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #434

A message from our sponsor, FireHydrant:

We’ve gone all out on our new integration with Microsoft Teams. If you’re a MS Teams user, FireHydrant now supports the most comprehensive integration for incident management. Run the entire IM process without ever leaving the chat.

https://firehydrant.com/blog/introducing-a-brand-new-microsoft-teams-integration/

The big news this week, of course, is the CrowdStrike-related series of outages in airports, banks, and many other businesses. Here’s their statement on the situation.

Rumor has it that Southwest Airlines survived because they run Windows 3.1. Well, that’s one way to do it.

  CrowdStrike

It’s time for Catchpoint’s annual SRE survey again! We get a lot of interesting information about SRE trends from this, so it’d be great if you could take a moment to fill it out.

Note, usually I try to avoid giving you “utm” stuff in links, but this link is specifically set up to track whether folks come from SRE Weekly, so I left it in this time.

  Catchpoint

Queues have a cost, as this article explains.

  Jean-Mark Wright

I wrote this article about an exciting project I led recently: taking down an entire availability zone in production to test reliability. Part 2 is due out next week!

  Lex Neva — Honeycomb

  Full disclosure: Honeycomb is my employer.

Deletion protection: it can really save you!

  Andre Newman — Gremlin

A thorough overview of Netflix’s architecture, with focus on data stores, content processing, billing, and the CDN, among other topics.

   Rahul Shivalkar — ClickIT

This article compares the terms “degradation”, “disruption”, and “service outage” through the lens of service levels.

  Alex Ewerlöf

Their workload involved writing many small objects but reading very few. By batching many writes into a single object in S3, they saved a ton of money, and now they’re open sourcing their solution.

  Pablo Matias Gomez — Embrace

SRE Weekly Issue #433

A message from our sponsor, FireHydrant:

We’ve gone all out on our new integration with Microsoft Teams. If you’re a MS Teams user, FireHydrant now supports the most comprehensive integration for incident management. Run the entire IM process without ever leaving the chat.

https://firehydrant.com/blog/introducing-a-brand-new-microsoft-teams-integration/

This article covers five skills:

  1. Ability to Lead
  2. Taking Charge in Critical Situations
  3. Expressing Opinions in a Non-Conflicting Way
  4. Leading Initiatives for Continuous Improvement
  5. Building and Maintaining Relationships

  Prabesh

I was pretty dubious most of the way through this article — until I realized it was a story about why this solution didn’t work for them. Now it’s an interesting read about Python and exercising restraint in complexity.

  Jean-Mark Wright

Meta is training an LLM to suggest commits that may have caused a given incident, and its suggestions are right 42% of the time.

  Diana Hsu, Michael Neu, Mohamed Farrag, and Rahul Kindi — Meta

Percentiles, because when your math(s) teacher told you you’d use math all the time when you grew up, they were right! This article does a great job of explaining percentiles if you’re having trouble wrapping your mind around them.

  Alex Ewerlöf

Netflix designed their load shedding system to efficiently drop the requests that don’t matter as much and prioritize what users really care about.

  Anirudh Mendiratta, Kevin Wang, Joey Lynch, Javier Fernandez-Ivern, and Benjamin Fedorka — Netflix

This article illustrates cascading delays in microservices and describes three techniques for dealing with them: timeouts, retries, and circuit breakers.

  Jean-Mark Wright

Cloudflare’s public DNS resolver had an outage due to a (probably accidental?) BGP hijack. 1.1.1.1 is a common address used internally for testing routing, so it’s easy to understand how an accidental route leak happened.

   Bryton Herdes, Mingwei Zhang, and Tanner Ryan — Cloudflare

Here’s a new post about durability and write-ahead logs. Write-ahead logs are used almost everywhere. But to build an intuition for why, it is helpful to imagine what you would do without a WAL.

  Phil Eaton

A production of Tinker Tinker Tinker, LLC Frontier Theme