SRE Weekly Issue #437

View on sreweekly.com

This week’s issue is entirely focused on the CrowdStrike incident: more details on what happened, analysis, and learnings. I’ll be back next week with a selection of all of the great stuff you folks have been writing while I’ve been off on vacation for the past two weeks—my RSS reader is packed with awesomeness!

CrowdStrike External Technical Root Cause Analysis — Channel File 291

This week, CrowdStrike posted quite a bit more detail about what happened on July 19. The short of it seems to be an argument count mismatch, but as with any incident of this sort, there are multiple contributing factors.

The report also continues the conversation about the use of kernel mode in a product such as this, amounting to a public conversation with Microsoft that is intriguing to watch from the outside.

CrowdStrike

The biggest-ever global outage: lessons for software engineers

This article has some interesting details about antitrust regulations(!) related to security vendors running code in kernel mode. There’s also an intriguing story of a very similar crash on Linux endpoints running CrowdStrike’s Falcon.

Note: this one is from a couple of weeks ago and some of its conjectures don’t quite line up with details that have been released in the interim.

Gergely Orosz

Staged rollouts of things still have limitations

While it mentions the CrowdStrike incident only in vague terms, this article discusses why slowly rolling out updates isn’t a universal solution and can bring its own problems.

Chris Siebenmann

Feedback on feed stuff and those pesky blue screens

Some thoughts on staged rollouts and the CrowdStrike outage:

The notion we tried to get known far and wide was “nothing goes everywhere at once”.

Note that this post was published before CrowdStrike’s RCA which subsequently confirmed that their channel file updates were not deployed through staged rollouts.

rachelbythebay

Expect it most when you expect it least

[…] there may be risks in your system that haven’t manifested as minor outages.

Jumping off from the CrowdStrike incident, this one asks us to look for reliability problems in parts of our infrastructure that we’ve grown to trust.

Lorin Hochstein

CrowdStrike: how did we get here?

While CrowdStrike’s RCA has quite a bit of technical detail, this post reminds us that we need a lot more context to really understand how an incident came to be.

Lorin Hochstein

No More Blue Fridays

In the future, computers will not crash due to bad software updates, even those updates that involve kernel code. In the future, these updates will push eBPF code.

I didn’t realize that Microsoft is working on eBPF for Windows.

Brendan Gregg

Lessons from Crowdstrike’s outage

This post isn’t about what Crowdstrike should have done. Instead, I use the resources to provide context and takeaways we can apply to our teams and organizations.

Bob Walker — Octopus Deploy

SRE Weekly Issue #437

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues