The big news this week, of course, is the CrowdStrike-related series of outages in airports, banks, and many other businesses. Here’s their statement on the situation.
Rumor has it that Southwest Airlines survived because they run Windows 3.1. Well, that’s one way to do it.
CrowdStrike
It’s time for Catchpoint’s annual SRE survey again! We get a lot of interesting information about SRE trends from this, so it’d be great if you could take a moment to fill it out.
Note, usually I try to avoid giving you “utm” stuff in links, but this link is specifically set up to track whether folks come from SRE Weekly, so I left it in this time.
Catchpoint
Queues have a cost, as this article explains.
Jean-Mark Wright
I wrote this article about an exciting project I led recently: taking down an entire availability zone in production to test reliability. Part 2 is due out next week!
Lex Neva — Honeycomb
Full disclosure: Honeycomb is my employer.
Deletion protection: it can really save you!
Andre Newman — Gremlin
A thorough overview of Netflix’s architecture, with focus on data stores, content processing, billing, and the CDN, among other topics.
Rahul Shivalkar — ClickIT
This article compares the terms “degradation”, “disruption”, and “service outage” through the lens of service levels.
Alex Ewerlöf
Their workload involved writing many small objects but reading very few. By batching many writes into a single object in S3, they saved a ton of money, and now they’re open sourcing their solution.
Pablo Matias Gomez — Embrace