What do you do when your hosts have kernel crashes at random every day? It turns out that you don’t need to be a seasoned kernel programmer to find a solution.
Pavlos Parissis — Booking.com
This is my first introduction
tcpconnect (part of BCC). Pretty nifty!
At Facebook, […] It is simply too difficult to rewrite caching/admission/eviction policies and other manually tuned heuristics by hand. We have to fundamentally change how we think about software maintenance.
Vladimir Bychkovsky, Jim Cipar, Alvin Wen, Lili Hu, and Saurav Mohapatra — Facebook
A couple weeks back, I linked to a postmortem template. Here’s a gameday report template from the same author.
I had a really hard time choosing whether to include this one. On the one hand, it’s a really interesting article about service discovery in franchises that has to work right every time. On the other hand, Chick-fil-A has a terrible track record on GLBT rights, and I can’t overlook that.
Ultimately, I’m choosing to link to this article for its educational content, but I urge you to join me as I continue to boycott Chick-fil-A.
Brian Chambers, Caleb Hurd, and Alex Crane — Chick-fil-A
At 9 years old, this may be the oldest article I’ve linked to, but it’s worth it. The analogy to a home mortage is spot on.
Click through to read about an interesting monitoring challenge and an account of how they solved it. I appreciate the emphasis on the importance of educating engineers to spread the knowledge of how the new system works among more people.
Joy Zheng and Jeeyoung Kim — Plaid
Another chaos engineering introduction. Why should you read it? If nothing else, the architecture diagram with the skull and cobwebs on it is pretty great. It’s also well worth reading if you’re looking to create a chaos engineering game plan.
Benjamin Wilms — Codecentric
Sometimes, a reliability risk can come in the form of a bunch of angry customers.
Ben Kuchera — Ars Technica
- Full disclosure: Fastly is my employer.
- Google Cloud Console
- This is a followup post for an incident on June 27. Another great example of a complex, multi-level failure.
- Google Cloud networking in europe-west1-b and europe-west4-b
- Another outage involving the Disney app and the Fastpass system.