SRE Weekly Issue #102


My phone died this week, and I discovered the hard way that my backups hadn’t been functioning properly.  SRE Weekly is served out of a single t2.micro, too.  Sometimes it’s hard to practice what I preach outside of work. ;)  Anyway, bit of a light issue this week, but still some great stuff.

SPONSOR MESSAGE

A robust mobile app is essential for on-call. See why VictorOps updated both native iOS and Android apps. http://try.victorops.com/SREWeekly/MobileBlog

Articles

I’ve lost count of the number of incidents I’ve witnessed that were caused by TCP connections in which one end disappeared. This cautionary tale has a pretty interesting cause as well.

In many ways, what you do with an incident post-mortem can be as important as the post-mortem itself. Below, I explain why and offer tips on what to do after the post-mortem is complete.

If the title of this article doesn’t make sense to you, then you may well have been interpreting traceroute results incorrectly. Definitely worth a read.

Gremlin inc. is live! Here’s the official “coming out” post for this chaos engineering startup.

There’s so much to delve into in this long article about Netflix’s infrastructure. It’s part of the book, Explain the Cloud Like I’m 10, but I didn’t really find the explanations watered-down or over-simplified.

A great description of booking.com‘s incident response and followup process.

Incidents are like presents: You love them as long as you don’t get the same present twice.

Outages

Updated: December 17, 2017 — 9:01 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme