View on sreweekly.com
My phone died this week, and I discovered the hard way that my backups hadn’t been functioning properly. SRE Weekly is served out of a single t2.micro, too. Sometimes it’s hard to practice what I preach outside of work. ;) Anyway, bit of a light issue this week, but still some great stuff.
Articles
I’ve lost count of the number of incidents I’ve witnessed that were caused by TCP connections in which one end disappeared. This cautionary tale has a pretty interesting cause as well.
In many ways, what you do with an incident post-mortem can be as important as the post-mortem itself. Below, I explain why and offer tips on what to do after the post-mortem is complete.
If the title of this article doesn’t make sense to you, then you may well have been interpreting traceroute results incorrectly. Definitely worth a read.
Gremlin inc. is live! Here’s the official “coming out” post for this chaos engineering startup.
There’s so much to delve into in this long article about Netflix’s infrastructure. It’s part of the book, Explain the Cloud Like I’m 10, but I didn’t really find the explanations watered-down or over-simplified.
A great description of booking.com‘s incident response and followup process.
Incidents are like presents: You love them as long as you don’t get the same present twice.
Outages
- Incident review: API and Dashboard outage on 10 October 2017 — GoCardless Blog
- This is a truly epic post-incident analysis from the folks at GoCardless. The highlights: simultaneous 3-drive failure in a RAID array, weird behavior from Pacemaker, a red herring from Postgres, and a multi-month investigation process.
- Slack Server Error, Site Down Amid Friday Outage
- And a second incident, later the same day.
- Former Rutgers Student Pleads Guilty After Historic Internet Outage
- The outage in question is the Dyn DDoS in October of 2016, and the student pled guilty to creating the Mirai botnet.
- Eurex and Xetra (stock exchanges)