SRE Weekly Issue #102

View on sreweekly.com
My phone died this week, and I discovered the hard way that my backups hadn’t been functioning properly. SRE Weekly is served out of a single t2.micro, too. Sometimes it’s hard to practice what I preach outside of work. ;) Anyway, bit of a light issue this week, but still some great stuff.

Articles

sysadvent: Day 13 – Half-Dead TCP Connections and Why Heartbeats Matter

I’ve lost count of the number of incidents I’ve witnessed that were caused by TCP connections in which one end disappeared. This cautionary tale has a pretty interesting cause as well.

Getting the Most from Your Incident Post-Mortem – PagerDuty

In many ways, what you do with an incident post-mortem can be as important as the post-mortem itself. Below, I explain why and offer tips on what to do after the post-mortem is complete.

Reminders about using traceroute in multi-path networks

If the title of this article doesn’t make sense to you, then you may well have been interpreting traceroute results incorrectly. Definitely worth a read.

Introducing Gremlin: Orchestrating Chaos

Gremlin inc. is live! Here’s the official “coming out” post for this chaos engineering startup.

Netflix: What Happens When You Press Play?

There’s so much to delve into in this long article about Netflix’s infrastructure. It’s part of the book, Explain the Cloud Like I’m 10, but I didn’t really find the explanations watered-down or over-simplified.

Incidents, fixes, and the day after

A great description of booking.com‘s incident response and followup process.

Incidents are like presents: You love them as long as you don’t get the same present twice.

Outages

Incident review: API and Dashboard outage on 10 October 2017 — GoCardless Blog
- This is a truly epic post-incident analysis from the folks at GoCardless. The highlights: simultaneous 3-drive failure in a RAID array, weird behavior from Pacemaker, a red herring from Postgres, and a multi-month investigation process.
Slack Server Error, Site Down Amid Friday Outage
- And a second incident, later the same day.
Former Rutgers Student Pleads Guilty After Historic Internet Outage
- The outage in question is the Dyn DDoS in October of 2016, and the student pled guilty to creating the Mirai botnet.
Eurex and Xetra (stock exchanges)
Instagram

SRE Weekly Issue #102

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues