Email subscribers, my apologies for the double-send last week. I upgraded WordPress and subsequently further cemented my distrust of all version upgrades ever.
I carefully tested a fix in staging before rolling it out gradually in preparation for this week’s issue. Just kidding, I hacked on it live until I got it fixed. Sorry about all those testing tweets. #testinproduction #yolo #SREWeeklydoesnotpracticeSRE
This is Google’s detailed report from their outage last week. This one’s really worth a read; I promise you won’t be disappointed!
I really like this guide and template for writing incident reports. Each section comes with an explanation of what goes there with examples.
Booking.com developed their Reliability Collaboration Model to guide the engagement between SRE and product development teams and the responsibilities assigned to each.
Emmanuel Goossaert — Booking.com
Especially timely now, in the thick of the holiday on-call period.
James Frost — Ably
Great tips. I hope your Black Friday / Cyber Monday is going well!
Quentin Rousseau — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.
I thought it might be better to try a new approach: defining what SRE was by looking at what it’s not. Or to put it another way, what can you remove from SRE and have it still be SRE?
Instead of asking that question this article urges understanding what happened.
Another reason that imagining future scenarios is better that counterfactuals about past scenarios is that our system in the future is different from the one in the past.