Here’s how Algolia was affected by the Salt Stack RCE vulnerability earlier this year and how they dealt with it.
Julien Lemoine — Algolia
Includes background information on SRE and example interview questions.
Marlo Vernon — Splunk
DNS, TLS certificates, and Unicode, among other issues, make for some great (and cringe-worthy) stories.
Adam LaGreca, with stories from Charity Majors, Matthew Fornaciari, Liran Haimovitch, Daniel Spoonhower, Lee Liu, and Tina Huang
In this story of a failover gone wrong, they discovered that they had had
innodb_flush_log_at_trx_commit set incorrectly, explaining how they lost data when they weren’t expecting to.
Rajeev Rai — Razorpay
This is a nice little comic about the role of SRE. Engineer the bridge, don’t be the bridge.
Piyush Verma — Last9
Lots of great concepts about human/computer systems, including this gem:
log facts, not interpretations
In this troubleshooting story, an innocent-seeming dependency upgrade introduced a subtle but nasty bug.
Jordan Place — Transposit
Google released an update to their post-analysis for the December 14th outage involving Google OAuth.