Articles
This one’s from a couple years ago and covers 3 main themes the author saw at SRECon Americas 2020. Fascinating topics include providing context for newbies, learning from incidents, and rethinking the incident command system.
Taylor Barnett — Transposit
On September 8, Honeycomb had a major outage in data ingestion, and they’ve posted this preliminary report, “pending an in-depth incident review in the upcoming weeks”.
BONUS CONTENT: Another outage report from a different outage the next day.
Honeycomb
Full disclosure: Honeycomb is my employer.
This is neat! Someone posted a day in their life as an actual SRE, and a bunch of commenters followed suit.
Various commenters — Reddit
Some big names in SRE got together to talk about how to know when your system is broken. Listen to the recording or read this excellent summary that goes in depth on grey failures and more.
Emily Arnott — Blameless
To better scale our systems, our infrastructure and product teams got together and decided to make these optimizations: reduce database loads, conduct load tests and size the demand and prioritize critical flows.
…and sharding.
Robinhood
A major incident went poorly, and that catalyzed investment in developing a new incident response system. They worked to transition from swarming to Incident Command.
Vikrant Saini — Razorpay
I love this part:
[…] if you have to deploy your microservices in a certain order, they’re not really microservices.
Cortex
This one had an interesting interplay of contributing factors.
Heroku