This one’s definitely going to be good to keep in mind during my next incident.
FYI for folks with no or low vision, there’s a screenshot of J. Paul Reed quoting Vanessa Huerta Granda: “Incidents are where engineers are made”.
Stuart Rimell — Uptime Labs
Etsy migrated a 1,000-table DB with 1,000 shards (with their own custom ORM!) over to vitess, and it took some care, especially in how they handled transactions.
Ella Yarmo-Gray — Etsy
Wow, this one sure hits hard.
Kenneth Eversole
The section on lessons learned toward the end of this debugging story is a goldmine.
Lokesh Soni
How do you ensure reliability in a system you can’t access? How can you monitor SLIs/SLOs without metrics?
Alex Ewerlöf
I love a good debugging story, and this one delivers, with a confluence of gnarly problems and lessons we can all learn from.
James Sawyer — Phantom Tide
Oof, what a nasty little gotcha in the API call at the heart of this incident.
David Tuber and Dzevad Trumic — Cloudflare
Lorin’s Law strikes again!
System intended to improve reliability contributed to incident
Lorin Hochstein
