Yet more proof that DNS behavior varies way more than is obvious at first glance. Who the heck thought longest common prefix matching was a good idea?
Charles Li — eBay
The application may log multiple lines during the lifecycle of a request. Stripe has found it invaluable to also log one final line with a fully summary of the request.
Brandur Leach — Stripe
This is a followup with more detail on the G-Suite outage I reported here last week. A database issue caused two separate outages.
Really great advice about 3 common pitfalls in implementing SL*s.
This research paper explores the marginal boundary, a set of conditions beyond which a system enters a different operating mode and an accident is much more likely. It discusses the concept of coupling between seemingly unrelated parts of the system and shows how economic incentives can push a system toward this boundary.
Dr. Richard Cook and Jens Rasmussen (Original paper)
Thai Wood — Resilience Roundup (summary)
This is an analysis of a recent BGP leak with a discussion about how the impact from such events can be mitigated through emerging best practices.
Alessandro Improta and Luca Sani — Catchpoint
How do you hand over ownership of a system, transferring enough knowledge that the new owners can maintain its availability and reliability successfully?
Aleksandra Gavrilovska — SoundCloud
Shopify works toward Black Friday / Cyber Monday all year long, through a combination of load testing, failure mode analysis, game days, and incident analysis.
Ryan McIlmoyl — Shopify