SRE Weekly Issue #267

Yet more proof that DNS behavior varies way more than is obvious at first glance. Who the heck thought longest common prefix matching was a good idea?

Charles Li — eBay

The application may log multiple lines during the lifecycle of a request. Stripe has found it invaluable to also log one final line with a fully summary of the request.

Brandur Leach — Stripe

This is a followup with more detail on the G-Suite outage I reported here last week. A database issue caused two separate outages.


Really great advice about 3 common pitfalls in implementing SL*s.


This research paper explores the marginal boundary, a set of conditions beyond which a system enters a different operating mode and an accident is much more likely. It discusses the concept of coupling between seemingly unrelated parts of the system and shows how economic incentives can push a system toward this boundary.

Dr. Richard Cook and Jens Rasmussen (Original paper)

Thai Wood — Resilience Roundup (summary)

This is an analysis of a recent BGP leak with a discussion about how the impact from such events can be mitigated through emerging best practices.

Alessandro Improta and Luca Sani — Catchpoint

How do you hand over ownership of a system, transferring enough knowledge that the new owners can maintain its availability and reliability successfully?

Aleksandra Gavrilovska — SoundCloud

Shopify works toward Black Friday / Cyber Monday all year long, through a combination of load testing, failure mode analysis, game days, and incident analysis.

Ryan McIlmoyl — Shopify


Updated: April 25, 2021 — 9:10 pm
