Articles
Spot-on advice for writing incident followups, citing examples of real write-ups that exhibit the techniques they recommend.
Hannah Culver — Blameless
“The beautiful thing about going on-call is you get to go off-call. If you aren’t on-call, I have news for you – you’re always on-call”
Jay Gordon — Page It to the Limit
This is a companion to last week’s article, Sharing SQLite databases across containers is surprisingly brilliant. This one explains the broader ctlstore system.
Rick Branson and Collin Van Dyck — Segment
Chaos Mesh is a versatile Chaos Engineering platform that features all-around fault injection methods for complex systems on Kubernetes, covering faults in Pod, network, file system, and even the kernel.
Chengwen Yin — PingCAP
Fake it ’til you make it clear what motivated the decisions of incident responders.
Lorin Hochstein
When running a platform, pay attention to the experience of specific customers, says Google. That may mean inferring their metrics from your own if they haven’t shared their SLIs with you.
Adrian Hilton — Google
This article takes a stand against the “Three Pillars of Observability”.
[…] focus on what kinds of questions you’re trying to answer and let that guide your choice of telemetry.
Mads Hartmann
My favorite recommendation is to make log messages “two-way greppable” — findable in logs and easy to tell exactly which part of the code it comes from.
Vladimir Garvardt — HelloFresh
Outages
- Dyn Managed DNS
- G Suite admin console
- WhatsApp Gets its First Ever Outage in 2020, Only Text Service Working
- South Africa and other African countries
- An important undersea cable was severed.
- US Driver’s License system
- A downstream dependency of many US states’ motor vehicle departments had an outage.
- UK National Lottery
- Spotify
- HootSuite