SRE Weekly Issue #203

A message from our sponsor, VictorOps:

Bulkhead and sidecar application design patterns can be used to create more efficient incident response workflows for DevOps and IT operations. Learn more:

https://go.victorops.com/sreweekly-bulkhead-and-sidecar-design-patterns

Articles

Spot-on advice for writing incident followups, citing examples of real write-ups that exhibit the techniques they recommend.

Hannah Culver — Blameless

“The beautiful thing about going on-call is you get to go off-call. If you aren’t on-call, I have news for you – you’re always on-call”

Jay Gordon — Page It to the Limit

This is a companion to last week’s article, Sharing SQLite databases across containers is surprisingly brilliant. This one explains the broader ctlstore system.

Rick Branson and Collin Van Dyck — Segment

Chaos Mesh is a versatile Chaos Engineering platform that features all-around fault injection methods for complex systems on Kubernetes, covering faults in Pod, network, file system, and even the kernel.

Chengwen Yin — PingCAP

Fake it ’til you make it clear what motivated the decisions of incident responders.

Lorin Hochstein

When running a platform, pay attention to the experience of specific customers, says Google. That may mean inferring their metrics from your own if they haven’t shared their SLIs with you.

Adrian Hilton — Google

This article takes a stand against the “Three Pillars of Observability”.

[…] focus on what kinds of questions you’re trying to answer and let that guide your choice of telemetry.

Mads Hartmann

My favorite recommendation is to make log messages “two-way greppable” — findable in logs and easy to tell exactly which part of the code it comes from.

Vladimir Garvardt — HelloFresh

Outages

Updated: January 19, 2020 — 8:44 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme