SRE Weekly Issue #203

Articles

5 Example Postmortems & Best Practices you can Start Using Today

Spot-on advice for writing incident followups, citing examples of real write-ups that exhibit the techniques they recommend.

Hannah Culver — Blameless

On-Call Nightmares With Jay Gordon

“The beautiful thing about going on-call is you get to go off-call. If you aren’t on-call, I have news for you – you’re always on-call”

Jay Gordon — Page It to the Limit

Serving 100µs reads with 100% availability

This is a companion to last week’s article, Sharing SQLite databases across containers is surprisingly brilliant. This one explains the broader ctlstore system.

Rick Branson and Collin Van Dyck — Segment

Chaos Mesh – Your Chaos Engineering Solution for System Resiliency on Kubernetes

Chaos Mesh is a versatile Chaos Engineering platform that features all-around fault injection methods for complex systems on Kubernetes, covering faults in Pod, network, file system, and even the kernel.

Chengwen Yin — PingCAP

Getting into people’s heads: how and why to fake it

Fake it ’til you make it clear what motivated the decisions of incident responders.

Lorin Hochstein

Deemed SLIs to put SRE into practice

When running a platform, pay attention to the experience of specific customers, says Google. That may mean inferring their metrics from your own if they haven’t shared their SLIs with you.

Adrian Hilton — Google

Journey into Observability: Telemetry

This article takes a stand against the “Three Pillars of Observability”.

[…] focus on what kinds of questions you’re trying to answer and let that guide your choice of telemetry.

Mads Hartmann

Logging: Rules of thumb

My favorite recommendation is to make log messages “two-way greppable” — findable in logs and easy to tell exactly which part of the code it comes from.

Vladimir Garvardt — HelloFresh

Outages

Dyn Managed DNS
G Suite admin console
WhatsApp Gets its First Ever Outage in 2020, Only Text Service Working
South Africa and other African countries
- An important undersea cable was severed.
US Driver’s License system
- A downstream dependency of many US states’ motor vehicle departments had an outage.
UK National Lottery
Spotify
HootSuite
Instagram
Reddit
LinkedIn

SRE Weekly Issue #203

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues