SRE Weekly Issue #203

A message from our sponsor, VictorOps:

Bulkhead and sidecar application design patterns can be used to create more efficient incident response workflows for DevOps and IT operations. Learn more:


Spot-on advice for writing incident followups, citing examples of real write-ups that exhibit the techniques they recommend.

Hannah Culver — Blameless

“The beautiful thing about going on-call is you get to go off-call. If you aren’t on-call, I have news for you – you’re always on-call”

Jay Gordon — Page It to the Limit

This is a companion to last week’s article, Sharing SQLite databases across containers is surprisingly brilliant. This one explains the broader ctlstore system.

Rick Branson and Collin Van Dyck — Segment

Chaos Mesh is a versatile Chaos Engineering platform that features all-around fault injection methods for complex systems on Kubernetes, covering faults in Pod, network, file system, and even the kernel.

Chengwen Yin — PingCAP

Fake it ’til you make it clear what motivated the decisions of incident responders.

Lorin Hochstein

When running a platform, pay attention to the experience of specific customers, says Google. That may mean inferring their metrics from your own if they haven’t shared their SLIs with you.

Adrian Hilton — Google

This article takes a stand against the “Three Pillars of Observability”.

[…] focus on what kinds of questions you’re trying to answer and let that guide your choice of telemetry.

Mads Hartmann

My favorite recommendation is to make log messages “two-way greppable” — findable in logs and easy to tell exactly which part of the code it comes from.

Vladimir Garvardt — HelloFresh


SRE Weekly Issue #202

A message from our sponsor, VictorOps:

If moving from a SysAdmin role into a DevOps-centric role is part of your 2020 resolution then you can’t miss this walkthrough for evolving your skillset:


When writing about an incident, it’s important to skillfully show the reader how the participants’ understanding of the situation evolved.

Lorin Hochstein

This is a summary of Bainbridge’s seminal paper, and I really love where Adrian Colyer goes with it.

One example I found myself thinking about while reading through the paper does have a human precedence though: self-driving cars.

Adrian Colyer — The Morning Paper (summary)

Bainbridge — Automatica (original paper)

I have to admit, it is brilliant. Why add the risk (and latency) of a centralized configuration repository service when a local DB on each host will do?

Rick Branson — Segment

This one covers a lot. My favorite parts:

  • Permissive failure — if Netflix’s subscriber information service is down they just show videos for free, favoring reliability over correctness.
  • Human attention span — if it takes 10 minutes to see if your changes broke production, you’re likely to wander off and work on something else.

Adrian Cockcroft

The author guides you through the moment they began to truly understand what observability is all about. Worth reading even if you’re already quite familiar with the concept.

Sanjeev Sharma

This article describes our work with NS1 to optimize our intelligent DNS-based global load balancing for corner cases that we uncovered while improving our point of presence (PoP) selection automation for our edge network.

Grab uses bulkheading to prevent localized demand spikes from affecting the service for customers elsewhere. The notable part is that they shed load they can’t satisfy anyway, due to a limited supply of available vehicles.

Corey Scott — Grab


SRE Weekly Issue #201

A message from our sponsor, VictorOps:

If moving from a SysAdmin role into a DevOps-centric role is part of your 2020 resolution then you can’t miss this walkthrough for evolving your skillset:


Looking from multiple perspectives is incredibly important to effectively learn from an incident. Equally true for asking what went right.

Subbu Allamaraju

Failure to anticipate and design for
the new challenges that are certain to arise following periods of technology change leads
to automation surprises when advocates are surprised by negative unintended consequences that offset apparent benefits

Thanks to Greg Burek for this one.

David Woods — Ohio State University

Start the year off with this refreshingly deep dive into how variable-argument functions in C work.

Jan Schaumann

Think you know how to write files safely, say with fsync() or something? Think again.

In conclusion, computers don’t work

Dan Luu


SRE Weekly Issue #200

A message from our sponsor, VictorOps:

Learn how to modernize your approach to incident management and slash MTTA/MTTR in the latest webinar from VictorOps + Splunk:


The logical argument goes like this: if incidents in your system each had a single root cause, that implies a level of brittleness that would preclude your company running successfully at all.

Lorin Hochstein

Once a system reaches a certain level of reliability, most major incidents will involve:

  • A manual intervention that was intended to mitigate a minor incident, or
  • Unexpected behavior of a subsystem whose primary purpose was to improve reliability

Lorin Hochstein

Confirmation bias can lead us to reinforce an incorrect mental model through spurious correlations.

Thai Wood — Resilience Roundup (summary)
Dennis Bernard, David Greathead, and Gordon Baxter — International Journal of Human Computer Studies (original paper)

In this post, I’ll recap his talk, sharing the journey that led them to build GoAlert, the problems they’ve solved, and how they use GoAlert with Sensu Go to simplify monitoring and reduce alert fatigue.

Anna MacLachlan — Sensu (recap)
Adam Westman — Target (talk)

Verbose debug logging + feature flagging = a way to investigate unknown unknowns in your system.

Will Sargent


SRE Weekly Issue #199

A message from our sponsor, VictorOps:

Ever find yourself asking, “How do I write Ansible playbooks for new Terraform servers?” Well, this new walkthrough from Splunk + VictorOps has your answer.


Domino model, Swiss Cheese model, stand aside. The Gamma Knife model is a nifty analogy for contributing factors.

Lorin Hochstein

Lots of great tips here for how to make things easier on yourself when you’re paged. Pave the way for your 3 am self to get things fixed and get back to sleep as soon as possible.

Katie McLaughlin (Sysadvent day 21)

Ooh, a new SRE podcast! PagerDuty started things up with 4 episodes right out of the gate.

Introducing “Page It To The Limit,” a new podcast by the Community team here at PagerDuty that discusses what it means to operate software in production.

Wow, I love the idea of this shadowing program. The author discusses incidents they saw and 5 things they learned while shadowing.

Tristan Read — GitLab


SRE WEEKLY © 2015 Frontier Theme