SRE Weekly Issue #225

A message from our sponsor, StackHawk:

Application security is shifting to a model where the engineers who write the code also take ownership of the security. Read our docs to learn more about how StackHawk makes that happen.


This suggests an upcoming shift in our field:

50 percent of SREs believe they will be working remotely post COVID-19, as compared to only 20 percent prior to the pandemic.

Kameerath Kareem — Catchpoint

BONUS CONTENT: An outside take on the survey results is here (Mike Vizard —

No one person can (or should) know everything. How do we allocate expertise and build connections in order to maximize resilience and adaptive capacity?

Will Gallego

A new feature was accidentally rolled out to too wide an audience, causing log message loss.


[…] one slow block device can affect the performance of processes even when those processes don’t use the slow block device.

Kalyanasundaram Somasundaram — LinkedIn

Should you count scheduled maintenance against your error budget? It depends.

Jesus Climent — Google

An investigation in response to three incidents led to this stark conclusion about Cassandra’s “counter columns” feature:

In fact, they don’t appear to have any properties that make them a useful primitive for building predictable distributed systems.

Paddy Byers — Ably

This article explains why we should have cost data at our fingertips as we design cloud-based systems.

[…] a well-architected system is often a cost-efficient system.


This is a new concept to me, and I really like it:

Capacity for maneuver (CfM) is a measure of how much adaptability or room to respond to a new challenge that a given part of the system has, whether a person or autonomous agent.

Amir B. Farjadian, Benjamin Thomsen, Anuradha M. Annaswamy, and David D. Woods (original paper)

Thai Wood — Resilience Roundup (summary)


Updated: June 28, 2020 — 8:31 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme