Articles
This suggests an upcoming shift in our field:
50 percent of SREs believe they will be working remotely post COVID-19, as compared to only 20 percent prior to the pandemic.
Kameerath Kareem — Catchpoint
BONUS CONTENT: An outside take on the survey results is here (Mike Vizard — DevOps.com).
No one person can (or should) know everything. How do we allocate expertise and build connections in order to maximize resilience and adaptive capacity?
Will Gallego
A new feature was accidentally rolled out to too wide an audience, causing log message loss.
Heroku
[…] one slow block device can affect the performance of processes even when those processes don’t use the slow block device.
Kalyanasundaram Somasundaram — LinkedIn
Should you count scheduled maintenance against your error budget? It depends.
Jesus Climent — Google
An investigation in response to three incidents led to this stark conclusion about Cassandra’s “counter columns” feature:
In fact, they don’t appear to have any properties that make them a useful primitive for building predictable distributed systems.
Paddy Byers — Ably
This article explains why we should have cost data at our fingertips as we design cloud-based systems.
[…] a well-architected system is often a cost-efficient system.
CloudZero
This is a new concept to me, and I really like it:
Capacity for maneuver (CfM) is a measure of how much adaptability or room to respond to a new challenge that a given part of the system has, whether a person or autonomous agent.
Amir B. Farjadian, Benjamin Thomsen, Anuradha M. Annaswamy, and David D. Woods (original paper)
Thai Wood — Resilience Roundup (summary)