SRE Weekly Issue #168

Articles

This one’s great for folks that are new to SRE, and it’s also an enlightening read for seasoned SREs. What caught me most was the Definition section, on what it means to be an SRE.

Alice Goldfuss

Chaos Engineering Traps

In this articlization of a conference talk, the author lays out 8 common pitfalls in chaos engineering, with detailed example stories related to them. It goes much deeper than mere chaos engineering into the theory of how to operate complex systems.

Nora Jones

Ghosts in the machines

Automation can have unintended effects — and can tend to not have the effect we hope it will.

Thanks to Greg Burek for this one.

Courtney Nash

What SREs can learn from Aviation industry? ·

Recently having binged watch Air Emergency, I felt that SREs can learn many things from aviation industry.

Anshul Patel

Notes on running production code

Lessons learned by a software engineer on supporting their code in production.

Kashyap Kondamudi

The CASE Method: Better Monitoring For Humans

CASE stands for Context-heavy, Actionable, Symptom-based, and Evaluated. That last one’s really key. The author proposes setting an expiration time for your alerts after which time you should evaluate them to make sure that they still make sense.

Cory Watson

Outages

Heroku: (EU) routing issues for ssl:endpoint applications
- Heroku posted this followup for an outage on April 2.
The Travis CI Blog: Incident review for slow booting Linux builds outage
- The outage happened March 27-28.
Azure VMs — North Central US
- Since deep-linking to Azure incident summaries doesn’t work and this one is especially interesting, I’ll quote it here:
  
  Azure Storage team made a configuration change on 9 April 2019 at 21:30 UTC to our back-end infrastructure in North Central US to improve performance and latency consistency for Azure Disks running inside Azure Virtual Machines. This change was designed to be transparent to customers. It was enabled following our normal deployment process, first to our test environment, and lower impact scale units before being rolled out to the North Central US region. However, this region hit bugs which impacted customer VM availability. Due to a bug, VM hosts were able to establish session with the storage scale unit but hit issues when trying to receive/send data from/to storage scale unit. This situation was designed to be handled with fallback to our existing data path, but an additional bug led to failure in the fallback path and resulted in in VM reboots.
Facebook, Instagram, and WhatsApp

SRE Weekly Issue #168

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues