Articles
This one’s great for folks that are new to SRE, and it’s also an enlightening read for seasoned SREs. What caught me most was the Definition section, on what it means to be an SRE.
Alice Goldfuss
In this articlization of a conference talk, the author lays out 8 common pitfalls in chaos engineering, with detailed example stories related to them. It goes much deeper than mere chaos engineering into the theory of how to operate complex systems.
Nora Jones
Automation can have unintended effects — and can tend to not have the effect we hope it will.
Thanks to Greg Burek for this one.
Courtney Nash
Recently having binged watch Air Emergency, I felt that SREs can learn many things from aviation industry.
Anshul Patel
Lessons learned by a software engineer on supporting their code in production.
Kashyap Kondamudi
CASE stands for Context-heavy, Actionable, Symptom-based, and Evaluated. That last one’s really key. The author proposes setting an expiration time for your alerts after which time you should evaluate them to make sure that they still make sense.
Cory Watson
Outages
- Heroku: (EU) routing issues for ssl:endpoint applications
- Heroku posted this followup for an outage on April 2.
- The Travis CI Blog: Incident review for slow booting Linux builds outage
- The outage happened March 27-28.
- Azure VMs — North Central US
- Since deep-linking to Azure incident summaries doesn’t work and this one is especially interesting, I’ll quote it here:
Azure Storage team made a configuration change on 9 April 2019 at 21:30 UTC to our back-end infrastructure in North Central US to improve performance and latency consistency for Azure Disks running inside Azure Virtual Machines. This change was designed to be transparent to customers. It was enabled following our normal deployment process, first to our test environment, and lower impact scale units before being rolled out to the North Central US region. However, this region hit bugs which impacted customer VM availability. Due to a bug, VM hosts were able to establish session with the storage scale unit but hit issues when trying to receive/send data from/to storage scale unit. This situation was designed to be handled with fallback to our existing data path, but an additional bug led to failure in the fallback path and resulted in in VM reboots.
- Since deep-linking to Azure incident summaries doesn’t work and this one is especially interesting, I’ll quote it here:
- Facebook, Instagram, and WhatsApp