A huge thanks to my awesome former coworker Greg Burek whose helpful link contributions make up fully half of this issue. Thanks, Greg!
Articles
This paper discusses the ways in which automation of industrial processes may expand rather than eliminate problems with the human operator.
My favorite bit of irony: presenting data to the user in the manner most readily understood results in lower likelihood of remembering the data, so perhaps the most easily grasped display is not actually the best!
Lisanne Bainbridge
Like malice and incompetence, laziness should be far off our radar when we investigate an incident. I hope that reading this article opens minds about the true scope of blamelessness.
Devon Price
Whether or not you agree with this particular attempt at defining what a Systems Engineer (or SRE or anything related) is, it’s worth thinking about and discussing. Our field is evolving quickly, and titles are a moving target.
Matt Ouille
Driven by a desire to update their 737 without causing airlines to have to retrain pilots, Boeing seemingly kept pilots in the dark about what may have been an important little detail of how the new 737 Max operates, with a tragic result.
James Glanz, Julie Creswell, Thomas Kaplan and Zach Wichter — New York Times
An experienced SRE will develop an innate skepticism of new technologies, even if they don’t realize it. This article provides an excellent list of questions to help articulate that skepticism when evaluating a potential design.
Kellan Elliott-McCrea
Auto-scaling isn’t all roses. Like any tool, you have to understand how it works in order to avoid the pitfalls. Read this article to learn what these folks learned the hard way.
Tyson Mote — Segment
Transitioning to a blameless culture can be difficult, especially as folks might blame each other for forgetting to be blameless!
Rachael Byrne — PagerDuty
Many of the old arguments for not instrumenting code (mostly about performance) no longer apply, and a host of new arguments push toward structured events.
Charity Majors
Outages
- QuadrigaCX
- Bloomberg’s title for the above-linked article says it all:
Crypto CEO Dies Holding Only Passwords That Can Unlock Millions in Customer Coins
QuadrigaCX ceased trading and posted a note on their front page.
- Bloomberg’s title for the above-linked article says it all:
- Gmail
- Mailchimp Mandrill
- A PostgreSQL transaction ID wraparound in a central database caused this prolonged outage on Superbowl Sunday.
- Wells Fargo (bank)
- Crunchyroll
- Hosted Graphite