This is the story of a fascinating incident in which a commercial airplane’s engine was ripped off during takeoff (also covered on Mentour Pilot). What really struck me is the way a huge team on the ground and in the air assembled around the incident and all played very important roles in getting the plane down safely.
Mark D. Young — PoliticsWeb
Time for another Catchpoint SRE Survey! They donate $5 to the Red Cross for every completed survey, so let’s all work together and drive a huge donation!
The US Federal Trade Commission (FTC) put out a request for information about cloud providers, including reliability among other topics. Here’s Corey Quinn’s answer.
Corey Quinn — The Duckbill Group
What can you do when running an incident feels like herding cats? This article has some tips.
Robert Ross — FireHydrant
I have a confession. Despite having been hired multiple times in part due to my experience with monitoring platforms, I have come to hate monitoring.
This jaded tale also contains some good suggestions for dealing with monitoring pitfalls.
The cardinal rule of engineering:
your solution shouldn’t become your next problem.
Kumar Amit — Mercari
Here’s the articlization of a talk Fred Hebert gave at QCon New York. The alternate title of the talk is:
This Is All Going To Hell Anyway
All We Can Do Is Influence How Long It’s Gonna Take
I had the pleasure of seeing a draft version of this talk at work, since (full disclosure) Fred is my coworker.
This article makes the case that elastic scaling is both harder to implement and more important for use cases involving streaming updates to users in real-time.
Mittul Madaan — Ably
An intro to
pdsh, my favorite of the tools that run commands on many hosts via SSH.
Amin Astaneh — Certo Modo