SRE Weekly Issue #378

Articles

This is the story of a fascinating incident in which a commercial airplane’s engine was ripped off during takeoff (also covered on Mentour Pilot). What really struck me is the way a huge team on the ground and in the air assembled around the incident and all played very important roles in getting the plane down safely.

Mark D. Young — PoliticsWeb

Catchpoint’s 2024 SRE Survey Is Here – We Need YOU!

Time for another Catchpoint SRE Survey! They donate $5 to the Red Cross for every completed survey, so let’s all work together and drive a huge donation!

Catchpoint

FTC Request, Answered: How Cloud Providers Do Business

The US Federal Trade Commission (FTC) put out a request for information about cloud providers, including reliability among other topics. Here’s Corey Quinn’s answer.

Corey Quinn — The Duckbill Group

The “people problem” of incident management

What can you do when running an incident feels like herding cats? This article has some tips.

Robert Ross — FireHydrant

Monitoring is a Pain

I have a confession. Despite having been hired multiple times in part due to my experience with monitoring platforms, I have come to hate monitoring.

This jaded tale also contains some good suggestions for dealing with monitoring pitfalls.

Mathew Duggan

Resilient Retry and Recovery Mechanism: Enhancing Fault Tolerance and System Reliability

The cardinal rule of engineering:

your solution shouldn’t become your next problem.

Kumar Amit — Mercari

Embrace Complexity; Tighten Your Feedback Loops

Here’s the articlization of a talk Fred Hebert gave at QCon New York. The alternate title of the talk is:

This Is All Going To Hell Anyway
All We Can Do Is Influence How Long It’s Gonna Take

I had the pleasure of seeing a draft version of this talk at work, since (full disclosure) Fred is my coworker.

Fred Hebert

Why elasticity is essential for delivering realtime updates at scale

This article makes the case that elastic scaling is both harder to implement and more important for use cases involving streaming updates to users in real-time.

Mittul Madaan — Ably

Parallel Distributed Shell

An intro to pdsh, my favorite of the tools that run commands on many hosts via SSH.

Amin Astaneh — Certo Modo

SRE Weekly Issue #378

Articles

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

Subscribe

RSS

Mastodon

Search Issues