SRE Weekly Issue #396

A message from our sponsor, FireHydrant:

DevOps keeps evolving but alerting tools are stuck in the past. Any modern alerting tool should be built on these four principles: cost-efficiency, service catalog empowerment, easier scheduling and substitutions, and clear distinctions between incidents and alerts.

Using 3 high-profile incidents from the past year, this article explores how to define SLOs that might catch similar problems, with a special focus on keeping the SLI close to the user experience.

   Adriana Villela and Ana Margarita Medina — The New Stack

Microservices can have some great benefits, but if you want to build with them, you’re going to have to solve a whole pile of new problems.

  Roberto Vitillo

To protect your application against failures, you first need to know what can go wrong. […] the most common failures you will encounter are caused by single points of failure, the network being unreliable, slow processes, and unexpected load.

  Roberto Vitillo

I love how this article keeps things interesting by starting with a fictional (but realistic) story about the dangers of over-alerting before continuing on to give direct advice.


I especially enjoy the section on the potential pitfalls and challenges with retries and how you can avoid them.


This reddit thread is a goldmine, including this gem:

I actively avoid getting involved with software subject matter expertise, because it robs the engineering team of self-reliance, which is itself a reliability issue.

  u/bv8z and others — reddit

There’s a pretty cool “Five Whys”-style analysis that goes past “dev pushed unreviewed code with incomplete tests to production” and to the sociotechnical challenges underlying that.

  Tobias Bieniek —

Updated: October 29, 2023 — 8:54 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme