SRE Weekly Issue #418

A message from our sponsor, FireHydrant:

FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates.

The observability waters have been muddy for awhile, and this article does a great job of taking a step back and building a definition — and a roadmap.

  Hazel Weakly

Fred Hebert wrote this response/follow-on to Hazel’s article:

The main points I’ll try to bring here are on the topics of the difference between insights and questions, the difference between observability and data availability, reinforcing a socio-technical definition, the mess of complex systems and mapping them, and finally, a hot take on the use of models when reasoning about systems.

  Fred Hebert

What the service providers are willing to put on the table in terms of penalties is often much less than the money you lose when your service goes down.

  Alex Ewerlöf

Fascinating legal questions come to the surface when lawyers consider the possibility for legal risk exposure from a surgical incident debriefing meeting.

  Dr. Rob Poston

if you approach on-call the right way, you can mitigate the impacts of alert fatigue or, better yet, avoid it altogether. Here, we’ll dive into the tactics teams can implement to address alert fatigue and its underlying causes.

How do you create an SLO that references multiple SLIs together, such as slow requests and errors?

  Ross Brodbeck

More than just a list of talks, this piece pulls out major themes from SRECon24.

  Will Gallego

Making your 9’s look great by cheating.

Of course, you don’t actually want to do that, but learning how can show us that availability numbers are nuanced.

  Ross Brodbeck

Updated: March 31, 2024 — 11:48 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme