SRE Weekly Issue #425

A message from our sponsor, FireHydrant:

FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates.

https://firehydrant.com/blog/ai-for-incident-management-is-here/

Great practical advice for how to present reliability problems (and your proposed solutions) to e-staff.

  Ross Brodbeck

It’s when things aren’t always on fire that it can be very difficult to assess whether we need to allocate additional resources to reduce risk.

  Lorin Hochstein

The three kinds of roles covered in this article relate to Standards, Operations, and Leadership.

  Gavin Cahill — Gremlin

Nagle’s algorithm considered harmful? It’s important to be aware of it because it can trip you up.

  Marc Brooker

In issue #423, I linked to a story about Amazon charging for unauthenticated and failed requests to S3 buckets. Thankfully, they’re no longer charging for that.

  Amazon

A little low on details, but interesting nonetheless: Google Cloud did something weird and accidentally deleted a customer’s account out from under them.

  UniSuper

What is a “service” in the context of service levels (SLI/SLO)?

  Alex Ewerlöf

My favorite part of this one is the description of techniques for improving psychological safety at your company.

  Incident.io

Updated: May 19, 2024 — 9:13 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme