SRE Weekly Issue #426

Got any burning questions to ask an experienced SRE? I’m gathering your questions in this google form, and I’d love to hear from you. I’m hoping to use your questions to help inspire authors looking to write more great SRE-related content.

A message from our sponsor, FireHydrant:

FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates.

https://firehydrant.com/blog/ai-for-incident-management-is-here/

If your overall request volume is low, single errors can have a big impact on your metrics — a phenomenon I’ve experienced at work recently.

  Ross Brodbeck

This article outlines five facets of microservice architectures that can have implications for reliability.

  Andre Newman — Gremlin

If this title sounds familiar, I’ve linked to an article about the Children of the Magenta concept before. In this accident report, the pilots became confused about their location and course, and ultimately, their trust in the Flight Management System contributed to the disaster.

  Kyra Dempsey (Admiral Cloudberg)

A Center of Production Excellence can be a powerful means for an organization to initiate transformations which foster resilience as it matures and its environment changes.

  Nick Travaglini — Honeycomb

  Full disclosure: Honeycomb is my employer.

Last week, I shared a story about an outage at UniSuper that was caused by Google Cloud. This week, Google shared more details about what went wrong, and it’s well worth a read.

  Google

This incident is intriguing because exponential backoff made the problem harder to detect.

  Heroku

A discussion of what might get in the way of an organization implementing SLI/SLO/SLAs.

Note that the second half of the article (overcoming those obstacles) is behind a paywall. I don’t often recommend pay-only content, but it’s worth considering a subscription, because Alex is an excellent author whose work I’ve featured here many times.

  Alex Ewerlöf

if we look at a distribution of incidents by contributor (or cause, or component), we’re unlikely to see any one of these stand out as being the source of a large number of incidents.

  Lorin Hochstein

Updated: May 26, 2024 — 10:18 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme