Got any burning questions to ask an experienced SRE? I’m gathering your questions in this google form, and I’d love to hear from you. I’m hoping to use your questions to help inspire authors looking to write more great SRE-related content.
If your overall request volume is low, single errors can have a big impact on your metrics — a phenomenon I’ve experienced at work recently.
Ross Brodbeck
This article outlines five facets of microservice architectures that can have implications for reliability.
Andre Newman — Gremlin
If this title sounds familiar, I’ve linked to an article about the Children of the Magenta concept before. In this accident report, the pilots became confused about their location and course, and ultimately, their trust in the Flight Management System contributed to the disaster.
Kyra Dempsey (Admiral Cloudberg)
A Center of Production Excellence can be a powerful means for an organization to initiate transformations which foster resilience as it matures and its environment changes.
Nick Travaglini — Honeycomb
Full disclosure: Honeycomb is my employer.
Last week, I shared a story about an outage at UniSuper that was caused by Google Cloud. This week, Google shared more details about what went wrong, and it’s well worth a read.
This incident is intriguing because exponential backoff made the problem harder to detect.
Heroku
A discussion of what might get in the way of an organization implementing SLI/SLO/SLAs.
Note that the second half of the article (overcoming those obstacles) is behind a paywall. I don’t often recommend pay-only content, but it’s worth considering a subscription, because Alex is an excellent author whose work I’ve featured here many times.
Alex Ewerlöf
if we look at a distribution of incidents by contributor (or cause, or component), we’re unlikely to see any one of these stand out as being the source of a large number of incidents.
Lorin Hochstein