SRE Weekly Issue #426

View on sreweekly.com

Got any burning questions to ask an experienced SRE? I’m gathering your questions in this google form, and I’d love to hear from you. I’m hoping to use your questions to help inspire authors looking to write more great SRE-related content.

The Rule of 5 Errors

If your overall request volume is low, single errors can have a big impact on your metrics — a phenomenon I’ve experienced at work recently.

Ross Brodbeck

How reliability differs between monolithic and microservice-based architectures

This article outlines five facets of microservice architectures that can have implications for reliability.

Andre Newman — Gremlin

Children of the Magenta: The crash of American Airlines flight 965

If this title sounds familiar, I’ve linked to an article about the Children of the Magenta concept before. In this accident report, the pilots became confused about their location and course, and ultimately, their trust in the Flight Management System contributed to the disaster.

Kyra Dempsey (Admiral Cloudberg)

Establishing and Enabling a Center of Production Excellence

A Center of Production Excellence can be a powerful means for an organization to initiate transformations which foster resilience as it matures and its environment changes.

Nick Travaglini — Honeycomb

Full disclosure: Honeycomb is my employer.

Details of Google Cloud GCVE incident

Last week, I shared a story about an outage at UniSuper that was caused by Google Cloud. This week, Google shared more details about what went wrong, and it’s well worth a read.

Google

Heroku Incident #2664 Followup

This incident is intriguing because exponential backoff made the problem harder to detect.

Heroku

Service level pitfalls

A discussion of what might get in the way of an organization implementing SLI/SLO/SLAs.

Note that the second half of the article (overcoming those obstacles) is behind a paywall. I don’t often recommend pay-only content, but it’s worth considering a subscription, because Alex is an excellent author whose work I’ve featured here many times.

Alex Ewerlöf

The error term isn’t Pareto distributed

if we look at a distribution of incidents by contributor (or cause, or component), we’re unlikely to see any one of these stand out as being the source of a large number of incidents.

Lorin Hochstein

SRE Weekly Issue #426

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues