SRE Weekly Issue #372

Articles

At Pulumi we read every single error message that our API produces. This is the primary mechanism that led to a 17x YoY reduction in our error rate

Evan Boyle — Pulumi

Uptime Guarantees — A Pragmatic Perspective

Rather than striving for a million nines, we should choose the right reliability target based on an evaluation of the effect of downtime on the business.

Itzy Sabo — HEY

Reckoning with the Harm We Do: In Search of Restorative Just Culture in Software and Web Operations

This is a presentation of a study of harm and trauma resulting from incident response work. I especially like the part about blamelessness in theory versus practice.

Jessica DeVita — InfoQ

Learning from incidents is not the goal

Perhaps a sensationalist title, but there’s a really good point here: learning from incidents is only practical if it actually improves the business.

Chris Evans — incident.io

Real-Time Presence Platform System Design

A highly-detailed proposal for a system to track which users are online at a huge scale.

Nk — System Design

Upscaling LinkedIn’s Profile Datastore While Reducing Costs

However, for any cache to be used for the purpose of upscaling, it must operate completely independent from the source of truth (SOT) and must not be allowed to fall back to the SOT on failures.

Estella Pham and Guanlin Lu – LinkedIn

The Madness in our Methods: The crash of Germanwings flight 9525 and our broken aeromedical system

If you design your system to make lying the only viable option, then people will lie. To me, this article is all about understanding that our systems involve real, squishy humans, an designing appropriately.

Admiral Cloudberg

SRE Weekly Issue #372

Articles

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

Subscribe

RSS

Mastodon

Search Issues