SRE Weekly Issue #456

MTTR: When sample means and power laws combine, trouble follows

Here’s another way to use math to show that tracking MTTR over time is going to help you draw incorrect conclusions about your incident trends.

Lorin Hochstein

How Dropbox Saved Millions of Dollars by Building a Load Balancer

Why build your own? Dropbox had a heterogeneous fleet with differently-sized backends, and no load-balancer available at the time could handle that.

Richard Oliver Bray

The Incident Maturity Model

There’s so much here, I need to read it again a few times — and you should too. Their model has three stages of increasing maturity, allowing you to adopt it at the right pace for your org.

Stephen Whitworth — incident.io

Break Stuff on Purpose

After accidentally losing all of their Kibana dashboards, the folks at Slack implemented chaos engineering to detect similar problems early.

Sean Madden — Slack

LLMs won’t save us

This article raises concerns about using LLMs in production operations that I haven’t seen expressed quite in this way before.

Niall Murphy

New Production Readiness Check experience in Mercari

Five years ago, Mercari adopted a checklist for production readiness, and they’ve seen reliability improve as a result. Now they’re sharing how adoption has gone and the impact it’s had on development teams and what they’re doing about it.

mshibuya — Mercari

Google Cloud incident report: Bigquery incident on December 4, 2024

They deleted an internal project that held API keys that were still in use.

Google

Uptime, status pages, and transparency calculus

A status page can be about so much more than just informing customers of downtime. It’s a marketing artifact, evidence for SLA breach, a sales pitch, and more.

Lawrence Jones

SRE Weekly Issue #456

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues