SRE Weekly Issue #456

A message from our sponsor, FireHydrant:

On-call during the holidays? Spend more time taking in some R&R and less getting paged. Let alerts make their rounds fairly with our new Round Robin feature for Escalation Policies.

https://firehydrant.com/blog/introducing-round-robin-for-signals-escalation-policies/

Here’s another way to use math to show that tracking MTTR over time is going to help you draw incorrect conclusions about your incident trends.

  Lorin Hochstein

Why build your own? Dropbox had a heterogeneous fleet with differently-sized backends, and no load-balancer available at the time could handle that.

  Richard Oliver Bray

There’s so much here, I need to read it again a few times — and you should too. Their model has three stages of increasing maturity, allowing you to adopt it at the right pace for your org.

  Stephen Whitworth — incident.io

After accidentally losing all of their Kibana dashboards, the folks at Slack implemented chaos engineering to detect similar problems early.

  Sean Madden — Slack

This article raises concerns about using LLMs in production operations that I haven’t seen expressed quite in this way before.

  Niall Murphy

Five years ago, Mercari adopted a checklist for production readiness, and they’ve seen reliability improve as a result. Now they’re sharing how adoption has gone and the impact it’s had on development teams and what they’re doing about it.

  mshibuya — Mercari

They deleted an internal project that held API keys that were still in use.

  Google

A status page can be about so much more than just informing customers of downtime. It’s a marketing artifact, evidence for SLA breach, a sales pitch, and more.

  Lawrence Jones

Updated: December 22, 2024 — 9:00 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme