SRE Weekly Issue #451

A message from our sponsor, FireHydrant:

Practice Makes Prepared: Why Every Minor System Hiccup Is Your Team’s Secret Training Ground.

https://firehydrant.com/blog/the-hidden-value-of-lower-severity-incidents/

Most fascinating air incident report I’ve seen in awhile! The pilots deviated from the non-normal checklist, and it immediately made me think of runbooks. On the one hand, you want the runbook to be simple and easy to handle in an incident. On the other hand, it can be very useful to tell the operator why they should do something.

  Mentour Pilot

With their claimed 14.5% of all websites depending on Cloudflare’s DNS, they had to be super careful with this migration. Lots of good stuff in here including:

  • replacing direct DB access by multiple services with an API
  • keeping the old and new DB in sync
  • ensuring both forward and reverse migration were possible in case of rollback

  Alex Fattouche and Corey Horton Cloudflare

I didn’t get to experience the value of a good tracing tool until recently in my career, and I didn’t understand the hype. If you’re in the same boat, this article may help you understand the value of tracing.

  Sam Starling — incident.io

About a year ago, Honeycomb git rid of incident severity levels in favor of incident types, which are purposefully not sortable. Here’s how their experiment has gone so far.

  Fred Hebert — Honeycomb

  Full disclosure: Honeycomb is my employer.

Is Service Level Indicator (SLI) the same as Key Performance Indicator (KPI)?

There’s a really cool framing in there: KPIs are moonshots, so we aim high and rarely hit all of them, while with SLOs, we under-promise and over-deliver.

  Alex Ewerlöf

A fun dive into some unix/linux internals with nine different methods to run a program with timeouts and retries. If you have a soft spot in your heart for signals and system calls, this one’s for you.

  Philippe Gaultier

Cosmos DB is Azure’s answer to Amazon’s DynamoDB. This article gives a nice overview and compares it to various other data stores to help you decide whether it’s right for your use case.

  Adam Gordon Bell — Pulumi

An engineer at Mercari shares their plan for migrating to their new payment system in this five-part article series, all of which are published now. They created their design after reading 80(!) similar articles from folks at other companies.

  resotto — Mercari

Updated: November 17, 2024 — 9:11 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme