SRE Weekly Issue #322

Bit of a short issue this week. This morning, I stepped on my phone, crushing it mightily beneath my bootheel. Unfortunately a lot of my automation for reviewing articles is on there… thank goodness I have functioning backups.

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly ๐Ÿš’. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

What? Actually, it’s a pretty good analogy.

  Emily Arnott โ€” Blameless

Mercari has this update to their previous article on their embedded SRE team with more details on how their embedding model works.

  Taichi Nakashima โ€” Mercari

Interesting things happen when you combine tail latency with a microservice architecture.

  Marc Brooker

Their starting point was paging for every single exception raised by their application. Here’s how they tempered that a bit to get a handle on their paging volume.

  Lisa Karlin Curtis โ€” incident.io

This article draws from the “SRE Hierarchy” in Google’s SRE book (which itself is a reference to Maslow’s hierarchy of needs). It recasts the SRE hierarchy as a path to maturity.

  Ash P. โ€” SREPath

Google posted this summary of an incident from late April. A configuration change had the unintended effect of causing livestream view requests to fail.

  Google

Outages

  • Xbox
    • I don’t normally bother with game outages, but this one caught my eye. During the 4-day outage, customers were unable to play Xbox games that they had already purchased.

  • Twitter
  • Coinbase
Updated: May 15, 2022 — 10:38 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme