SRE Weekly Issue #294

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo:
https://rootly.com/?utm_source=sreweekly

Articles

The steps are:

  • Know How Much Time Is Spent On Toil
  • Find The Toil
  • Determine The Root Causes Of Toil
  • Find And Prioritize The Low-Hanging Fruit
  • Promote Toil Reduction

Aater Suleman — Forbes

I like how they try to strike a balance and avoid reviewing too far in depth, while still hitting everything important.

Milan Plžík — Grafana Labs

Lots of good stuff in this one about one of my favorite topics, service ownership.

Kenneth Rose — OpsLevel

This is the intro I needed to understand Conflict-Free Replicated Data Types.

Jo Stichbury — Ably

Availability, maintainability and reliability all have distinct—if related—meanings, and they each play different roles in reliability operations.

JJ Tang — DevOps.com

The five Ps come from medicine and understanding medical accidents, but they apply equally well to analyzing incidents in IT.

Lydia Leong

I really love the focus on de-emphasizing finding action items in incident retrospectives, in favor of learning.

Gergely Orosz — The Pragmatic Engineer

Outages

  • AT&T SMS in the US
    • This week, I saw several status pages point to some kind of problem in their ability to send SMS notifications to AT&T phones. I thought this was interesting because usually I don’t learn about an outage solely from other companies’ status pages.
  • Google Meet
  • Tesco
  • Coinbase
  • Zomato
  • Barclays
  • HSBC
Updated: October 31, 2021 — 9:06 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme