SRE Weekly Issue #319

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly πŸš’. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

Be judicious when you generate remediation tasks from incidents, or you can end up investing in the wrong area.

  Marc Brooker

Zendesk SRE has a set of 8 reliability principles that guide what they do.

  Jason Smale β€” Zendesk

We’re going to talk about a few necessities that enable exceptional incident management.

  1. Service ownership
  2. Incident roles
  3. The incident declaration process
  4. Running incident drills

  Robert Ross β€” FireHydrant

I don’t think you’re supposed to use Consul that way…

Read this article to follow along on an interesting design journey.

  Thomas Ptacek β€” Fly.io

One single metric for availability probably can’t tell you the whole story.

Β Β Stephen Townshend β€” Slight Reliability

We can learn from the process another engineer takes to debug a problem. But often, a ticket or problem description is stripped of the process and just has the answer, hampering learning.

  Lorin Hochstein β€” The ReadME Project (GitHub)

We’re still not 100% there as a team, but I hope this article will serve as a reference for anyone who might create an SRE team in the future.

  @tjun β€” Mercari

This article gives 6 different ways to organize the findings from your retrospective to share with different audiences.

Β Β Vanessa Huerta Granda β€” Jeli

There’s a great reliability story in the way that the Hubble telescope and the Apollo missions used gimbals β€” and in the way that the JWST doesn’t.

  Robert Barron β€” IBM

Outages

  • Hulu
  • IRS
    • The US Internal Revenue Service’s systems went down on the due date for tax filing.

  • Instagram
Updated: April 24, 2022 — 9:17 pm
SRE WEEKLY © 2015 Frontier Theme