SRE Weekly Issue #306

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly shirt):
https://rootly.com/demo/?utm_source=sreweekly

Articles

In the past, NASA has increased the likelihood of mission success by sending duplicate spacecraft. In the case of the JWST, that’s not an option.

  Robert Barron

This article makes a case that agile development practices depend on SRE.

  Ash P — Cruform Newsletter

This history covers the advent of the Incident Command System (ICS) and subsequently the National Incident Management System (NIMS).

  JJ Tang — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

Meta migrated their Facebook Ordered Queueing Service (FOQS) system to a global, highly-available deployment. This article describes the original architecture, lists its shortcomings, and explains how they did the migration with zero downtime.

  Jasmit Kaur Saluja and Dillon George — Meta

This is the first time I’ve heard of a “Problem Manager” role, and I like it.

  Laurel Frazier — Transposit

How do you make an SLO for a service with long-running requests? One method is to report metrics on regular time intervals.

  Liz Fong-Jones — Honeycomb

A failure in their Software-Defined Networking (SDN) configuration system required manual recovery.

  Google

Outages

Updated: June 1, 2022 — 9:42 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme