SRE Weekly Issue #325

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

This article really upends the concept of “human error”, in an intriguing way.

  Lorin Hochstein

A key part of building reliable systems is often overlooked: continuously learning.

In the highly dynamic CI/CD environment, engineers with stale or outdated knowledge of the system are less able to detect, diagnose or repair anomalous behavior in their systems […]

  Laura Maguire (jeli.io) — The New Stack

This is the story of how an organization transitioned from a single NOC-like on-call team to individualized alerts routed to the relevant team.

  Boris Cherkasky

This guide has a set of key factors you should consider when building a new SRE team in order to increase the likelihood of success.

  Ash P — SREPath

My favorite pitfall discussed in this article: avoid committing to every possible remediation action from every incident.

  incident.io

This article, written by one of the authors of the Google SRE book, is a critical look at the state of SRE and what the future holds.

Today, I believe we cannot successfully answer several key questions about SRE.

  Niall Murphy

This interview goes into the thorny challenges around building a reliable app based on third-party services. It delves into the lack of reliable reporting we commonly see from cloud service providers and what ideal reporting would look like.

  Jeff Martens (interviewing Jeff Smith) — Metrist

Outages

Updated: June 5, 2022 — 9:02 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme