SRE Weekly Issue #329

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

A primer on what makes a good runbook.

Runbooks are most effective when they are readily available, easily actionable, and up-to-date and accurate.

  Cortex

In this article, we describe the architecture and implementation of our SRE infrastructure, how it is used and how it was adopted.

  Philipp GĂĽndisch and Vladyslav Ukis — Siemens

After an explanation of tech debt, this article goes into a possible solution: having on-call folks fix lingering problems in between pages.

  Dormain Drewitz — The New Stack

I’ve read plenty of articles about service ownership, but this one has something new: a discussion of how to divvy up a monolith into separate “services” for teams to own.

  Hannah Culver — PagerDuty

The folks at Sendinblue have chronicled their journey to better incident response, and there’s a lot here to learn from.

  Tanguy Antoine — Sendinblue

Incidents will always happen, but thankfully they have plenty of upsides, as this article explains.

  Andre King — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

You’re not getting paged. Is it because you’ve fixed all the things, or has your alerting atrophied?

  Boris Cherkasky

The folks at incident.io are here with the results of their survey of on-call practices. I like the focus on compensation for being on-call.

  incident.io

Outages

Updated: July 3, 2022 — 10:30 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme