SRE Weekly Issue #354

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly πŸš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:


This episode of DisasterCast discusses what happens when attempts to make things safer backfire.

by trying to suppress small problems, we create a reservoir of danger waiting to burst out

  Drew Rae

These images offer a glimpse into the visual patterns that appear in our variables and time-series, and the beauty that emerges from chaos. Some of the images in these galleries appeared during difficult rollouts, and some even during production incidents. All come from graphs generated by Google’s monitoring systems.


The popular slogan says “test in production”, but what if your business simply doesn’t allow it?

For any scenario where I expect to be causing client impact, I’d rather test in non-production than not test at all, since production is clearly off the table.

  Christina Yakomin β€” InfoQ

There’s been a trend toward narrating our engineering work on company blogs, without which this newsletter probably wouldn’t exist.

  Jordan Teicher β€” New York Times

My team recently moved databases from local files in the codebase to an online Database.

It didn’t go quite as planned, but they got there in the end.

  Kaustubh Hiware β€” Mercari

In Product Analytics we wanted to support our colleagues in SRE, so we created a model to predict the monetary costs of incidents affecting our conversion funnel.

  Enrique Hernani Ros β€” HelloFresh

There’s some interesting detail here about multiple failed UPSes and an accidental voltage mismatch exacerbating the situation.

  Laura Dobberstein β€” The Register

Updated: January 8, 2023 — 9:05 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme