SRE Weekly Issue #370

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

[…] although “getting the system back up” should be our first priority, to do so safely, we first need to very carefully define what “up” means.

What functionality is critical? Should we sacrifice feature A to save feature B?  It’s important to plan ahead.

  Boris Cherkasky

It turns out that it depends on how you define “uptime”. Does claiming “100%” actually benefit you?

  Ellen Steinke — Metrist

Skipping the retro shouldn’t be an option. Ditch the one-size-fits-all process to ensure that this important step is held at the end of every incident.

  Jouhné Scott — FireHydrant

Another good one to have in your back pocket for those “What would you say… you do here?” moments.

  Ash Patel — SREPath

Build versus buy for incident management systems: what is the true cost of rolling your own?

   Biju Chacko and Nir Sharma — Squadcast

A plugin to give ChatGPT the ability to run AWS API calls. I’m not sure how I feel about this.

   Banjo Obayomi — DZone

They solved a cardinality explosion by switching from query-based alerting to stream data processing.

  Ruchir Jha, Brian Harrington, and Yingwu Zhao — Netflix

Updated: May 1, 2023 — 12:19 am
A production of Tinker Tinker Tinker, LLC Frontier Theme