SRE Weekly Issue #303

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo:
https://rootly.com/demo/?utm_source=sreweekly

Articles

There are way too many gorgeous, mind-blowing ways for incidents to occur without a single change to code being deployed.

That last hot take is the kicker: even if you don’t do a code freeze in December (in the US), you’ll still see a lot of the same pitfalls as you would have if you did.

  Emily Ruppe — Jeli

Ah, IaC, the tool we use to machine-gun our feet in a highly-available manner at scale. This analysis of an incident from back in August tells what happened and what they learned.

  Stuart Davidson — Skyscanner

By establishing a set of core principles (Response, Observability, Availability and Delivery) aka our “ROAD to SRE”, we now have clarity on what areas we expect our SRE team should be focusing on and avoiding a common pitfall of becoming another platform or Ops team.

  Bruce Dominguez

In this blog post, we’ll look at:

  • The advantages of an SRE team where each member is a specialist.
  • Some SRE specialist roles and how they help.

  Emily Arnott — The New Stack

I love these “predictions for $YEAR” posts. What are your predictions?

  Emily Arnott — Blameless

Deployment Decision-Making during the holidays amid the COVID19 Pandemic

A sneak peek into my forthcoming MSc. thesis in Human Factors and Systems Safety, Lund University.

  Jessica DeVita (edited by Jennifer Davis) — SysAdvent

This article covers what to do as an incident commander, how to handle long-running incidents, and how to do a post-incident review.

  Joshua Timberman — SysAdvent

So in this post I’m going to go over what makes a good metric, why data aggregation on its own loses resolution and messy details that are often critical to improvements, and that good uses of metrics are visible by their ability to assist changes and adjustments.

  Fred Hebert

Here’s a great tutorial to get started with eBPF through a (somewhat convoluted) “Hello World” exercise.

  Ania KapuĹ›ciĹ„ska (edited by Shaun Mouton) — SysAdvent

The concept of engineering work being about resolving ambiguity really resonates with me.

  Lorin Hochstein

This appears to have caused a problem with Microsoft Exchange servers. Maybe this belongs in the Outages section…

  rachelbythebay

Outages

Updated: January 2, 2022 — 9:04 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme