SRE Weekly Issue #303

Articles

There are way too many gorgeous, mind-blowing ways for incidents to occur without a single change to code being deployed.

That last hot take is the kicker: even if you don’t do a code freeze in December (in the US), you’ll still see a lot of the same pitfalls as you would have if you did.

Emily Ruppe — Jeli

How a couple of characters brought down our site

Ah, IaC, the tool we use to machine-gun our feet in a highly-available manner at scale. This analysis of an incident from back in August tells what happened and what they learned.

Stuart Davidson — Skyscanner

The ROAD to SRE

By establishing a set of core principles (Response, Observability, Availability and Delivery) aka our “ROAD to SRE”, we now have clarity on what areas we expect our SRE team should be focusing on and avoiding a common pitfall of becoming another platform or Ops team.

Bruce Dominguez

Building SRE Teams with Specialization

In this blog post, we’ll look at:

The advantages of an SRE team where each member is a specialist.

Some SRE specialist roles and how they help.

Emily Arnott — The New Stack

SRE Predictions 2022

I love these “predictions for $YEAR” posts. What are your predictions?

Emily Arnott — Blameless

Day 20 – To Deploy or Not to Deploy? That is the question.

Deployment Decision-Making during the holidays amid the COVID19 Pandemic

A sneak peek into my forthcoming MSc. thesis in Human Factors and Systems Safety, Lund University.

Jessica DeVita (edited by Jennifer Davis) — SysAdvent

Day 22 – So, You’re Incident Commander, Now What?

This article covers what to do as an incident commander, how to handle long-running incidents, and how to do a post-incident review.

Joshua Timberman — SysAdvent

Plato’s Dashboards

So in this post I’m going to go over what makes a good metric, why data aggregation on its own loses resolution and messy details that are often critical to improvements, and that good uses of metrics are visible by their ability to assist changes and adjustments.

Fred Hebert

Day 23 – What is eBPF?

Here’s a great tutorial to get started with eBPF through a (somewhat convoluted) “Hello World” exercise.

Ania Kapuścińska (edited by Shaun Mouton) — SysAdvent

The ambiguity of real work

The concept of engineering work being about resolving ambiguity really resonates with me.

Lorin Hochstein

YYMMDDHHMM just overflowed a signed 32 bit int

This appears to have caused a problem with Microsoft Exchange servers. Maybe this belongs in the Outages section…

rachelbythebay

SRE Weekly Issue #303

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues