SRE Weekly Issue #205

A message from our sponsor, VictorOps:

Service resilience requires both real-time incident response software and a robust incident management and IT ticketing tool. These common techniques and tools can help you enhance your VictorOps and ServiceNow integration – making incident management suck less:

https://go.victorops.com/sreweekly-victorops-and-servicenow

Articles

This article hints at the fact that blame and sanction (punishment) are two different things.

Bonus content: Dr. Richard Cook on blameless vs sanctionless retrospectives

Bob Reselman

here we have a few lessons in operations that we all (eventually) (have to) learn; often the hard way.

Jan Schaumann

I especially like the emphasis on reducing pager fatigue through thoughtfully selected SLOs.

Emily Arnott — Blameless

The four concepts, drawn from a paper by Dr. David Woods, are:

  • Rebound
  • Robustness
  • Graceful extensibility
  • Sustained adaptability

Thai Wood — Resilience Roundup

Understanding the difference between work-as-imagined and work-as-done is critical to the reliability of a complex system.

Jaime Woo and Emil Stolarsky — The Morning Mind-Meld

There’s a useful survey in here if you’re trying to measure or track toil in your organization.

Eric Harvieux — Google

A nice little debugging story hinging on a bug in an upstream library.

Sanket Patel

Outages

Updated: February 2, 2020 — 8:35 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme