General

SRE Weekly Issue #207

A message from our sponsor, VictorOps:

Host extraordinaire, Benton Rochester, talks with Gene Kim about DevOps and his excellent new book, The Unicorn Project. Don’t miss this highly-anticipated episode of Ship Happens, the Splunk + VictorOps podcast:

https://go.victorops.com/sreweekly-ship-happens-with-gene-kim

Articles

The scenario: a seemingly botched landing, a finding of human error, and retraining for the errant pilots. The author recasts the entire incident in a much more realistic light that shows that the pilots’ actions were perfectly reasonable.

Robyn Ironside — Safety Differently

Just exactly what would it take to (reliably) run your own git server internally?

Chris Siebenmann

In this two part series, The Morning paper takes on John Allspaw’s master’s thesis from Lund University. Here’s part two.

Adrian Colyer — The Morning Paper (summary)

John Allspaw — Lund University (original paper)

The section toward the end under the heading “Things need to get worse before they get better.” especially resonated with me.

Hannah Culver — Blameless

Incident response and improvisational music share a lot in common.

Matt Davis — Verica

Outages

SRE Weekly Issue #206

A message from our sponsor, VictorOps:

Host extraordinaire, Benton Rochester, talks with Gene Kim about DevOps and his excellent new book, The Unicorn Project. Don’t miss this highly-anticipated episode of Ship Happens, the Splunk + VictorOps podcast:

https://go.victorops.com/sreweekly-ship-happens-with-gene-kim

Articles

All the plans in the world can’t prepare us for every incident, and yet we can excel during incidents anyway. How?

Will Gallego

These pilots’ minds were almost literally sleeping. The air traffic controller gave them a command they could execute in their sleep: Descend and Maintain.

Along the same lines an incident caused this pilot to nearly forget how to fly, and yet she safely landed the plane with some reassurance by the ATC.

Today’s choice looks at what it takes for machines to participate productively in collaborations with humans.

Adrian Colyer — The Morning Paper (summary)

Klein et al. — IEEE Computer Nov/Dec 2004 (original paper)

A lot of things that are happening in your organization, your system, are largely invisible. And those things, that work, is keeping things up and running.

Lorin Hochstein

This followup covers incidents on January 22 and 24.

Tracking the number of incidents is almost never going to be useful, and is likely to be detrimental.

Rick Branson

Some good scenarios to think about, including an idea for chaos engineering with humans.

Dean Wilson

Outages

SRE Weekly Issue #205

A message from our sponsor, VictorOps:

Service resilience requires both real-time incident response software and a robust incident management and IT ticketing tool. These common techniques and tools can help you enhance your VictorOps and ServiceNow integration – making incident management suck less:

https://go.victorops.com/sreweekly-victorops-and-servicenow

Articles

This article hints at the fact that blame and sanction (punishment) are two different things.

Bonus content: Dr. Richard Cook on blameless vs sanctionless retrospectives

Bob Reselman

here we have a few lessons in operations that we all (eventually) (have to) learn; often the hard way.

Jan Schaumann

I especially like the emphasis on reducing pager fatigue through thoughtfully selected SLOs.

Emily Arnott — Blameless

The four concepts, drawn from a paper by Dr. David Woods, are:

  • Rebound
  • Robustness
  • Graceful extensibility
  • Sustained adaptability

Thai Wood — Resilience Roundup

Understanding the difference between work-as-imagined and work-as-done is critical to the reliability of a complex system.

Jaime Woo and Emil Stolarsky — The Morning Mind-Meld

There’s a useful survey in here if you’re trying to measure or track toil in your organization.

Eric Harvieux — Google

A nice little debugging story hinging on a bug in an upstream library.

Sanket Patel

Outages

SRE Weekly Issue #204

A message from our sponsor, VictorOps:

Continuous improvement, delivery, and integration typically sit at the forefront of DevOps. But, none of this is possible without a successful system for continuous testing. See how modern teams are creating a robust continuous testing framework:

https://go.victorops.com/sreweekly-continuous-testing-in-devops

Articles

In this talk, Dr. Richard Cook presents bone as the archetype for resilient systems, and shows us what we can learn about resilience engineering from medicine.

Richard Cook, MD — Adaptive Capacity Labs

Some interesting ideas on testing in production, involving developer instances that live right inside production and take a portion of production traffic.

Will Sargent

Keep in mind, though, that you aren’t really studying an incident at all: you’re studying your system through the lens of an incident.

Lorin Hochstein

This thread has an interesting analogy between alerts and code comments.

Shelby Spees

I’m really loving this thing where Adrian Colyer is going through classic works on The Morning Paper. Here’s his take on the STELLA Report.

Adrian Colyer — The Morning Paper (summary)

Woods et al. (original report)

Outages

SRE Weekly Issue #203

A message from our sponsor, VictorOps:

Bulkhead and sidecar application design patterns can be used to create more efficient incident response workflows for DevOps and IT operations. Learn more:

https://go.victorops.com/sreweekly-bulkhead-and-sidecar-design-patterns

Articles

Spot-on advice for writing incident followups, citing examples of real write-ups that exhibit the techniques they recommend.

Hannah Culver — Blameless

“The beautiful thing about going on-call is you get to go off-call. If you aren’t on-call, I have news for you – you’re always on-call”

Jay Gordon — Page It to the Limit

This is a companion to last week’s article, Sharing SQLite databases across containers is surprisingly brilliant. This one explains the broader ctlstore system.

Rick Branson and Collin Van Dyck — Segment

Chaos Mesh is a versatile Chaos Engineering platform that features all-around fault injection methods for complex systems on Kubernetes, covering faults in Pod, network, file system, and even the kernel.

Chengwen Yin — PingCAP

Fake it ’til you make it clear what motivated the decisions of incident responders.

Lorin Hochstein

When running a platform, pay attention to the experience of specific customers, says Google. That may mean inferring their metrics from your own if they haven’t shared their SLIs with you.

Adrian Hilton — Google

This article takes a stand against the “Three Pillars of Observability”.

[…] focus on what kinds of questions you’re trying to answer and let that guide your choice of telemetry.

Mads Hartmann

My favorite recommendation is to make log messages “two-way greppable” — findable in logs and easy to tell exactly which part of the code it comes from.

Vladimir Garvardt — HelloFresh

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme