SRE Weekly Issue #209

A message from our sponsor, VictorOps:

Efficient management of SQL schema evolutions allows DevOps professionals to deploy code quickly and reliably with little to no impact. Learn how modern teams are building out zero impact SQL database deployment workflows here:

https://go.victorops.com/sreweekly-zero-impact-sql-database-deployments

Articles

Azure developed this tool to sniff out production problems caused by deploys and guess which deploy might have been the culprit. Its accuracy is impressive.

Adrian Colyer — The Morning Paper (summary)

Li et al. — NSDI’20 (original paper)

This one made me laugh out loud.  Better check those system call return codes, people.

rachelbythebay

This caught my eye:

In addition, what is seen as the IC maintaining organizational discipline during a response can actually be undermining the sources of resilient practice that help incident responders cope with poorly matched coordination strategies and the cognitive demands of the incident.

Laura M.D. Maguire — ACM Queue Volume 17, Issue 6

A guide on salary expectations for various levels of SRE, especially useful if you’re changing jobs.

Gremlin

The flipside of microservices agility is the resiliency you can lose from service distribution. Here are some microservices resiliency patterns that can keep your services available and reliable.

Joydip Kanjilal

There have been several recent failures of consumer devices based on a cloud service outage, and this author argues for change.

Kevin C. Tofel — Stacey on IoT

This sounds familiar

Durham Radio News

Essentially, you’re taking that risk of the Friday afternoon deployment, and spreading it thinly across many deployments throughout the week.

Ben New

Outages

SRE Weekly Issue #208

A message from our sponsor, VictorOps:

Learn about some more subtle, unknown use cases for using Splunk + VictorOps to drive a more analytical, proactive approach to incident response:

https://go.victorops.com/sreweekly-splunk-for-analytical-incident-response

Articles

There’s so much in this article:

  • how to recognize when your system may be susceptible to cascading failure
  • how to prevent it
  • how to deal with it when it happens (and how hard that can be)

Laura Nolan — Slack

It’s time for this year’s SRE Survey. Don’t forget that with each completed survey, Catchpoint donates $5 to charity.

This growing demand [for SREs] is not without growing pains as a skills gap problem has emerged due to the fact that SRE training requires a hands-on, interactive learning environment.

Peter Murray — Catchpoint

Both the summary and the original article are well worth reading. This stood out to me:

As much as we may think of incidents as taking place in all those technical parts of the system below the line, incidents actually take place above it

Thai Wood (summary)

Dr. Richard Cook (original article)

The EBS control plane data store resembles a “jellyfish” (actually a Physalia, a.k.a. Portuguese man-of-war).

Timothy Prickett Morgan — The Next Platform

Ideal: each team manages their microservice(s) in isolation.

Reality: microservices interact in unexpected ways and a broader system emerges that has remarkable similarities to running a monolith.

Ben Sigelman — LightStep

This one discusses how to handle SRE for a monolith, and some examples of what often goes wrong.

Eric Harvieux — Google

The author blocked an unexpected Sunday deploy of untested code, and it turned out to be a good thing they did.

rachelbythebay

Outages

SRE Weekly Issue #207

A message from our sponsor, VictorOps:

Host extraordinaire, Benton Rochester, talks with Gene Kim about DevOps and his excellent new book, The Unicorn Project. Don’t miss this highly-anticipated episode of Ship Happens, the Splunk + VictorOps podcast:

https://go.victorops.com/sreweekly-ship-happens-with-gene-kim

Articles

The scenario: a seemingly botched landing, a finding of human error, and retraining for the errant pilots. The author recasts the entire incident in a much more realistic light that shows that the pilots’ actions were perfectly reasonable.

Robyn Ironside — Safety Differently

Just exactly what would it take to (reliably) run your own git server internally?

Chris Siebenmann

In this two part series, The Morning paper takes on John Allspaw’s master’s thesis from Lund University. Here’s part two.

Adrian Colyer — The Morning Paper (summary)

John Allspaw — Lund University (original paper)

The section toward the end under the heading “Things need to get worse before they get better.” especially resonated with me.

Hannah Culver — Blameless

Incident response and improvisational music share a lot in common.

Matt Davis — Verica

Outages

SRE Weekly Issue #206

A message from our sponsor, VictorOps:

Host extraordinaire, Benton Rochester, talks with Gene Kim about DevOps and his excellent new book, The Unicorn Project. Don’t miss this highly-anticipated episode of Ship Happens, the Splunk + VictorOps podcast:

https://go.victorops.com/sreweekly-ship-happens-with-gene-kim

Articles

All the plans in the world can’t prepare us for every incident, and yet we can excel during incidents anyway. How?

Will Gallego

These pilots’ minds were almost literally sleeping. The air traffic controller gave them a command they could execute in their sleep: Descend and Maintain.

Along the same lines an incident caused this pilot to nearly forget how to fly, and yet she safely landed the plane with some reassurance by the ATC.

Today’s choice looks at what it takes for machines to participate productively in collaborations with humans.

Adrian Colyer — The Morning Paper (summary)

Klein et al. — IEEE Computer Nov/Dec 2004 (original paper)

A lot of things that are happening in your organization, your system, are largely invisible. And those things, that work, is keeping things up and running.

Lorin Hochstein

This followup covers incidents on January 22 and 24.

Tracking the number of incidents is almost never going to be useful, and is likely to be detrimental.

Rick Branson

Some good scenarios to think about, including an idea for chaos engineering with humans.

Dean Wilson

Outages

SRE Weekly Issue #205

A message from our sponsor, VictorOps:

Service resilience requires both real-time incident response software and a robust incident management and IT ticketing tool. These common techniques and tools can help you enhance your VictorOps and ServiceNow integration – making incident management suck less:

https://go.victorops.com/sreweekly-victorops-and-servicenow

Articles

This article hints at the fact that blame and sanction (punishment) are two different things.

Bonus content: Dr. Richard Cook on blameless vs sanctionless retrospectives

Bob Reselman

here we have a few lessons in operations that we all (eventually) (have to) learn; often the hard way.

Jan Schaumann

I especially like the emphasis on reducing pager fatigue through thoughtfully selected SLOs.

Emily Arnott — Blameless

The four concepts, drawn from a paper by Dr. David Woods, are:

  • Rebound
  • Robustness
  • Graceful extensibility
  • Sustained adaptability

Thai Wood — Resilience Roundup

Understanding the difference between work-as-imagined and work-as-done is critical to the reliability of a complex system.

Jaime Woo and Emil Stolarsky — The Morning Mind-Meld

There’s a useful survey in here if you’re trying to measure or track toil in your organization.

Eric Harvieux — Google

A nice little debugging story hinging on a bug in an upstream library.

Sanket Patel

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme