General

SRE Weekly Issue #182

A message from our sponsor, VictorOps:

Collaborate with the right teammates, find the right information and resolve system outages in minutes. Play the VictorOps on-call game to test your skillz and compete against your friends and coworkers.

http://try.victorops.com/sreweekly/on-call-game

Articles

Friday deploys are going to be necessary occasionally, even if we try to ban them. Doing so will only mean that we’re less experienced at executing Friday deploys successfully.

Will Gallego

Jet engines are Complicated. The system of jet engine maintenance (including the technicians, policies, schedules, etc) is Complex. Understanding the difference is key to managing complex systems.

Adam Johns

In this issue, we have articles from the front-line, as well as from safety, legal, leadership, human factors and psychology specialists.

Hindsight is a magazine targeted at air traffic controllers. An example article title from this issue:

Mode-Switching in Air Traffic Control

Thanks to Greg Burek for this one.

The US Federal Communications Commission released their report on an outage last December that took down 911 (emergency services) across a large swathe of the US.

This outage was caused by an equipment
failure catastrophically exacerbated by a network configuration error.

They’re two separate concepts, but they’re often presented together, blurring the line between them.

Daniel Abadi

I love the idea of applying the ideas of resilience engineering to child welfare services. This article quotes from Hollnagel and Dekker.

Tom Morton and Jess McDonald

Outages

SRE Weekly Issue #181

A message from our sponsor, VictorOps:

Think you’ve got what it takes to quickly resolve a system outage? Test your on-call skillz with the new VictorOps on-call adventure game.

http://try.victorops.com/sreweekly/on-call-game

Articles

Root Cause Analysis is a flawed concept, and seeking it almost inevitably results in treating people unfairly. I like the concept of “Least Effort to Remediate” introduced in this article.

Casey Rosenthal — Verica

Slack developed a load simulation tool and used it to verify a new feature, Enterprise Key Management

Serry Park, Arka Ganguli, and Joe Smith

After reviewing the history of the term “antifragility”, this article explains why it is a flawed concept and contrasts it with Chaos Engineering.

This is where the concept of antifragility veers from a truism into bad advice.

Casey Rosenthal

Outages

SRE Weekly Issue #180

A message from our sponsor, VictorOps:

Endorsing a culture of blameless transparency around post-incident reviews can lead to continuous improvement and more resilient services. Check out an interesting technique that SRE teams are using to improve post-incident analysis and learn more from failure:

http://try.victorops.com/sreweekly/ishikawas-fishbone-diagram

Articles

This reads like a mini list of war stories from a grizzled veteran reliability engineer… because that’s exactly what it is. Don’t forget to click the link at the bottom for the followup post!

rachelbythebay

The myths:

  1. Add Redundancy
  2. Simplify
  3. Avoid Risk
  4. Enforce Procedures
  5. Defend against Prior Root Causes
  6. Document Best Practices and Runbooks
  7. Remove the People Who Cause Accidents

If that doesn’t make you want to read this, I don’t know what will.

Casey Rosenthal — Verica

The graveyard that no one dared tread in was the Terraform code. Once they got CI/CD set up, deploys became much easier — and less scary.

Liz Fong-Jones — Honeycomb

My favorite idea in this article is that the absence of “errors” is not the same thing as safety.

Thai Woods (summary)

Sidney Dekker (original paper)

High availability and resilience are key features of Kubernetes. But what do you do when your Kubernetes cluster starts to become unstable and it looks like your ship is starting to sink?

Tim Little — Kudos

Outages

SRE Weekly Issue #179

A message from our sponsor, VictorOps:

A good SRE manager can make or break your site reliability engineering team. Learn all about the duties of an SRE manager and the best practices for building a highly-effective SRE program:

http://try.victorops.com/sreweekly/duties-of-effective-sre-managers

Articles

This is an engrossing write-up of the Chernobyl incident from the perspective of complex systems and failure analysis.

Barry O’Reilly

Slack’s Disasterpiece Theater isn’t quite chaos engineering, but it’s arguably better in some ways. They carefully craft scenarios to test their system’s resiliency, verifying (or disproving!) their hypothesis that a given disruption will be handled by the system without an incident. They share three riveting stories of lessons learned from past exercises.

The process each Disasterpiece Theater exercise follows is designed to maximize learning while minimizing risk of a production incident.

Richard Crowley — Slack

The above is the title of this YouTube playlist curated by John Allspaw.

My favorite sentence:

If you think an incident is “too common” to get its own postmortem that’s a good indicator that there’s a deeper issue that we need to address, and an excellent opportunity to apply our postmortem process to it.

Fran Garcia — HostedGraphite

In this post, we’ll share the algorithms and infrastructure that we developed to build a real-time, scalable anomaly detection system for Pinterest’s key operational timeseries metrics. Read on to hear about our learnings, lessons, and plans for the future.

I sure do love a good debugging story.

Eve Harris — Ably

When an incident occurs, your company is faced with a choice: do you seek to learn as much as possible about how it happened, or do you seek to find out who messed up?

Phillip Dowland — Safety Differently

Outages

SRE Weekly Issue #178

A message from our sponsor, VictorOps:

Containers and microservices can improve development speed and service flexibility. But, more complex systems have a higher potential for incidents. Learn how SRE teams are building more reliable services and adding context to microservices and containerized environments:

http://try.victorops.com/sreweekly/container-monitoring-and-alerting-best-practices

Articles

Imagine a database that promises consistency except in the case of a network partition, in which case it favors availability. That’s conditional consistency, and it’s effectively the same as no consistency.

Daniel Abadi

This is a story about distributed coordination, the TCP API, and how we debugged and fixed a bug in Puma that only shows up at scale.

Richard Schneeman — Heroku

Here’s more on the Australian Tax Office outage earlier this month.

Max Smolaks — The Register

Ever experience a total outage while your cloud provider still reports 99.999% availability? This one’s for you.

rachelbythebay

What’s good or bad to do in production? And how do you transfer knowledge when new team members want to release production services or take the ownership of existing services?

Jaana B. Dogan (JBD)

The internet is a series of tubes — the kind that transmit light. Favorite thing I learned: fiber optic cables are sheathed in copper that powers repeaters along their length.

 James Griffiths — CNN

How do you build a reliable network when faced with highly skilled and motivated adversaries?

Alex Wawro — DARKReading

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme