General

SRE Weekly Issue #183

lex

September 1, 2019

General

Comments

View on sreweekly.com

Articles

I test in prod – Increment: Testing

Another issue of Increment, on a topic integral to SRE: testing.

It doesn’t matter if you’ve already read everything Charity Majors has written; in this article she’s still managed to find new and even more compelling ways to argue that we should embrace testing in production.

My two other favorite articles from this issue:

What Broke the Bank (Chris Stokel-Walker)
Tests from the Crypt (Tammy Butow)

Charity Majors — Honeycomb

We Re-Launched The New York Times Paywall and No One Noticed

That’s exactly what we hoped for.

They rewrote this critical service and carefully deployed it to avoid user impact, using a technique I love: run the new code alongside the old for awhile to verify that it returns the same result.

Jeremy Gayed, Said Ketchman, Oleksii Khliupin, Ronny Wang and Sergey Zheznyakovskiy — New York Times

For CAs: What makes a Good Incident Response?

This is aimed at Certification Authorities dealing with TLS certificate misissuance issues and the like, but it very much applies to any kind of incident.

BONUS CONTENT: An incident report from LetsEncrypt just a few days later included this gem, exactly in line with what Ryan wrote:

After initially confirming the report we reached out to multiple other CAs that we believed would also be affected.

Ryan Sleevi

Our incident postmortem template

Whose? Hosted Graphite’s. Definitely worth a read.

Fran Garcia — Hosted Graphite

Understanding the risk profile of your technical debt

Which brings me to this unpopular opinion: All code is technical debt.

However, debt itself isn’t bad. It can be risky, especially if misunderstood, but debt itself is not inherently good or bad. It’s a tool.

Dormain Drewitz — Pivotal

Building a Culture of Continuous Improvement – Blameless: Better Reliability Through SRE

Blameless is running a free workshop on writing post-incident reports.

In this talk we will discuss the elements of an effective postmortem and the challenges faced while defining the process. We will introduce concrete methodologies that alleviate the cognitive overhead and emotional burden of doing postmortems.

Blameless

Outages

Heroku Status
- Heroku experienced 8+ hours of instability. This status page posting is really worth a read because it has:
  - meticulously detailed customer impact
  - no sugar-coating
  - detailed workarounds when they were available
  Hats off to you, folks.
Slack
Reddit
Sling TV
Disney Plus
- Increased traffic from a sale caused instability.
Fastly

SRE Weekly Issue #182

lex

August 25, 2019

General

Comments

View on sreweekly.com

Articles

Continuous Verification of Friday Deploys

Friday deploys are going to be necessary occasionally, even if we try to ban them. Doing so will only mean that we’re less experienced at executing Friday deploys successfully.

Will Gallego

Complicated & Complex Systems in Safety Management

Jet engines are Complicated. The system of jet engine maintenance (including the technicians, policies, schedules, etc) is Complex. Understanding the difference is key to managing complex systems.

Adam Johns

Hindsight 28 — Change: Changing to adapt and adapting to change

In this issue, we have articles from the front-line, as well as from safety, legal, leadership, human factors and psychology specialists.

Hindsight is a magazine targeted at air traffic controllers. An example article title from this issue:

Mode-Switching in Air Traffic Control

Thanks to Greg Burek for this one.

FCC Issues Report on CenturyLink Network Outage

The US Federal Communications Commission released their report on an outage last December that took down 911 (emergency services) across a large swathe of the US.

This outage was caused by an equipment
failure catastrophically exacerbated by a network configuration error.

An explanation of the difference between Isolation levels vs. Consistency levels

They’re two separate concepts, but they’re often presented together, blurring the line between them.

Daniel Abadi

Studying Human Error Is Not Enough In Child Welfare Safety Management

I love the idea of applying the ideas of resilience engineering to child welfare services. This article quotes from Hollnagel and Dekker.

Tom Morton and Jess McDonald

Outages

Amazon Cloud Outage Causing Major Issues at Some Crypto Exchanges – CoinDesk
Amazon and some Cryptocurrency Exchanges
- AWS had an outage in Asia Pacific, affecting some cryptocurrency exchanges. There’s some speculation that the outage may have resulted in some bitcoins being bought for under 1 USD (way below market value).
WhatsApp
GitHub
Google OAuth
- Along with preventing logins to Google some services, this also affected “Log in with Google” on non-Google sites.

SRE Weekly Issue #181

lex

August 18, 2019

General

Comments

View on sreweekly.com

Articles

Inhumanity of Root Cause Analysis

Root Cause Analysis is a flawed concept, and seeking it almost inevitably results in treating people unfairly. I like the concept of “Least Effort to Remediate” introduced in this article.

Casey Rosenthal — Verica

Surf’s Up! Preparing for Huge Waves of Traffic via Load Testing

Slack developed a load simulation tool and used it to verify a new feature, Enterprise Key Management

Serry Park, Arka Ganguli, and Joe Smith

Antifragility is a Fragile Concept

After reviewing the history of the term “antifragility”, this article explains why it is a flawed concept and contrasts it with Chaos Engineering.

This is where the concept of antifragility veers from a truism into bad advice.

Casey Rosenthal

Outages

statuspage.io
- A routine data migration was found to have locked the primary database, causing request timeouts for all inbound requests.
Heroku: Followup for Incident #1821
- A routine update caused unexpected downtime.
Gmail
Google Cloud Platform networking
Hosted Graphite
US Customs and Border Patrol
London Stock Exchange
Stack Exchange Outage Postmortem
- Here’s a followup for the Stack Exchange outage reported here previously.

SRE Weekly Issue #180

lex

August 12, 2019

General

Comments

View on sreweekly.com

Articles

Some items from my “reliability list”

This reads like a mini list of war stories from a grizzled veteran reliability engineer… because that’s exactly what it is. Don’t forget to click the link at the bottom for the followup post!

rachelbythebay

Top Seven Myths of Robust Systems

The myths:

Add Redundancy

Simplify

Avoid Risk

Enforce Procedures

Defend against Prior Root Causes

Document Best Practices and Runbooks

Remove the People Who Cause Accidents

If that doesn’t make you want to read this, I don’t know what will.

Casey Rosenthal — Verica

Treading in Haunted Graveyards

The graveyard that no one dared tread in was the Terraform code. Once they got CI/CD set up, deploys became much easier — and less scary.

Liz Fong-Jones — Honeycomb

Resilience Roundup – Illusions of explanation: A critical essay on error classification

My favorite idea in this article is that the absence of “errors” is not the same thing as safety.

Thai Woods (summary)

Sidney Dekker (original paper)

Increasing resilience in Kubernetes

High availability and resilience are key features of Kubernetes. But what do you do when your Kubernetes cluster starts to become unstable and it looks like your ship is starting to sink?

Tim Little — Kudos

Outages

We had issues with Monzo on 29th July. Here’s what happened, and what we did to fix it.
- At this point, we’ve confirmed that something we thought was impossible, had in fact happened.
  
  I know the feeling, folks.
Heroku Incident #1819 follow-up
- Heroku’s API service degraded when its external error logging provider suffered an outage.
Slack
Halifax and Lloyds (bank)
Facebook, Instagram, and WhatsApp
Google search indexing
British Airways

SRE Weekly Issue #179

lex

August 4, 2019

General

Comments

View on sreweekly.com

Articles

Failure—Is It A Matter Of When?

This is an engrossing write-up of the Chernobyl incident from the perspective of complex systems and failure analysis.

Barry O’Reilly

Disasterpiece Theater: Slack’s process for approachable Chaos Engineering

Slack’s Disasterpiece Theater isn’t quite chaos engineering, but it’s arguably better in some ways. They carefully craft scenarios to test their system’s resiliency, verifying (or disproving!) their hypothesis that a given disruption will be handled by the system without an incident. They share three riveting stories of lessons learned from past exercises.

The process each Disasterpiece Theater exercise follows is designed to maximize learning while minimizing risk of a production incident.

Richard Crowley — Slack

Resilience Engineering, Cognitive Systems Engineering, and Human Factors Concepts in Software Contexts

The above is the title of this YouTube playlist curated by John Allspaw.

“It’s dead, Jim”: How we write an incident postmortem

My favorite sentence:

If you think an incident is “too common” to get its own postmortem that’s a good indicator that there’s a deeper issue that we need to address, and an excellent opportunity to apply our postmortem process to it.

Fran Garcia — HostedGraphite

Building a real-time anomaly detection system for time series at Pinterest

In this post, we’ll share the algorithms and infrastructure that we developed to build a real-time, scalable anomaly detection system for Pinterest’s key operational timeseries metrics. Read on to hear about our learnings, lessons, and plans for the future.

Ably Debugging Tales Part 1 — An Elixir Erlang Mystery

I sure do love a good debugging story.

Eve Harris — Ably

Incident investigation: Learning vs Blaming

When an incident occurs, your company is faced with a choice: do you seek to learn as much as possible about how it happened, or do you seek to find out who messed up?

Phillip Dowland — Safety Differently

Outages

Stack Exchange

SRE Weekly Issue #183

Articles

Outages

SRE Weekly Issue #182

Articles

Outages

SRE Weekly Issue #181

Articles

Outages

SRE Weekly Issue #180

Articles

Outages

SRE Weekly Issue #179

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues