General

SRE Weekly Issue #163

lex

March 10, 2019

Articles

Three analytical traps in accident investigation (YouTube, 7:36)

Using an NTSB report on an airplane crash as a case study, this video presents three common traps we fall into in incident retrospectives:

Counterfactual reasoning

Normative language

Mechanistic reasoning

I want to make this required material for all retrospective participants.

Dr. Johan Bergström — Lund University

Recipe for building a widget: How we helped to “peak-shift” demand by helping passengers understand travel trends

Peak-shifting can save you and your customers money and make load easier to handle.

Lara PuReum Yim, Prashant Kumar, Raghav Garg, Preeti Kotamarthi, Ajmal Afif, Calvin Ng Tjioe, and Renrong Weng — Grab

npm On-Call

These folks structured their on-call and incident response process around wombats (and sound guidelines and playbooks).

Wes Mason — npm

Crafting a Resilient Culture: Or, How to Survive an Accidental Mid-Day Production Incident

Lots of great stuff in this case study on an incident involving Chef and Apache. My favorite:

Enforcing processes arbitrarily as a way of trying to increase robustness often ends up making systems more fragile, as people will find ways to work around processes that frustrate their goals

Ryn Daniels — HashiCorp

Structured Logging: The Best Friend You’ll Want When Things Go Wrong

Here’s how and why Grab rebuilt their logging pipeline around structured JSON events.

Aditya Praharaj — Grab

The Four Agreements of Incident Response

Don Miguel Ruiz’s Four Agreements as applied to incident response:

Be Impeccable With Your Word

Don’t Take Anything Personally

Don’t Make Assumptions

Always Do Your Best

Matt Stratton — PagerDuty

Outages

Duo Security
- Retrospective analysis included.
Google Cloud Platform (Cloud Routers)
Fastly
Wells Fargo
Hosted Graphite

SRE Weekly Issue #162

lex

March 3, 2019

General

Comments

View on sreweekly.com

Articles

What would a EvE online Internet look like?

Want to nerd out on BGP? Check out how this person modeled the Eve Online universe as an 8000-VM cluster running BGP.

Ben Cartwright-Cox

Three Vacation Policies

Accrued vacation time is antiquated, and “unlimited” vacation paradoxically leads employees to take less time overall. Time to enforce vacations, lest we forget that burnout is a reliability risk.

Baron Schwartz

How to Avoid Catastrophe

How to avoid catastrophe: pay attention to near misses. This article makes an incredibly compelling point that we need to make a conscious effort to pay attention to near misses, and explains how cognitive bias will tend to make us do the exact opposite.

Catherine H. Tinsley, Robin L. Dillon, and Peter M. Madsen — Harvard Business Review

How SRE Creates a Blameless Culture

An intro to how blame causes problems, why blamelessness is better, and how to adopt a blameless culture.

Ashar Rizqi

Learning From Organizational Incidents: Resilience Engineering for High-Risk Process Environments

A 100-year-old chemical company thought they had a great safety record. Turns out that folks were just considering accidents “routine” and not reporting them.

Thai Wood (reviewing a paper by Stefanie Huber, Ivette van Wijgerden, Arjan de Witt, and Sidney W.A. Dekker)

How Reliability and Product Teams Collaborate at Booking.com

Booking.com has 50,000 servers and many SRE squads. They developed tools they call the Reliability Collaboration Model and the Ownership Map to help them define which products SRE squads support and at what level.

Emmanuel Goossaert — Booking.com

Outages

New Relic
Duo Security
Amtrak (US long-distance passenger rail)
- Amtrak had an outage of its switching system this past week. Linked above is an article with the inflammatory title, “Human error? Try abject stupidity, Amtrak”. Exercise: try to think of ways in which this is not a case of abject stupidity.
  Rich Miller — Capitol Fax
YouTube

SRE Weekly Issue #161

lex

February 24, 2019

General

Comments

View on sreweekly.com

Articles

Resilience Engineering and Error Budgets

I’m not a fan of error budgets. I’ve never seen them implemented particularly well up close, though I know lots of folks who say it works for them.

I’ve started to feel a bit sour on the whole error budget thing, but I couldn’t really pin down why. This article really nails it.

Will Gallego

Will Gallego is my co-worker, although I came across this article separately.

Accident Case Study: In Too Deep (YouTube)

I’m still hooked on flight accident case studies. In this one, mission fixation and indecision lead to disaster.

Air Safety Institute

Normalization of Deviance

If I was setting up curriculum at a university I’d make an entire semester-long class on The Challenger disaster, and make it required for any remotely STEM-oriented major.

This awesome article is about getting so used to pushing the limits that you forget you’re even doing it, until disaster strikes.

Foone Turing

The On-call Compensation Survey, 2019 | oncall.wtf

A couple weeks back, I linked to a survey about compensation for on-call. Here’s an analysis of the results and some raw data in case you want to tinker with it.

Chris Evans and Spike Lindsey

Faking fires: Get better incident management with practise

Learn how this company does incident management drills. They seem to handle things much like a real incident, including doing a retrospective afterward!

Tim Little — Kudos

Outages

Hosted Graphite
Salesforce
- Salesforce experienced a single-pod outage. Heroku was affected as well.
eBay
Southwest Airlines
Facebook
Crunchyroll
Hulu
Slack
- And this one too. Both contain brief, high-level descriptions of what went wrong.
Duo Security
Gmail

SRE Weekly Issue #160

lex

February 17, 2019

General

Comments

View on sreweekly.com

Articles

Operable Software

This is a long one, but trust me, it’s worth the read. My favorite part is where the author gets into mental models, hearkening back to the Stella Report.

Fred Hebert

Multi-CDN support in Mux for improved performance and reliability

When CDN outages occur, it becomes immediately clear who is using multiple CDNs and who is not.

A multi-CDN approach can be tricky to pull off, but as these folks explain, it can be critical for reliability and performance.

Scott Kidder — mUX

Full disclosure: Fastly, my employer, is mentioned.

Towards an understanding of technical debt

This article explains five different phenomena that people mean when they say “technical debt”, and advocates understanding the full context rather than just assuming the folks that came before were fools.

/thanks Greg Burek

Kellan Elliott-McCrea

How We Prepared New York Times Engineering for the Midterm Elections

The work we did to get our teams aligned and our systems in good shape meant that we were able to scale, even with some services getting 40 times the normal traffic.

Kriton Dolias and Vinessa Wan — The New York Times

@mipsytipsy on Twitter: what to alert on

How does one resolve the emerging consensus for alerting exclusively on user-visible outages, with the undeniable need to learn about and react to things +before* users notice? Like a high cache eviction rate?

There’s a real gem in here, definitely worth a read.

Charity Majors (and Liz Fong-Jones in reply)

Notes from On-call Adjacency – Honeycomb

Being on-call will always involve getting woken up occasionally. But when that does happen, it should be for something that matters, and that the on-call person can make progress toward fixing.

Rachel Perkins — Honeycomb

How we used delayed replication for disaster recovery with PostgreSQL

Delayed replication can be used as a first resort to recover from accidental data loss and lends itself perfectly to situations where the loss-inducing event is noticed within the configured delay.

Andreas Brandl — GitLab

Outages

Azure Kubernetes Service (US East)
- There’s a pretty interesting incident description in their history page.
VFEmail
- Via Twitter:
  
  At this time, the attacker has formatted all the disks on every server. Every VM is lost. Every file server is lost, every backup server is lost. NL was 100% hosted with a vastly smaller dataset. NL backups by the provideer were intact, and service should be up there.
  
  My sympathies, folks.
Slack
- Emails into slack were failing due to an expired TLS certificate.
Squarespace
- Linked is their followup post explaining more about the incident.
JPMorgan Chase
Instagram
Strava and Garmin Connect
Microsoft Windows Update
Snapchat
Sydney, AU Train Network
Lloyds Bank

SRE Weekly Issue #159

lex

February 10, 2019

General

Comments

View on sreweekly.com

A huge thanks to my awesome former coworker Greg Burek whose helpful link contributions make up fully half of this issue. Thanks, Greg!

Articles

Ironies of Automation

This paper discusses the ways in which automation of industrial processes may expand rather than eliminate problems with the human operator.

My favorite bit of irony: presenting data to the user in the manner most readily understood results in lower likelihood of remembering the data, so perhaps the most easily grasped display is not actually the best!

Lisanne Bainbridge

Laziness Does Not Exist

Like malice and incompetence, laziness should be far off our radar when we investigate an incident. I hope that reading this article opens minds about the true scope of blamelessness.

Devon Price

The New Systems Engineer

Whether or not you agree with this particular attempt at defining what a Systems Engineer (or SRE or anything related) is, it’s worth thinking about and discussing. Our field is evolving quickly, and titles are a moving target.

Matt Ouille

Behind the Lion Air Crash, a Trail of Decisions Kept Pilots in the Dark

Driven by a desire to update their 737 without causing airlines to have to retrain pilots, Boeing seemingly kept pilots in the dark about what may have been an important little detail of how the new 737 Max operates, with a tragic result.

James Glanz, Julie Creswell, Thomas Kaplan and Zach Wichter — New York Times

Questions for a new technology.

An experienced SRE will develop an innate skepticism of new technologies, even if they don’t realize it. This article provides an excellent list of questions to help articulate that skepticism when evaluating a potential design.

Kellan Elliott-McCrea

When AWS Autoscale Doesn’t

Auto-scaling isn’t all roses. Like any tool, you have to understand how it works in order to avoid the pitfalls. Read this article to learn what these folks learned the hard way.

Tyson Mote — Segment

Postmortems Part 2: How to Adopt a Learning Culture

Transitioning to a blameless culture can be difficult, especially as folks might blame each other for forgetting to be blameless!

Rachael Byrne — PagerDuty

Logs vs Structured Events

Many of the old arguments for not instrumenting code (mostly about performance) no longer apply, and a host of new arguments push toward structured events.

Charity Majors

Outages

QuadrigaCX
- Bloomberg’s title for the above-linked article says it all:
  
  Crypto CEO Dies Holding Only Passwords That Can Unlock Millions in Customer Coins
  
  QuadrigaCX ceased trading and posted a note on their front page.
Gmail
Mailchimp Mandrill
- A PostgreSQL transaction ID wraparound in a central database caused this prolonged outage on Superbowl Sunday.
Wells Fargo (bank)
Crunchyroll
Hosted Graphite
reddit

SRE Weekly Issue #163

Articles

Outages

SRE Weekly Issue #162

Articles

Outages

SRE Weekly Issue #161

Articles

Outages

SRE Weekly Issue #160

Articles

Outages

SRE Weekly Issue #159

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues