General

SRE Weekly Issue #162

A message from our sponsor, VictorOps:

Ever been on-call? Then you know it can suck. Check out some of our tips and tricks to see how SRE teams are maintaining composure during a critical incident and making on-call suck less:

http://try.victorops.com/sreweekly/on-call-stress-management

Articles

Want to nerd out on BGP? Check out how this person modeled the Eve Online universe as an 8000-VM cluster running BGP.

Ben Cartwright-Cox

Accrued vacation time is antiquated, and “unlimited” vacation paradoxically leads employees to take less time overall. Time to enforce vacations, lest we forget that burnout is a reliability risk.

Baron Schwartz

How to avoid catastrophe: pay attention to near misses. This article makes an incredibly compelling point that we need to make a conscious effort to pay attention to near misses, and explains how cognitive bias will tend to make us do the exact opposite.

Catherine H. Tinsley, Robin L. Dillon, and Peter M. Madsen — Harvard Business Review

An intro to how blame causes problems, why blamelessness is better, and how to adopt a blameless culture.

Ashar Rizqi

A 100-year-old chemical company thought they had a great safety record. Turns out that folks were just considering accidents “routine” and not reporting them.

Thai Wood (reviewing a paper by Stefanie Huber, Ivette van Wijgerden, Arjan de Witt, and Sidney W.A. Dekker)

Booking.com has 50,000 servers and many SRE squads. They developed tools they call the Reliability Collaboration Model and the Ownership Map to help them define which products SRE squads support and at what level.

Emmanuel Goossaert — Booking.com

Outages

  • New Relic
  • Duo Security
  • Amtrak (US long-distance passenger rail)
    • Amtrak had an outage of its switching system this past week. Linked above is an article with the inflammatory title, “Human error? Try abject stupidity, Amtrak”. Exercise: try to think of ways in which this is not a case of abject stupidity.

      Rich Miller — Capitol Fax

  • YouTube

SRE Weekly Issue #161

A message from our sponsor, VictorOps:

Being on-call can suck. Without alert context or a collaborative system for incident response, SRE teams will have trouble appropriately responding to on-call incidents. Check out The On-Call Template to become the master of on-call and improve service reliability:

http://try.victorops.com/sreweekly/the-on-call-template

Articles

I’m not a fan of error budgets. I’ve never seen them implemented particularly well up close, though I know lots of folks who say it works for them.

I’ve started to feel a bit sour on the whole error budget thing, but I couldn’t really pin down why. This article really nails it.

Will Gallego

Will Gallego is my co-worker, although I came across this article separately.

I’m still hooked on flight accident case studies. In this one, mission fixation and indecision lead to disaster.

Air Safety Institute

If I was setting up curriculum at a university I’d make an entire semester-long class on The Challenger disaster, and make it required for any remotely STEM-oriented major.

This awesome article is about getting so used to pushing the limits that you forget you’re even doing it, until disaster strikes.

Foone Turing

A couple weeks back, I linked to a survey about compensation for on-call. Here’s an analysis of the results and some raw data in case you want to tinker with it.

Chris Evans and Spike Lindsey

Learn how this company does incident management drills. They seem to handle things much like a real incident, including doing a retrospective afterward!

Tim Little — Kudos

Outages

SRE Weekly Issue #160

A message from our sponsor, VictorOps:

Establishing an effective post-incident review process and taking the time to execute on it makes a world of difference in software reliability. See this example of a post-incident review process that’s already helping SRE teams continuously improve:

http://try.victorops.com/sreweekly/post-incident-review-process

Articles

This is a long one, but trust me, it’s worth the read. My favorite part is where the author gets into mental models, hearkening back to the Stella Report.

Fred Hebert

When CDN outages occur, it becomes immediately clear who is using multiple CDNs and who is not.

A multi-CDN approach can be tricky to pull off, but as these folks explain, it can be critical for reliability and performance.

Scott Kidder — mUX

Full disclosure: Fastly, my employer, is mentioned.

This article explains five different phenomena that people mean when they say “technical debt”, and advocates understanding the full context rather than just assuming the folks that came before were fools.

/thanks Greg Burek

Kellan Elliott-McCrea

The work we did to get our teams aligned and our systems in good shape meant that we were able to scale, even with some services getting 40 times the normal traffic.

Kriton Dolias and Vinessa Wan — The New York Times

How does one resolve the emerging consensus for alerting exclusively on user-visible outages, with the undeniable need to learn about and react to things +before* users notice? Like a high cache eviction rate?

There’s a real gem in here, definitely worth a read.

Charity Majors (and Liz Fong-Jones in reply)

Being on-call will always involve getting woken up occasionally. But when that does happen, it should be for something that matters, and that the on-call person can make progress toward fixing.

Rachel Perkins — Honeycomb

Delayed replication can be used as a first resort to recover from accidental data loss and lends itself perfectly to situations where the loss-inducing event is noticed within the configured delay.

Andreas Brandl — GitLab

Outages

SRE Weekly Issue #159

A huge thanks to my awesome former coworker Greg Burek whose helpful link contributions make up fully half of this issue.  Thanks, Greg!

A message from our sponsor, VictorOps:

Are you an SRE working with Microsoft Azure? Learn more about the key services offered in Azure and how SRE teams can leverage these tools and applications to build and deploy reliable services at a consistent pace:

http://try.victorops.com/sreweekly/microsoft-azure

Articles

This paper discusses the ways in which automation of industrial processes may expand rather than eliminate problems with the human operator.

My favorite bit of irony: presenting data to the user in the manner most readily understood results in lower likelihood of remembering the data, so perhaps the most easily grasped display is not actually the best!

Lisanne Bainbridge

Like malice and incompetence, laziness should be far off our radar when we investigate an incident. I hope that reading this article opens minds about the true scope of blamelessness.

Devon Price

Whether or not you agree with this particular attempt at defining what a Systems Engineer (or SRE or anything related) is, it’s worth thinking about and discussing. Our field is evolving quickly, and titles are a moving target.

Matt Ouille

Driven by a desire to update their 737 without causing airlines to have to retrain pilots, Boeing seemingly kept pilots in the dark about what may have been an important little detail of how the new 737 Max operates, with a tragic result.

James Glanz, Julie Creswell, Thomas Kaplan and Zach Wichter — New York Times

An experienced SRE will develop an innate skepticism of new technologies, even if they don’t realize it. This article provides an excellent list of questions to help articulate that skepticism when evaluating a potential design.

Kellan Elliott-McCrea

Auto-scaling isn’t all roses. Like any tool, you have to understand how it works in order to avoid the pitfalls. Read this article to learn what these folks learned the hard way.

Tyson Mote — Segment

Transitioning to a blameless culture can be difficult, especially as folks might blame each other for forgetting to be blameless!

Rachael Byrne — PagerDuty

Many of the old arguments for not instrumenting code (mostly about performance) no longer apply, and a host of new arguments push toward structured events.

Charity Majors

Outages

SRE Weekly Issue #158

A message from our sponsor, VictorOps:

The golden signals of SRE and monitoring helps identify a great starting point for teams looking to proactively build reliability into highly integrated applications and services.

http://try.victorops.com/sreweekly/sre-golden-signals

Articles

This air traffic accident analysis is chilling to listen to, and also incredibly educational. As you listen through the conversation, it becomes more and more clear that the pilot is suffering from information overload. An Incident Commander would be wise to remember the lessons learned here.

After listening to the above recording, I got hooked and kept listening to more and more case studies. Here’s another enlightening one: Real Pilot Story: From Miscue to Rescue

US Air Safety Institute

PagerDuty is quickly approaching Etsy’s level of awesome incident-related articles and guides.

Rachael Byrne — PagerDuty

Retiring features and products can often be harder to do safely than deploying them in the first place.

Rachana Kumar– Etsy

Do your SLIs measure what really matters to your customers? This article discusses how to find out and what to do if they don’t.

Adrian Hilton and Yaniv Aknin — Google

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme