General

SRE Weekly Issue #154

A message from our sponsor, VictorOps:

The golden signals of SRE will help you create visibility into system health and allow you to proactively build robust services. See how you can start leveraging SRE’s golden signals today:

http://try.victorops.com/sreweekly/sre-golden-signals

Articles

Hands-down the best thing I’ve read in awhile! The author draws on the work of Nancy Leveson, applying her STAMP theory to a recent incident involving a rogue NPM package that stole bitcoin wallets.

Hillel Wayne

For more on STAMP theory (Systems-Theoretic Accident Modeling and Processes), check out this academic paper by Leveson et al. It centers around a chilling case study of the e. coli poisoning of a community in Canada. While starts off looking to be a clear case of negligence, it quickly becomes apparent that an accident of this sort was nearly guaranteed to happen.

Nancy Leveson, Mirna Daouk, Nicolas Dulac, and Karen Marais

It’s pretty much as awesome as you’d expect given that title. I originally thought this was a video or audio AMA and was waiting for a recording to be posted. Instead, he answered the excellent questions in the comments, and each answer is like its own polished article.

John Allspaw (and many commenters)

My fundamental issue with being on call is that I care more about my personal life & health than I do about whether my employer’s website is operational.

I assume everyone does! So…why do we put up with on-call at all?

Required reading for anyone who’s on call or manages folks that are on call.

Sarah Mei

If you manage an SRE team or intend to start one, this article will help you understand the types of documents your team needs to write and why each type is needed, allowing you to plan for and prioritize documentation work along with other team projects.

Shylaja Nukala and Vivek Rau — ACM Qeueue

Outages

  • Amazon Alexa
  • Discord
  • Google Cloud
    • Followup post for an incident that occurred on December 21:

      The additional load was created by a partially-deployed new feature. A routine maintenance operation in combination with this new feature resulted in an unexpected increase in the load on the metadata store.

  • 911 emergency service in communities across the US
    • While visiting the library with my kids, my phone (and those of others around me) blew up with an emergency alert telling me that 911 service was down and that I should dial the local police directly. CenturyLink provides the infrastructure that runs emergency phone services for various areas in the US, and they had an extended outage.
  • Snapchat
  • Banner Health electronic health records

Happy new year!

Happy new year, SRE Weekly readers! No issue this week as I attempt to recover from the holidays.

Thank you all so much for reading. The past three years have been awesome, and I love all the great comments and contributions I receive from you folks.

See you next week!

SRE Weekly Issue #153

A message from our sponsor, VictorOps:

SRE teams can leverage chaos engineering, stress testing and load testing tools to proactively build reliability into the services you build. This list of open source chaos tools can help you get started:

http://try.victorops.com/sreweekly/open-source-chaos-testing-tools

Articles

In this podcast episode, Courtney Eckhardt and the panel cover a lot of bases related to incident response, retrospectives, defensiveness, blamelessness, social justice, and tons more engrossing stuff. Well worth a listen.

Mandy Moore (summary); John K. Sawers, Sam Livingston-Gray, Jamey Hampton, and Coraline Ada Ehmke (panelists); Courtney Eckhardt (guest)

Do you wonder what effect partitioned versus unified consistency might have on latency? Do you want to know what those terms mean? Read on.

Daniel Abadi

Cape is Dropbox’s real-time event processing system. The design bits in this article have a ton of interesting detail, and I also love the part where they go into their motivations behind not just using an existing queuing system.

Peng Kang — Dropbox

This is a great intro to the circuit breaker pattern if you’re unfamiliar with it, and it’s also got a lot of meaty content for folks experienced with them.

Corey Scott — Grab

Though it sounds counterintuitive, more dashboards often make people less informed and less aligned.

Having a few good dashboards is important, but if you have too many, it’ll get in the way of your ability to do dynamic analysis.

Benn Stancil — Mode

What activities count as SRE work, versus “just” Operations?

Site Reliability Engineering do Operations but are not an Operations Team.

Stephen Thorne

Outages

SRE Weekly Issue #152

A message from our sponsor, VictorOps:

SRE teams can leverage automation in chat to improve incident response and make on-call suck less. Learn the ins and outs about using automated ChatOps for incident response:

http://try.victorops.com/sreweekly/automated-chatops-in-incident-response

Articles

It’s hard to summarize all the awesome here, but it boils down to empathy, collaboration, and asking, “How can I help?”. These pay dividends all over an organization, especially in reliability.

Note: Will Gallego is my coworker, although I came across this post on my own.

Will Gallego

This followup post for a Google Groups outage was (fittingly) hidden away in a Google Group.

Thanks to Jonathan Rudenberg for this one.

Now I can link directly to specific incidents! I miss the graphs, though.

Jamie Hannaford — GitHub

I laughed so hard I scared my cats:

COWORKER: we need to find the root cause asap
ME: takes long drag the root cause is that our processes are not robust enough to prevent a person from making this mistake
COWORKER: amy please not right now”

Great discussion in the thread!

Amy Nguyen

In Air Traffic Control parlance, if a pilot or controller can’t satisfy with a request, they should state that they are “unable” to comply. It can be difficult to decide in the moment what one is truly “unable” to do. There are a lot of great lessons here that apply equally well to IT incident response.

Tarrance Kramer — AVweb

The common theme at KubeCon is that SRE teams at many companies produce reliable, reusable patterns for their developers to build with.

Beth Pariseau — TechTarget

This is the story of a tenacious fight to find out what went wrong during an incident. If you read nothing else, the Conclusion section has a lot of great tidbits.

Tony Meehan — Endgame

Here’s a new guide on how to apply Restorative Just Culture. This made me laugh:

They also fail to address the systemic issues that gave rise to the harms caused, since they reduce an incident to an individual who needs to be ‘just cultured’.

Sidney Dekker — Safety Differently

Outages

SRE Weekly Issue #151

A message from our sponsor, VictorOps:

SRE teams can use synthetic monitoring and real-user monitoring to create a holistic understanding of the way their system handles stress. See how SRE teams are already implementing synthetic and real-user monitoring tools:

http://try.victorops.com/sreweekly/synthetic-and-real-user-monitoring-for-sre

Articles

They used feature flags to safely transition from a single-host service to a horizontally-scaled distributed system.

Ciaran Egan and Cian Synnott — Hosted Graphite

Limits and quotas can really ruin your day, and it can be very difficult to predict limit exhaustion before a change reaches production, as we learn in this incident story from RealSelf.

Bakha Nurzhanov — RealSelf

The challenge: you have to defend against abuse to keep your service running, but the abuse detection also must not adversely impact the user experience.

Sahil Handa — LinkedIn

PagerDuty has developed a system for measuring on-call health, factoring in quantity of pages, time of each page, frequency, clustering of pages, etc. I love what they’re doing and I hope we see more of this in our industry.

Lisa Yang — PagerDuty

A summary of three outage stories from Honeycomb’s recent event. My favorite is the third:

While Google engineers had put in place procedures for ensuring bad code did not take down their servers, they hadn’t taken the same precautions with data pushes.

Alaina Valenzuela — Honeycomb

Looking at that title, I thought to myself, “Uh, because it’s better?” It’s worth a read though, because it so eloquently explains horizontal versus vertical scaling, why you’d do one or the other, and why horizontal scaling is hard.

Sean T. Allen — Wallaroo Labs

Netflix has some truly massive cache systems at a scale of hundreds of terabytes. Find out what they do to warm up new cache nodes before putting them in production.

Deva Jayaraman, Shashi Madappa, Sridhar Enugula, and Ioannis Papapanagiotou — Netflix

This article lays out a promising plan for reducing the number of technologies your engineering department is using while still giving engineers the freedom to choose the right tool for the job.

Charity Majors

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme