General

SRE Weekly Issue #275

A message from our sponsor, StackHawk:

Join ZAP Founder & Project Lead Simon Bennetts on June 30 for a live AMA where he will be answering questions on all things open source and AppSec. Register:
http://sthwk.com/Simon-AMA

Articles

Here’s a take on incident severity levels. I enjoy learning what criteria folks use for this, so please send similar articles my way (or maybe write your own?).

Nancy Chauhan — Rootly

Counterfactuals (“should haves”) stifle incident retrospectives by tempting us to stop digging deeper. This article points out that there are unending possible counterfactuals for any incident.

Michael Nygard

Read to find out how counting incidents (or “# days since an outage”) won’t help and will cause more problems than it’s worth. Also included: options for what to count instead.

incident.io

Sloth is a tool for generating SLOs as Prometheus metrics, claiming to support “any kind of service”.

Xabier Larrakoetxea

If you’re looking for a way to evaluate your SRE process, this might help.

Alex Bramley — Google

This article tries to put an actual number on the cost of adding more nines of reliability.

Jack Shirazi — Expedia

It’s time for Catchpoint’s yearly SRE report, downloadable in PDF form through this link. Note: you have to give them your email address.

Catchpoint

Outages

  • Akamai
    • This outage impacted banks and airlines, among other Akamai customers.

SRE Weekly Issue #274

A message from our sponsor, StackHawk:

Join the GraphQL Security Testing Learning Lab on June 29 at 9 AM PT. Learn how to run automated security testing against your GraphQL APIs so you can find and fix vulnerabilities fast.
http://sthwk.com/graphql-learning-lab

Articles

The last section suggests selling SLOs to executives by likening them to OKRs or KPIs.

Austin Parker — Devops.com

Lowe’s is a home improvement retailer in North America. I often find it fascinating when I learn that a company that’s not seen as being in the tech-sector has a robust SRE practice.

Vivek Balivada and Rahul Mohan Kola Kandy — Lowe’s

The hallmark of sociological storytelling is if it can encourage us to put ourselves in the place of any character, not just the main hero/heroine, and imagine ourselves making similar choices.

Lorin Hochstein

This is brilliant: they apply DevOps and SRE practices to the challenging work of raising two autistic children.

Zac Nickens — USENIX ;login:

I especially like how their bot automatically pages reinforcements after folks have been on an incident for long enough to become fatigued.

Daniella Niyonkuru

Rather than measuring Mean Time To Recovery for incidents, let’s track our Mean Time To Retrospective.

Robert Ross — FireHydrant

Outages

  • Fastly
    • Fastly had a global outage of their CDN service, with many 5xx errors for around 40 minutes and diminished cache hit ratios following after. Many customers of Fastly experienced degradation, notably including Amazon, Reddit, and GitHub, among many others.

      Fastly posted a summary shortly after the incident, describing a latent bug that was triggered by a customer’s (valid) configuration change.

      Full disclosure: Fastly is my employer.

  • Salesforce
  • Facebook, Instagram, and WhatsApp

SRE Weekly Issue #273

A message from our sponsor, StackHawk:

StackHawk is helping One Medical equip developers with automated security testing and self-service remediations. See how:
http://sthwk.com/onemedical

Articles

What indeed? It depends on who you ask.

Quentin Rousseau — Rootly

This academic paper explains Google’s efforts toward identifying “mercurial” CPU coores — cores that make erroneous computations.

[…] we observe on the order of a few mercurial cores per several thousand machines […]

This one blew my mind:

A deterministic AES mis-computation, which was “selfinverting”: encrypting and decrypting on the same core yielded the identity function, but decryption elsewhere yielded gibberish.

Peter H. Hochschild, Paul Turner, Jeffrey C. Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David E. Culler, and Amin Vahdat — Google

The decisions, non-decisions, and workarounds that we implement now can have lasting effects on the Internet as a whole.

Mark Nottingham — Fastly

Full disclosure: Fastly is my employer.

A great intro to the topic of resilience engineering. Hint: resilience != high availability.

Piet van Dongen — Luminis Arnhem

When you include people in your definition of “the system”, something that looked like a system failure where humans had to “step in” is actually a success in which the system adapted.

Lorin Hochstein

I find the way this author presented this argument especially convincing. My favorite part is the real-world story toward the end.

Rachel by the Bay

Facebook presents their method for finding and dealing with PCIe errors in their infrastructure.

Ashwin Poojary, Bill Holland, Makan Diarra, and Ray Park — Facebook

Overflow of a 32-bit integer primary key caused a security issue.

Scott Sanders — GitHub

This caught my eye. I’ve seldom been in an on-call rotation with shifts that were not a week or two at a time.

The optimal frequency for being on call is about three days a month.

There’s also a good discussion of paying for on-call shifts, which, in my experience, goes a long way toward making on-call more palatable.

Christine Patton — SoundCloud

Outages

SRE Weekly Issue #272

A message from our sponsor, StackHawk:

See how automated security testing can change how your teams find and fix security vulnerabilities.
http://sthwk.com/security-automation

Articles

Salesforce has posted a ton of information about their major outage two weeks ago.
It involved a change to their DNS system that combined with an issue in BIND daemon shutdown that prevented it from starting back up.

The analysis goes into great detail on the fact that an engineer used the Emergency Break-Fix (EBF) process to rush out the DNS configuration change.

In this case, the engineer subverted the known policy and the appropriate disciplinary action has been taken to ensure this does not happen in the future.

Thanks to an anonymous reader for pointing this out to me.

Salesforce

This article calls out the heavily blame-ridden language in the above incident analysis and the briefing given by Salesforce’s Chief Reliability Officer.

I’m dismayed to see such language from someone who is at the C-level for reliability.

“For whatever reason that we don’t understand, the employee decided to do a global deployment,” Dieken went on.

Richard Speed — The Register

…and the Twittersphere agrees with me.

If you want to blame someone, maybe try blaming the “chief availability officer” who oversees a system so fragile that one action by one engineer can cause this much damage. But it’s never that simple, is it.

@ReinH on Twitter

Another really great take on the Salesforce outage followup.

Lorin Hochstein

I like how this article covers the different roles that SREs play.

Emily Arnott — Blameless

The principles covered in this article are:

  • Build a hypothesis around steady-state behavior
  • Vary real-world events
  • Run experiments in production
  • Automate experiments to run continuously
  • Minimize blast radius

Casey Rosenthal — Verica

This post is full of thought-provoking questions on the nature of configuration changes and incidents.

Lorin Hochstein

Outages

  • IBM Cloud
  • Klarna
    • Klarna showed users information related to other users, as detailed in this followup post.

SRE Weekly Issue #271

A message from our sponsor, StackHawk:

Join StackHawk on Tuesday, May 25 for a hands-on authenticated security testing workshop. Follow along as we walk through three common authentication scenarios step-by-step.

Register:
http://sthwk.com/auth-workshop

Articles

Should you keep things anonymous (“an engineer”), or should you say exactly who did what? Here’s a solid argument for the latter.

Lorin Hochstein

This article explores the downsides to a design composed of independent parts such as with microservices.

Ephraim Baron

Uber designed a tool they call Blackbox to perform simulated user requests and measure availability. I was struck by the candid discussion of complexity — no one person can understand how all of Uber’s microservices go together.

Carissa Blossom — Uber

They’ve made a YAML specification and validator for expressing SLOs in a machine-readable format.

Mike Vizard — Devops.com

A new spin: this one makes the distinction between “experimental tools” that affect the state of the system, and “observability tools” that are read-only.

Brendan Gregg

“Contributing factors: moose and squirrel.”

JJ Tang — Rootly

Every once in awhile, I need to pull out gdb. In times like those, it’s useful to have this kind of thing floating around in the back of my mind.

Brendon Scheinman — okcupid

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme