General

SRE Weekly Issue #272

A message from our sponsor, StackHawk:

See how automated security testing can change how your teams find and fix security vulnerabilities.
http://sthwk.com/security-automation

Articles

Salesforce has posted a ton of information about their major outage two weeks ago.
It involved a change to their DNS system that combined with an issue in BIND daemon shutdown that prevented it from starting back up.

The analysis goes into great detail on the fact that an engineer used the Emergency Break-Fix (EBF) process to rush out the DNS configuration change.

In this case, the engineer subverted the known policy and the appropriate disciplinary action has been taken to ensure this does not happen in the future.

Thanks to an anonymous reader for pointing this out to me.

Salesforce

This article calls out the heavily blame-ridden language in the above incident analysis and the briefing given by Salesforce’s Chief Reliability Officer.

I’m dismayed to see such language from someone who is at the C-level for reliability.

“For whatever reason that we don’t understand, the employee decided to do a global deployment,” Dieken went on.

Richard Speed — The Register

…and the Twittersphere agrees with me.

If you want to blame someone, maybe try blaming the “chief availability officer” who oversees a system so fragile that one action by one engineer can cause this much damage. But it’s never that simple, is it.

@ReinH on Twitter

Another really great take on the Salesforce outage followup.

Lorin Hochstein

I like how this article covers the different roles that SREs play.

Emily Arnott — Blameless

The principles covered in this article are:

  • Build a hypothesis around steady-state behavior
  • Vary real-world events
  • Run experiments in production
  • Automate experiments to run continuously
  • Minimize blast radius

Casey Rosenthal — Verica

This post is full of thought-provoking questions on the nature of configuration changes and incidents.

Lorin Hochstein

Outages

  • IBM Cloud
  • Klarna
    • Klarna showed users information related to other users, as detailed in this followup post.

SRE Weekly Issue #271

A message from our sponsor, StackHawk:

Join StackHawk on Tuesday, May 25 for a hands-on authenticated security testing workshop. Follow along as we walk through three common authentication scenarios step-by-step.

Register:
http://sthwk.com/auth-workshop

Articles

Should you keep things anonymous (“an engineer”), or should you say exactly who did what? Here’s a solid argument for the latter.

Lorin Hochstein

This article explores the downsides to a design composed of independent parts such as with microservices.

Ephraim Baron

Uber designed a tool they call Blackbox to perform simulated user requests and measure availability. I was struck by the candid discussion of complexity — no one person can understand how all of Uber’s microservices go together.

Carissa Blossom — Uber

They’ve made a YAML specification and validator for expressing SLOs in a machine-readable format.

Mike Vizard — Devops.com

A new spin: this one makes the distinction between “experimental tools” that affect the state of the system, and “observability tools” that are read-only.

Brendan Gregg

“Contributing factors: moose and squirrel.”

JJ Tang — Rootly

Every once in awhile, I need to pull out gdb. In times like those, it’s useful to have this kind of thing floating around in the back of my mind.

Brendon Scheinman — okcupid

Outages

SRE Weekly Issue #270

A message from our sponsor, StackHawk:

APIs are not only the backbone of modern application architecture, but they are also a key part of security. Discover what API security testing is, how it works, and get started using API security tools
http://sthwk.com/API-security

Articles

This is an in-progress document about the kinds of patterns we see or use when designing systems. The author warned me that it’s a work in progress and maybe not ready for prime-time, but I think this is exactly the time when I should get it in front of your eyes.

I’d love your help growing this list. If you know of a name that is missing from the list please send me a tweet with the name and a short description of it and I’ll include it in the list with a link to your tweet

Mads Hartmann

Whoa, a podcast dedicated to picking apart public incident postings! I love this, because there’s a lot that’s left to shorthand, and a live conversation is a great way to flesh it out.

Tom Kleinpeter and Jamie Turner

There’s a really interesting undercurrent in this story about resilience. Nurses can catch these kinds of errors, but this just one layered protection among many. If the system is reduced to relying on that second-layer defense, the overall resilience is diminished.

Daniel Keane — ABC News

Of course, before reaching this stage, all of the pieces are tested in isolation. But until they’re all put together, it’s almost impossible to predict the behavior of the finished product during an accident.

Mikolaj Pawlikowski

The attributes discussed are:

  • Problem solving
  • Awareness building
  • Collaboration
  • Empathy

Jayne Groll

Wait, more attributes? Oh, and by the same author, too:

  • “Great SREs have a passion for high-quality automation.”
  • “A great SRE ensures SLOs (Service Level Objectives) are set at correct boundaries of service; […]”
  • Prize Communication.
  • Look for longer-term support experience.
  • Look for a person that demonstrates empathy.

Jayne Groll

This one explore the application of SRE principles to mobile app design.

Abhijith Krishnappa

This two-part series uses a narrative case study format to show how SLOs can be misleading. You might have great numbers, but what are the numbers actually measuring?

Adam Hammond — Squadcast

Outages

  • A major US oil pipeline
    • The pipeline was targeted by a ransomware attack.
  • GasBuddy
    • This app for finding gasoline prices seems to have been impacted by a flood of user traffic driven by the US oil pipeline outage. In fact, their front page seems to be very slow for me as I write this.
  • Salesforce
    • The outage was widespread and even affected their status page.
  • eBay
  • Microsoft Outlook

SRE Weekly Issue #269

A message from our sponsor, StackHawk:

Tune into ZAPCon After Hours this Tuesday at 8 am PT to learn how to include automated security testing in your builds with ZAP
http://sthwk.com/after-hours-3

Articles

We built Edgar to ease this burden, by empowering our users to troubleshoot distributed systems efficiently with the help of a summarized presentation of request tracing, logs, analysis, and metadata.

Kevin Lew, Maulik Pandey, Narayanan Arunachalam, Dustin Haffner, Andrei Ushakov, Seth Katz, Greg Burrell, Ram Vaithilingam, Mike Smith and Elizabeth Carretto — Netflix

The PDF covers 5 main areas:

  1. Availability
  2. Performance
  3. Monitoring
  4. Incident Response
  5. Preparation

No account required or form to fill out to download the PDF.

Splunk/VictorOps

This one’s especially interesting for the section about what MTTx metrics aren’t good for, and the following section on how to improve them.

Emily Arnott — Blameless

If you’re interested in deploying Kafka in a multi-region configuration, eBay has put quite a bit of thought into this and has a lot to share.

Engin Yoeyen — eBay

Straight from someone who was there from the start. The “what chaos engineering is not” section is especially enlightening.

Casey Rosenthal — Verica

The last paragraph regarding “unknown unknowns” is noteworthy.

Heroku

There are some great questions in here on blamelessness and full service ownership.

James Thigpen — Gremlin

Outages

SRE Weekly Issue #268

A message from our sponsor, StackHawk:

Join StackHawk Tuesday May 4 at 9 am PT for a hands-on technical workshop! By the end of the session, you will have three types of security testing running in your GitHub pipeline. Register:
http://sthwk.com/technical-workshop

Articles

The SRE book has a chapter covering on-call, but it’s best suited for huge-scale companies. What should the rest of us do?

Utsav Shah

If you’re feeling hesitant about chaos engineering, or you’re trying to convince someone who is, this might be useful. The myths are:

Myth #1: Chaos engineering is testing in production
Myth #2: Chaos engineering is about randomly breaking things
Myth #3: Chaos engineering is only for large, modern distributed systems
Myth #4: We don’t need more chaos – we already have plenty!
Myth #5: Chaos engineering is only for very mature teams/products

Mikolaj Pawlikowski

Drawing parallels to the high modernism movement during the cold war, this article raises interesting questions about the direction SRE is going, and system administration in general.

Laura Nolan — USENIX

Riffing off of a tweet by Charity Majors, this article explores the idea that moving faster can actually be safer, despite an urge one may feel to slow down.

Bruce Johnston

An extreme oversimplification of this incident would be: multiple engine failure on a plane subsequent to a maintenance error on all engines. This accident is cited as a reason to have separate mechanics work on each engine, in hopes of avoiding duplicated errors.

US National Transportation Safety Board (multiple authors)

[…] in order to ship new features and improvements faster while lowering the risk in our deployments, we have a simple but powerful tool: feature flags.

Alberto Gimeno — GitHub

This one blew my mind. By recording instruction execution traces in a ring buffer, they’re able to reconstruct enough information to step through the execution leading up to a crash — even though they weren’t running the application under a debugger!

Walter Erquinigo, David Carrillo-Cisneros, Alston Tang — Facebook

Automation is supposed to take some of the load off of the human operator, right? But in reality, humans need to build a mental model of what the automation is doing in order to use it safely and effectively.

Shem Malmquist — WIRED

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme