SRE Weekly Issue #272

A message from our sponsor, StackHawk:

See how automated security testing can change how your teams find and fix security vulnerabilities.
http://sthwk.com/security-automation

Articles

Salesforce has posted a ton of information about their major outage two weeks ago.
It involved a change to their DNS system that combined with an issue in BIND daemon shutdown that prevented it from starting back up.

The analysis goes into great detail on the fact that an engineer used the Emergency Break-Fix (EBF) process to rush out the DNS configuration change.

In this case, the engineer subverted the known policy and the appropriate disciplinary action has been taken to ensure this does not happen in the future.

Thanks to an anonymous reader for pointing this out to me.

Salesforce

This article calls out the heavily blame-ridden language in the above incident analysis and the briefing given by Salesforce’s Chief Reliability Officer.

I’m dismayed to see such language from someone who is at the C-level for reliability.

“For whatever reason that we don’t understand, the employee decided to do a global deployment,” Dieken went on.

Richard Speed — The Register

…and the Twittersphere agrees with me.

If you want to blame someone, maybe try blaming the “chief availability officer” who oversees a system so fragile that one action by one engineer can cause this much damage. But it’s never that simple, is it.

@ReinH on Twitter

Another really great take on the Salesforce outage followup.

Lorin Hochstein

I like how this article covers the different roles that SREs play.

Emily Arnott — Blameless

The principles covered in this article are:

  • Build a hypothesis around steady-state behavior
  • Vary real-world events
  • Run experiments in production
  • Automate experiments to run continuously
  • Minimize blast radius

Casey Rosenthal — Verica

This post is full of thought-provoking questions on the nature of configuration changes and incidents.

Lorin Hochstein

Outages

  • IBM Cloud
  • Klarna
    • Klarna showed users information related to other users, as detailed in this followup post.
Updated: May 30, 2021 — 9:58 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme