SRE Weekly Issue #272

Articles

[Salesforce] Multi-Instance Service Disruption on May 11-12, 2021

Salesforce has posted a ton of information about their major outage two weeks ago.
It involved a change to their DNS system that combined with an issue in BIND daemon shutdown that prevented it from starting back up.

The analysis goes into great detail on the fact that an engineer used the Emergency Break-Fix (EBF) process to rush out the DNS configuration change.

In this case, the engineer subverted the known policy and the appropriate disciplinary action has been taken to ensure this does not happen in the future.

Thanks to an anonymous reader for pointing this out to me.

Salesforce

That Salesforce outage: Global DNS downfall started by one engineer trying a quick fix

This article calls out the heavily blame-ridden language in the above incident analysis and the briefing given by Salesforce’s Chief Reliability Officer.

I’m dismayed to see such language from someone who is at the C-level for reliability.

“For whatever reason that we don’t understand, the employee decided to do a global deployment,” Dieken went on.

Richard Speed — The Register

@ReinH on Twitter Re: Salesforce Outage

…and the Twittersphere agrees with me.

If you want to blame someone, maybe try blaming the “chief availability officer” who oversees a system so fragile that one action by one engineer can cause this much damage. But it’s never that simple, is it.

@ReinH on Twitter

Subverting the process

Another really great take on the Salesforce outage followup.

Lorin Hochstein

Building an SRE Team? How to Hire, Assess, & Manage SREs

I like how this article covers the different roles that SREs play.

Emily Arnott — Blameless

The Advanced Principles of Chaos Engineering

The principles covered in this article are:

Build a hypothesis around steady-state behavior

Vary real-world events

Run experiments in production

Automate experiments to run continuously

Minimize blast radius

Casey Rosenthal — Verica

Why do config changes keep coming up in major incidents?

This post is full of thought-provoking questions on the nature of configuration changes and incidents.

Lorin Hochstein

Outages

IBM Cloud
Klarna
- Klarna showed users information related to other users, as detailed in this followup post.

SRE Weekly Issue #272

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues