SRE WEEKLY – Page 45 – scalability, availability, incident response, automation

SRE Weekly Issue #273

lex

June 6, 2021

General

Comments

View on sreweekly.com

Articles

Incident Management vs. Incident Response

What indeed? It depends on who you ask.

Quentin Rousseau — Rootly

Cores that don’t count

This academic paper explains Google’s efforts toward identifying “mercurial” CPU coores — cores that make erroneous computations.

[…] we observe on the order of a few mercurial cores per several thousand machines […]

This one blew my mind:

A deterministic AES mis-computation, which was “selfinverting”: encrypting and decrypting on the same core yielded the identity function, but decryption elsewhere yielded gibberish.

Peter H. Hochschild, Paul Turner, Jeffrey C. Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David E. Culler, and Amin Vahdat — Google

Minimizing ossification risk is everyone’s responsibility

The decisions, non-decisions, and workarounds that we implement now can have lasting effects on the Internet as a whole.

Mark Nottingham — Fastly

Full disclosure: Fastly is my employer.

What is resilience engineering? A lightning talk with background information

A great intro to the topic of resilience engineering. Hint: resilience != high availability.

Piet van Dongen — Luminis Arnhem

Dealing with new kinds of trouble

When you include people in your definition of “the system”, something that looked like a system failure where humans had to “step in” is actually a success in which the system adapted.

Lorin Hochstein

Please don’t count outages (or SEVs, or whatever)

I find the way this author presented this argument especially convincing. My favorite part is the real-world story toward the end.

Rachel by the Bay

How Facebook deals with PCIe faults to keep our data centers running reliably

Facebook presents their method for finding and dealing with PCIe errors in their infrastructure.

Ashwin Poojary, Bill Holland, Makan Diarra, and Ray Park — Facebook

GitHub Availability Report: May 2021

Overflow of a 32-bit integer primary key caused a security issue.

Scott Sanders — GitHub

Building a Healthy On-Call Culture

This caught my eye. I’ve seldom been in an on-call rotation with shifts that were not a week or two at a time.

The optimal frequency for being on call is about three days a month.

There’s also a good discussion of paying for on-call shifts, which, in my experience, goes a long way toward making on-call more palatable.

Christine Patton — SoundCloud

Outages

HBO Max
Apple Card
Sling TV
Google Meet
GitHub
Discord
- Discord had several outages this week.

SRE Weekly Issue #272

lex

May 30, 2021

General

Comments

View on sreweekly.com

Articles

[Salesforce] Multi-Instance Service Disruption on May 11-12, 2021

Salesforce has posted a ton of information about their major outage two weeks ago.
It involved a change to their DNS system that combined with an issue in BIND daemon shutdown that prevented it from starting back up.

The analysis goes into great detail on the fact that an engineer used the Emergency Break-Fix (EBF) process to rush out the DNS configuration change.

In this case, the engineer subverted the known policy and the appropriate disciplinary action has been taken to ensure this does not happen in the future.

Thanks to an anonymous reader for pointing this out to me.

Salesforce

That Salesforce outage: Global DNS downfall started by one engineer trying a quick fix

This article calls out the heavily blame-ridden language in the above incident analysis and the briefing given by Salesforce’s Chief Reliability Officer.

I’m dismayed to see such language from someone who is at the C-level for reliability.

“For whatever reason that we don’t understand, the employee decided to do a global deployment,” Dieken went on.

Richard Speed — The Register

@ReinH on Twitter Re: Salesforce Outage

…and the Twittersphere agrees with me.

If you want to blame someone, maybe try blaming the “chief availability officer” who oversees a system so fragile that one action by one engineer can cause this much damage. But it’s never that simple, is it.

@ReinH on Twitter

Subverting the process

Another really great take on the Salesforce outage followup.

Lorin Hochstein

Building an SRE Team? How to Hire, Assess, & Manage SREs

I like how this article covers the different roles that SREs play.

Emily Arnott — Blameless

The Advanced Principles of Chaos Engineering

The principles covered in this article are:

Build a hypothesis around steady-state behavior

Vary real-world events

Run experiments in production

Automate experiments to run continuously

Minimize blast radius

Casey Rosenthal — Verica

Why do config changes keep coming up in major incidents?

This post is full of thought-provoking questions on the nature of configuration changes and incidents.

Lorin Hochstein

Outages

IBM Cloud
Klarna
- Klarna showed users information related to other users, as detailed in this followup post.

SRE Weekly Issue #271

lex

May 23, 2021

General

Comments

View on sreweekly.com

Articles

Naming names in incident writeups

Should you keep things anonymous (“an engineer”), or should you say exactly who did what? Here’s a solid argument for the latter.

Lorin Hochstein

How Systems Complexity Reduces Uptime

This article explores the downsides to a design composed of independent parts such as with microservices.

Ephraim Baron

User Simulation for Rapid Outage Mitigation

Uber designed a tool they call Blackbox to perform simulated user requests and measure availability. I was struck by the candid discussion of complexity — no one person can understand how all of Uber’s microservices go together.

Carissa Blossom — Uber

Nobl9 Makes SLO Specification Open Source

They’ve made a YAML specification and validator for expressing SLOs in a machine-readable format.

Mike Vizard — Devops.com

What is Observability

A new spin: this one makes the distinction between “experimental tools” that affect the state of the system, and “observability tools” that are read-only.

Brendan Gregg

The Incident Review: 4 Odd Incidents Caused by Animals

“Contributing factors: moose and squirrel.”

JJ Tang — Rootly

When Debuggers Lie

Every once in awhile, I need to pull out gdb. In times like those, it’s useful to have this kind of thing floating around in the back of my mind.

Brendon Scheinman — okcupid

Outages

Slack
Colonial Pipeline
- The same major US oil pipeline mentioned here last week is still having network issues.
Binance and Coinbase
YouTube
Sabre
- Sabre is a backend service provider used by a lot of airlines.
Azure web portal
- There’s an interesting followup post about a DNS issue.

SRE Weekly Issue #270

lex

May 16, 2021

General

Comments

View on sreweekly.com

Articles

Thundering herds, noisy neighbours, and retry storms

This is an in-progress document about the kinds of patterns we see or use when designing systems. The author warned me that it’s a work in progress and maybe not ready for prime-time, but I think this is exactly the time when I should get it in front of your eyes.

I’d love your help growing this list. If you know of a name that is missing from the list please send me a tweet with the name and a short description of it and I’ll include it in the list with a link to your tweet

Mads Hartmann

The Downtime Project

Whoa, a podcast dedicated to picking apart public incident postings! I love this, because there’s a lot that’s left to shorthand, and a live conversation is a great way to flesh it out.

Tom Kleinpeter and Jamie Turner

Health boss unsure how many hospital patients were overdosed due to Windows upgrade

There’s a really interesting undercurrent in this story about resilience. Nurses can catch these kinds of errors, but this just one layered protection among many. If the system is reduced to relying on that second-layer defense, the overall resilience is diminished.

Daniel Keane — ABC News

Have you ever seen a car crash test? That’s Chaos Engineering

Of course, before reaching this stage, all of the pieces are tested in isolation. But until they’re all put together, it’s almost impossible to predict the behavior of the finished product during an accident.

Mikolaj Pawlikowski

4 attributes of a great site reliability engineer

The attributes discussed are:

Problem solving

Awareness building

Collaboration

Empathy

Jayne Groll

How to hire Site Reliability Engineers (SREs): 5 top qualities

Wait, more attributes? Oh, and by the same author, too:

“Great SREs have a passion for high-quality automation.”

“A great SRE ensures SLOs (Service Level Objectives) are set at correct boundaries of service; […]”

Prize Communication.

Look for longer-term support experience.

Look for a person that demonstrates empathy.

Jayne Groll

Site Reliability Engineering for Native Mobile Apps

This one explore the application of SRE principles to mobile app design.

Abhijith Krishnappa

Choosing SLOs that users need, not the ones you want to provide

This two-part series uses a narrative case study format to show how SLOs can be misleading. You might have great numbers, but what are the numbers actually measuring?

Adam Hammond — Squadcast

Outages

A major US oil pipeline
- The pipeline was targeted by a ransomware attack.
GasBuddy
- This app for finding gasoline prices seems to have been impacted by a flood of user traffic driven by the US oil pipeline outage. In fact, their front page seems to be very slow for me as I write this.
Salesforce
- The outage was widespread and even affected their status page.
eBay
Microsoft Outlook

SRE Weekly Issue #269

lex

May 9, 2021

General

Comments

View on sreweekly.com

Articles

Edgar: Solving Mysteries Faster with Observability

We built Edgar to ease this burden, by empowering our users to troubleshoot distributed systems efficiently with the help of a summarized presentation of request tracing, logs, analysis, and metadata.

Kevin Lew, Maulik Pandey, Narayanan Arunachalam, Dustin Haffner, Andrei Ushakov, Seth Katz, Greg Burrell, Ram Vaithilingam, Mike Smith and Elizabeth Carretto — Netflix

The Comprehensive Site Reliability Engineering (SRE) PDF

The PDF covers 5 main areas:

Availability
Performance
Monitoring
Incident Response
Preparation

No account required or form to fill out to download the PDF.

Splunk/VictorOps

What are MTTx Metrics Good For? Let’s Find Out.

This one’s especially interesting for the section about what MTTx metrics aren’t good for, and the following section on how to improve them.

Emily Arnott — Blameless

Resiliency and Disaster Recovery with Kafka

If you’re interested in deploying Kafka in a multi-region configuration, eBay has put quite a bit of thought into this and has a lot to share.

Engin Yoeyen — eBay

What Chaos Engineering Is (and Isn’t)

Straight from someone who was there from the start. The “what chaos engineering is not” section is especially enlightening.

Casey Rosenthal — Verica

Heroku incident #2226 follow-up: Private Space apps experiencing domain to SSL cert mapping errors

The last paragraph regarding “unknown unknowns” is noteworthy.

Heroku

Failover Conf follow-up: Your team and culture questions answered!

There are some great questions in here on blamelessness and full service ownership.

James Thigpen — Gremlin

Outages

Google Cloud Platform us-west2 region
- They posted a detailed follow-up at the above link.
TikTok
Network Solutions and Register.com
Singapore Exchange (SGX)
reddit
Parler

SRE Weekly Issue #273

Articles

Outages

SRE Weekly Issue #272

Articles

Outages

SRE Weekly Issue #271

Articles

Outages

SRE Weekly Issue #270

Articles

Outages

SRE Weekly Issue #269

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues