SRE Weekly Issue #273

Articles

Incident Management vs. Incident Response

What indeed? It depends on who you ask.

Quentin Rousseau — Rootly

This academic paper explains Google’s efforts toward identifying “mercurial” CPU coores — cores that make erroneous computations.

[…] we observe on the order of a few mercurial cores per several thousand machines […]

This one blew my mind:

A deterministic AES mis-computation, which was “selfinverting”: encrypting and decrypting on the same core yielded the identity function, but decryption elsewhere yielded gibberish.

Peter H. Hochschild, Paul Turner, Jeffrey C. Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David E. Culler, and Amin Vahdat — Google

Minimizing ossification risk is everyone’s responsibility

The decisions, non-decisions, and workarounds that we implement now can have lasting effects on the Internet as a whole.

Mark Nottingham — Fastly

Full disclosure: Fastly is my employer.

What is resilience engineering? A lightning talk with background information

A great intro to the topic of resilience engineering. Hint: resilience != high availability.

Piet van Dongen — Luminis Arnhem

Dealing with new kinds of trouble

When you include people in your definition of “the system”, something that looked like a system failure where humans had to “step in” is actually a success in which the system adapted.

Lorin Hochstein

Please don’t count outages (or SEVs, or whatever)

I find the way this author presented this argument especially convincing. My favorite part is the real-world story toward the end.

Rachel by the Bay

How Facebook deals with PCIe faults to keep our data centers running reliably

Facebook presents their method for finding and dealing with PCIe errors in their infrastructure.

Ashwin Poojary, Bill Holland, Makan Diarra, and Ray Park — Facebook

GitHub Availability Report: May 2021

Overflow of a 32-bit integer primary key caused a security issue.

Scott Sanders — GitHub

Building a Healthy On-Call Culture

This caught my eye. I’ve seldom been in an on-call rotation with shifts that were not a week or two at a time.

The optimal frequency for being on call is about three days a month.

There’s also a good discussion of paying for on-call shifts, which, in my experience, goes a long way toward making on-call more palatable.

Christine Patton — SoundCloud

Outages

HBO Max
Apple Card
Sling TV
Google Meet
GitHub
Discord
- Discord had several outages this week.

SRE Weekly Issue #273

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues