SRE Weekly Issue #273

A message from our sponsor, StackHawk:

StackHawk is helping One Medical equip developers with automated security testing and self-service remediations. See how:
http://sthwk.com/onemedical

Articles

What indeed? It depends on who you ask.

Quentin Rousseau — Rootly

This academic paper explains Google’s efforts toward identifying “mercurial” CPU coores — cores that make erroneous computations.

[…] we observe on the order of a few mercurial cores per several thousand machines […]

This one blew my mind:

A deterministic AES mis-computation, which was “selfinverting”: encrypting and decrypting on the same core yielded the identity function, but decryption elsewhere yielded gibberish.

Peter H. Hochschild, Paul Turner, Jeffrey C. Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David E. Culler, and Amin Vahdat — Google

The decisions, non-decisions, and workarounds that we implement now can have lasting effects on the Internet as a whole.

Mark Nottingham — Fastly

Full disclosure: Fastly is my employer.

A great intro to the topic of resilience engineering. Hint: resilience != high availability.

Piet van Dongen — Luminis Arnhem

When you include people in your definition of “the system”, something that looked like a system failure where humans had to “step in” is actually a success in which the system adapted.

Lorin Hochstein

I find the way this author presented this argument especially convincing. My favorite part is the real-world story toward the end.

Rachel by the Bay

Facebook presents their method for finding and dealing with PCIe errors in their infrastructure.

Ashwin Poojary, Bill Holland, Makan Diarra, and Ray Park — Facebook

Overflow of a 32-bit integer primary key caused a security issue.

Scott Sanders — GitHub

This caught my eye. I’ve seldom been in an on-call rotation with shifts that were not a week or two at a time.

The optimal frequency for being on call is about three days a month.

There’s also a good discussion of paying for on-call shifts, which, in my experience, goes a long way toward making on-call more palatable.

Christine Patton — SoundCloud

Outages

Updated: June 6, 2021 — 9:04 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme