What indeed? It depends on who you ask.
Quentin Rousseau — Rootly
This academic paper explains Google’s efforts toward identifying “mercurial” CPU coores — cores that make erroneous computations.
[…] we observe on the order of a few mercurial cores per several thousand machines […]
This one blew my mind:
A deterministic AES mis-computation, which was “selfinverting”: encrypting and decrypting on the same core yielded the identity function, but decryption elsewhere yielded gibberish.
Peter H. Hochschild, Paul Turner, Jeffrey C. Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David E. Culler, and Amin Vahdat — Google
The decisions, non-decisions, and workarounds that we implement now can have lasting effects on the Internet as a whole.
Mark Nottingham — Fastly
Full disclosure: Fastly is my employer.
A great intro to the topic of resilience engineering. Hint: resilience
!= high availability.
Piet van Dongen — Luminis Arnhem
When you include people in your definition of “the system”, something that looked like a system failure where humans had to “step in” is actually a success in which the system adapted.
I find the way this author presented this argument especially convincing. My favorite part is the real-world story toward the end.
Rachel by the Bay
Facebook presents their method for finding and dealing with PCIe errors in their infrastructure.
Ashwin Poojary, Bill Holland, Makan Diarra, and Ray Park — Facebook
Overflow of a 32-bit integer primary key caused a security issue.
Scott Sanders — GitHub
This caught my eye. I’ve seldom been in an on-call rotation with shifts that were not a week or two at a time.
The optimal frequency for being on call is about three days a month.
There’s also a good discussion of paying for on-call shifts, which, in my experience, goes a long way toward making on-call more palatable.
Christine Patton — SoundCloud