Articles
Want to nerd out on BGP? Check out how this person modeled the Eve Online universe as an 8000-VM cluster running BGP.
Ben Cartwright-Cox
Accrued vacation time is antiquated, and “unlimited” vacation paradoxically leads employees to take less time overall. Time to enforce vacations, lest we forget that burnout is a reliability risk.
Baron Schwartz
How to avoid catastrophe: pay attention to near misses. This article makes an incredibly compelling point that we need to make a conscious effort to pay attention to near misses, and explains how cognitive bias will tend to make us do the exact opposite.
Catherine H. Tinsley, Robin L. Dillon, and Peter M. Madsen — Harvard Business Review
An intro to how blame causes problems, why blamelessness is better, and how to adopt a blameless culture.
Ashar Rizqi
A 100-year-old chemical company thought they had a great safety record. Turns out that folks were just considering accidents “routine” and not reporting them.
Thai Wood (reviewing a paper by Stefanie Huber, Ivette van Wijgerden, Arjan de Witt, and Sidney W.A. Dekker)
Booking.com has 50,000 servers and many SRE squads. They developed tools they call the Reliability Collaboration Model and the Ownership Map to help them define which products SRE squads support and at what level.
Emmanuel Goossaert — Booking.com
Outages
- New Relic
- Duo Security
- Amtrak (US long-distance passenger rail)
- Amtrak had an outage of its switching system this past week. Linked above is an article with the inflammatory title, “Human error? Try abject stupidity, Amtrak”. Exercise: try to think of ways in which this is not a case of abject stupidity.
Rich Miller — Capitol Fax
- Amtrak had an outage of its switching system this past week. Linked above is an article with the inflammatory title, “Human error? Try abject stupidity, Amtrak”. Exercise: try to think of ways in which this is not a case of abject stupidity.
- YouTube