What might an AWS outage look like? Try this new simulation tool to find out!
It’s not something you’ll want to use for too long (the internet is better when it works, it turns out), but it’s a view that’s well worth taking in, if only to taste the sheer scope of Amazon’s server empire.
Russell Brandom — The Verge (tool by Dhruv Mehrotra)
This article goes step-by-step through setting up a 3-server GlusterFS cluster.
Jack Wallen — TechRepublic
My favorite part of this is the concept of vacations as a “human game day”. Can we survive without you?
Matt Stratton — PagerDuty (with Alice Goldfuss)
One question I have been seeing is “if Istio provides reliability for me, do I have to worry about it in my application?”
The answer is: abso-freakin-lutely :)
This take on the theft and crashing of an airplane in Seattle is applicable to SRE in multiple ways. It includes discussion of the incident response and some thoughts on what level of risk for extremely rare events is acceptable.
James Fallows — The Atlantic
Two funny GIFs about SRE. Full disclosure: @dbaops is my boss and this stemmed from a DM conversation between us.
@dbaops on Twitter
Coarse-grained health checks might be sufficient for orchestration systems, but prove to be inadequate to ensure quality-of-service and prevent cascading failures in distributed systems.