The first episode of this new podcast answers the question in three ways: what Google says SRE is, what the podcast host thinks it is, and how people seem to be practicing SRE.
Stephen Townsend — Slight Reliability
This aircraft accident report puts heavy emphasis on the deeper contributing factors rather than a seemingly obvious single root cause.
Google posted an incident report for the March 8 incident involving Traffic Director.
This one includes some neat graphs made by showing load and theoretical success rates for various strategies such as no retries, N retries, token buckets, and circuit breakers.
What if your alerting system goes down? These folks set up a dead-switch to handle that situation.
Miedwar Meshbesher — Nanit
Strategies for creating concise, efficient communication between teams during incidents and operational suprises
[…] communications must be precise and descriptive to minimize confusion and accelerate a responder’s ability to assess and remedy the situation.
Steve Stevens — Transposit
I really love these articles about hardware errors. They’re more common than we tend to realize.
Harish Dattatraya Dixit — Facebook