Articles
An engineer’s observation of a really effective Incident Command pattern.
Dean Wilson
Here’s Lorin Hochstein’s take on the STAMP (Systems-Theoretic Accident Model and Processes) workshop he attended recently.
Lorin Hochstein
What’s the difference between Resilience Engineering and High Reliability Organizations? This paper (and excellent summary) explains.
Torgeir Haavik, Stian Antonsen, Ragnar Rosness, and Andrew Hale (original paper)
Thai Wood — Resilience Roundup (summary)
This one focuses on what I feel are really important parts of SRE, taken from the article’s subheadings:
- Vendor engineering
- Product engineering
- Sociotechnical systems engineering
- Managing the portfolio of technical investments
Charity Majors — Honeycomb
Now that’s a for-serious incident report. Nice one, folks! This is an interesting case of theory-meets-reality for disaster planning.
giles — PythonAnywhere
Outages
- Equinix
- Equnix had a power failure in a London datacenter.
- Crunchyroll
- Deliveroo
- Google Cloud Platform
- Squarespace
- Spotify
- Looks like it may have been an expired TLS certificate.
- G Suite