Lots of details about how Slack does incident response in this one.
Stephen Whitworth — incident.io
This list also gives an interesting insight into the way this company does SRE.
Mayank Gupta and Merlyn Shelley — Squadcast
Oh BGP, you rascally little routing protocol.
Alessandro Improta and Luca Sani — Catchpoint
A comprehensive definition of SREs and Site Reliability Engineering, including what SREs do and what makes SREs different from other roles.
The article covers various facets of SRE and acknowledges that SREs can perform many roles.
JJ Tang — Rootly
Another really excellent air accident story with lots of great talk about mental models and confirmation bias. The crew saw lots of disparate indications that each didn’t point to anything in particular and each wasn’t a huge problem on its own. That, coupled with confirmation bias, helped them miss what might seem obvious in hindsight.