Find out how AutoTrader deployed TLS to 3000 vendor websites, and what they did when things went wrong despite their careful deployment strategy.
Lee Goodman — AutoTrader
An excellent short piece about incident response, using the radio recordings from an aircraft accident as a case study.
No production operation is too big or too small for a checklist. Similarly, no situation is too strenuous for one.
[…] in this new series, we’re sharing some of our internal SRE processes. This first post looks at the guidelines our SRE team follow to communicate with customers during an incident, with some practical tips, examples, and the thinking behind it all.
Fran Garcia — Hosted Graphite
Here’s why adopting a multi-cloud strategy may not do what you want, while also making your life much harder.
Last fall, I linked to a couple of talks on research in automated bugfixing. Facebook has now deployed such a system to production.
Yue Jia, Ke Mao, Mark Harman — Facebook
Microsoft’s Visual Studio Team System (VSTS) was one of the services impacted by the major Azure outage earlier this month. Here’s an in-depth analysis of what went wrong and what they might (or might not) be able to do to prevent a similar incident.
Buck Hodges — Microsoft