SRE Weekly Issue #139


SRE teams need to prepare for incidents. Maintain high levels of uptime, prepare for downtime, and create more reliable services by optimizing incident detection, response, and remediation workflows:


Find out how AutoTrader deployed TLS to 3000 vendor websites, and what they did when things went wrong despite their careful deployment strategy.

Lee Goodman — AutoTrader

An excellent short piece about incident response, using the radio recordings from an aircraft accident as a case study.

Sri Ray

No production operation is too big or too small for a checklist. Similarly, no situation is too strenuous for one.

Sri Ray

[…] in this new series, we’re sharing some of our internal SRE processes. This first post looks at the guidelines our SRE team follow to communicate with customers during an incident, with some practical tips, examples, and the thinking behind it all.

Fran Garcia — Hosted Graphite

Here’s why adopting a multi-cloud strategy may not do what you want, while also making your life much harder.

Tyler Treat

Last fall, I linked to a couple of talks on research in automated bugfixing. Facebook has now deployed such a system to production.

Yue Jia, Ke Mao, Mark Harman — Facebook

Microsoft’s Visual Studio Team System (VSTS) was one of the services impacted by the major Azure outage earlier this month. Here’s an in-depth analysis of what went wrong and what they might (or might not) be able to do to prevent a similar incident.

Buck Hodges — Microsoft


Updated: September 16, 2018 — 4:16 pm
SRE WEEKLY © 2015 Frontier Theme