SRE Weekly Issue #139

Articles

Find out how AutoTrader deployed TLS to 3000 vendor websites, and what they did when things went wrong despite their careful deployment strategy.

Lee Goodman — AutoTrader

Incident Response – Speedbird 38

An excellent short piece about incident response, using the radio recordings from an aircraft accident as a case study.

Sri Ray

checklists: an operational gift

No production operation is too big or too small for a checklist. Similarly, no situation is too strenuous for one.

Sri Ray

How to write a status page update

[…] in this new series, we’re sharing some of our internal SRE processes. This first post looks at the guidelines our SRE team follow to communicate with customers during an incident, with some practical tips, examples, and the thinking behind it all.

Fran Garcia — Hosted Graphite

Multi-Cloud Is a Trap

Here’s why adopting a multi-cloud strategy may not do what you want, while also making your life much harder.

Tyler Treat

Finding and fixing software bugs automatically with SapFix and Sapienz

Last fall, I linked to a couple of talks on research in automated bugfixing. Facebook has now deployed such a system to production.

Yue Jia, Ke Mao, Mark Harman — Facebook

Postmortem: VSTS 4 September 2018

Microsoft’s Visual Studio Team System (VSTS) was one of the services impacted by the major Azure outage earlier this month. Here’s an in-depth analysis of what went wrong and what they might (or might not) be able to do to prevent a similar incident.

Buck Hodges — Microsoft

Outages

GitHub
Travis CI
- Also this one.
Twitch
Xero
- Xero experienced an outage this week and posted this article explaining what went wrong.
  Tony Stewart — Xero

SRE Weekly Issue #139

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues