SRE Weekly Issue #216

A message from our sponsor, VictorOps:

Our people and tools need to be connected now more than ever before. That’s why VictorOps is offering free, 90-day extended Enterprise trials for on-call incident response and alert management, up to 100 users, to anyone who needs it:

https://go.victorops.com/sreweekly-extended-trials-for-incident-response

Articles

Awesome resource! In each section, they explain what to include, why to include it, and an example from their playbook.

Blake Thorne — Atlassian

I didn’t make it to Failover Conf, and it sounds like I missed a great time, so I’m especially grateful for this writeup.

Rich Burroughs — FireHydrant

And this one!

Hannah Culver — Blamelss

I’m a little late with this one, sorry folks! Survey ends tomorrow, April 27.

This is an anonymous survey to look at the impact that COVID-19 has had on oncall teams in tech.

FireHydrant

Most post-incident review documents are written to be filed, not written to be read.

This slide deck is awesome and well worth the read.

John Allspaw — Adaptive Capacity Labs

A deep dive into the math behind anomaly detection.

Nikita Butakov — Ericsson

This article brings together thoughts on on-call work during the pandemic from folks at different companies.

Rich Burroughs — FireHydrant

A frontend engineer shares their key takeaways from their time shadowing.

Laura Montemayor — GitLab

Outages

Updated: April 26, 2020 — 9:00 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme