Articles
Best article about post-incident investigations that I’ve seen in awhile. My favorite part is the recommendation not to use a template for the retrospective, as it will artificially narrow the scope of the investigation.
Ryan Frantz
These folks have set up a survey to gather information on whether and how folks are compensated for on-call in IT. This topic has been gaining traction over the past couple of years, and I can’t wait to see the results of the survey. Please take a moment to fill it out.
Chris Evans and Spike Lindsey
I’ll be speaking at SRECon19 Americas this March with my former coworker, Courtney Eckhardt. The talk lineup looks incredible and I’m really excited to go!
If you’re going to be there, drop me an email (I’m terrible at Twitter) and let me know. I’ll have lots of swag available, made with 100% open source software (Ink/Stitch and inkscape-silhouette).
Especially useful for folks new to on-call.
If you only take one thing away from this post, it’s that you need to put your own well-being first, and once you do that other aspects of on-call will become easier.
Dave Fennell — Hosted Graphite
I have to admit I wasn’t clear on two-phase commit before I read this. Now I know what it’s all about — and its drawbacks.
Daniel Abadi
This guide from Google describes the qualities and practices of SRE teams of various levels from beginner to advanced.
Gustavo Franco — Google
A good intro if you’re new around here.
Sylvia Fronczak — Scalyr
Outages
- Slack
- Greenhouse.io
- Microsoft Office 365
- CenturyLink explains end-of-year outage
- Here are some details on the CenturyLink outage that took down 911 emergency services across portions of the US in late December.
- Passport Canada
- Canada was unable to process passports during the outage.