SRE Weekly Issue #57

A short one this week as I recover from a truly heinous chest-cold.  Thanks, 2017.

SPONSOR MESSAGE

“The How and Why of Minimum Viable Runbooks.” Get the free ebook from VictorOps.

Articles

In this issue of Production Ready, Mathias shows how his team set up semantic monitoring. They continuously run integration tests and feed the results into their monitoring system, rather than running CI only when building new code.

[…] just because the services themselves report to be healthy doesn’t necessarily mean the integration points between them are fine too.

By “construction outage”, the headline means “a network outage due to a fiber cut that was caused by construction”. It will be interesting to see whether this suit is successful.

Recommendations for an on-call hand-off procedure. It’s geared toward using the VictorOps platform, but the main ideas apply more broadly. I like the idea of reviewing deploys as well as incidents and for running a monthly review of handoffs.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Outages

Updated: January 29, 2017 — 10:41 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme