A short one this week as I recover from a truly heinous chest-cold. Thanks, 2017.
Articles
In this issue of Production Ready, Mathias shows how his team set up semantic monitoring. They continuously run integration tests and feed the results into their monitoring system, rather than running CI only when building new code.
[…] just because the services themselves report to be healthy doesn’t necessarily mean the integration points between them are fine too.
By “construction outage”, the headline means “a network outage due to a fiber cut that was caused by construction”. It will be interesting to see whether this suit is successful.
Recommendations for an on-call hand-off procedure. It’s geared toward using the VictorOps platform, but the main ideas apply more broadly. I like the idea of reviewing deploys as well as incidents and for running a monthly review of handoffs.
This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.
Outages
- Battlefield 1
- PSN
- Stack Exchange
- Stack Exchange had a 12-minute outage on January 24. Click through for their postmortem, published two days later.
- United Airlines
- DirecTV Now