Metrics like Mean Time to Detection (MTTD), Resolution (MTTR), and the like pave over all of the incredibly valuable details of the individual incidents. If you place a lot of emphasis on aggregate incident response metrics, this article may cause you to rethink your methods.
Incidents are unplanned investments. When you focus solely on shallow data you are giving up the return on those investments that you can realize by deeper and more elaborate analysis.
John Allspaw — Adaptive Capacity Labs
Duct tape: you know, all the little shell scripts you have in your ~/bin
directory that you wrote because your system’s tooling got in your way or didn’t do what you needed? Find that, according to this article, and you’ll find interesting things to work on to make the system better. I’d add that these rough edges are often also the kinds of things that contribute to incidents.
Rachel Kroll
A thoughtful and detailed incident post-analysis, including an in-depth discussion of the weeks-long investigation to determine the contributing factors. The outage involved the interaction of Pacemaker and Postgres.
Chris Sinjakli , Harry Panayiotou , Lawrence Jones , Norberto Lopes and Raúl Naveiras — GoCardless
Here’s a nice overview of chaos engineering, including a mention of a tool I wasn’t aware of for applying chaos to Docker containers.
Jennifer Riggins — The New Stack
The question in the title refers to the gathering of metrics from many systems in an infrastructure. Do they push their metrics in, or should the system pull metrics from each host instead? This Prometheus author explains why they pull and how it scales.
Julius Volz — Prometheus
A primer on achieving seamless deployments with Docker, including examples.
Jussi Nummelin — Kontena
I had some extra time for reviewing content this week, and I took the opportunity to listen to this episode of the Food Fight podcast, with a focus on observability. The discussion is really excellent, and there are some really thought-provoking moments.
Nell Shamrell-Harrington, with Nathen Harvey, Charity Majors, and Jamie Osler
How? By writing runbooks. This article takes you through how, why, and what tools to use as you develop runbooks for your systems.
Francesco Negri — Buildo
As a security-focused company, it only makes sense that Threat Stack would focus on safety when giving developers access to operate their software production.
We believe that good operations makes for good security. Reducing the scope of engineers’ access to systems reduces the noise if we ever have to investigate malicious activity.
Pete Cheslock — Threat Stack