SRE Weekly Issue #115

SPONSOR MESSAGE

SREcon addresses engineering resilience, reliability, and performance in complex distributed systems. Join us to grab Jason Hand’s new SRE book, and attend a book signing w/ Nicole Forsgren and Jez Humble. March 27-29. http://try.victorops.com/SREWeekly/SREcon

Articles

Metrics like Mean Time to Detection (MTTD), Resolution (MTTR), and the like pave over all of the incredibly valuable details of the individual incidents. If you place a lot of emphasis on aggregate incident response metrics, this article may cause you to rethink your methods.

Incidents are unplanned investments. When you focus solely on shallow data you are giving up the return on those investments that you can realize by deeper and more elaborate analysis.

John Allspaw — Adaptive Capacity Labs

Duct tape: you know, all the little shell scripts you have in your ~/bin directory that you wrote because your system’s tooling got in your way or didn’t do what you needed? Find that, according to this article, and you’ll find interesting things to work on to make the system better. I’d add that these rough edges are often also the kinds of things that contribute to incidents.

Rachel Kroll

A thoughtful and detailed incident post-analysis, including an in-depth discussion of the weeks-long investigation to determine the contributing factors. The outage involved the interaction of Pacemaker and Postgres.

Chris Sinjakli , Harry Panayiotou , Lawrence Jones , Norberto Lopes and Raúl Naveiras — GoCardless

Here’s a nice overview of chaos engineering, including a mention of a tool I wasn’t aware of for applying chaos to Docker containers.

Jennifer Riggins — The New Stack

The question in the title refers to the gathering of metrics from many systems in an infrastructure. Do they push their metrics in, or should the system pull metrics from each host instead? This Prometheus author explains why they pull and how it scales.

Julius Volz — Prometheus

A primer on achieving seamless deployments with Docker, including examples.

Jussi Nummelin — Kontena

I had some extra time for reviewing content this week, and I took the opportunity to listen to this episode of the Food Fight podcast, with a focus on observability. The discussion is really excellent, and there are some really thought-provoking moments.

Nell Shamrell-Harrington, with Nathen Harvey, Charity Majors, and Jamie Osler

How? By writing runbooks. This article takes you through how, why, and what tools to use as you develop runbooks for your systems.

Francesco Negri — Buildo

As a security-focused company, it only makes sense that Threat Stack would focus on safety when giving developers access to operate their software production.

We believe that good operations makes for good security. Reducing the scope of engineers’ access to systems reduces the noise if we ever have to investigate malicious activity.

Pete Cheslock — Threat Stack

Outages

  • Data Action
    • Data Action is a dependency of many Australian banks.
  • Travis CI
  • S3
    • Amazon S3 had a pair of outages for connections through VPC Endpoints. The Travis CI, Datadog, and New Relic outages were around the same time, but I can’t tell conclusively whether they were related.
  • Datadog
  • New Relic
Updated: March 25, 2018 — 9:08 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme