SRE Weekly Issue #115

Articles

Metrics like Mean Time to Detection (MTTD), Resolution (MTTR), and the like pave over all of the incredibly valuable details of the individual incidents. If you place a lot of emphasis on aggregate incident response metrics, this article may cause you to rethink your methods.

Incidents are unplanned investments. When you focus solely on shallow data you are giving up the return on those investments that you can realize by deeper and more elaborate analysis.

John Allspaw — Adaptive Capacity Labs

Look for the duct tape

Duct tape: you know, all the little shell scripts you have in your ~/bin directory that you wrote because your system’s tooling got in your way or didn’t do what you needed? Find that, according to this article, and you’ll find interesting things to work on to make the system better. I’d add that these rough edges are often also the kinds of things that contribute to incidents.

Rachel Kroll

Incident review: API and Dashboard outage on 10 October 2017

A thoughtful and detailed incident post-analysis, including an in-depth discussion of the weeks-long investigation to determine the contributing factors. The outage involved the interaction of Pacemaker and Postgres.

Chris Sinjakli , Harry Panayiotou , Lawrence Jones , Norberto Lopes and Raúl Naveiras — GoCardless

How Chaos Engineering Can Bring Stability to Your Distributed Systems

Here’s a nice overview of chaos engineering, including a mention of a tool I wasn’t aware of for applying chaos to Docker containers.

Jennifer Riggins — The New Stack

Pull doesn’t scale – or does it?

The question in the title refers to the gathering of metrics from many systems in an infrastructure. Do they push their metrics in, or should the system pull metrics from each host instead? This Prometheus author explains why they pull and how it scales.

Julius Volz — Prometheus

Zero downtime deployments with containers

A primer on achieving seamless deployments with Docker, including examples.

Jussi Nummelin — Kontena

observability – Food Fight

I had some extra time for reviewing content this week, and I took the opportunity to listen to this episode of the Food Fight podcast, with a focus on observability. The discussion is really excellent, and there are some really thought-provoking moments.

Nell Shamrell-Harrington, with Nathen Harvey, Charity Majors, and Jamie Osler

Enable your Devs to do Ops

How? By writing runbooks. This article takes you through how, why, and what tools to use as you develop runbooks for your systems.

Francesco Negri — Buildo

How Threat Stack Does DevOps (Part IV): Making Engineers Accountable

As a security-focused company, it only makes sense that Threat Stack would focus on safety when giving developers access to operate their software production.

We believe that good operations makes for good security. Reducing the scope of engineers’ access to systems reduces the noise if we ever have to investigate malicious activity.

Pete Cheslock — Threat Stack

Outages

Data Action
- Data Action is a dependency of many Australian banks.
Travis CI
S3
- Amazon S3 had a pair of outages for connections through VPC Endpoints. The Travis CI, Datadog, and New Relic outages were around the same time, but I can’t tell conclusively whether they were related.
Datadog
New Relic

SRE Weekly Issue #115

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues