SRE Weekly Issue #176

A message from our sponsor, VictorOps:

[Free Guide] VictorOps partnered with Catchpoint and came up with six actionable ways to transform your monitoring and incident response practices. See how SRE teams are being more proactive toward service reliability.

http://try.victorops.com/sreweekly/transform-monitoring-and-incident-response

Articles

[…] spans are too low-level to meaningfully be able to unearth the most valuable insights from trace data.

Find out why current distributed tracing tools fall short and the author’s vision of the future of distributed tracing.

Cindy Sridharan

If I wanted to introduce the concept of blameless culture to execs, this article would be a great starting point.

Rui Su — Blameless

When we look closely at post-incident artifacts, we find that they can serve a number of different purposes for different audiences.

John Allspaw — Adaptive Capacity Labs

When you meant to type /127 but entered /12 instead

Oops?

The early failure injection testing mechanisms from Chaos Monkey and friends were like acts of random vandalism. Monocle is more of an intelligent probing, seeking out any weakness a service may have.

There’s a great example of Monocle discovering a mismatched timeout between client and server and targeting it for a test.

Adrian Colyer (summary)

Basiri et al., ICSE 2019 (original paper)

Take the axiom of “don’t hardcode values” to an extreme, and you end up right back where you started.

Mike Hadlow

Outages

Updated: July 7, 2019 — 9:49 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme