[…] spans are too low-level to meaningfully be able to unearth the most valuable insights from trace data.
Find out why current distributed tracing tools fall short and the author’s vision of the future of distributed tracing.
If I wanted to introduce the concept of blameless culture to execs, this article would be a great starting point.
Rui Su — Blameless
When we look closely at post-incident artifacts, we find that they can serve a number of different purposes for different audiences.
John Allspaw — Adaptive Capacity Labs
When you meant to type /127 but entered /12 instead
The early failure injection testing mechanisms from Chaos Monkey and friends were like acts of random vandalism. Monocle is more of an intelligent probing, seeking out any weakness a service may have.
There’s a great example of Monocle discovering a mismatched timeout between client and server and targeting it for a test.
Adrian Colyer (summary)
Basiri et al., ICSE 2019 (original paper)
Take the axiom of “don’t hardcode values” to an extreme, and you end up right back where you started.
- Cloudflare suffered a massive outage, returning 502 responses for over 80% of traffic for over 20 minutes. Linked above is their analysis. A tweet thread involving their CEO is also illuminating.
- Google Maps
- Azure suffered an outage in San Jose, CA, USA on July 2.