The Pyramid introduced in this article is three levels of monitoring: Operational, Data Validation, and Business Assumptions. These roughly correspond to questions like: is the system up? Is the right amount of data flowing through it? Is that data correct?
Karel Vanden Bussche — DEV
Extremely powerful tools can become extremely powerful footguns, for example Terraform.
Dave Smith — GitLab
Sure, you know what latency is, but do you really know what a percentile is? A histogram? A heatmap?
If you’re using a CDN, you need to keep an eye on it. Here’s a primer on what to watch for.
Or Hillel — DZone
This article series covers 12 aspects important in the design of reliable systems. Some of the aspects, such as modularity, loose coupling, graceful degradation, and redundancy, are covered in depth.
A couple weeks back, GitHub was hard down, even including its status page at times. This report goes into that in detail, and the cause is pretty interesting.
Jakub Oleksy – GitHub
An in-depth look at different kinds of failover, including each kind’s methodology and purposes.
This one is especially interesting for the controversial and baseless conclusions popularized in the media about a supposed cause rooted in Korean culture. It’s a good reminder that we need to be careful to ensure the validity of the lessons we learn from incidents.