Articles
Two of the hardest problems of testing in production are curtailing blast radius and dealing with state. In this post, I aim to explore the topic of curtailing blast radius in more detail.
Cindy Sridharan
This team was getting paged constantly to fix failed Kafka nodes, and their outlook for the future was looking even worse. They responded by developing an auto-remediation system.
Andrey Falko — Lyft
As last week’s Boeing-related article explained, Boeing and Airbus have significantly different philosophies regarding the role of pilots vs aircraft in aviation safety. This new NTSB report strikes at the heart of that dichotomy.
Alex Davies – Wired
This is an especially interesting read because the team in question was a network operations team, and the members largely had no software engineering experience. Part of the transformation involved essentially training them for a new career.
Tom Wright — Google
My favorite part is the explanation of why observability is critical in microservice architectures.
The system is no longer in one of two states but more like one of n-factorial states.
Tyler Treat
Given that Lambda et al. auto-scale, is caching still relevant? Find out why by reading this article.
Yan Cui
Outages
- GitHub
- Repository forking operations were delayed.
- Statuspage.io
- Slack
-
Some customers are seeing an error code (“1AE32E16D91F”) when connecting to Slack.
Now I really want to know what
1AE32E16D91F
is…
-