In case you weren’t familiar with the Saga pattern like I was, it’s basically a pseudo-transaction across multiple microservices. Here’s why it might not be a great idea.
During a rolling deploy, for a very brief period of time, different parts of the infrastructure had old or new code running, with unexpected results.
On its face, we have a simple requirement:
- Generate sequential numbers
- Ensure that there can be no gaps
- Do that in a distributed manner
It’s never simple with distributed systems.
In classic Cloudflare style, here’s an ultra-deep dive into the kernel to find the source of trouble-making packet loss.
Terin Stock — Cloudflare
Even with a “duplicate” incident, there’s always at least one thing that’s different: the fact that it’s happened before. That changes things. In practice, a lot more will be different too.
Fred Hebert — Honeycomb
Full disclosure: Honeycomb is my employer.
There are definitely pros and cons to being in the most popular (and most oft-maligned) AWS region.
Jeff Martens — Metrist
Changes are frequent causes of incidents, but what exactly counts as a change? This article delves into that with examples.
This crash is a great reminder that we have to look past “human error” to the systems around the humans that set them up for failure (or don’t set them up for success).