Articles
They fully expected their deep-discount sale to drive traffic, but they didn’t expect their system to handle the increase in the way that it did.
Michał Kosmulski — Allegro
Pre-stop hooks, liveness probes, and readiness probes were key to smoothly transitioning their services from a home-grown container system to Kubernetes.
Oliver Leaver-Smith — Sky Betting & Gaming
The experience of responding to an incident can evoke emotions that run the gamut.
Mads Hartmann
Google has released course materials the first of a series of classes on NALSD (“non-abstract large systems design”). This first one is about a distributed Pub-Sub system.
Auithor: Jenny Liao and Salim Virji — Google
Usually, doing a post-analysis on an incident you were in is an anti-pattern because you’re likely to introduce bias. But sometimes, it can lead you to learn more than you would have otherwise.
Lorin Hochstein
Outages
- Datadog
- G Suite
- Google Cloud Platform
- Let’s Encrypt
- Google CT logs had an issue, impairing Let’s Encrypt’s ability to issue.
- Tesla
- Apple
- Heroku
- Connectivity Issues
- Crypto.com (cryptocurrency exchange)
- The CEO says a database issue (nearly) opened up the possibility for arbitrage.