This is a long one, but trust me, it’s worth the read. My favorite part is where the author gets into mental models, hearkening back to the Stella Report.
When CDN outages occur, it becomes immediately clear who is using multiple CDNs and who is not.
A multi-CDN approach can be tricky to pull off, but as these folks explain, it can be critical for reliability and performance.
Scott Kidder — mUX
Full disclosure: Fastly, my employer, is mentioned.
This article explains five different phenomena that people mean when they say “technical debt”, and advocates understanding the full context rather than just assuming the folks that came before were fools.
/thanks Greg Burek
The work we did to get our teams aligned and our systems in good shape meant that we were able to scale, even with some services getting 40 times the normal traffic.
Kriton Dolias and Vinessa Wan — The New York Times
How does one resolve the emerging consensus for alerting exclusively on user-visible outages, with the undeniable need to learn about and react to things +before* users notice? Like a high cache eviction rate?
There’s a real gem in here, definitely worth a read.
Charity Majors (and Liz Fong-Jones in reply)
Being on-call will always involve getting woken up occasionally. But when that does happen, it should be for something that matters, and that the on-call person can make progress toward fixing.
Rachel Perkins — Honeycomb
Delayed replication can be used as a first resort to recover from accidental data loss and lends itself perfectly to situations where the loss-inducing event is noticed within the configured delay.
Andreas Brandl — GitLab
- Azure Kubernetes Service (US East)
- There’s a pretty interesting incident description in their history page.
- Via Twitter:
At this time, the attacker has formatted all the disks on every server. Every VM is lost. Every file server is lost, every backup server is lost. NL was 100% hosted with a vastly smaller dataset. NL backups by the provideer were intact, and service should be up there.
My sympathies, folks.
- Via Twitter:
- Emails into slack were failing due to an expired TLS certificate.
- Linked is their followup post explaining more about the incident.
- JPMorgan Chase
- Strava and Garmin Connect
- Microsoft Windows Update
- Sydney, AU Train Network
- Lloyds Bank