This incident report from a September Datadog outage has an interesting tidbit aboiut scaling external incident response in tandem with internal.
Alexis Lê-Quôc — Datadog
This is Google’s write-up for an interesting issue that involved repeated re-sending of invitations to edit a Google Drive document.
I basically want to immediately absorb any article with this title, unless it’s just clickbait spam. This one definitely isn’t.
Lots of juicy details in this one about the difficulty Slack has had in scaling their DB layer and how Vitess solved their problems.
Arka Ganguli, Guido Iaquinti, Maggie Zhou, and Rafael Chacón — Slack
Hitting file descriptor limits is such an annoying kind of outage. Some good tips here, clearly coming from hard-won experience.
They used two providers synced with OctoDNS.
Ryan Timken and Kiran Naidoo — Cloudflare
This is all about understanding the whole system (people and technology) and building learning, rather than finding a superficial “root cause”.
Piyush Verma — Last9
- New Zealand Reserve Bank
Local e-commerce site OneDayOnly is running Black Friday discount deals again today, after the shopping site was down for a few hours last Friday.
- This outage occurred on Giving Tuesday, a very important day for nonprofits to raise funds.