SRE Weekly Issue #247

A message from our sponsor, StackHawk:

The ZAP open source project is the underlying security scanner for StackHawk. Check out this 21 minute introduction to ZAP from project founder and core-contributor Simon Bennetts.
https://sthwk.com/zap-intro-video

Articles

This incident report from a September Datadog outage has an interesting tidbit aboiut scaling external incident response in tandem with internal.

Alexis Lê-Quôc — Datadog

This is Google’s write-up for an interesting issue that involved repeated re-sending of invitations to edit a Google Drive document.

Google

I basically want to immediately absorb any article with this title, unless it’s just clickbait spam. This one definitely isn’t.

Ronak Nathani

Lots of juicy details in this one about the difficulty Slack has had in scaling their DB layer and how Vitess solved their problems.

Arka Ganguli, Guido Iaquinti, Maggie Zhou, and Rafael Chacón — Slack

Hitting file descriptor limits is such an annoying kind of outage. Some good tips here, clearly coming from hard-won experience.

Utsav Shah

They used two providers synced with OctoDNS.

Ryan Timken and Kiran Naidoo — Cloudflare

This is all about understanding the whole system (people and technology) and building learning, rather than finding a superficial “root cause”.

Piyush Verma — Last9

Outages

Updated: December 6, 2020 — 8:33 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme