Articles
Go’s HTTP client defaults to no timeout. Making HTTP requests with no timeout is rarely a good idea and has been at the heart of many incidents I’ve been involved in.
Nathan Smith
A few times now, I’ve made offhand comments about how Spanner promises a lot and I’d like to know what the catches are. Here they are! In all fairness, they’re pretty reasonable constraints to work with.
Niel Markwick and Robert Saxby — Google
I’d refer to this as more of a retrospective template, but in any case, it’s pretty nifty!
Michael Kehoe
This is a news report rather than a technical deep-dive. It’s got some pretty interesting (and amusing) stories from various MMOs.
Alex Wiltshire — PC Gamer
Here’s how Netflix does observability.
Kevin Lew and Sangeeta Narayanan — Netflix
Looks like I’ve missed a few incident followup posts from Heroku in the past couple months:
#1548: Increased errors in starting dynos
#1535: Post-incident Dyno Restarts
#1459: Scheduled API Maintenance on Monday March 26 at 23:00 UTC (4:00 PM PT)’
#1413: Dyno Availability
#1414: Heroku Connect Sync Delays
#1395: Heroku Connect Availability
#1393: Heroku Connect unavailable
#1379: Dyno boot issues
Outages
- Walt Disney World Website and My Disney Experience Mobile App
- Having just been to Disney World in April, I can attest to the severity of this kind of outage and the importance of the app.
- Paytm (digital wallet service)
- ASX (Australian Stock Exchange)
- Inadvertent release of fire suppression gas damaged some equipment and halted trading.
- LSE (London Stock Exchange)
- Today we mitigated 1.1.1.1
-
On May 31, 2018 we had a 17 minute outage on our 1.1.1.1 resolver service; this was our doing and not the result of an attack.
Cloudflare shares some detail on what went wrong in this comprehensive incident analysis.
Marek Majkowski — Cloudflare
-