Articles
There’s so much in this article:
- how to recognize when your system may be susceptible to cascading failure
- how to prevent it
- how to deal with it when it happens (and how hard that can be)
Laura Nolan — Slack
It’s time for this year’s SRE Survey. Don’t forget that with each completed survey, Catchpoint donates $5 to charity.
This growing demand [for SREs] is not without growing pains as a skills gap problem has emerged due to the fact that SRE training requires a hands-on, interactive learning environment.
Peter Murray — Catchpoint
Both the summary and the original article are well worth reading. This stood out to me:
As much as we may think of incidents as taking place in all those technical parts of the system below the line, incidents actually take place above it
Thai Wood (summary)
Dr. Richard Cook (original article)
The EBS control plane data store resembles a “jellyfish” (actually a Physalia, a.k.a. Portuguese man-of-war).
Timothy Prickett Morgan — The Next Platform
Ideal: each team manages their microservice(s) in isolation.
Reality: microservices interact in unexpected ways and a broader system emerges that has remarkable similarities to running a monolith.
Ben Sigelman — LightStep
This one discusses how to handle SRE for a monolith, and some examples of what often goes wrong.
Eric Harvieux — Google
The author blocked an unexpected Sunday deploy of untested code, and it turned out to be a good thing they did.
rachelbythebay
Outages
- GitHub
- NPM
- Linked is an interesting explanation from Cloudflare, posted as a comment on a GitHub issue.
- New Relic
- PagerDuty
- Fidelity
- Fidelity customers saw a $0 balance for their 401(k) [US retirement] accounts.
- Microsoft Office 365 & Outlook down – Users getting service unavailable error
- Heathrow Airport (London, UK)
- Zillow
- Indeed
- Kobo
- Heroku
- Squarespace
- Also this one.