Articles
Azure developed this tool to sniff out production problems caused by deploys and guess which deploy might have been the culprit. Its accuracy is impressive.
Adrian Colyer — The Morning Paper (summary)
Li et al. — NSDI’20 (original paper)
This one made me laugh out loud. Better check those system call return codes, people.
rachelbythebay
This caught my eye:
In addition, what is seen as the IC maintaining organizational discipline during a response can actually be undermining the sources of resilient practice that help incident responders cope with poorly matched coordination strategies and the cognitive demands of the incident.
Laura M.D. Maguire — ACM Queue Volume 17, Issue 6
A guide on salary expectations for various levels of SRE, especially useful if you’re changing jobs.
Gremlin
The flipside of microservices agility is the resiliency you can lose from service distribution. Here are some microservices resiliency patterns that can keep your services available and reliable.
Joydip Kanjilal
There have been several recent failures of consumer devices based on a cloud service outage, and this author argues for change.
Kevin C. Tofel — Stacey on IoT
This sounds familiar…
Durham Radio News
Essentially, you’re taking that risk of the Friday afternoon deployment, and spreading it thinly across many deployments throughout the week.
Ben New
Outages
- Fidelity
- This one was especially problematic because it happened on Monday, a day of huge losses for the US stock market.
- GitHub
- This one too. GitHub posted a short note on the recent outages.
- TechCrunch
- TechCrunch was serving an expired TLS certificate. The strange thing is that the certificate had only been valid for 12 hours.
- Petnet pet feeders
- Google Nest