Azure developed this tool to sniff out production problems caused by deploys and guess which deploy might have been the culprit. Its accuracy is impressive.
Adrian Colyer — The Morning Paper (summary)
Li et al. — NSDI’20 (original paper)
This one made me laugh out loud. Better check those system call return codes, people.
This caught my eye:
In addition, what is seen as the IC maintaining organizational discipline during a response can actually be undermining the sources of resilient practice that help incident responders cope with poorly matched coordination strategies and the cognitive demands of the incident.
Laura M.D. Maguire — ACM Queue Volume 17, Issue 6
A guide on salary expectations for various levels of SRE, especially useful if you’re changing jobs.
The flipside of microservices agility is the resiliency you can lose from service distribution. Here are some microservices resiliency patterns that can keep your services available and reliable.
There have been several recent failures of consumer devices based on a cloud service outage, and this author argues for change.
Kevin C. Tofel — Stacey on IoT
This sounds familiar…
Durham Radio News
Essentially, you’re taking that risk of the Friday afternoon deployment, and spreading it thinly across many deployments throughout the week.
- This one was especially problematic because it happened on Monday, a day of huge losses for the US stock market.
- TechCrunch was serving an expired TLS certificate. The strange thing is that the certificate had only been valid for 12 hours.
- Petnet pet feeders
- Google Nest