SRE Weekly Issue #209

Articles

Gandalf: an intelligent, end-to-end analytics service for safe deployment in cloud-scale infrastructure

Azure developed this tool to sniff out production problems caused by deploys and guess which deploy might have been the culprit. Its accuracy is impressive.

Adrian Colyer — The Morning Paper (summary)

Li et al. — NSDI’20 (original paper)

fork() can fail: this is important

This one made me laugh out loud. Better check those system call return codes, people.

rachelbythebay

Managing the Hidden Costs of Coordination

This caught my eye:

In addition, what is seen as the IC maintaining organizational discipline during a response can actually be undermining the sources of resilient practice that help incident responders cope with poorly matched coordination strategies and the cognitive demands of the incident.

Laura M.D. Maguire — ACM Queue Volume 17, Issue 6

How much money do SREs make?

A guide on salary expectations for various levels of SRE, especially useful if you’re changing jobs.

Gremlin

3 microservices resiliency patterns for better reliability

The flipside of microservices agility is the resiliency you can lose from service distribution. Here are some microservices resiliency patterns that can keep your services available and reliable.

Joydip Kanjilal

It’s time for smart home devices to have local failover options during cloud outages

There have been several recent failures of consumer devices based on a cloud service outage, and this author argues for change.

Kevin C. Tofel — Stacey on IoT

Human error, miscommunication and lack of training behind false alarm at Pickering nuclear station

This sounds familiar…

Durham Radio News

Friday deploys: comfort, not pressure

Essentially, you’re taking that risk of the Friday afternoon deployment, and spreading it thinly across many deployments throughout the week.

Ben New

Outages

Fidelity
- This one was especially problematic because it happened on Monday, a day of huge losses for the US stock market.
GitHub
- This one too. GitHub posted a short note on the recent outages.
TechCrunch
- TechCrunch was serving an expired TLS certificate. The strange thing is that the certificate had only been valid for 12 hours.
Petnet pet feeders
Google Nest

SRE Weekly Issue #209

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues