SRE Weekly Issue #70

Articles

Enabling DNS split authority with OctoDNS

GitHub has released OctoDNS, their tool for synchronizing DNS across multiple providers. Shortly after the Dyn outage last fall (covered here), they still only had one DNS provider (source: direct observation). I suspected that this may have had to do with complication in keeping records synched across two providers – perhaps that’s why they created OctoDNS.

Introducing Bolt: On Instance Diagnostic and Remediation Platform

Bolt is Netflix’s “event driven diagnostic and remediation platform”, although it actually seems like a full-blown remote execution system for large fleets of servers.

Incident management at Google — adventures in SRE-land

A Google SRE takes us through their first on-call shift including running incident command for a production incident. I like the emphasis on a blameless postmortem.

Onboarding, On-Call and Learning

Pete Shima received some questions about onboarding SREs, and lucky us, he decided to answer them publicly. My favorite section is the one about connecting a new SRE to people across the company. I find that solid connections to folks in various positions are vital to getting my job done well. Thanks to Pete for the SRE Weekly mention!

Take A Moment To Refocus – Salesforce + Open Source = ❤

Salesforce has a humongous infrastructure, and they needed a tool to help visualize data from lots of monitoring systems. They created Refocus to serve that need, and they open sourced it. They had three goals: gather data from all of the monitoring systems, on-board new services quickly, and visualize data in a way that makes sense for each service.

Full disclosure: Salesforce (parent company of my employer, Heroku), is mentioned.

New zine: let’s learn tcpdump!

Tcpdump is a critical tool for debugging thorny network issues. Julia Evans created a new zine to help you learn the basics, although if her other zines are any indication, even a pro may learn a new trick or two. The zine is $10 now and will be available for free at some point in the future.

Sharks Want to Bite Google’s Undersea Cables

Turns out that sharks are a reliability risk. And not just those WFLB.

Why Code Gets Released too Early (and how to fix It)

From their Global Developer Survey, GitLab learned that it’s common for developers to release code before it’s production-ready in response to organizational pressures.

Code released before it’s ready might be good for meeting deadlines, but that’s about all it’s good for.

Four nines & failure rates – will the cloud ever cut it for transactional banking?

Here’s a pretty excellent analysis of why adopting the cloud can be difficult for banks. Just skip past the bit with the incorrect uptime calculation, since four nines of uptime actually equates to about 53 minutes’ downtime per year, not 9 hours.

Outages

London Marathon Donations
- Ebay and Virgin Money Giving both went down under the load as many flocked to place donations before the London Marathon.
CARLI
- CARLI is the Consortium of Academic Research Libraries in Illinois. I included this outage because of the short but sweetly personal postmortem from their network engineer.
Instagram
Reddit
- Sorry for the extended outage there. We failed back the maintenance performed earlier tonight. We’ll provide a post-mortem at a later date.

SRE Weekly Issue #70

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues