SRE Weekly Issue #70

SPONSOR MESSAGE

Resolving DevOps and IT incidents is not enough. Download the eBook: “Blameless Post Mortems (and how to do them)”, and start learning from them. http://try.victorops.com/BlamelessPostMortems/SREWeekly

Articles

GitHub has released OctoDNS, their tool for synchronizing DNS across multiple providers. Shortly after the Dyn outage last fall (covered here), they still only had one DNS provider (source: direct observation). I suspected that this may have had to do with complication in keeping records synched across two providers – perhaps that’s why they created OctoDNS.

Bolt is Netflix’s “event driven diagnostic and remediation platform”, although it actually seems like a full-blown remote execution system for large fleets of servers.

A Google SRE takes us through their first on-call shift including running incident command for a production incident. I like the emphasis on a blameless postmortem.

Pete Shima received some questions about onboarding SREs, and lucky us, he decided to answer them publicly. My favorite section is the one about connecting a new SRE to people across the company. I find that solid connections to folks in various positions are vital to getting my job done well. Thanks to Pete for the SRE Weekly mention!

Salesforce has a humongous infrastructure, and they needed a tool to help visualize data from lots of monitoring systems. They created Refocus to serve that need, and they open sourced it. They had three goals: gather data from all of the monitoring systems, on-board new services quickly, and visualize data in a way that makes sense for each service.

Full disclosure: Salesforce (parent company of my employer, Heroku), is mentioned.

Tcpdump is a critical tool for debugging thorny network issues. Julia Evans created a new zine to help you learn the basics, although if her other zines are any indication, even a pro may learn a new trick or two. The zine is $10 now and will be available for free at some point in the future.

Turns out that sharks are a reliability risk. And not just those WFLB.

From their Global Developer Survey, GitLab learned that it’s common for developers to release code before it’s production-ready in response to organizational pressures.

Code released before it’s ready might be good for meeting deadlines, but that’s about all it’s good for.

Here’s a pretty excellent analysis of why adopting the cloud can be difficult for banks. Just skip past the bit with the incorrect uptime calculation, since four nines of uptime actually equates to about 53 minutes’ downtime per year, not 9 hours.

Outages

  • London Marathon Donations
    • Ebay and Virgin Money Giving both went down under the load as many flocked to place donations before the London Marathon.
  • CARLI
    • CARLI is the Consortium of Academic Research Libraries in Illinois. I included this outage because of the short but sweetly personal postmortem from their network engineer.
  • Instagram
  • Reddit
    • Sorry for the extended outage there. We failed back the maintenance performed earlier tonight. We’ll provide a post-mortem at a later date.

Updated: April 30, 2017 — 9:24 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme