SRE Weekly Issue #74

This is the first issue sent to over 2000 email subscribers (not to mention the 500+ Twitter followers and an unknown number of RSS subscribers!).  Wow!  Thank you all so much for reading and for all the great feedback you’ve sent over the past year and a half.  You make this fun.

SPONSOR MESSAGE

Upcoming devops.com webinar: Top 10 Practices of Highly Successful DevOps Incident Management Teams. Learn more and register: http://try.victorops.com/SRE_Weekly/IncidentMgmtWebinar

Articles

The holy grail of high availability is a multi-datacenter (or cloud) active/active architecture. This article goes into why, including examples of common pitfalls of traditional disaster recovery solutions.

Neat idea: here’s a Stack Overflow question asking for critique of a proposed outline for a post-incident analysis. It’s a great start already, and the answers include some pretty top-notch suggestions.

A tutorial on setting up multi-region failover for an S3-hosted website, written in response to February’s major S3 outage in us-east.

Last week, I linked to an article about debugging an overloaded ELB node. This week we have the sequel, a deep dive into the intricate details behind the problem, complete with a trip into the glibc source code.

Netflix uses data science to figure out how to fill the limited space on their edge content delivery nodes with the videos that people will request, all while (hopefully) avoiding hot nodes.

Zayna Shahzad, a PagerDuty software engineer, did customer support for a day, and she learned a ton. As SREs, we have the customer experience directly in our sights, so this kind of thing sounds like a really great idea.

Charity Majors does not want to be an SRE. Find out why by watching this 5-minute video interview between her and Rob Hirschfeld. I don’t often link to videos, because who has time to watch stuff? But this one is pretty intriguing.

Server Density originated the term “humanops”, and now they share 12 parts of how they practice it.

A Malaysian doctor writes about how to ensure that the national health system’s on-call policy is safe for doctors.

The passing of a paediatrician-to-be involved in a road traffic accident (motor-vehicle accident) recently is indeed a heart-breaking news to the whole medical fraternity. With the incident, a persistent recurring issue also resurfaced – work-related commuting accident ie road traffic accidents involving exhausted doctors after on-calls.

Do what better? Prevent and end illegal and unethical actions like discrimination, harassment, and retaliation. This article is by Susan Fowler, featured here a bunch, and while it’s not directly related to SRE, it’s so important that I urge you to read it.

Outages

  • Monitorama 2017 PDX
    • Monitorama (and a swathe of Portland) suffered a power outage last week. The organizers created a status site post (linked) and quickly organized a disaster recovery site: an entirely separate conference venue. Seriously amazing work, and oddly appropriate given the conference subject matter.

      If you didn’t make it to Monitorama, here’s a summary from LinkedIn SRE Michael Kehoe.

  • Sacramento Airport (CA, USA)
  • British Airways
Updated: May 28, 2017 — 9:44 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme