SRE Weekly Issue #74

View on sreweekly.com

This is the first issue sent to over 2000 email subscribers (not to mention the 500+ Twitter followers and an unknown number of RSS subscribers!). Wow! Thank you all so much for reading and for all the great feedback you’ve sent over the past year and a half. You make this fun.

Articles

The Always On Architecture – Moving Beyond Legacy Disaster Recovery

The holy grail of high availability is a multi-datacenter (or cloud) active/active architecture. This article goes into why, including examples of common pitfalls of traditional disaster recovery solutions.

Documenting an outage for a post-mortem review

Neat idea: here’s a Stack Overflow question asking for critique of a proposed outline for a post-incident analysis. It’s a great start already, and the answers include some pretty top-notch suggestions.

Multi-region S3 failover /w Route53

A tutorial on setting up multi-region failover for an S3-hosted website, written in response to February’s major S3 outage in us-east.

DNS Resolution in Go and Cgo

Last week, I linked to an article about debugging an overloaded ELB node. This week we have the sequel, a deep dive into the intricate details behind the problem, complete with a trip into the glibc source code.

How Data Science Helps Power Worldwide Delivery of Netflix Content

Netflix uses data science to figure out how to fill the limited space on their edge content delivery nodes with the videos that people will request, all while (hopefully) avoiding hot nodes.

Shadowing Customer Support for a Day

Zayna Shahzad, a PagerDuty software engineer, did customer support for a day, and she learned a ton. As SREs, we have the customer experience directly in our sights, so this kind of thing sounds like a really great idea.

Is SRE a Good Term?

Charity Majors does not want to be an SRE. Find out why by watching this 5-minute video interview between her and Rob Hirschfeld. I don’t often link to videos, because who has time to watch stuff? But this one is pretty intriguing.

How we do HumanOps at Server Density

Server Density originated the term “humanops”, and now they share 12 parts of how they practice it.

Modifications to the current on-call system?

A Malaysian doctor writes about how to ensure that the national health system’s on-call policy is safe for doctors.

The passing of a paediatrician-to-be involved in a road traffic accident (motor-vehicle accident) recently is indeed a heart-breaking news to the whole medical fraternity. With the incident, a persistent recurring issue also resurfaced – work-related commuting accident ie road traffic accidents involving exhausted doctors after on-calls.

Five Things Tech Companies Can Do Better

Do what better? Prevent and end illegal and unethical actions like discrimination, harassment, and retaliation. This article is by Susan Fowler, featured here a bunch, and while it’s not directly related to SRE, it’s so important that I urge you to read it.

Outages

Monitorama 2017 PDX
- Monitorama (and a swathe of Portland) suffered a power outage last week. The organizers created a status site post (linked) and quickly organized a disaster recovery site: an entirely separate conference venue. Seriously amazing work, and oddly appropriate given the conference subject matter.
  If you didn’t make it to Monitorama, here’s a summary from LinkedIn SRE Michael Kehoe.
Sacramento Airport (CA, USA)
British Airways

SRE Weekly Issue #74

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues