View on sreweekly.com
I really love that some of you are taking vacations. Preventing burnout is really critical for improving reliability. That said, if you’d please exempt my address from your vacation auto-responder, that’d be super-cool 😉
Last week, I linked to a reddit story of an engineer that was unfairly fired for a mistake on their first day. Dr. Richard Cook picked this up and wrote up a great analysis of the underlying organizational issues.
Thanks to John Allspaw for this one.
This was released the week before last, but it took me awhile to digest it. The ATO did a very thorough post-analysis on their two outages and released this polished report. I like that they took full responsibility for the outage even though it was an issue with a fully-managed vendor SAN offering, and they clearly sought to learn as much as possible.
Pinterest tech lead Suman Karumuri explains how they use distributed tracing and the benefits it’s brought them.
With these new use cases, we see tracing infrastructure as the third pillar of monitoring our services in addition to metrics and log search systems.
Frustrated by British Airways’s Willie Walsh’s public statement regarding their major outage, TripWire founder Gene Kim took it upon himself to write an open letter of apology as if he were an airline CEO. It’s pretty great.
This article explores several options for HA with Nginx: put an ELB in front of it, Route 53 with health checks, or an elastic IP switched either by keepalived or a Lambda function.
I’ve been following GitLab’s blog since their engineer accidentally deleted their database earlier this year, and I’m glad I did. This article touches on all sorts of topics near to my heart: preventing burnout, examining incident response metrics, enforcing vacations, incident command, and having developers go on-call for what they wrote.
The costs associated with running a full-capacity redundant system in a secondary site can be numerous and subtle. Those costs can be especially hard to swallow when expected returns on infrastructure investments prove elusive.
Netflix explains in depth the careful scientific experiments they perform in production in order to improve the QoE (quality of experience).
- Google Cloud Services
- 62-minute multiple-zone total internet outage in asia-northeast1. Postmortem linked, including a description of several contributing factors.
We apologize for the impact this issue had on our customers, and especially to those customers with deployments across multiple zones in the asia-northeast1 region. We recognize we failed to deliver the regional reliability that multiple zones are meant to achieve.
View on sreweekly.com
This week, I had the awesome opportunity to attend a short-form training session on the Incident Management System (the broader system that includes Incident Command) given by Blackrock 3 Partners. Shout-out to Rob, Ron, and Chris – it was awesome meeting you guys, and I really enjoyed our conversations!
In case you missed it, Uber kicked off this and another investigation in response to a blog post by Susan Fowler, an SRE whose writing I’ve featured here a number of times. I’m pleased at this first step by Uber and I’m looking forward to what comes next. It might be a leave of absence for Uber’s CEO, although no decision has been made yet.
Here’s the 2013 article that started it all. If you’re unfamiliar with Jepsen, it’s an article series on testing various distributed data systems for partition tolerance, along with a companion tool set for inducing failures.
For those not completely “cloud native” (ugh) by this point, here’s a nifty primer on some of the BGP tricks you’ll need to know if you manage your own IP transit links.
Redis has a pretty big gotcha regarding deletion of expired keys, as these engineers discovered. In fact, my experience with Redis was full of operational gotchas like this.
This poor anonymous Reddit poster had a very bad day. The community rallied around them to explain that no, the anonymous poster is not to blame. One of the top commenters is Yorick Peterse, the engineer that inadvertently deleted GitLab.com’s main database earlier this year. Click through to see blamelessness in action.
PagerDuty is deeply invested in the Incident Management System, and most especially Incident Command. This article is a great overview, and if you want more, don’t forget that they also released their incident response documentation awhile back, including their Incident Commander training material.
The main theme in this article by StatusPage.io is the direct relationship between increasing complexity and difficulty in attaining high reliability. I like the mention of microservices as a trade-off and not a panacea.
Automation doesn’t replace ops, it augments it. Abstraction doesn’t replace ops, it hides it. Function as a service doesn’t remove complexity, it increases it exponentially.
- Amazon product pages went down today in a rare outage
- The linked story was for an outage on June 7th. There was at least one additional similar outage on June 9th (source: personal experience).
- Dutch hosting provider Verelox is having a really rough time:
First of all, we want to offer our apologies for any inconvenience. Unfortunately, an ex administrator has deleted all customer data and wiped most servers.
Ouch. Good luck, folks.
View on sreweekly.com
This is the first issue sent to over 2000 email subscribers (not to mention the 500+ Twitter followers and an unknown number of RSS subscribers!). Wow! Thank you all so much for reading and for all the great feedback you’ve sent over the past year and a half. You make this fun.
The holy grail of high availability is a multi-datacenter (or cloud) active/active architecture. This article goes into why, including examples of common pitfalls of traditional disaster recovery solutions.
Neat idea: here’s a Stack Overflow question asking for critique of a proposed outline for a post-incident analysis. It’s a great start already, and the answers include some pretty top-notch suggestions.
A tutorial on setting up multi-region failover for an S3-hosted website, written in response to February’s major S3 outage in us-east.
Last week, I linked to an article about debugging an overloaded ELB node. This week we have the sequel, a deep dive into the intricate details behind the problem, complete with a trip into the glibc source code.
Netflix uses data science to figure out how to fill the limited space on their edge content delivery nodes with the videos that people will request, all while (hopefully) avoiding hot nodes.
Zayna Shahzad, a PagerDuty software engineer, did customer support for a day, and she learned a ton. As SREs, we have the customer experience directly in our sights, so this kind of thing sounds like a really great idea.
Charity Majors does not want to be an SRE. Find out why by watching this 5-minute video interview between her and Rob Hirschfeld. I don’t often link to videos, because who has time to watch stuff? But this one is pretty intriguing.
Server Density originated the term “humanops”, and now they share 12 parts of how they practice it.
A Malaysian doctor writes about how to ensure that the national health system’s on-call policy is safe for doctors.
The passing of a paediatrician-to-be involved in a road traffic accident (motor-vehicle accident) recently is indeed a heart-breaking news to the whole medical fraternity. With the incident, a persistent recurring issue also resurfaced – work-related commuting accident ie road traffic accidents involving exhausted doctors after on-calls.
Do what better? Prevent and end illegal and unethical actions like discrimination, harassment, and retaliation. This article is by Susan Fowler, featured here a bunch, and while it’s not directly related to SRE, it’s so important that I urge you to read it.