View on sreweekly.com
Shout-out to all the folks I met at Velocity! It was an exhilarating week filled with awesome personal conversations and some really incredible talks.
Then I came back to Earth to discover that everyone chose this week to write awesome SRE-related articles. I’m still working my way through them, but get ready for a great issue.
This is the blockbuster PDF dropped by the SNAFUcatchers during their keynote on day two of Velocity. Even just the 15-minute summary by Richard Cook and David Woods had me on the edge of my seat. In this report, they summarize the lessons gleaned from presentations of “SNAFUs” by several companies during winter storm Stella.
SNAFUs are anomalous situations that would have turned into outages were it not for the actions taken by incident responders. Woods et al. introduced a couple of concepts that are new to me: “dark debt” and “blameless versus sanctionless”. I love these ideas and can’t wait to read more.
These two articles provide a pretty good round-up of the ideas shared at Velocity this past week.
This one starts with a 6-hour 911 (emergency services) outage in 2014 and the Toyota unintended acceleration incidents, and then vaults off into really awesome territory. Research is being done into new paradigms of software development that leave the programming to computers, focusing instead on describing behavior using a declarative language. The goal: provably correct systems. Long read, but well worth it.
Drawing from Woods, Allspaw, Snowden, and others, this article explains how and why to improve the resilience of a system. There’s a great hypothetical example of graceful degradation that really clarified it for me.
In a recent talk, Charity Majors made waves by saying, “Nines don’t matter when users aren’t happy.” Look, you can have that in t-shirt and mug format!
A summary of how six big-name companies test new functionality by gradually rolling it out in production.
This article jumps off from Azure’s announcement of availability zones to discuss a growing trend in datacenters. We’re moving away from highly reliable “tier 4” datacenters and pushing more of the responsibility for reliability to software and networks.
Of course I do, and I don’t even know who Xero is! They use chat, chatops, and Incident Command, like a lot of other shops. I find it interesting that incident response starts off with someone filling out a form.
- PagerDuty posted a lengthy followup report on their outage on September 19-21. TL;DR: Cassandra. It was the worst kind of incident, in which they had to spin up an entirely new cluster and develop, test, and enact a novel cut-over procedure. Ouch.
- Heroku suffered a few significant outages. The one linked above includes a followup that describes a memory leak in their request routing layer. These two don’t yet have followups: #1298, #1301
Full disclosure: Heroku is my employer.
- On September 29, Azure suffered a 7-hour outage in Northern Europe. They’ve released a preliminary followup that describes an accidental release of fire suppression agent and the resulting carnage. Microsoft promises more detail by October 13.
Unfortunately can’t deep-link to this followup, so just scroll down to 9/29.
- New Relic
- Blackboard (education web platform)
View on sreweekly.com
I’m heading to New York tomorrow and will be at Velocity Tuesday and Wednesday. If you’re there, look for the weirdo in the SRE Weekly shirt and hit me up for some nifty swag! Also, maybe check out my talk on DNS, if you’re into that kind of thing.
Thanks to an eagle-eyed reader for pointing out that I totally screwed up the HTML on the link last week. Oops.
Here’s how Hosted Graphite made their job ad for an SRE-like role (Ops Automation Engineer) more inclusive. The article is filled with specific before/after language snippets, each with a detailed explanation of why they made the change.
A couple weeks after their major outage last October, Dyn published this article explaining secondary DNS. It’s a great primer and digs into what to do if you use advanced non-standard functionality like ALIAS records and traffic balancing.
SignalFx goes into deep detail on their feature for predicting future metric values. We get an explanation of why prediction is difficult and a discussion of the math involved in their solution.
Payments: we really have to get them right. Here’s DropBox’s Jessica Fisher with a discussion of how they reconcile failed payments.
No matter what goes wrong, our top priority is to make sure that customers receive service for which they’ve been charged, and aren’t charged for service they haven’t received.
A couple of weeks ago, I linked to a story about Resilience4j, a fault tolerance library for Java. This week is the second installment that shows you how to use it to implement circuit breakers. There’s also an interesting discussion of one of the implementation details.
Here’s a cute little debugging story. Turns out ntpd has a bit of a blind spot!
Adcash CTO Arnaud Granal gives us a rare glimpse into the multiple iterations of their infrastructure. Hear what worked well and what didn’t as they scaled to handle 500k requests per second at peak.
- OpenSRS (DNS provider)
- OpenSRS (registrar and DNS provider, among other services) had a major outage in their DNS service.
At 1AM UTC we were the target of a sophisticated DNS attack that was followed by an unrelated double failure of core network equipment at our main Canadian data center, caused by an undocumented software limitation.
- Amadeus (airline booking system)
- Amadeus provides the technical underpinnings of many airlines around the world. They had issues this past week, taking a lot of airlines with them.
Our [data center] hosting provider has been having issues with a power distribution unit.
View on sreweekly.com
A couple of DNS-related links this week. I’ll be giving a talk at Velocity NYC on all of the fascinating things I learned about DNS in the wake of the Dyn DDoS and the .io TLD outage last fall. If you’re there, hit me up for some SRE Weekly swag!
We’re all becoming distributed systems engineers, and this stuff sure isn’t easy.
Isn’t distributed programming just concurrent programming where some of the threads happen to execute on different machines? Tempting, but no.
Every-second canarying is a pretty awesome concept. Not only that, but they even post the results on their status page. Impressive!
So many lessons! My favorite is to make sure you test the “sad path”, as opposed to just the “happy path”. If a customer screws up their input and then continues on correctly from there on, does everything still work?
Extensive notes taken during 19 talks at SRECon 17 EMEA. I’m blown away by the level of detail. Thanks, Aaron!
A cheat sheet and tool list for diagnosing CPU-related issues. There’s also one on network troubleshooting by the same author. Note: LinkedIn login required to view.
Antifragility is an interesting concept that I was previously unaware of. I’m not really sure how to apply it practically in an infrastructure design, but I’m going to keep my eye out for antifragile patterns.
It’s easy to overlook your DNS, but a failure can take your otherwise perfectly running infrastructure down — at least from the perspective of your customers.
Do you run a retrospective on near misses? The screws they tightened in this story could just as easily be databases quietly running at max capacity.
A piece of one of the venting systems fell and almost hit an employee which almost certainly would have caused a serious injury and possibly death. The business determined that (essentially) a screw came loose causing the part to fall. It then checked the remaining venting systems and learned that other screws had starting becoming loose as well and was able to resolve the issue before anyone got hurt.
Oh look, Azure has AZs now.
The transport layer in question is gRPC, and this article discusses using it to connect a microservice-based infrastructure. If you’ve been looking for an intro to gRPC, check this out.
How do you prevent human error? Remove the humans. Yeah, I’m not sure I believe it either, but this was still an interesting read just to learn about the current state of lights-out datacenters.
This is a really neat idea: generate an interaction diagram automatically using a packet capture and a UML tool.
Thanks to Devops Weekly for this one.
- The .io TLD went down again, in exactly the same way as last fall.
- PagerDuty suffered a major outage lasting over 12 hours this past thursday. Customers scrambled to come up with other alerting methods.
Some really excellent discussion around this incident happened on the hangops slack in the #incident_response channel. I and others requested more details on the actual paging latency and PagerDuty delivered them on their status site. Way to go, folks!
- I noticed this minor incident after getting a 500 reloading PagerDuty’s status page.
- The Travis CI Blog: Sept 6 – 11 macOS outage postmortem
- This week, Travis posted this followup describing the SAN performance issues that impacted their system.
- Outlook and Hotmail