Wow, I have a lot of great content to share with you this week! Sometimes it seems like awesome articles come in waves… not sure what that’s about.
This is the first in a series where New York Times CTO, Nick Rockwell, talks to leaders in the technology world about their work.
There’s so incredibly much awesome in this conversation, and I’ve already seen the internet alight with people quoting it. Charity says so many insightful things that I’m going to have to reread this a couple of times to absorb it all. It’s a must-read!
Xero SRE is back, this time with an article about their incident response process and an overview of their chatbot, Multivac. The bot assists with paging and information tracking and, crucially, guides incident responders through a checklist of actions such as determining severity.
Here’s a fun little distributed system debugging story from the founder of RavenDB.
This CNN article goes into a little more detail about what happened. To my eye, there’s not enough in those details to warrant firing, so there must be more than has been shared publicly.
LinkedIn’s growth from a single datacenter to multiple “hyperscale” locations was accompanied by a cultural shift. They transitioned from “‘Site-Up’ is priority #1” to “taking intelligent risks” as their overall reliability improved.
The program is nominally aimed toward “a variety of industries, including the aerospace, automotive, maritime, manufacturing, oil, chemical, power transmission, medical device, infrastructure planning and extreme event response sectors”, though I can’t help but wonder if it might be applicable to IT.
“Well I’d cut out the pizza and beer and instead pay for Splunk.”
This author pushes us to resist the urge to write something in-house and instead look for external services or software, when the tool is not key to delivering customer value.
Here’s a very well-articulated argument for using a third-party feature-flag service rather than writing your own. I’ve seen every pitfall they mention and more. This article is by Rollout.io, a feature-flag service, but they notably don’t mention their product even once, and they don’t need to. Nicely done, folks.
I think there’s another layer we get out of the postmortem process itself that hasn’t usually been part of the discussion: communicating about your service’s long-term stability.
We should look beyond merely preventing the same kind of incident in the future and improving our incident response process, says this article from PagerDuty.
How many times have you been paged for a server at 95% disk usage, only to find that it’s still months away from full? This article by SignalFX is about a feature on their platform, but its concepts are generally applicable to other tools.
A primer on testing failover in a MongoDB Atlas cluster.
Large numbers of SREs went scrambling last month when we realized that we may suddenly run out of resources on our NoSQL workloads. Here are some concrete numbers on how things actually turned out.
- PolitiFact was down for a bit during President Trump’s yearly State of the Union address.
- It seems that folks with two-factor authentication were unable to log in for multiple days.
- The Travis CI Blog: Major build outage: a postmortem report
- Linked is a highly detailed summary of their troubles with an overloaded RabbitMQ cluster.