Articles
A+ article! Susan Fowler has been a developer, an ops person, and now an SRE. That means she’s well-qualified to give an opinion on who should be on call, and she says that the answer is developers (in most cases). Bonus content includes “What does SRE become if developers are on call?”
[…]if you are going to be woken up in the middle of the night because a bug you introduced into code caused an outage, you’re going to try your hardest to write the best code you possibly can, and catch every possible bug before it causes an outage.
Thanks to Devops Weekly for this one.
I figured this new zine from Julia Evans would be mostly review for me. WRONG. I’d never heard of dstat, opensnoop, or execsnoop, or perf before, but I sure will be using them now. As far as I can tell, Julia wants to learn literally everything, and better yet, she wants to teach us what she learned and how she learned it. Hats off to her.
“While we’ve got the entire system down to do X, shall we do Y also?”
This article argues that we should never do Y. If something goes wrong, we won’t know whether to roll back X or Y, and it’ll take twice as long to figure out which one is to blame.
This week, Mathias introduces “system blindness”, the flawed understanding of how a system works and the lack of knowledge of how incomplete our understanding of it is. Whether we realize it or not, we struggle to mentally model the intricate interconnections in the increasingly complex systems we’re building.
There are no side effects, just effects that result from our flawed understanding of the system.
I’ve mentioned Spokes (formerly DGit) here previously. This time, GitHub shares the details on how they designed Spokes for high durability and availability.
TIL: Ruby can suffer from Java-style stop-the-world garbage collection freezes.
Here’s recap of a talk about Facebook’s “Protect Storm”, given by VP Jay Parikh at @Scale. Project Storm involved retrofitting Facebook’s infrastructure time handle the failure of entire datacenters.
“I was having coffee with a colleague just before the first drill. He said, ‘You’re not going to go through with it; you’ve done all the prep work, so you’re done, right?’ I told him, ‘There’s only one way to find out’” if it works.
Here’s an interview with Jason Hand of VictorOps about the importance of a blameless culture. He mentions the idea that “Why?” is an inherently blameful kind of question (hat tip to John Allspaw’s Infinite “How?”s). I have to say that I’m not sure I agree with Jason’s other point that we shouldn’t bother attempting incident prevention, though. Just look at the work the aviation industry has done toward accident prevention.
This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.
SCALE has opened their CFP, and one of the chairs told me that they’d “love to get SRE focused sessions on open-source.”
Outages
- British Airways
- FLOW (Jamaica telecom)
- SSP
- SSP provides a SaaS for insurance companies to run their business on. They’re dealing with a ten-plus-day outage initially caused by some kind of power issue that fried their SAN. As a result, they’re going to decommission the datacenter in question.
- Heroku
- Full disclosure: Heroku is my employer.
- Azure
- Two EU regions went down simultaneously.
- Overwatch (game)
- Asana
- Linked is a postmortem with an interesting set of root causes. A release went out that increased CPU usage, but it didn’t cause issues until peak traffic the next day. Asana is brave for enabling comments on their postmortem — not sure I’d have the stomach for that.Thanks to an anonymous contributor for this one.
- ESPN’s fantasy football
- Unfortunate timing, being down on opening day.