View on sreweekly.com
I’m trying an experiment this week: I’ve included authors at the bottom of each article. I feel like it’s only fair to increase exposure for the folks that put in the significant effort necessary to write articles. It also saves me having to mention names and companies, hopefully leaving more room for useful summaries.
If you like it, great! If not, please let me know why — reply by email or tweet @SREWeekly. I feel like this is the right thing to do from the perspective of crediting authors, but I’d like to know if a significant number of you disagree.
Hat-tip to Developer Tools Weekly for the idea.
Conversations around compensation for on-call. What has worked or not for you? $$ vs PTO. Alerts vs Scheduled vs Actual Time?1 x 1.5 or 2x?
The replies to her tweet are pretty interesting and varied.
Lisa Phillips, VP at Fastly
Full disclosure: Fastly is my employer.
This thread is incredibly well phrased, explaining exactly why it’s important for developer to be on call and how to make that not terrible. Bonus content: the thread also branches out into on-call compensation.
if you aren’t supporting your own services, your services are qualitatively worse **and** you are pushing the burden of your own fuckups onto other people, who also have lives and sleep schedules.
Charity Majors — Honeycomb
This week, Blackrock3 Partners posted an excerpt from their book, Incident Management for Operations that you can read free of charge. If you enjoy it, I highly recommend you sign up for their first-ever open enrollment IMS training course. I know I keep pushing this, but I truly believe that incident response in our industry as a whole will be significantly improved if more people train with these folks.
“On-call doesn’t have to suck” has been a big theme lately, with articles and comments on both sides. Here’s a pile of great advice from my favorite ops heroine.
Charity Majors — Honeycomb
An interesting little debugging story involving unexpected SSL server-side behavior.
Ayende Rahien — RavenDB
In this post, I’m going to take a look at a sample application that uses the Couchbase Server Multi-Cluster Aware (MCA) Java client. This client goes hand-in-hand with Couchbase’s Cross-Data Center Replication (XDCR) capabilities.
Hod Greeley — Couchbase
Tips for how to go about scaling your on-call policy and procedures in order to be fair and humane to engineers.
Emel Dogrusoz — OpsGenie
View on sreweekly.com
Wow, I have a lot of great content to share with you this week! Sometimes it seems like awesome articles come in waves… not sure what that’s about.
This is the first in a series where New York Times CTO, Nick Rockwell, talks to leaders in the technology world about their work.
There’s so incredibly much awesome in this conversation, and I’ve already seen the internet alight with people quoting it. Charity says so many insightful things that I’m going to have to reread this a couple of times to absorb it all. It’s a must-read!
Xero SRE is back, this time with an article about their incident response process and an overview of their chatbot, Multivac. The bot assists with paging and information tracking and, crucially, guides incident responders through a checklist of actions such as determining severity.
Here’s a fun little distributed system debugging story from the founder of RavenDB.
This CNN article goes into a little more detail about what happened. To my eye, there’s not enough in those details to warrant firing, so there must be more than has been shared publicly.
LinkedIn’s growth from a single datacenter to multiple “hyperscale” locations was accompanied by a cultural shift. They transitioned from “‘Site-Up’ is priority #1” to “taking intelligent risks” as their overall reliability improved.
The program is nominally aimed toward “a variety of industries, including the aerospace, automotive, maritime, manufacturing, oil, chemical, power transmission, medical device, infrastructure planning and extreme event response sectors”, though I can’t help but wonder if it might be applicable to IT.
“Well I’d cut out the pizza and beer and instead pay for Splunk.”
This author pushes us to resist the urge to write something in-house and instead look for external services or software, when the tool is not key to delivering customer value.
Here’s a very well-articulated argument for using a third-party feature-flag service rather than writing your own. I’ve seen every pitfall they mention and more. This article is by Rollout.io, a feature-flag service, but they notably don’t mention their product even once, and they don’t need to. Nicely done, folks.
I think there’s another layer we get out of the postmortem process itself that hasn’t usually been part of the discussion: communicating about your service’s long-term stability.
We should look beyond merely preventing the same kind of incident in the future and improving our incident response process, says this article from PagerDuty.
How many times have you been paged for a server at 95% disk usage, only to find that it’s still months away from full? This article by SignalFX is about a feature on their platform, but its concepts are generally applicable to other tools.
A primer on testing failover in a MongoDB Atlas cluster.
Large numbers of SREs went scrambling last month when we realized that we may suddenly run out of resources on our NoSQL workloads. Here are some concrete numbers on how things actually turned out.