I’m trying an experiment this week: I’ve included authors at the bottom of each article. I feel like it’s only fair to increase exposure for the folks that put in the significant effort necessary to write articles. It also saves me having to mention names and companies, hopefully leaving more room for useful summaries.
If you like it, great! If not, please let me know why — reply by email or tweet @SREWeekly. I feel like this is the right thing to do from the perspective of crediting authors, but I’d like to know if a significant number of you disagree.
Hat-tip to Developer Tools Weekly for the idea.
Conversations around compensation for on-call. What has worked or not for you? $$ vs PTO. Alerts vs Scheduled vs Actual Time?1 x 1.5 or 2x?
The replies to her tweet are pretty interesting and varied.
Lisa Phillips, VP at Fastly
Full disclosure: Fastly is my employer.
This thread is incredibly well phrased, explaining exactly why it’s important for developer to be on call and how to make that not terrible. Bonus content: the thread also branches out into on-call compensation.
if you aren’t supporting your own services, your services are qualitatively worse **and** you are pushing the burden of your own fuckups onto other people, who also have lives and sleep schedules.
Charity Majors — Honeycomb
This week, Blackrock3 Partners posted an excerpt from their book, Incident Management for Operations that you can read free of charge. If you enjoy it, I highly recommend you sign up for their first-ever open enrollment IMS training course. I know I keep pushing this, but I truly believe that incident response in our industry as a whole will be significantly improved if more people train with these folks.
“On-call doesn’t have to suck” has been a big theme lately, with articles and comments on both sides. Here’s a pile of great advice from my favorite ops heroine.
Charity Majors — Honeycomb
An interesting little debugging story involving unexpected SSL server-side behavior.
Ayende Rahien — RavenDB
In this post, I’m going to take a look at a sample application that uses the Couchbase Server Multi-Cluster Aware (MCA) Java client. This client goes hand-in-hand with Couchbase’s Cross-Data Center Replication (XDCR) capabilities.
Hod Greeley — Couchbase
Tips for how to go about scaling your on-call policy and procedures in order to be fair and humane to engineers.
Emel Dogrusoz — OpsGenie
- Hurricane Electric (datacenter provider)
- BB&T (Bank)
- Stack Overflow
- TD Bank
- The Things Network
- The Things Network is an IoT infrastructure provider.
- Google Cloud Platform
- An incident on February 18th broke autoscaling and prevented communication between new instances and instances in other zones. The linked post-analysis discusses the failure of a process and of the automated failover process.