SRE Weekly Issue #111

View on sreweekly.com

I’m trying an experiment this week: I’ve included authors at the bottom of each article. I feel like it’s only fair to increase exposure for the folks that put in the significant effort necessary to write articles. It also saves me having to mention names and companies, hopefully leaving more room for useful summaries.

If you like it, great! If not, please let me know why — reply by email or tweet @SREWeekly. I feel like this is the right thing to do from the perspective of crediting authors, but I’d like to know if a significant number of you disagree.

Hat-tip to Developer Tools Weekly for the idea.

Articles

Twitter: Lisa Phillips about on-call compensation

Conversations around compensation for on-call. What has worked or not for you? $$ vs PTO. Alerts vs Scheduled vs Actual Time?1 x 1.5 or 2x?

The replies to her tweet are pretty interesting and varied.

Lisa Phillips, VP at Fastly
Full disclosure: Fastly is my employer.

Twitter: Charity Majors about making on-call suck less

This thread is incredibly well phrased, explaining exactly why it’s important for developer to be on call and how to make that not terrible. Bonus content: the thread also branches out into on-call compensation.

if you aren’t supporting your own services, your services are qualitatively worse **and** you are pushing the burden of your own fuckups onto other people, who also have lives and sleep schedules.

Charity Majors — Honeycomb

The Role of the Incident Commander

This week, Blackrock3 Partners posted an excerpt from their book, Incident Management for Operations that you can read free of charge. If you enjoy it, I highly recommend you sign up for their first-ever open enrollment IMS training course. I know I keep pushing this, but I truly believe that incident response in our industry as a whole will be significantly improved if more people train with these folks.

Oncall and Sustainable Software Development

“On-call doesn’t have to suck” has been a big theme lately, with articles and comments on both sides. Here’s a pile of great advice from my favorite ops heroine.

Charity Majors — Honeycomb

Production postmortem: The unavailable Linux server

An interesting little debugging story involving unexpected SSL server-side behavior.

Ayende Rahien — RavenDB

Couchbase High Availability and Disaster Recovery: Java Multi-Cluster Aware Client

In this post, I’m going to take a look at a sample application that uses the Couchbase Server Multi-Cluster Aware (MCA) Java client. This client goes hand-in-hand with Couchbase’s Cross-Data Center Replication (XDCR) capabilities.

Hod Greeley — Couchbase

Advice to Management Teams While Enrolling Changes to On-Call Systems

Tips for how to go about scaling your on-call policy and procedures in order to be fair and humane to engineers.

Emel Dogrusoz — OpsGenie

Outages

Hurricane Electric (datacenter provider)
BB&T (Bank)
Facebook/Instagram
Stack Overflow
LastPass
TD Bank
The Things Network
- The Things Network is an IoT infrastructure provider.
Hulu
Yahoo
Google Cloud Platform
- An incident on February 18th broke autoscaling and prevented communication between new instances and instances in other zones. The linked post-analysis discusses the failure of a process and of the automated failover process.

SRE Weekly Issue #111

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues