SRE Weekly Issue #122

Articles

Rapid response: how we fixed our on call process to avoid engineer burnout

After adopting a “full ownership” philosophy, this company faced burnout, with five or more separate developers on call simultaneously. Read about their awesome solution involving a shared on-call rotation staffed entirely by volunteers, spurred by the incentive of extra compensation.

Brian Scanlan — Intercom

Google Cloud Platform Blog: SRE vs. DevOps: competing standards or close friends?

What exactly is SRE and how does it relate to DevOps? Earlier this year, we (Liz Fong-Jones and Seth Vargo) launched a video series to help answer some of these questions and reduce the friction between the communities. This blog post summarizes the themes and lessons of each video in the series to offer actionable steps toward better, more reliable systems.

Liz Fong-Jones and Seth Vargo — Google

Making LinkedIn’s Organic Feed Handle Peak Traffic

After a load test uncovered a scaling issue, they dug deep, finding issues with garbage collection settings, cascading failures, and an overeager retry strategy.

Val Markovic — LinkedIn

7 Tips to Get New Engineers Ready to Be On-Call

These tips cover the basics and will be especially useful for teams onboarding engineers that have never been on-call before.

Just Culture & High Reliability: The Initial Approach

This article examines a case study of an EMS company attempting to adopt a just culture policy. There’s a great discussion of why it’s not a good idea to lay blame on individuals when systemic problems may be far more important.

Larry Boxman and Paul LeSage — JEMS (Journal of Emergency Medical Services)

SRE@Xero: Managing Incidents Part III

In this third and final article in a series, Xero lays out their process for analyzing incidents after the fact. Thanks to the Xero folks for being so open about your processes and for taking the time to write these articles!

Karthik Nilakant — Xero

Want to Debug Latency?

I like the nifty heat maps with example distributed traces. Neat idea!

JBD — Google

Outages

Sutter Health
Fortnite (incident analysis)
- I really love how deep and technical Fortnite is with their incident analysis articles! Here’s one for their outage in mid-april.
  The Epic Team
Google Compute Engine (us-east4 region)
Atlassian Statuspage
Roku
Hulu

SRE Weekly Issue #122

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues