SRE Weekly Issue #235

Articles

This isn’t just another boring article about SLOs. There’s a ton of good stuff in here about why they moved to SLO-based alerts, too.

we’re hoping that by implementing SLOs – and alerting on them – we’ll be able to improve communication during incidents, reduce the toil on on-callers, and help improve our reliability in a way that’s meaningful to our users.

Mads Hartmann

A nudge in the right direction

Often, serendipity gets us out of an incident or makes it less severe.

Unless we treat this sort of activity as first class when looking at incidents, we won’t really understand how it can be that some incidents get resolved so quickly and some take much longer.

Lorin Hochstein

Seamlessly Swapping the API backend of the Netflix Android app

It’s your classic “replace the engines on a jet while flying it” story. My favorite part is how they recorded real traffic and played it at the old and new backend API to compare the JSON responses.

Rohan Dhruva and Ed Ballot — Netflix

Using feature flags during incident management

Feature flags can help with load shedding and throttling, and feature flag activity can even be useful data that points to contributing factors.

Dawn Parzych — LaunchDarkly

Unimog – Cloudflare’s edge load balancer

Unimog uses a lot of really interesting techniques to balance layer 4 traffic, about which this article goes into in great detail.

David Wragg — Cloudflare

Production testing with dark canaries

I like this idea: it’s like a normal canary, except that you only send it a copy of traffic and discard the result, so as to avoid impacting users.

David Hoa — LinkedIn

SRE Weekly Issue #235

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues