SRE Weekly Issue #235

A message from our sponsor, StackHawk:

Adding application security tests to your CI pipeline is simple. It typically takes <30 minutes to setup automated testing so you can be confident of your application’s security. Check out our onboarding guide to see how to get started.
https://www.stackhawk.com/blog/onboarding-guide?SREWeekly

Articles

This isn’t just another boring article about SLOs. There’s a ton of good stuff in here about why they moved to SLO-based alerts, too.

we’re hoping that by implementing SLOs – and alerting on them – we’ll be able to improve communication during incidents, reduce the toil on on-callers, and help improve our reliability in a way that’s meaningful to our users.

Mads Hartmann

Often, serendipity gets us out of an incident or makes it less severe.

Unless we treat this sort of activity as first class when looking at incidents, we won’t really understand how it can be that some incidents get resolved so quickly and some take much longer.

Lorin Hochstein

It’s your classic “replace the engines on a jet while flying it” story. My favorite part is how they recorded real traffic and played it at the old and new backend API to compare the JSON responses.

Rohan Dhruva and Ed Ballot — Netflix

Feature flags can help with load shedding and throttling, and feature flag activity can even be useful data that points to contributing factors.

Dawn Parzych — LaunchDarkly

Unimog uses a lot of really interesting techniques to balance layer 4 traffic, about which this article goes into in great detail.

David Wragg — Cloudflare

I like this idea: it’s like a normal canary, except that you only send it a copy of traffic and discard the result, so as to avoid impacting users.

David Hoa — LinkedIn

Outages

Updated: September 13, 2020 — 8:30 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme