This isn’t just another boring article about SLOs. There’s a ton of good stuff in here about why they moved to SLO-based alerts, too.
we’re hoping that by implementing SLOs – and alerting on them – we’ll be able to improve communication during incidents, reduce the toil on on-callers, and help improve our reliability in a way that’s meaningful to our users.
Often, serendipity gets us out of an incident or makes it less severe.
Unless we treat this sort of activity as first class when looking at incidents, we won’t really understand how it can be that some incidents get resolved so quickly and some take much longer.
It’s your classic “replace the engines on a jet while flying it” story. My favorite part is how they recorded real traffic and played it at the old and new backend API to compare the JSON responses.
Rohan Dhruva and Ed Ballot — Netflix
Feature flags can help with load shedding and throttling, and feature flag activity can even be useful data that points to contributing factors.
Dawn Parzych — LaunchDarkly
Unimog uses a lot of really interesting techniques to balance layer 4 traffic, about which this article goes into in great detail.
David Wragg — Cloudflare
I like this idea: it’s like a normal canary, except that you only send it a copy of traffic and discard the result, so as to avoid impacting users.
David Hoa — LinkedIn