SRE Weekly Issue #110

Articles

How production engineers support global events on Facebook

Facebook goes in-depth on their preparations for New Year’s Day 2018 in their live streaming infrastructure. They used forecasting based on last year and various kinds of load testing to develop the right kind of scaling strategy to meet demand.

On-call doesn’t have to suck

Cindy Sridharan went and blew up the internet with an excellent and controversial tweet about on-call. She took to Medium to address all of the discussion that followed, and the result is a pretty excellent article about on-call and work/life balance.

Production Test Run: Overburdened and under provisioned

A discussion about how RavenDB handles resource exhaustion, and just how resource exhaustion can be defined and detected.

Development at Honeycomb: Crossing the Observability Bridge to Production

Honeycomb on using observability tooling to precisely analyze how a change actually affects your users. Did the new feature/bugfix have the effect you expected?

Low latency, large working set, and GHC’s garbage collector: pick two of three

Pusher is obsessed with low latency, and for good reason. When they saw high long-tail latency, they discovered that Haskell’s garbage collector is optimized for throughput, rather than latency.

Resilience Engineering at LinkedIn with Project Waterbear | LinkedIn Engineering

Facebook’s Project Waterbear seeks to improve resiliency across many of their services through a combination of chaos engineering, cultural changes, and improvements to Rest.li, their common REST framework.

As SREs, we measure, analyze, and provide best practices to help improve the resilience of each application for the application owners and engineering teams.

Observability: the new wave or buzzword?

The tradeoff for more resilient, soft-failing software systems is more complex debugging when things go wrong. As these problems are now more likely to reside deep in application code — which wasn’t the case not along ago — observability tooling is playing catchup.

Everything You Need to Know About DynamoDB Global Tables

OpsGenie analyzes AWS’s new DynamoDB Global Tables, a cross-region multi-master NoSQL datastore. They share the upsides and the pitfalls and include a discussion of how to transition to a global table.

Why, as a Netflix infrastructure manager, am I on call?

A Netflix manager shares his reasons for still being on-call even though he’s a manager, and they’re pretty great. A lot of it has to do with keeping in tune with what it’s like being a developer on his team, especially with regard to on-call burden and operability.

Outages

Visual Studio Team Services (Microsoft)
- Microsoft posted an incredibly detailed analysis of an incident that occurred on February 7th. The interesting bit is that they still don’t know what went wrong, and they included a lot of detail on how they’ve tried to track it down so far. Lots to learn from here.
TD Bank
Snapchat
Vocus Communications (data center provider)
Twitter

SRE Weekly Issue #110

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues