Articles
Facebook goes in-depth on their preparations for New Year’s Day 2018 in their live streaming infrastructure. They used forecasting based on last year and various kinds of load testing to develop the right kind of scaling strategy to meet demand.
Cindy Sridharan went and blew up the internet with an excellent and controversial tweet about on-call. She took to Medium to address all of the discussion that followed, and the result is a pretty excellent article about on-call and work/life balance.
A discussion about how RavenDB handles resource exhaustion, and just how resource exhaustion can be defined and detected.
Honeycomb on using observability tooling to precisely analyze how a change actually affects your users. Did the new feature/bugfix have the effect you expected?
Pusher is obsessed with low latency, and for good reason. When they saw high long-tail latency, they discovered that Haskell’s garbage collector is optimized for throughput, rather than latency.
Facebook’s Project Waterbear seeks to improve resiliency across many of their services through a combination of chaos engineering, cultural changes, and improvements to Rest.li, their common REST framework.
As SREs, we measure, analyze, and provide best practices to help improve the resilience of each application for the application owners and engineering teams.
The tradeoff for more resilient, soft-failing software systems is more complex debugging when things go wrong. As these problems are now more likely to reside deep in application code — which wasn’t the case not along ago — observability tooling is playing catchup.
OpsGenie analyzes AWS’s new DynamoDB Global Tables, a cross-region multi-master NoSQL datastore. They share the upsides and the pitfalls and include a discussion of how to transition to a global table.
A Netflix manager shares his reasons for still being on-call even though he’s a manager, and they’re pretty great. A lot of it has to do with keeping in tune with what it’s like being a developer on his team, especially with regard to on-call burden and operability.
Outages
- Visual Studio Team Services (Microsoft)
- Microsoft posted an incredibly detailed analysis of an incident that occurred on February 7th. The interesting bit is that they still don’t know what went wrong, and they included a lot of detail on how they’ve tried to track it down so far. Lots to learn from here.
- TD Bank
- Snapchat
- Vocus Communications (data center provider)