SRE Weekly Issue #107

Articles

Google Cloud Platform Blog: An example escalation policy — CRE life lessons

Here, “escalation policy” refers to ongoing work by SRE to get a service back into its SLO, rather than an escalation policy definition in PagerDuty (for example). This article describes the tactics a hypothetical Google SRE team has at their disposal to deal with an ailing service. It’s especially striking to me how this policy comes across as almost punitive in nature.

Now You See Me, Now You Don’t: LinkedIn’s Real-Time Presence Platform

In this post, we’ll provide a technical walk-through of how we used the Play Framework and the Akka Actor Model to build the massive infrastructure that keeps track of the online status of millions of members at any given moment. We’ll describe how it distributes thousands of changes per second in the online status of these members to millions of other connected members in real time. You will also learn how to apply these techniques to your own applications.

If You’re Going to Fail, Fail Safely

This article from LaunchDarkly is about assuming failure and mitigating harm, through the lens of feature-flag-based deployment.

What Tools Do Site Reliability Engineers Use?

New Relic shares this list of the categories of tools that SREs use to standardize the systems they support.

As Liz [Fong-Jones] told Matthew Flaming, New Relic vice president of software engineering, “One SRE team is going to have a really difficult time supporting 50 different software engineering teams if they’re each doing their own separate thing, and they’re each using separate tooling.”

Building a Distributed Log from Scratch, Part 5: Sketching a New System

In the final article of this series, Tyler Treat lays out a design for a new distributed log based on NSQ.

Observations on the Enterprise of Hiring

While perhaps not strictly SRE-related, hiring is still critically important for SRE teams. I really love Honeycomb’s approach to hiring as laid out in this blog post.

Why is random testing effective for partition tolerance bugs?

Why indeed? This issue of The Morning Paper discusses a paper on the effectiveness of random testing in distributed systems. More specifically, it goes over the mathematics behind why randomized testing in Jepsen is actually useful, despite classical theories that it ought not be.

Outages

Pinterest
Google Cloud Storage
- This one’s worth a read. Google’s original status posting stated 100% impact to cloud storage in its US region, but their followup post retroactively reduced that to 2.0% average and 3.6% peak.
Netflix
- This one happened seemingly at the same time as the Google Cloud Storage outage, but that may be a spurious correlation. This is the first time that I learned that Netflix does have a status page of sorts: it’s an article in their help center entitled “Is Netflix Down?” and they update it live. Who knew?
Facebook/Instagram
National Health Service (UK)

SRE Weekly Issue #107

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues