SRE Weekly Issue #53

Articles

Take It to the Limit: Considerations for Building Reliable Systems

Without explicit limits, things fail in unexpected and unpredictable ways. Remember, the limits exist, they’re just hidden.

A Case Study in Global Fault Isolation

AWS gives us this in-depth explanation of their use of shuffle sharding in the Route 53 service. This is especially interesting given the Dyn DDoS attack a couple of months ago.

A container networking overview

How does container networking work? Julia Evans points her curious mind toward this question and shares what she learned.

[…] it’s important to understand what’s going on behind the scenes, so that if something goes wrong I can debug it and fix it.

The Problem with Math: Why Your Monitoring Solution is Wrong

More on the subject of percentiles and incorrect math this week from Circonus. The SLA calculation stuff is especially on point.

sysadvent: Day 20 – How to set and monitor SLAs

And speaking of SLAs, here’s an excellent article on how to design and adopt an SLA in your product or service.

Systems We Love 2016

A summary of a few notable Systems We Love talks. I’m so jealous of all of you folks that got to go!

#OnCallSelfie – PagerDuty

PagerDuty added #OnCallSelfie support to their app. Amusingly, that first picture is of my (awesome) boss. Hi, Joy!

Summary of Windows Azure Service Disruption on Feb 29th, 2012

A post-analysis of an Azure outage from 2012. The especially interesting thing to me is the secondary outage caused by eagerness to quickly deploy a fix to the first outage. There’s a cognitive trap here: we become overconfident when we think we’ve found The Root Cause and we rush to deploy a patch.

SRE Weekly Issue #53

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues