SRE Weekly Issue #135


SRE looks different from organization to organization. But, this recent interview with members of our SRE council showcases their approach to SRE, some of their favorite parts of SRE, and how SRE continues to evolve:


What might an AWS outage look like? Try this new simulation tool to find out!

It’s not something you’ll want to use for too long (the internet is better when it works, it turns out), but it’s a view that’s well worth taking in, if only to taste the sheer scope of Amazon’s server empire.

Russell Brandom — The Verge (tool by Dhruv Mehrotra)

This article goes step-by-step through setting up a 3-server GlusterFS cluster.

Jack Wallen — TechRepublic

My favorite part of this is the concept of vacations as a “human game day”. Can we survive without you?

Matt Stratton — PagerDuty (with Alice Goldfuss)

One question I have been seeing is “if Istio provides reliability for me, do I have to worry about it in my application?”

The answer is: abso-freakin-lutely :)

Christian Posta

This take on the theft and crashing of an airplane in Seattle is applicable to SRE in multiple ways. It includes discussion of the incident response and some thoughts on what level of risk for extremely rare events is acceptable.

James Fallows — The Atlantic

Two funny GIFs about SRE. Full disclosure: @dbaops is my boss and this stemmed from a DM conversation between us.

@dbaops on Twitter

Coarse-grained health checks might be sufficient for orchestration systems, but prove to be inadequate to ensure quality-of-service and prevent cascading failures in distributed systems.

Cindy Sridharan


Updated: August 19, 2018 — 8:34 pm
SRE WEEKLY © 2015 Frontier Theme