SRE Weekly Issue #153

A message from our sponsor, VictorOps:

SRE teams can leverage chaos engineering, stress testing and load testing tools to proactively build reliability into the services you build. This list of open source chaos tools can help you get started:

http://try.victorops.com/sreweekly/open-source-chaos-testing-tools

Articles

In this podcast episode, Courtney Eckhardt and the panel cover a lot of bases related to incident response, retrospectives, defensiveness, blamelessness, social justice, and tons more engrossing stuff. Well worth a listen.

Mandy Moore (summary); John K. Sawers, Sam Livingston-Gray, Jamey Hampton, and Coraline Ada Ehmke (panelists); Courtney Eckhardt (guest)

Do you wonder what effect partitioned versus unified consistency might have on latency? Do you want to know what those terms mean? Read on.

Daniel Abadi

Cape is Dropbox’s real-time event processing system. The design bits in this article have a ton of interesting detail, and I also love the part where they go into their motivations behind not just using an existing queuing system.

Peng Kang — Dropbox

This is a great intro to the circuit breaker pattern if you’re unfamiliar with it, and it’s also got a lot of meaty content for folks experienced with them.

Corey Scott — Grab

Though it sounds counterintuitive, more dashboards often make people less informed and less aligned.

Having a few good dashboards is important, but if you have too many, it’ll get in the way of your ability to do dynamic analysis.

Benn Stancil — Mode

What activities count as SRE work, versus “just” Operations?

Site Reliability Engineering do Operations but are not an Operations Team.

Stephen Thorne

Outages

Updated: December 23, 2018 — 8:41 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme