SRE Weekly Issue #184

A message from our sponsor, VictorOps:

Do you dream of reducing MTTA from four hours to two minutes? Learn how you can improve incident detection, alerting, real-time incident collaboration and cross-functional transparency to make on-call suck less and build more reliable services:

http://try.victorops.com/sreweekly/improved-incident-response

Articles

This article relates to Donella H. Meadows’s book, Thinking in Systems.

What follows is Meadows’ list of leverage points outfitted with those my ideas of where or how they can be applied to software development and web operations.

Ryan Frantz

D:

I know its past an hour but… we got ~600 Nagios emails a day. Boss forbade us from muting any of them. In weekly status meeting, he’d often quiz on-call on a random alert. If oncall didnt know about it, boss would often scream at us…

Jason Antman (@j_antman)

Find out how the Couchbase folks use Jepsen to test their database offering.

Korrigan Clark

A supportive on-call environment is critical to ensuring reliability and resiliency.

Deirdre Mahon — Honeycomb

This is a follow-on to an article I linked to awhile back.

It’s really simpler to call it Tech Risk.

I love the idea of tracking the decisions an organization makes and the risks they entail.

Sarah Baker

Outages

Updated: September 8, 2019 — 9:05 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme