SRE Weekly Issue #260

A message from our sponsor, StackHawk:

Check out this guide to modern dynamic application security testing to learn how it works and what to look for in tooling.
http://sthwk.com/dynamic-appsec-overview

Articles

People throw around “resiliency” quite often when they mean “reliability” or “high availability”. Dr. Woods sets the record straight.

Ipsita Agarwal — Increment

A key part of their strategy is to keep their service running at 50% capacity or less, allowing them to lose a datacenter without overloading the remaining datacenter.

Mathieu Frappier, Dorothy Jung, and Qui Nguyen — Increment

In issue #236, I linked to an excellent paper by Dr. Richard Cook and Beth Long about engineering resilience in incident response. Now they’re back, teaming up with John Allspaw to summarize and expand on that paper!

John Allspaw, Beth Adele Long, and Dr. Richard Cook — Increment

A quick s/security/reliability/g and this is an SRE article; the same principles apply to both fields.

Aaron Rinehart — Verica

How can we apply the tenets and principles of NASA mission controllers to our SRE work?

Geoff White — Blameless

Genius idea: we can take our lead from activists as we try to win over our organization to adopt SRE principles.

Chris Hendrix — Blameless

This insightful observation caught my eye:

It’s unnecessary overhead for a product team to plan capacity, set up good alerts and multihoming (automatically running in multiple data centers) for small, simple functionality.

Naphat Sanguansin and Utsav Shah — Dropbox

Outages

Updated: March 7, 2021 — 8:46 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme