People throw around “resiliency” quite often when they mean “reliability” or “high availability”. Dr. Woods sets the record straight.
Ipsita Agarwal — Increment
A key part of their strategy is to keep their service running at 50% capacity or less, allowing them to lose a datacenter without overloading the remaining datacenter.
Mathieu Frappier, Dorothy Jung, and Qui Nguyen — Increment
In issue #236, I linked to an excellent paper by Dr. Richard Cook and Beth Long about engineering resilience in incident response. Now they’re back, teaming up with John Allspaw to summarize and expand on that paper!
John Allspaw, Beth Adele Long, and Dr. Richard Cook — Increment
s/security/reliability/g and this is an SRE article; the same principles apply to both fields.
Aaron Rinehart — Verica
How can we apply the tenets and principles of NASA mission controllers to our SRE work?
Geoff White — Blameless
Genius idea: we can take our lead from activists as we try to win over our organization to adopt SRE principles.
Chris Hendrix — Blameless
This insightful observation caught my eye:
It’s unnecessary overhead for a product team to plan capacity, set up good alerts and multihoming (automatically running in multiple data centers) for small, simple functionality.
Naphat Sanguansin and Utsav Shah — Dropbox