SRE Weekly Issue #29

Articles

I can’t summarize this awesome article well enough, so I’m just going to quote Charity a bunch:

the outcomes associated with operations (reliability, scalability, operability) are the responsibility of *everyone* from support to CEO.

if you have a candidate come in and they’re a jerk to your office manager or your cleaning person, don’t fucking hire that person because having jerks on your team is an operational risk

If you try and just apply Google SRE principles to your own org according to their prescriptive model, you’re gonna be in for a really, really bad time.

Traffic spikes can be incredibly difficult to handle, foreseen or not. Packagecloud.io details its efforts to survive a daily spike of 600% of normal traffic in March.

This checklist is aimed toward deployment on Azure, but a lot of the items could be generalized and applied to infrastructures deployed elsewhere.

In-depth detail surrounding the multiple failures of TNReady mentioned earlier this year (issues #10 and #20).

A two-sided debate, both sides of which are Gareth Rushgrove (maintainer of the excellent Devops Weekly). Should we try to adopt Google’s way of doing things in our own infrastructures? For example, error budgets:

What if you’re operating an air traffic control system or a nuclear power station? Your goal is probably closer to zero outages

Outages

Updated: July 3, 2016 — 11:03 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme