SRE Weekly Issue #20

Articles

Here’s a fairly negative review of the new Google SRE book. The author makes some well-articulated points against the tone of the book and its applicability outside Google. I’ve been hearing some talk of a condescending tone in the book, along with a tendency to claim “inventing” things that others also invented elsewhere. My copy arrives next week — should be an interesting read, for better or worse.

Full disclosure: Heroku, my employer, is mentioned.

The Ripple Effect Of Outages And Downtime Cannot Be Underestimated »

A discussion of the impact of an outage on a company’s brand. Skip the last bit; it’s an ad. The rest is worth reading, though.

Reputation and customer loyalty suffers dramatically. The Boston Consulting Group reports that over a quarter of users (28%) never return to a company’s web site if it doesn’t perform sufficiently well.

3 Way Ops Can Help Devs: A Developer Perspective

Conflict between “dev” and “ops” (whatever they’re called at a given company) can create reliability problems. SRE is in part an effort to relieve that tension, either through embedding or enacting process changes. This article gathers opinions and ideas from ops and dev engineers and proposes three methods for alleviating the tension.

CloudEndure’s 2016 Cloud Migration Survey Reveals 52% of Enterprise Companies Plan to Migrate to Public Clouds Over Next 2 Years

Another interesting survey-based report.

When asked what is the acceptable “downtime window” to finish migrations to minimize downtime, almost half (44%) of respondents said they cannot afford any downtime or, at most, just for under 1 hour.

I’ve done both kinds, and in my experience, migrations with planned downtime end up being the more painful ones, as one is under pressure to meet a predefined outage window, which inevitably slips.

Uptime: How Many 9s Do We Need?

In practice, there’s a point of diminishing returns after which you’re wasting money to get more availability than you need. That’s at the crux of this article, and it’s an interesting read.

dastergon/awesome-sre · GitHub

Haven’t gotten your fill from SRE Weekly? Here’s a long list of curated SRE-related links to peruse.

Fault Injection in Production

Here’s a classic from the venerable John Allspaw of Etsy on running gameday scenarios in production. The general process is to brainstorm possible failures, improve the system to handle them, and then test by actually inducing the failures in production.

Imagining failure scenarios and asking, “What if…?” can help combat this thinking and bring a constant sense of unease to the organization. This sense of unease is a hallmark of high-reliability organizations. Think of it as continuously deploying a BCP (business continuity plan).

(emphasis mine)

What to do with the "rm -rf" hoax question – Meta Server Fault

Yup, turns out it was a hoax. Still generated an interesting conversation though.

Outages

123-reg (UK web hosting)
- An error in a script resulted in mass deletion of customer sites.
SquareSpace
Nucleus Market (illicit goods market)
The Pirate Bay
More US voting issues
US state school testing systems
- This week, both New Jersey and Tennessee had to cancel testing due to failures in their computerized trading systems. I’ve mentioned TNReady previously here, and this is their third failure.
Facebook

SRE Weekly Issue #20

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues