General

SRE Weekly Issue #100

lex

December 3, 2017

General

Comments

View on sreweekly.com
Whoa, it’s issue #100! Thank you all so much for reading.

Articles

void *: Incidents as Untyped Pointers

Richard Cook wrote this short, incredibly insightful essay on how we can use incidents to improve our mental model of the system.

An incident is the message from the underlying system about where the people who made and who operate that system are uncalibrated.

a short example of why dimensions are suuuuuper valuable

A nifty trip through a debugging session that shows the importance of being able to dig into high-cardinality fields in your monitoring system.

How to Monitor the SRE Golden Signals

Various sources list a couple of key metrics to keep an eye on, including request rate, error rate, latency, and others. This 6-part series defines the golden signals and shows how to monitor them in several popular systems.

Thrift on Steroids: A Tale of Scale and Abstraction

This article explains some downsides of Thrift and introduces the author’s solution: Frugal, a Thrift wrapper.

re:Invent 2017 | New Products & Services

re:Invent 2017 is over (whew) and now we have a raft of new products and features to play with. I’m going to leave the detailed analysis for Last Week in AWS and just point out a few bits of special interest to SREs:

Hibernation for spot instances
T2 unlimited
EC2 spread placement groups
Aurora DB multi-master support (preview)
DynamoDB global tables

How Etsy caches: hashing, Ketama, and cache smearing

Etsy details their caching setup and explains the importance of consistent hashing in cache cluster design. I haven’t heard of their practice of “cache smearing” before, and I like it.

The role of software in spacecraft accidents

[…] “Success is ironically one of the progenitors of accidents when it leads to overconfidence and cutting corners or making tradeoffs that increase risk.” […]

Won’t Get Fooled Again

Gremlin had an incident that was caused by filled disks. Because they’re Gremlin, they now purposefully fill a disk on a random server every day just to make sure their systems deal with it gracefully, a practice they call “continuous chaos”.

Fearless shared postmortems — CRE life lessons

Google’s CRE team (Customer Reliability Engineering) discusses when to post public followups and how to write them. I love their idea of investigating where they got lucky during an incident, catching cases where things could have been much worse if not for serendipity. I’m going to start using that.

Outages

SRE Weekly Issue #99

lex

November 26, 2017

General

Comments

View on sreweekly.com

Lots of outages this week, although not as many as in some previous years on Black Friday. We’ll see what Cyber Monday brings.

I’m writing this from the airport on my way to re:Invent. Perhaps I’ll see some of you there as I rush about from meeting to meeting.

Articles

Why Amazon DynamoDB isn’t for everyone

Complete with a nifty flow-chart for informed decision-making.

7 Habits of Highly Successful Site Reliability Engineers

As the title suggests, this article by New Relic is about the mindset of an SRE. I really love number 3, where they discuss the idea that gating production deploys can actually reduce reliability rather than improve it.

How To Create a High Availability Setup with Heartbeat and Floating IPs on Ubuntu 16.04

It’s what it says on the tin, and it’s targeted for DigitalOcean. One could also use this as a general primer on setting up HeartBeat failover using other cloud platforms.

Chaos Toolkit

The Chaos Toolkit is a free, open source project that enables you to create and apply Chaos Experiments to various types of infrastructure, platforms and applications.

It currently supports Kubernetes and Spring.

The AWS Cloud Goes Underground at re:Invent

Here’s a neat little overview of the temporary but massive network that joins the re:Invent venues up and down the Las Vegas strip. Half of the strip is also set up for Direct Connect to the nearest AWS region.

Diagnosing (and Avoiding) the Top 3 AWS EC2 Issues

The three pitfalls discussed are confusing EBS latency, idle EC2 instances wasting money, and memory leaks. My favorite gotcha isn’t mentioned: performance cliffs caused by running out of burst in T2 instances or GP2 volumes.

Outages

Marketo
Cloudflare
Takealot
Uniqlo
Macy’s
Lowe’s
Monzo
- This really awesome followup anal
- .ysis of an outage at Monzo is a great example of the fact that there is no such thin as a single root cause.
Lowe’s

SRE Weekly Issue #98

lex

November 19, 2017

General

Comments

View on sreweekly.com

Articles

IT Incident Command and Incident Management Course

I’ve mentioned Blackrock3 Partners here before, a team of veteran firefighters that train IT incident responders in the same Incident Management System used by firefighters and other disaster responders. Before now, they’ve only done training and consulting directly with companies.

Now, for the first time, they are opening a training session to the public, so you can learn to be an Incident Commander (IC) without having to be at a company that contracts with them. Their training will significantly up your incident response game, so don’t miss out on this. Click through for information on tickets.

Blackrock3 Partners has not provided me with compensation in any form for including this link.

How Your Systems Keep Running Day After Day

This is John Allspaw’s 30-minute talk at DOES17, and it contains so much awesomeness that I really hope you’ll make time for it. Here are a couple of teasers (paraphrased):

Treat incidents as unplanned investments in your infrastructure.

Perform retrospectives not to brainstorm remediation items but to understand where your mental model of the system went wrong.

Slack’s Julia Grace on the lessons learned from downtime, and the responsibility to pay it forward

Here’s some more detail on Slack’s major outage on Halloween, in the form of a summary of an interview with their director of infrastructure, Julia Grace.

Google Cloud Platform Blog: With Multi-Region support in Cloud Spanner, have your cake and eat it too

Google claims a lot with Cloud Spanner. Does it deliver? I’d really like to see a balanced, deeply technical review, so if you know of one, please drop me a link.

With this release, we’ve extended Cloud Spanner’s transactions and synchronous replication across regions and continents. That means no matter where your users may be, apps backed by Cloud Spanner can read and write up-to-date (strongly consistent) data globally and do so with minimal latency for end users.

On-Call Horror Story Number Four: This Wins the Most Debilitating Award

Ever been on-call for work and your baby? I think a fair number of us can relate. Thankfully, it sounds like these folks realized that it’s not exactly a best practice to have a parent of a 5-day old premie be on call…

Fault, Error, Failure; Availability and Stability

Here’s a nice pair of articles on fault tolerance and availability. In the first post (linked above), the author defines the terms “fault”, “error”, and “failure”. The second post starts with definitions of “availability” and “stability” and covers ways of achieving them.

[John Allspaw] My Next Step

John Allspaw, former CTO of Etsy and author of a ton of awesome articles I’ve featured here, is moving on to something new.

Along with Dr. Richard Cook and Dr. David Woods, I’m launching a company we’re calling Adaptive Capacity Labs, and it’s focused on helping companies (via consulting and training) build their own capacity to go beyond the typical “template-driven” postmortem process and treat post-incident review as the powerful insight lens it can be.

I’m really hoping to have an opportunity to try out their training, because I know it’s going to be awesome.

Outages

Heroku
- Heroku suffered an outage caused by Daylight Saving Time, according to this incident report. Happens to someone every year.Full disclosure: Heroku is my employer.
Google Docs
Discord

SRE Weekly Issue #97

lex

November 12, 2017

General

Comments

View on sreweekly.com

Articles

SRE@Xero: Managing Incidents Part I

Last month, I linked to an article on Xero’s incident response process, and I said:

I find it interesting that incident response starts off with someone filling out a form.

This article goes into detail on how the form works, why they have it, and the actual questions on the form! Then they go on to explain their “on-call configuration as code” setup, which is really nifty. I can’t wait to see part II and beyond.

Stretching Spokes

Spokes is GitHub’s system for storing distributed replicas of git repositories. This article explains how they can do this over long distances in a reasonable amount of time (and why that’s hard). I especially love the “Spokes checksum” concept.

Fly the airplane: Three practices for effective incident response

From the CEO of NS1, a piece on the value of checklists in incident response.

The Ultimate Guide to Secondary DNS

Here’s another great guide on the hows and whys of secondary DNS, including options on dealing with nonstandard record types that aren’t compatible with AXFR.

Availability has a new meaning. And it doesn’t include planned downtime.

From a customer’s perspective, “planned downtime” and “outage” often mean the same thing.

Risks of a “serverless” future: dissolving valuable infrastructure

“serverless” != “NoOps”

Willis urges the importance of integration with existing operations processes over replacement. “Serverless is just another form of compute. … All the core principles that we’ve really learned about high-performance organizations apply differently … but the principles stay the same,” he said.

Root Cause Analysis as Storytelling – Wide Awake Developers

When we use root cause analysis, says Michael Nygard, we narrow our focus into counter-factuals that get in the way of finding out what really happened.

CW: hypothetical violent imagery

Outages

This week had a weirdly large number of outages!

Heroku
- Heroku posted a public followup for incident #1334, with a pretty interesting cause. At the end of the month, load on an internal API increased because the number of apps that ran out of monthly free quota hit a peak.
  Full disclosure: Heroku is my employer.
How a Tiny Error Shut Off the Internet for Parts of the US
- I normally don’t include ISP failures, but this one was widespread across the US and had an interesting cause. Level 3 accidentally created a route leak that broke traffic for many Comcast customers (including me).
Google App Engine Memcache Service
- Linked is Google’s followup analysis, which suggests that the outage was due to a scaling issue in a configuration database.
OVH to Disassemble Container Data Centers after Epic Outage in Europe
Snapchat
Instagram
E-Trade
Grindr
Netflix
Yahoo Mail

SRE Weekly Issue #96

lex

November 5, 2017

General

Comments

View on sreweekly.com

Articles

The Phone Book Is On Fire: Lessons From the Dyn DNS DDoS — Velocity NYC 2017

Here’s the recording of my Velocity 2017 talk, posted on YouTube with permission from O’Reilly (thanks!). Want to learn about some gnarly DNS details?

Log20: Fully automated optimal placement of log printing statements under specified overhead threshold

I fell in love with this after reading just the title, and it only got better from there. Why add debug statements haphazardly when an algorithm can automatically figure out where they’ll be most effective? I especially love the analysis of commit histories to build stats on when debug statements were added to various open source projects.

Operating a Kubernetes network

Julia Evans is back with another article about Kubernetes. Along with explaining how it all fits together, she describes a few things that can go wrong and how to fix them.

How can we apply the principles of chaos engineering to AWS Lambda?

In this introductory post of a four-part series, we learn why chaos testing a lambda-based infrastructure is especially challenging.

Google Vizier: A service for black-box optimization

I love the idea of a service that automatically optimizes things even without knowing anything about their internals. Mmm, cookies.

Lyft’s Envoy dashboards – mattklein123 – Medium

What we are releasing is unfortunately not going to be readily consumable. It is also not an OSS project that will be maintained in any way. The goal is to provide a snapshot of what Lyft does internally (what is on each dashboard, what stats do we look at, etc.). Our hope is having that as a reference will be useful in developing new dashboards for your organization.

Microsoft has built a secret network emulator it says can prevent most cloud outages

It’s not a secret since they published a paper about it. This is an intriguing idea, but I’m wondering whether it’s really more effective than staging environments tend to be in practice.

The Rise of Site Reliability Engineers

A history of the SRE profession and a description of how New Relic does SRE.

Full disclosure: Heroku, my employer, is mentioned.

Outages

Collision with buffer stops at King’s Cross station, London, 15 August 2017
- This is the Rail Accident Investigation Branch’s report on a minor accident involving a driver that suffered a “microsleep” due to fatigue.
LearnVest
Slack

← Older Posts

Newer Posts →

General

SRE Weekly Issue #100

Articles

Outages

SRE Weekly Issue #99

Articles

Outages

SRE Weekly Issue #98

Articles

Outages

SRE Weekly Issue #97

Articles

Outages

SRE Weekly Issue #96

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues