General

SRE Weekly Issue #101

SPONSOR MESSAGE

Integrate VictorOps into your SRE ops to support faster recovery and improved post-incident analysis. Get your free trial started today: http://try.victorops.com/SREWeekly/FreeTrial

Articles

It’s Sysadvent season again! This article is a great introduction to the idea that there is never just one root cause in an incident.

Want to try out chaos engineering? Here are four kinds of terrible things you can do to your infrastructure, from the folks at Gremlin.

To be clear, this is about using Consul as part of load balancing another service, not load-balancing Consul itself. Several methods are discussed, along with the pros and cons of each.

This article has some interesting ideas, including automated root cause discovery or at least computer-assisted analysis. It also contains this week’s second(!) Challenger shuttle accident reference.

As job titles change, this author argues that the same basic operations skills are still applicable.

Here’s Catchpoint’s yearly round-up of how various sites fared over the recent US holiday period.

These terms mean similar things, and sometimes some of them are used interchangeably. Baron Schwartz sets the record straight, defining each term and explaining the distinctions between them.

If you have a moment, please consider filling out this survey by John Allspaw:

[…] I’m looking to understand what engineers in software-reliant companies need in learning better from post-incident reviews.

In a continuation of last week’s article, Google’s CRE team discusses sharing a postmortem with customers. “Sharing” here means not only giving it to them, but actually working on the postmortem process together with customers, including assigning them followup actions(!).

SRE Amy Tobey approached a new SRE gig with a beginner’s mind and took notes. The result is a useful set of lessons learned and observations that may come in useful next time you change jobs.

Outages

SRE Weekly Issue #100


Whoa, it’s issue #100! Thank you all so much for reading.

SPONSOR MESSAGE

Integrate VictorOps into your SRE ops to support faster recovery and improved post-incident analysis. Get your free trial started today: http://try.victorops.com/SREWeekly/FreeTrial

Articles

Richard Cook wrote this short, incredibly insightful essay on how we can use incidents to improve our mental model of the system.

An incident is the message from the underlying system about where the people who made and who operate that system are uncalibrated.

A nifty trip through a debugging session that shows the importance of being able to dig into high-cardinality fields in your monitoring system.

Various sources list a couple of key metrics to keep an eye on, including request rate, error rate, latency, and others. This 6-part series defines the golden signals and shows how to monitor them in several popular systems.

This article explains some downsides of Thrift and introduces the author’s solution: Frugal, a Thrift wrapper.

re:Invent 2017 is over (whew) and now we have a raft of new products and features to play with. I’m going to leave the detailed analysis for Last Week in AWS and just point out a few bits of special interest to SREs:

  • Hibernation for spot instances
  • T2 unlimited
  • EC2 spread placement groups
  • Aurora DB multi-master support (preview)
  • DynamoDB global tables

Etsy details their caching setup and explains the importance of consistent hashing in cache cluster design. I haven’t heard of their practice of “cache smearing” before, and I like it.

[…] “Success is ironically one of the progenitors of accidents when it leads to overconfidence and cutting corners or making tradeoffs that increase risk.” […]

Gremlin had an incident that was caused by filled disks. Because they’re Gremlin, they now purposefully fill a disk on a random server every day just to make sure their systems deal with it gracefully, a practice they call “continuous chaos”.

Google’s CRE team (Customer Reliability Engineering) discusses when to post public followups and how to write them. I love their idea of investigating where they got lucky during an incident, catching cases where things could have been much worse if not for serendipity. I’m going to start using that.

Outages

SRE Weekly Issue #99

Lots of outages this week, although not as many as in some previous years on Black Friday.  We’ll see what Cyber Monday brings.

I’m writing this from the airport on my way to re:Invent.  Perhaps I’ll see some of you there as I rush about from meeting to meeting.

SPONSOR MESSAGE

Attending AWS re:Invent 2017? Visit the VictorOps booth, schedule a meeting, or join us for some after hours fun. See you in Vegas! http://try.victorops.com/SREWeekly/AWS

Articles

Complete with a nifty flow-chart for informed decision-making.

As the title suggests, this article by New Relic is about the mindset of an SRE. I really love number 3, where they discuss the idea that gating production deploys can actually reduce reliability rather than improve it.

It’s what it says on the tin, and it’s targeted for DigitalOcean. One could also use this as a general primer on setting up HeartBeat failover using other cloud platforms.

The Chaos Toolkit is a free, open source project that enables you to create and apply Chaos Experiments to various types of infrastructure, platforms and applications.

It currently supports Kubernetes and Spring.

Here’s a neat little overview of the temporary but massive network that joins the re:Invent venues up and down the Las Vegas strip. Half of the strip is also set up for Direct Connect to the nearest AWS region.

The three pitfalls discussed are confusing EBS latency, idle EC2 instances wasting money, and memory leaks. My favorite gotcha isn’t mentioned: performance cliffs caused by running out of burst in T2 instances or GP2 volumes.

Outages

SRE Weekly Issue #98

SPONSOR MESSAGE

Attending AWS re:Invent 2017? Visit the VictorOps booth, schedule a meeting, or join us for some after hours fun. See you in Vegas! http://try.victorops.com/SREWeekly/AWS

Articles

I’ve mentioned Blackrock3 Partners here before, a team of veteran firefighters that train IT incident responders in the same Incident Management System used by firefighters and other disaster responders. Before now, they’ve only done training and consulting directly with companies.

Now, for the first time, they are opening a training session to the public, so you can learn to be an Incident Commander (IC) without having to be at a company that contracts with them. Their training will significantly up your incident response game, so don’t miss out on this. Click through for information on tickets.

Blackrock3 Partners has not provided me with compensation in any form for including this link.

This is John Allspaw’s 30-minute talk at DOES17, and it contains so much awesomeness that I really hope you’ll make time for it. Here are a couple of teasers (paraphrased):

Treat incidents as unplanned investments in your infrastructure.

Perform retrospectives not to brainstorm remediation items but to understand where your mental model of the system went wrong.

Here’s some more detail on Slack’s major outage on Halloween, in the form of a summary of an interview with their director of infrastructure, Julia Grace.

Google claims a lot with Cloud Spanner. Does it deliver? I’d really like to see a balanced, deeply technical review, so if you know of one, please drop me a link.

With this release, we’ve extended Cloud Spanner’s transactions and synchronous replication across regions and continents. That means no matter where your users may be, apps backed by Cloud Spanner can read and write up-to-date (strongly consistent) data globally and do so with minimal latency for end users.

Ever been on-call for work and your baby? I think a fair number of us can relate. Thankfully, it sounds like these folks realized that it’s not exactly a best practice to have a parent of a 5-day old premie be on call…

Here’s a nice pair of articles on fault tolerance and availability. In the first post (linked above), the author defines the terms “fault”, “error”, and “failure”. The second post starts with definitions of “availability” and “stability” and covers ways of achieving them.

John Allspaw, former CTO of Etsy and author of a ton of awesome articles I’ve featured here, is moving on to something new.

Along with Dr. Richard Cook and Dr. David Woods, I’m launching a company we’re calling Adaptive Capacity Labs, and it’s focused on helping companies (via consulting and training) build their own capacity to go beyond the typical “template-driven” postmortem process and treat post-incident review as the powerful insight lens it can be.

I’m really hoping to have an opportunity to try out their training, because I know it’s going to be awesome.

Outages

  • Heroku
    • Heroku suffered an outage caused by Daylight Saving Time, according to this incident report. Happens to someone every year.Full disclosure: Heroku is my employer.
  • Google Docs
  • Discord

SRE Weekly Issue #97

SPONSOR MESSAGE

Attending AWS re:Invent 2017? Visit the VictorOps booth, schedule a meeting, or join us for some after hours fun. See you in Vegas! http://try.victorops.com/SREWeekly/AWS

Articles

Last month, I linked to an article on Xero’s incident response process, and I said:

I find it interesting that incident response starts off with someone filling out a form.

This article goes into detail on how the form works, why they have it, and the actual questions on the form! Then they go on to explain their “on-call configuration as code” setup, which is really nifty. I can’t wait to see part II and beyond.

Spokes is GitHub’s system for storing distributed replicas of git repositories. This article explains how they can do this over long distances in a reasonable amount of time (and why that’s hard). I especially love the “Spokes checksum” concept.

From the CEO of NS1, a piece on the value of checklists in incident response.

Here’s another great guide on the hows and whys of secondary DNS, including options on dealing with nonstandard record types that aren’t compatible with AXFR.

From a customer’s perspective, “planned downtime” and “outage” often mean the same thing.

“serverless” != “NoOps”

Willis urges the importance of integration with existing operations processes over replacement. “Serverless is just another form of compute. … All the core principles that we’ve really learned about high-performance organizations apply differently … but the principles stay the same,” he said.

When we use root cause analysis, says Michael Nygard, we narrow our focus into counter-factuals that get in the way of finding out what really happened.

CW: hypothetical violent imagery

Outages

This week had a weirdly large number of outages!

A production of Tinker Tinker Tinker, LLC Frontier Theme