General

SRE Weekly Issue #103

SPONSOR MESSAGE

Looking for light reading for the new year? Dive into a VictorOps favorite: Scala Unified Logging to Full System Observability. http://try.victorops.com/SREWeekly/SystemVisibility

Articles

Gremlin Inc. helps folks simulate failure, but what happens when they turn their tools on their own infrastructure? In this article, they share all sorts of juicy details about how they set up their experiments, what they hoped to prove and thought might happen, and then what actually happened, including an unexpected failure mode.

This article series isn’t actually about writing your own new distributed log from scratch — probably not a good idea. It’s about learning the fundamental principles involved in designing such systems so that we can better understand them while operating and using them.

What do you do about the scary system that nobody touches and everyone is afraid will fall over some day? This article shows you a concrete plan for digging in and dealing with the skeleton in the closet.

It’s Julia Evans, writing at Stripe!

In this post, we’ll explain why we chose to build on top of Kubernetes. We’ll examine how we integrated Kubernetes into our existing infrastructure, our approach to building confidence in (and improving) our Kubernetes’ cluster’s reliability, and the abstractions we’ve built on top of Kubernetes.

AppOptics’s take on alerting, including this gem:

More often, our metric choices and threshold values are guided by our preexisting tools. Hence, if our tools cannot measure latency, we do not alert on latency.

How many times have you seen a migration or transition reach 90% completion and stall? This SysAdvent author urges caution in engaging a “hybrid cloud” vendor solution.

Juniper discusses the evolution of the Network Engineer role into Network Reliability Engineer (NRE).

Just like sysadmins have graduated from technicians to technologists as SREs, the NRE title is a declaration of a new culture and serves as the zenith for all that we do and have as engineers of network invincibility.

A primer on setting up load testing for WebDAV using Apache Jmeter.

An interesting debugging story involving a tricky data corruption bug in RavenDB.

Outages

SRE Weekly Issue #102


My phone died this week, and I discovered the hard way that my backups hadn’t been functioning properly.  SRE Weekly is served out of a single t2.micro, too.  Sometimes it’s hard to practice what I preach outside of work. ;)  Anyway, bit of a light issue this week, but still some great stuff.

SPONSOR MESSAGE

A robust mobile app is essential for on-call. See why VictorOps updated both native iOS and Android apps. http://try.victorops.com/SREWeekly/MobileBlog

Articles

I’ve lost count of the number of incidents I’ve witnessed that were caused by TCP connections in which one end disappeared. This cautionary tale has a pretty interesting cause as well.

In many ways, what you do with an incident post-mortem can be as important as the post-mortem itself. Below, I explain why and offer tips on what to do after the post-mortem is complete.

If the title of this article doesn’t make sense to you, then you may well have been interpreting traceroute results incorrectly. Definitely worth a read.

Gremlin inc. is live! Here’s the official “coming out” post for this chaos engineering startup.

There’s so much to delve into in this long article about Netflix’s infrastructure. It’s part of the book, Explain the Cloud Like I’m 10, but I didn’t really find the explanations watered-down or over-simplified.

A great description of booking.com‘s incident response and followup process.

Incidents are like presents: You love them as long as you don’t get the same present twice.

Outages

SRE Weekly Issue #101

SPONSOR MESSAGE

Integrate VictorOps into your SRE ops to support faster recovery and improved post-incident analysis. Get your free trial started today: http://try.victorops.com/SREWeekly/FreeTrial

Articles

It’s Sysadvent season again! This article is a great introduction to the idea that there is never just one root cause in an incident.

Want to try out chaos engineering? Here are four kinds of terrible things you can do to your infrastructure, from the folks at Gremlin.

To be clear, this is about using Consul as part of load balancing another service, not load-balancing Consul itself. Several methods are discussed, along with the pros and cons of each.

This article has some interesting ideas, including automated root cause discovery or at least computer-assisted analysis. It also contains this week’s second(!) Challenger shuttle accident reference.

As job titles change, this author argues that the same basic operations skills are still applicable.

Here’s Catchpoint’s yearly round-up of how various sites fared over the recent US holiday period.

These terms mean similar things, and sometimes some of them are used interchangeably. Baron Schwartz sets the record straight, defining each term and explaining the distinctions between them.

If you have a moment, please consider filling out this survey by John Allspaw:

[…] I’m looking to understand what engineers in software-reliant companies need in learning better from post-incident reviews.

In a continuation of last week’s article, Google’s CRE team discusses sharing a postmortem with customers. “Sharing” here means not only giving it to them, but actually working on the postmortem process together with customers, including assigning them followup actions(!).

SRE Amy Tobey approached a new SRE gig with a beginner’s mind and took notes. The result is a useful set of lessons learned and observations that may come in useful next time you change jobs.

Outages

SRE Weekly Issue #100


Whoa, it’s issue #100! Thank you all so much for reading.

SPONSOR MESSAGE

Integrate VictorOps into your SRE ops to support faster recovery and improved post-incident analysis. Get your free trial started today: http://try.victorops.com/SREWeekly/FreeTrial

Articles

Richard Cook wrote this short, incredibly insightful essay on how we can use incidents to improve our mental model of the system.

An incident is the message from the underlying system about where the people who made and who operate that system are uncalibrated.

A nifty trip through a debugging session that shows the importance of being able to dig into high-cardinality fields in your monitoring system.

Various sources list a couple of key metrics to keep an eye on, including request rate, error rate, latency, and others. This 6-part series defines the golden signals and shows how to monitor them in several popular systems.

This article explains some downsides of Thrift and introduces the author’s solution: Frugal, a Thrift wrapper.

re:Invent 2017 is over (whew) and now we have a raft of new products and features to play with. I’m going to leave the detailed analysis for Last Week in AWS and just point out a few bits of special interest to SREs:

  • Hibernation for spot instances
  • T2 unlimited
  • EC2 spread placement groups
  • Aurora DB multi-master support (preview)
  • DynamoDB global tables

Etsy details their caching setup and explains the importance of consistent hashing in cache cluster design. I haven’t heard of their practice of “cache smearing” before, and I like it.

[…] “Success is ironically one of the progenitors of accidents when it leads to overconfidence and cutting corners or making tradeoffs that increase risk.” […]

Gremlin had an incident that was caused by filled disks. Because they’re Gremlin, they now purposefully fill a disk on a random server every day just to make sure their systems deal with it gracefully, a practice they call “continuous chaos”.

Google’s CRE team (Customer Reliability Engineering) discusses when to post public followups and how to write them. I love their idea of investigating where they got lucky during an incident, catching cases where things could have been much worse if not for serendipity. I’m going to start using that.

Outages

SRE Weekly Issue #99

Lots of outages this week, although not as many as in some previous years on Black Friday.  We’ll see what Cyber Monday brings.

I’m writing this from the airport on my way to re:Invent.  Perhaps I’ll see some of you there as I rush about from meeting to meeting.

SPONSOR MESSAGE

Attending AWS re:Invent 2017? Visit the VictorOps booth, schedule a meeting, or join us for some after hours fun. See you in Vegas! http://try.victorops.com/SREWeekly/AWS

Articles

Complete with a nifty flow-chart for informed decision-making.

As the title suggests, this article by New Relic is about the mindset of an SRE. I really love number 3, where they discuss the idea that gating production deploys can actually reduce reliability rather than improve it.

It’s what it says on the tin, and it’s targeted for DigitalOcean. One could also use this as a general primer on setting up HeartBeat failover using other cloud platforms.

The Chaos Toolkit is a free, open source project that enables you to create and apply Chaos Experiments to various types of infrastructure, platforms and applications.

It currently supports Kubernetes and Spring.

Here’s a neat little overview of the temporary but massive network that joins the re:Invent venues up and down the Las Vegas strip. Half of the strip is also set up for Direct Connect to the nearest AWS region.

The three pitfalls discussed are confusing EBS latency, idle EC2 instances wasting money, and memory leaks. My favorite gotcha isn’t mentioned: performance cliffs caused by running out of burst in T2 instances or GP2 volumes.

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme