General

SRE Weekly Issue #124

Today’s my birthday!  Bit of a short issue this week as a result, but lots of interesting outages.

SPONSOR MESSAGE

Support your DevOps and SRE efforts by implementing on-call tools that make people happy. With the right on-call tools, you can continuously deliver while maintaining system resiliency. Read more to learn about identifying good on-call tools: http://try.victorops.com/SREWeekly/on-call-tools

Articles

These terms are not interchangeable. Learn about the ins and outs of fault tolerance to highlight the differences between the two concepts.

Fernando Doglio

What caught my eye in this article: AIIMS, the Australasian InterService Incident Management System. It’s the equivalent of the Incident Management System (IMS) in the US.

Ian Jones

Outages

SRE Weekly Issue #123

I hope you all had a happy GDPR day!  SRE Weekly’s privacy policy has not changed.  Folks that subscribed by email would have seen a message that I only share your email address with MailChimp, and that’s the way it will stay.

You can unsubscribe at any time by following the link at the bottom of the email, but if you have any trouble at all with unsubscribing, please don’t hesitate to email me and I’ll take care of it for you.

SPONSOR MESSAGE

Maintaining reliability through cloud migration can be difficult. Learn how implementing an incident management solution can make migration faster, reduce costs, and make SRE-life easier: http://try.victorops.com/SREWeekly/Cloud-Migration-Incident-Management

Articles

The system is highly configurable, allowing fine-grained A/B testing of failures at all levels of the microservice call tree.

Ephemeral port exhaustion can really ruin your day. Read this to learn how to deal with it, how to detect it before you have problems, and why you might run out of ports sooner than you expect.

Will Sewell — Pusher

This incident report from 2013 is a great read. It’s really two inches in one, including an analysis of why a remediation task from the first wasn’t completed in time to prevent the second.

David Poblador i Garcia — Spotify

There are a few nice tidbits in this interview, including this one:

[…] the health of the system no longer matters.  We’ve entered an era where what matters is the health of each individual event, or each individual user’s experience […]

Daniel Bryant – InfoQ

This article has introduction to implementing canary deployment and also includes a discussion of the potential downsides.

Erik [surname not given] — Rollout.io

Lots of great detail in this announcement, including an analysis of how (and why) they designed their load balancer to function entirely in userspace without a kernel bypass mechanism.

Nikita Shirokov and Ranjeeth Dasineni — Facebook

Metrics are great, right? Except sometimes they’re not, when the metric collection itself adds enough load to impair the system.

Jonathan Brown — Wallaroo

Outages

  • Google BigQuery
    • Click through for the full incident report.

      Configuration changes being rolled out on the evening of the incident were not applied in the intended order.

  • GCP Networking in us-east4
    • Here’s some detail on the BGP issue that took down us-east4 last week.
  • Google StackDriver
    • It’s a hat trick of three GCP incident followup reports. Happy reading!
  • Slack
  • Bank of New Zealand
  • Twitter
  • National Australia Bank
    • This outage is particularly notable because the bank has stated their intention to compensate customers for their losses, such as estimated lost revenues from inability to complete sales transactions.

SRE Weekly Issue #122

SPONSOR MESSAGE

Determining the right tools for your SRE team(s) can get confusing. So, VictorOps, InfluxData, and Grafana are putting on a webinar—May 16th, 1 pm ET—to help you build your SRE toolchain: http://try.victorops.com/SREWeekly/Webinar

Articles

After adopting a “full ownership” philosophy, this company faced burnout, with five or more separate developers on call simultaneously. Read about their awesome solution involving a shared on-call rotation staffed entirely by volunteers, spurred by the incentive of extra compensation.

Brian Scanlan — Intercom

What exactly is SRE and how does it relate to DevOps? Earlier this year, we (Liz Fong-Jones and Seth Vargo) launched a video series to help answer some of these questions and reduce the friction between the communities. This blog post summarizes the themes and lessons of each video in the series to offer actionable steps toward better, more reliable systems.

Liz Fong-Jones and Seth Vargo — Google

After a load test uncovered a scaling issue, they dug deep, finding issues with garbage collection settings, cascading failures, and an overeager retry strategy.

Val Markovic — LinkedIn

These tips cover the basics and will be especially useful for teams onboarding engineers that have never been on-call before.

This article examines a case study of an EMS company attempting to adopt a just culture policy. There’s a great discussion of why it’s not a good idea to lay blame on individuals when systemic problems may be far more important.

Larry Boxman and Paul LeSage — JEMS (Journal of Emergency Medical Services)

In this third and final article in a series, Xero lays out their process for analyzing incidents after the fact. Thanks to the Xero folks for being so open about your processes and for taking the time to write these articles!

Karthik Nilakant — Xero

I like the nifty heat maps with example distributed traces. Neat idea!

JBD — Google

Outages

SRE Weekly Issue #121

SPONSOR MESSAGE

Determining the right tools for your SRE team(s) can get confusing. So, VictorOps, InfluxData, and Grafana are putting on a webinar—May 16th, 1 pm ET—to help you build your SRE toolchain: http://try.victorops.com/SREWeekly/Webinar

Articles

This latest in the CRE Life Lessons series takes on dependencies and how they impact a service’s SLO in obvious and subtle ways.

Robert van Gent — Google

This company discovered that the benefits of microservices came with some significant downsides. Here’s how they turned to chaos testing to improve reliability.

Meredith Courtemanche — TechTaret

Keeping in mind that this is written by the CTO of Gremlin, it contains some good points about buying versus building your chaos engineering system. It would apply to other chaos engineering services too — if there were any.

Matt Fornaciari — Gremlin, Inc.

Even as an experienced Terraform user, I learned about some Terraform features I hadn’t been aware of.

Nic Jackson — Hashicorp

In issue #98, I linked to a recording of John Allspaw’s DOES17 talk. In case you didn’t have time to listen, here’s a transcript. If you didn’t have time to read the Stella Report, I highly recommend reading this as an intro to the major concepts therein.

John Allspaw

Outages

SRE Weekly Issue #120

SPONSOR MESSAGE

A combination of the right people and the right tools create SRE-friendly environments. See the hundreds of tools and integrations that already work with VictorOps to make your people better and help you maintain more reliable systems: http://try.victorops.com/SREWeekly/Tools

Articles

“You can OOM a single NUMA node” thus entered my list of things to worry about when a box seems to have plenty of memory but still goes off and slaughters innocent (but big) processes.

Rachel Kroll

In this podcast episode, the panelists hold a retrospective for the snow-related delay of DevOps Days Baltimore. Toward the end they go into the idea of reliability and single points of failure with respect to conference planning. My favorite quote in the show, from Nell Shamrell-Harrington:

Incident Management is never about technology — it’s a people.

Nell Shamrell-Harrington and Nathen Harvey

I really love this Who, Me? section from The Register.

Simon Sharwood — The Register

This article has a great discussion of how to get started with chaos engineering — and how to avoid biting off more than you can chew.

Jennifer Riggins — The New Stack

Beamer is a stateless datacenter load balancer supporting both TCP and Multipath TCP (MPTCP). It manages to keep the load balancers stateless by taking advantage of connection state already held by servers.

Super-clever! The LB does keep state, but the size of the state is constant, unrelated to the number of connections flowing through it.

Adrian Colyer — summary, Olteanu et al. — original paper

Sometimes it’s worthwhile to lay everything out and describe just exactly what we’re up against as SREs. The analogies here are pretty awesome. Read this for a hefty dose of cynicism about the state of our increasingly computer-driven world.

Peter Welch

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme