SRE Weekly Issue #136

SPONSOR MESSAGE

Define goals, set agendas, and build SRE like a boss. SRE team lead, Jonathan Schwietert, discusses how to organize effective SRE meetings and cultivate a collaborative culture of resiliency:

http://try.victorops.com/sreweekly/organized-sre

Articles

This infographic shows how Ably’s client library and backend infrastructure is designed to work around many common failure modes. My favorite: they have redundant TLS certificates from distinct issuers.

Matthew O’Riordan — Ably

This article argues that spending a little time to fix staging can make production significantly more stable.

Michael Nygard

This is a story of a flawed development process on top of a flawed infrastructure, without the necessary data to drive decision-making. It’s also a story of waking up to these problems and charting a way out.

[…]

As it turns out, pure reasoning cannot solve the kind of problems you see in the production environment of a complex application. These problems are almost always more difficult, since they have survived all of the testing you could throw at them.

John Casey

A story of a somewhat rare failure case (a datacenter heat buildup event) and how to monitor for such a thing without contributing to metrics overload.

Pavel Trukhanov — okmeter

On twitter this week, @srhtcn noted that “Many incidents happen during or right after release” and asked for advice on ways to fix this.

Great advice, useful for managers and individual contributors.

Charity Majors

Outages

SRE Weekly Issue #135

SPONSOR MESSAGE

SRE looks different from organization to organization. But, this recent interview with members of our SRE council showcases their approach to SRE, some of their favorite parts of SRE, and how SRE continues to evolve:
http://try.victorops.com/sreweekly/what-is-sre-to-me

Articles

What might an AWS outage look like? Try this new simulation tool to find out!

It’s not something you’ll want to use for too long (the internet is better when it works, it turns out), but it’s a view that’s well worth taking in, if only to taste the sheer scope of Amazon’s server empire.

Russell Brandom — The Verge (tool by Dhruv Mehrotra)

This article goes step-by-step through setting up a 3-server GlusterFS cluster.

Jack Wallen — TechRepublic

My favorite part of this is the concept of vacations as a “human game day”. Can we survive without you?

Matt Stratton — PagerDuty (with Alice Goldfuss)

One question I have been seeing is “if Istio provides reliability for me, do I have to worry about it in my application?”

The answer is: abso-freakin-lutely :)

Christian Posta

This take on the theft and crashing of an airplane in Seattle is applicable to SRE in multiple ways. It includes discussion of the incident response and some thoughts on what level of risk for extremely rare events is acceptable.

James Fallows — The Atlantic

Two funny GIFs about SRE. Full disclosure: @dbaops is my boss and this stemmed from a DM conversation between us.

@dbaops on Twitter

Coarse-grained health checks might be sufficient for orchestration systems, but prove to be inadequate to ensure quality-of-service and prevent cascading failures in distributed systems.

Cindy Sridharan

Outages

SRE Weekly Issue #134

SPONSOR MESSAGE

Sr. Software Engineer, Greg Frank, discusses a tool using simulated chaos and validators to improve SRE. See part one of the series to learn more about this tool for supporting your own SRE efforts:

http://try.victorops.com/sreweekly/simulators-and-validators-for-sre

Articles

The big news this week is SegmentSmack, a denial of service vulnerability in the Linux kernel that allows an attacker to cause high CPU consumption. Linked is a SANS Technology Institute researcher’s summary of the attack. Other coverage:

Johannes B. Ullrich, PhD — SAN Technology Institute

It’s rare that any system we create will remain static throughout its lifetime. How can you handle retrofitting it without sacrificing reliability?

Yiwei Liu — Grubhub

We’ve previously introduced GLB, our scalable load balancing solution for bare metal datacenters […] Today we’re excited to share more details about our load balancer’s design, as well as release the GLB Director as open source.

Theo Julienne — GitHub

HostedGraphite had a load-balancing challenge: some connections carried 5 data points per second while others had 5000. Here’s how they solved it.

Ciaran Gaffney — HostedGraphite

Here’s how Grab designed their global rate-limiting system, ensuring nearly instant local rate-limiting decisions controlled asynchronously by a global service.

Jim Zhan and Gao Chao — Grab

Find out how Lyft avoids cascading failure in their microservice-based architecture, through the use of a client- and server-side rate-limiting proxy.

Daniel Hochman and Jose Nino — Lyft

A good post-mortem process is broken down into three major parts, the first of which will usually take up the bulk of your time:

  • Writing a post-mortem.
  • Reviewing the post-mortem and publishing the post-mortem.
  • Tracking the post-mortem.

Let’s go through each step in more detail.

Sweta Ackerman — Increment

The FCC blamed their outage this past May on a DDoS. Turns out it was just massively distributed requests for legitimate service.

Thomas Barrabi — Fox Business

My favorite part of this interview with Charity Majors is the discussion of operations in a serverless infrastructure (toward the end).

Forrest Brazeal — A Cloud Guru

Outages

SRE Weekly Issue #133

SPONSOR MESSAGE

A big part of SRE is outage preparation and confidence. See how a DevOps culture of collaboration and accountability can better prepare your SRE team for outages:

http://try.victorops.com/sreweekly/sre-outage-collaboration

Articles

My sincerest apology to Ali Haider Zaveri, author of the article Location-Aware Distribution: Configuring servers at scale. I originally miscredited the article to two folks, claiming they were from Facebook when in fact they work at Google.

As Grubhub built out their service-oriented architecture, they first developed “base frameworks for building highly available, distributed services”.

William Blackie — Grubhub

Cloudflare discusses an optimization that improves their p99 response time in the face of occasionally slow disk access. Today I learned: Linux does not allow for non-blocking disk reads.

Ka-Hing Cheung — Cloudflare

I include this article not just to warn you in case you depend on GeoTrust certificates, but also to highlight what’s involved in running a reliable and trustworthy CA.

Devon O’Brien, Ryan Sleevi, and Andrew Whalley — Google

They go over the 6 key constraints that influenced their design and describe the solution they came up with. Some of the constraints seem to involve preserving not just their own systems’ reliability, but that of their customers’ systems.

Simon Woolf — Ably

Given that we already knew in advance how to deal with each issue as it arose, it made sense to automate the work. Here’s how we did it.

James O’Keeffe — Google

In this article we will look at the various load balancing solutions available in Azure and which one should be used in which scenario.

Rahul Rajat Singh

Outages

SRE Weekly Issue #132

SPONSOR MESSAGE

Build reliability and optimize application performance for your complete infrastructure with effective monitoring. See how we used metrics to uncover issues in our own mobile application’s performance:

http://try.victorops.com/sreweekly/mobile-monitoring-sre

Articles

In this blog post I will show you what a disaster recovery exercise is, how it can diagnose weak points in your infrastructure, and how it can be a learning experience for your on-call team.

Alexandra Johnson — SigOpt

This article showcases the Chaos Toolkit experiments these folks wrote to test their system’s resiliency.

Sylvain Hellegouarc — chaosiq

With millions of servers and thousands of configuration changes per day, distribution of configuration information becomes a huge scaling challenge. Here’s some insight (and pretty architecture diagrams) explaining how Facebook does it.

Ali Haider Zaveri — Facebook [NOTE: originally miscredited, sorry!]

Liftbridge is a system for lightweight, fault-tolerant (LIFT) message streams built on NATS and gRPC. Fundamentally, it extends NATS with a Kafka-like publish-subscribe log API that is highly available and horizontally scalable.

Tyler Treat

This pretty neat: Google Cloud Platform now exposes their SLIs directly to you, as they pertain to the requests you make of the platform. For example, if a given API call has increased latency, you’ll see it on their graph. This can be great for those “is it us or is it them?” incidents.

Jay Judkowitz — Google

What can I do to make sure that, when this system fails, it fails as effectively as possible?

Todd Conklin — Pre-Accident Podcast

Here’s a review of Google’s new SRE book. I’m a little miffed that now I have to say that, instead of just “Google’s SRE book” or just “the SRE book”. Ah well. This one appears to be more about practical use cases than theory.

Todd Hoff — High Scalability

Chaos engineering isn’t just for SREs.

everyone benefits from observing a failure. Even UI engineers, people from a UX background, product managers.

Patrick Higgins — Gremlin

Outages

  • MoviePass
    • Interestingly, the company reported in their SEC filing that the outage was the result of their running out of cash and being unable to pay vendors.
  • BBC website
A production of Tinker Tinker Tinker, LLC Frontier Theme