General

SRE Weekly Issue #138

SPONSOR MESSAGE

A dedication to SRE will improve the lives of your customers and team. For our August Roundup, we’ve compiled a list of top SRE articles in order to help you keep up with the latest news, tips, and topics in SRE:

http://try.victorops.com/sreweekly/august-sre-roundup

Articles

This episode of Greater Than Code features John Allspaw, and it’s pretty much as awesome as I expected. Some highlights:

  • rather than asking how an incident happened, ask what prevented it from being worse
  • ask “how” rather than “why” an incident happened
  • humans plus technology are together a cognitive system
  • how can you make automation a team player?

Janelle Klein, John Sawers, Rein Henrichs, and Jessica Kerr, with John Allspaw

What does cold start look like on various FaaS platforms? This article has hard numbers obtained through empirical testing.

Mikhail Shilkov

Colm MacCárthaigh explains how shuffle sharding improves reliability by acting like some kind of magic lever made of math.

Colm MacCárthaigh — AWS (thanks to Thread Reader for the thread rollup)

Who cares if your CDN has an eleventeen terabaud backbone uplink? What really matters is how they can serve your traffic.

Matt Levine — CacheFly

An engineer pushes a small change and OkCupid goes up in flames.

A new, entry-level employee takes down a big site — due not to a bug in his software, but in a dependency.

Dale Markowitz — LOGIC Magazine (Issue #5)

What happens when you mix Observability and Serverless? Corey Quinn of Last Week in AWS lets you in on the hottest new operations practice.

Corey Quinn

How will climate change and rising sea levels impact the reliability of our networks?

Carol Barford — iAfrikan

I watched this Nova (PBS) episode this week, and I highly recommend it. It explores why trains crash and what governments are doing to improve safety. The link above requires membership, but you can also watch it on Netflix.

PBS

Outages

SRE Weekly Issue #137

SPONSOR MESSAGE

At first, getting internal buy-in for SRE efforts can be difficult. “Build the Resilient Future Faster: Creating a Culture of Reliability” shows you exactly how–and why–we created and implemented our own culture of DevOps and SRE:

http://try.victorops.com/sreweekly/sre-ebook

Articles

Read about their transition from multi-cloud to all AWS and how they scaled to 10x the login throughput.

Dirceu Tiegs — Auth0

This article on the emergent behavior of algorithms is well worth thinking about as an SRE. Even without machine learning, our infrastructures have complex emergent behaviors, as you can read in any incident retrospective.

Andrew Smith — The Guardian

This interesting pitfall of chaos engineering stood out to me:

[…] if you hand a team 50 vulnerabilities, they’re probably not going to fix any of them. You know what I mean? So you have to figure out a way to prioritize those …

Andrea Echstenkamper with Nora Jones (Netflix), Ted Strzalkowski (LInkedIn), and Pat Higgins (Gremlin)

Well worth a quick listen (2 minutes 30 seconds).

Todd Conklin — Pre-Accident Podcast

In this series, we’ll dig into different types of observability tools. For each type, we’ll cover what they’re used for, what specific tools are available, some use cases, and any unique characteristics that may come up during your search for a new tool.

Linked above is an introduction to the article series. The first in the series is also out, focusing on time-series metric systems.

Dan Barker

Outages

SRE Weekly Issue #136

SPONSOR MESSAGE

Define goals, set agendas, and build SRE like a boss. SRE team lead, Jonathan Schwietert, discusses how to organize effective SRE meetings and cultivate a collaborative culture of resiliency:

http://try.victorops.com/sreweekly/organized-sre

Articles

This infographic shows how Ably’s client library and backend infrastructure is designed to work around many common failure modes. My favorite: they have redundant TLS certificates from distinct issuers.

Matthew O’Riordan — Ably

This article argues that spending a little time to fix staging can make production significantly more stable.

Michael Nygard

This is a story of a flawed development process on top of a flawed infrastructure, without the necessary data to drive decision-making. It’s also a story of waking up to these problems and charting a way out.

[…]

As it turns out, pure reasoning cannot solve the kind of problems you see in the production environment of a complex application. These problems are almost always more difficult, since they have survived all of the testing you could throw at them.

John Casey

A story of a somewhat rare failure case (a datacenter heat buildup event) and how to monitor for such a thing without contributing to metrics overload.

Pavel Trukhanov — okmeter

On twitter this week, @srhtcn noted that “Many incidents happen during or right after release” and asked for advice on ways to fix this.

Great advice, useful for managers and individual contributors.

Charity Majors

Outages

SRE Weekly Issue #135

SPONSOR MESSAGE

SRE looks different from organization to organization. But, this recent interview with members of our SRE council showcases their approach to SRE, some of their favorite parts of SRE, and how SRE continues to evolve:
http://try.victorops.com/sreweekly/what-is-sre-to-me

Articles

What might an AWS outage look like? Try this new simulation tool to find out!

It’s not something you’ll want to use for too long (the internet is better when it works, it turns out), but it’s a view that’s well worth taking in, if only to taste the sheer scope of Amazon’s server empire.

Russell Brandom — The Verge (tool by Dhruv Mehrotra)

This article goes step-by-step through setting up a 3-server GlusterFS cluster.

Jack Wallen — TechRepublic

My favorite part of this is the concept of vacations as a “human game day”. Can we survive without you?

Matt Stratton — PagerDuty (with Alice Goldfuss)

One question I have been seeing is “if Istio provides reliability for me, do I have to worry about it in my application?”

The answer is: abso-freakin-lutely :)

Christian Posta

This take on the theft and crashing of an airplane in Seattle is applicable to SRE in multiple ways. It includes discussion of the incident response and some thoughts on what level of risk for extremely rare events is acceptable.

James Fallows — The Atlantic

Two funny GIFs about SRE. Full disclosure: @dbaops is my boss and this stemmed from a DM conversation between us.

@dbaops on Twitter

Coarse-grained health checks might be sufficient for orchestration systems, but prove to be inadequate to ensure quality-of-service and prevent cascading failures in distributed systems.

Cindy Sridharan

Outages

SRE Weekly Issue #134

SPONSOR MESSAGE

Sr. Software Engineer, Greg Frank, discusses a tool using simulated chaos and validators to improve SRE. See part one of the series to learn more about this tool for supporting your own SRE efforts:

http://try.victorops.com/sreweekly/simulators-and-validators-for-sre

Articles

The big news this week is SegmentSmack, a denial of service vulnerability in the Linux kernel that allows an attacker to cause high CPU consumption. Linked is a SANS Technology Institute researcher’s summary of the attack. Other coverage:

Johannes B. Ullrich, PhD — SAN Technology Institute

It’s rare that any system we create will remain static throughout its lifetime. How can you handle retrofitting it without sacrificing reliability?

Yiwei Liu — Grubhub

We’ve previously introduced GLB, our scalable load balancing solution for bare metal datacenters […] Today we’re excited to share more details about our load balancer’s design, as well as release the GLB Director as open source.

Theo Julienne — GitHub

HostedGraphite had a load-balancing challenge: some connections carried 5 data points per second while others had 5000. Here’s how they solved it.

Ciaran Gaffney — HostedGraphite

Here’s how Grab designed their global rate-limiting system, ensuring nearly instant local rate-limiting decisions controlled asynchronously by a global service.

Jim Zhan and Gao Chao — Grab

Find out how Lyft avoids cascading failure in their microservice-based architecture, through the use of a client- and server-side rate-limiting proxy.

Daniel Hochman and Jose Nino — Lyft

A good post-mortem process is broken down into three major parts, the first of which will usually take up the bulk of your time:

  • Writing a post-mortem.
  • Reviewing the post-mortem and publishing the post-mortem.
  • Tracking the post-mortem.

Let’s go through each step in more detail.

Sweta Ackerman — Increment

The FCC blamed their outage this past May on a DDoS. Turns out it was just massively distributed requests for legitimate service.

Thomas Barrabi — Fox Business

My favorite part of this interview with Charity Majors is the discussion of operations in a serverless infrastructure (toward the end).

Forrest Brazeal — A Cloud Guru

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme