SRE Weekly Issue #134

SPONSOR MESSAGE

Sr. Software Engineer, Greg Frank, discusses a tool using simulated chaos and validators to improve SRE. See part one of the series to learn more about this tool for supporting your own SRE efforts:

http://try.victorops.com/sreweekly/simulators-and-validators-for-sre

Articles

The big news this week is SegmentSmack, a denial of service vulnerability in the Linux kernel that allows an attacker to cause high CPU consumption. Linked is a SANS Technology Institute researcher’s summary of the attack. Other coverage:

Johannes B. Ullrich, PhD — SAN Technology Institute

It’s rare that any system we create will remain static throughout its lifetime. How can you handle retrofitting it without sacrificing reliability?

Yiwei Liu — Grubhub

We’ve previously introduced GLB, our scalable load balancing solution for bare metal datacenters […] Today we’re excited to share more details about our load balancer’s design, as well as release the GLB Director as open source.

Theo Julienne — GitHub

HostedGraphite had a load-balancing challenge: some connections carried 5 data points per second while others had 5000. Here’s how they solved it.

Ciaran Gaffney — HostedGraphite

Here’s how Grab designed their global rate-limiting system, ensuring nearly instant local rate-limiting decisions controlled asynchronously by a global service.

Jim Zhan and Gao Chao — Grab

Find out how Lyft avoids cascading failure in their microservice-based architecture, through the use of a client- and server-side rate-limiting proxy.

Daniel Hochman and Jose Nino — Lyft

A good post-mortem process is broken down into three major parts, the first of which will usually take up the bulk of your time:

  • Writing a post-mortem.
  • Reviewing the post-mortem and publishing the post-mortem.
  • Tracking the post-mortem.

Let’s go through each step in more detail.

Sweta Ackerman — Increment

The FCC blamed their outage this past May on a DDoS. Turns out it was just massively distributed requests for legitimate service.

Thomas Barrabi — Fox Business

My favorite part of this interview with Charity Majors is the discussion of operations in a serverless infrastructure (toward the end).

Forrest Brazeal — A Cloud Guru

Outages

Updated: August 12, 2018 — 10:35 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme