Articles
The big news this week is SegmentSmack, a denial of service vulnerability in the Linux kernel that allows an attacker to cause high CPU consumption. Linked is a SANS Technology Institute researcher’s summary of the attack. Other coverage:
Johannes B. Ullrich, PhD — SAN Technology Institute
It’s rare that any system we create will remain static throughout its lifetime. How can you handle retrofitting it without sacrificing reliability?
Yiwei Liu — Grubhub
We’ve previously introduced GLB, our scalable load balancing solution for bare metal datacenters […] Today we’re excited to share more details about our load balancer’s design, as well as release the GLB Director as open source.
Theo Julienne — GitHub
HostedGraphite had a load-balancing challenge: some connections carried 5 data points per second while others had 5000. Here’s how they solved it.
Ciaran Gaffney — HostedGraphite
Here’s how Grab designed their global rate-limiting system, ensuring nearly instant local rate-limiting decisions controlled asynchronously by a global service.
Jim Zhan and Gao Chao — Grab
Find out how Lyft avoids cascading failure in their microservice-based architecture, through the use of a client- and server-side rate-limiting proxy.
Daniel Hochman and Jose Nino — Lyft
A good post-mortem process is broken down into three major parts, the first of which will usually take up the bulk of your time:
- Writing a post-mortem.
- Reviewing the post-mortem and publishing the post-mortem.
- Tracking the post-mortem.
Let’s go through each step in more detail.
Sweta Ackerman — Increment
The FCC blamed their outage this past May on a DDoS. Turns out it was just massively distributed requests for legitimate service.
Thomas Barrabi — Fox Business
My favorite part of this interview with Charity Majors is the discussion of operations in a serverless infrastructure (toward the end).
Forrest Brazeal — A Cloud Guru
Outages
- Travis CI
- Google G Suite administrator console
- Datadog
- Google Compute Engine
- This is a followup analysis of an outage that occurred on July 27.
The issue was caused by an unintended side effect of a configuration change […]
- This is a followup analysis of an outage that occurred on July 27.