SRE WEEKLY – Page 73 – scalability, availability, incident response, automation

SRE Weekly Issue #134

lex

August 12, 2018

Articles

What Do I Need To Know about “SegmentSmack”

The big news this week is SegmentSmack, a denial of service vulnerability in the Linux kernel that allows an attacker to cause high CPU consumption. Linked is a SANS Technology Institute researcher’s summary of the attack. Other coverage:

Johannes B. Ullrich, PhD — SAN Technology Institute

How to manage changing requirements for a high availability service

It’s rare that any system we create will remain static throughout its lifetime. How can you handle retrofitting it without sacrificing reliability?

Yiwei Liu — Grubhub

GLB: GitHub’s open source load balancer

We’ve previously introduced GLB, our scalable load balancing solution for bare metal datacenters […] Today we’re excited to share more details about our load balancer’s design, as well as release the GLB Director as open source.

Theo Julienne — GitHub

The Secrets of Load-balancing Long Lived TCP Connections

HostedGraphite had a load-balancing challenge: some connections carried 5 data points per second while others had 5000. Here’s how they solved it.

Ciaran Gaffney — HostedGraphite

How we designed the Quotas microservice to prevent resource abuse

Here’s how Grab designed their global rate-limiting system, ensuring nearly instant local rate-limiting decisions controlled asynchronously by a global service.

Jim Zhan and Gao Chao — Grab

Envoy Service Mesh Case Study: Mitigating Cascading Failure at Lyft

Find out how Lyft avoids cascading failure in their microservice-based architecture, through the use of a client- and server-side rate-limiting proxy.

Daniel Hochman and Jose Nino — Lyft

Post-mortems to the rescue

A good post-mortem process is broken down into three major parts, the first of which will usually take up the bulk of your time:

Writing a post-mortem.

Reviewing the post-mortem and publishing the post-mortem.

Tracking the post-mortem.

Let’s go through each step in more detail.

Sweta Ackerman — Increment

John Oliver viewers, not hackers, responsible for FCC system outage

The FCC blamed their outage this past May on a DDoS. Turns out it was just massively distributed requests for legitimate service.

Thomas Barrabi — Fox Business

You can’t debug systems with dashboards

My favorite part of this interview with Charity Majors is the discussion of operations in a serverless infrastructure (toward the end).

Forrest Brazeal — A Cloud Guru

Outages

Travis CI
Google G Suite administrator console
Datadog
Google Compute Engine
- This is a followup analysis of an outage that occurred on July 27.
  
  The issue was caused by an unintended side effect of a configuration change […]

SRE Weekly Issue #133

lex

August 5, 2018

General

Comments

View on sreweekly.com

Articles

Errata: miscredited article in last week’s issue

My sincerest apology to Ali Haider Zaveri, author of the article Location-Aware Distribution: Configuring servers at scale. I originally miscredited the article to two folks, claiming they were from Facebook when in fact they work at Google.

Cloud infrastructure at Grubhub

As Grubhub built out their service-oriented architecture, they first developed “base frameworks for building highly available, distributed services”.

William Blackie — Grubhub

How we scaled nginx and saved the world 54 years every day

Cloudflare discusses an optimization that improves their p99 response time in the face of occasionally slow disk access. Today I learned: Linux does not allow for non-blocking disk reads.

Ka-Hing Cheung — Cloudflare

Google Online Security Blog: Chrome’s Plan to Distrust Symantec Certificates

I include this article not just to warn you in case you depend on GeoTrust certificates, but also to highlight what’s involved in running a reliable and trustworthy CA.

Devon O’Brien, Ryan Sleevi, and Andrew Whalley — Google

How we built a globally distributed rate limiter that scales horizontally

They go over the 6 key constraints that influenced their design and describe the solution they came up with. Some of the constraints seem to involve preserving not just their own systems’ reliability, but that of their customers’ systems.

Simon Woolf — Ably

Repairing network hardware at scale with SRE principles

Given that we already knew in advance how to deal with each issue as it arose, it made sense to automate the work. Here’s how we did it.

James O’Keeffe — Google

Understanding Azure Load Balancing Solutions – Azure Load Balancer, Azure Application Gateway and Azure Traffic Manager – Rahul Rajat Singh’s blog

In this article we will look at the various load balancing solutions available in Azure and which one should be used in which scenario.

Rahul Rajat Singh

Outages

Google Cloud Networking europe-west2
GitHub
Facebook
FastMail
Chipotle
- It appears that Chipotle may have DoSed themselves with an offer of free guacamole to folks that order online.
MoviePass

SRE Weekly Issue #132

lex

July 29, 2018

General

Comments

View on sreweekly.com

Articles

How to Lead a Disaster Recovery Exercise For Your On-Call Team

In this blog post I will show you what a disaster recovery exercise is, how it can diagnose weak points in your infrastructure, and how it can be a learning experience for your on-call team.

Alexandra Johnson — SigOpt

Exploring Spring Boot resiliency on AWS EKS

This article showcases the Chaos Toolkit experiments these folks wrote to test their system’s resiliency.

Sylvain Hellegouarc — chaosiq

Location-Aware Distribution: Configuring servers at scale

With millions of servers and thousands of configuration changes per day, distribution of configuration information becomes a huge scaling challenge. Here’s some insight (and pretty architecture diagrams) explaining how Facebook does it.

Ali Haider Zaveri — Facebook [NOTE: originally miscredited, sorry!]

Introducing Liftbridge: Lightweight, Fault-Tolerant Message Streams

Liftbridge is a system for lightweight, fault-tolerant (LIFT) message streams built on NATS and gRPC. Fundamentally, it extends NATS with a Kafka-like publish-subscribe log API that is highly available and horizontally scalable.

Tyler Treat

Transparent SLIs: See Google Cloud the way your application experiences it

This pretty neat: Google Cloud Platform now exposes their SLIs directly to you, as they pertain to the requests you make of the platform. For example, if a given API call has increased latency, you’ll see it on their graph. This can be great for those “is it us or is it them?” incidents.

Jay Judkowitz — Google

Safety Moment – Predicting the FUTURE!!!!

What can I do to make sure that, when this system fails, it fails as effectively as possible?

Todd Conklin — Pre-Accident Podcast

Google’s New Book: The Site Reliability Workbook

Here’s a review of Google’s new SRE book. I’m a little miffed that now I have to say that, instead of just “Google’s SRE book” or just “the SRE book”. Ah well. This one appears to be more about practical use cases than theory.

Todd Hoff — High Scalability

Great GameDays: Thinking About Failure Holistically | LaunchDarkly Blog

Chaos engineering isn’t just for SREs.

everyone benefits from observing a failure. Even UI engineers, people from a UX background, product managers.

Patrick Higgins — Gremlin

Outages

MoviePass
- Interestingly, the company reported in their SEC filing that the outage was the result of their running out of cash and being unable to pay vendors.
BBC website

SRE Weekly Issue #131

lex

July 22, 2018

General

Comments

View on sreweekly.com

Articles

Twitter: @alicegoldfuss on New Ops

I love the idea of using hobbies as a gauge for your overload level at work. Also, serious kudos to Alice for the firm stance against alcohol at work and especially in Ops.

Alice Goldfuss

Open sourcing oomd, a new approach to handling OOMs

If the Linux OOM killer gets involved, you’ve already lost. Facebook reckons they can do better.

We find that oomd can respond faster, is less rigid, and is more reliable than the traditional Linux kernel OOM killer. In practice, we have seen 30-minute livelocks completely disappear.

Daniel Xu — Facebook

Debug a Real Honeycomb Outage with Honeycomb

This is radical transparency: Honeycomb has set up a sandbox copy of their app for you to play with and loaded it with data from a real outage on their platform! Tinker away. It’s super fun.

Honeycomb

Good housekeeping for error budgets – part two – CRE life lessons

It may not actually make sense to halt feature development if your team has exhausted the error budget. What do you do instead?

Adrian Hilton, Alec Warner and Alex Bramley — Google

Introducing Centrifuge

Today, we’re excited to share the architecture for Centrifuge–Segment’s system for reliably sending billions of messages per day to hundreds of public APIs. This post explores the problems Centrifuge solves, as well as the data model we use to run it in production.

The parallels to the Plaid article a few weeks ago (scaling 9000+ heterogeneous bank integrations) are intriguing.

Calvin French-Owen — Segment

SLOs & You: A Guide To Service Level Objectives

A solid definition of SLIs, SLOs, and SLAs (from someone other than Google!). Includes some interesting tidbits on defining and measuring availability, choosing a useful time quantum, etc.

Kevin Kamel — Circonus

Rolling the Heroku Redis Fleet

Read about how Heroku deployed a security fix to their fleet of customer Redis instances. This is awesome:

Our fleet roll code only schedules replacement operations during the current on-call operator’s business hours. This limits burnout by reducing the risk of the fleet roll waking them up at night.

Camille Baldock — Heroku

Exploring Multi-level Weaknesses using Automated Chaos Experiments

In this article I’m going to explore how multi-level automated chaos experiments can be used to explore system weaknesses that cross the boundaries between the technical and people/process/practices levels.

Russ Miles — ChaosIQ

Load Testing Round Up: 8 tools you can use to strengthen your stack

A comparison of 2 free and 6 paid tools for load testing, along with advice on how to use them.

Noah Heinrich — ButterCMS

Why Having a Feature Flag Microservice Is a Bad Idea

One could even call this article, “Why having a single microservice that every other microservice depends on is a bad idea”.

Mark Henke — Rollout.io

Outages

Google Cloud Platform
- Perhaps you noticed that a ton of sites fell over this past Tuesday? Or maybe you were on the front lines dealing with it yourself. Google’s Global Load Balancer fleet suffered a major outage, and they posted this detailed analysis/apology the next day.
Amazon’s Prime Day
- Seems like a tradition at this point…
Azure
- A BGP announcement error caused global instability for VM instances trying to reach Azure endpoints.
PagerDuty
Slack
Atlassian Statuspage
British Airways
Twitter
Fortnite: Playground LTM Postmortem
- This is a really juicy incident analysis! Epic Games tried to release a new game mode for Fortnite and quickly discovered a major scaling issue in their system, which they explain in great detail.
  
  The process of getting Playground stable and in the hands of our players was tougher than we would have liked, but was a solid reminder that complex distributed systems fail in unpredictable ways. We were forced to make significant emergency upgrades to our Matchmaking Service, but these changes will serve the game well as we continue to grow and expand our player base into the future.
  
  The Fortnite Team — Epic Games
Snapchat
Facebook
reddit

SRE Weekly Issue #130

lex

July 15, 2018

General

Comments

View on sreweekly.com

Articles

Goodbye Microservices: From 100s of problem children to 1 superstar

Segment discovered the hard way that their move to a microservice architecture had brought far more problems than benefits. Here’s why they transitioned back and how they pulled it off. Awesome article!

Alexandra Noonan — Segment

Establishing Resilience: The Challenges and Opportunities of Complexity

Drawing on the work of Dr. David Woods and the rest of the SNAFU Catchers, this article discusses the concepts behind resiliency and how to measure and achieve it.

Beth Long — New Relic

Solving for serverless: How do you manage something that’s not there?

Serverless is not the magical gateway to the land of NoOps. You still have to operate your system even if you’re not directly running the servers. This article does a great job of explaining why.

Bhanu Singh — Network World

How I use Wireshark

New to me: Wireshark’s statistics view and how it can be useful.

Julia Evans

Health and availability in computer systems

How do you define whether your system is available and healthy? This article uses an anology to medical health.

Claiming that our system is doing well means nothing if users can perceive an outage.

José Carlos Chávez — Typeform

On the AWS Application Load Balancer HTTP/2 Support

These folks are experiencing mysterious latency with HTTP/2 traffic to their ALB and published this report on their investigation. There’s no happy ending here — ultimately they disabled HTTP/2 support. I hope they update if they do discover the culprit.

Peter Forsberg — ShopGun

relp 100% cpu – rsyslog stop after start · Issue #13 · rsyslog/librelp · GitHub

I had some fun this week unearthing the cause for the chronic lockups in Rsyslog that we’ve experienced at work. I found the cause (overeager retries of socket writes) and put together a bug report and a pull request.

Full disclosure: Fastly, my employer, is mentioned.

Building Grab’s Experimentation Platform

I love science! Grab wrote a nifty tool to help them select cohorts of users and perform experiments on them.

Abeesh Thomas and Roman Atachiants — Grab

Auto Scaling Production Services on Titus – Netflix TechBlog – Medium

Titus is the container orchestration system that Netflix created and open sourced. Rather than building a new auto-scaling feature for Titus, they instead got Amazon to productize EC2’s auto-scaling engine as a generalized auto-scaling tool, which Netflix then integrated with Titus. Neat!

See Amazon’s Application Auto Scaling announcement, published this past week.

Andrew Leung, Amit Joshi, and the rest of the Titus team — Netflix

Outages

Gmail
Google Docs, Sheets, et al.
YouTube TV
- During the World Cup match.
Discord
- Discord had a couple of outages this week.
Instagram
Mastercard
Facebook Messenger
Snapchat
99acres (real estate site)
Heroku
Disney blames 4-hour tech woes on network maintenance
- Here’s an update on the Disney system outage I linked to last week.
  Gabrielle Russon — Orlando Sentinel

SRE Weekly Issue #134

Articles

Outages

SRE Weekly Issue #133

Articles

Outages

SRE Weekly Issue #132

Articles

Outages

SRE Weekly Issue #131

Articles

Outages

SRE Weekly Issue #130

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues