SRE Weekly Issue #131

Articles

I love the idea of using hobbies as a gauge for your overload level at work. Also, serious kudos to Alice for the firm stance against alcohol at work and especially in Ops.

Alice Goldfuss

Open sourcing oomd, a new approach to handling OOMs

If the Linux OOM killer gets involved, you’ve already lost. Facebook reckons they can do better.

We find that oomd can respond faster, is less rigid, and is more reliable than the traditional Linux kernel OOM killer. In practice, we have seen 30-minute livelocks completely disappear.

Daniel Xu — Facebook

Debug a Real Honeycomb Outage with Honeycomb

This is radical transparency: Honeycomb has set up a sandbox copy of their app for you to play with and loaded it with data from a real outage on their platform! Tinker away. It’s super fun.

Honeycomb

Good housekeeping for error budgets – part two – CRE life lessons

It may not actually make sense to halt feature development if your team has exhausted the error budget. What do you do instead?

Adrian Hilton, Alec Warner and Alex Bramley — Google

Introducing Centrifuge

Today, we’re excited to share the architecture for Centrifuge–Segment’s system for reliably sending billions of messages per day to hundreds of public APIs. This post explores the problems Centrifuge solves, as well as the data model we use to run it in production.

The parallels to the Plaid article a few weeks ago (scaling 9000+ heterogeneous bank integrations) are intriguing.

Calvin French-Owen — Segment

SLOs & You: A Guide To Service Level Objectives

A solid definition of SLIs, SLOs, and SLAs (from someone other than Google!). Includes some interesting tidbits on defining and measuring availability, choosing a useful time quantum, etc.

Kevin Kamel — Circonus

Rolling the Heroku Redis Fleet

Read about how Heroku deployed a security fix to their fleet of customer Redis instances. This is awesome:

Our fleet roll code only schedules replacement operations during the current on-call operator’s business hours. This limits burnout by reducing the risk of the fleet roll waking them up at night.

Camille Baldock — Heroku

Exploring Multi-level Weaknesses using Automated Chaos Experiments

In this article I’m going to explore how multi-level automated chaos experiments can be used to explore system weaknesses that cross the boundaries between the technical and people/process/practices levels.

Russ Miles — ChaosIQ

Load Testing Round Up: 8 tools you can use to strengthen your stack

A comparison of 2 free and 6 paid tools for load testing, along with advice on how to use them.

Noah Heinrich — ButterCMS

Why Having a Feature Flag Microservice Is a Bad Idea

One could even call this article, “Why having a single microservice that every other microservice depends on is a bad idea”.

Mark Henke — Rollout.io

Outages

Google Cloud Platform
- Perhaps you noticed that a ton of sites fell over this past Tuesday? Or maybe you were on the front lines dealing with it yourself. Google’s Global Load Balancer fleet suffered a major outage, and they posted this detailed analysis/apology the next day.
Amazon’s Prime Day
- Seems like a tradition at this point…
Azure
- A BGP announcement error caused global instability for VM instances trying to reach Azure endpoints.
PagerDuty
Slack
Atlassian Statuspage
British Airways
Twitter
Fortnite: Playground LTM Postmortem
- This is a really juicy incident analysis! Epic Games tried to release a new game mode for Fortnite and quickly discovered a major scaling issue in their system, which they explain in great detail.
  
  The process of getting Playground stable and in the hands of our players was tougher than we would have liked, but was a solid reminder that complex distributed systems fail in unpredictable ways. We were forced to make significant emergency upgrades to our Matchmaking Service, but these changes will serve the game well as we continue to grow and expand our player base into the future.
  
  The Fortnite Team — Epic Games
Snapchat
Facebook
reddit

SRE Weekly Issue #131

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues