SRE Weekly Issue #131

SPONSOR MESSAGE

The costs of downtime can add up quickly. Take a deep dive into the behavioral and financial costs of downtime and check out some great techniques that teams are using to mitigate it:

http://try.victorops.com/sreweekly/costs-of-downtime

Articles

I love the idea of using hobbies as a gauge for your overload level at work. Also, serious kudos to Alice for the firm stance against alcohol at work and especially in Ops.

Alice Goldfuss

If the Linux OOM killer gets involved, you’ve already lost. Facebook reckons they can do better.

We find that oomd can respond faster, is less rigid, and is more reliable than the traditional Linux kernel OOM killer. In practice, we have seen 30-minute livelocks completely disappear.

Daniel Xu — Facebook

This is radical transparency: Honeycomb has set up a sandbox copy of their app for you to play with and loaded it with data from a real outage on their platform! Tinker away. It’s super fun.

Honeycomb

It may not actually make sense to halt feature development if your team has exhausted the error budget. What do you do instead?

Adrian Hilton, Alec Warner and Alex Bramley — Google

Today, we’re excited to share the architecture for Centrifuge–Segment’s system for reliably sending billions of messages per day to hundreds of public APIs. This post explores the problems Centrifuge solves, as well as the data model we use to run it in production.

The parallels to the Plaid article a few weeks ago (scaling 9000+ heterogeneous bank integrations) are intriguing.

Calvin French-Owen — Segment

A solid definition of SLIs, SLOs, and SLAs (from someone other than Google!). Includes some interesting tidbits on defining and measuring availability, choosing a useful time quantum, etc.

Kevin Kamel — Circonus

Read about how Heroku deployed a security fix to their fleet of customer Redis instances. This is awesome:

Our fleet roll code only schedules replacement operations during the current on-call operator’s business hours. This limits burnout by reducing the risk of the fleet roll waking them up at night.

Camille Baldock — Heroku

In this article I’m going to explore how multi-level automated chaos experiments can be used to explore system weaknesses that cross the boundaries between the technical and people/process/practices levels.

Russ Miles — ChaosIQ

A comparison of 2 free and 6 paid tools for load testing, along with advice on how to use them.

Noah Heinrich — ButterCMS

One could even call this article, “Why having a single microservice that every other microservice depends on is a bad idea”.

Mark Henke — Rollout.io

Outages

  • Google Cloud Platform
    • Perhaps you noticed that a ton of sites fell over this past Tuesday? Or maybe you were on the front lines dealing with it yourself. Google’s Global Load Balancer fleet suffered a major outage, and they posted this detailed analysis/apology the next day.
  • Amazon’s Prime Day
    • Seems like a tradition at this point…
  • Azure
    • A BGP announcement error caused global instability for VM instances trying to reach Azure endpoints.
  • PagerDuty
  • Slack
  • Atlassian Statuspage
  • British Airways
  • Twitter
  • Fortnite: Playground LTM Postmortem
    • This is a really juicy incident analysis! Epic Games tried to release a new game mode for Fortnite and quickly discovered a major scaling issue in their system, which they explain in great detail.

      The process of getting Playground stable and in the hands of our players was tougher than we would have liked, but was a solid reminder that complex distributed systems fail in unpredictable ways. We were forced to make significant emergency upgrades to our Matchmaking Service, but these changes will serve the game well as we continue to grow and expand our player base into the future.

      The Fortnite Team — Epic Games

  • Snapchat
  • Facebook
  • reddit
Updated: July 22, 2018 — 10:19 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme