SRE Weekly Issue #109

Articles

Pusher had a problem: their service was being bombarded by connections from rogue clients, and they needed to enforce limits. This article is highly polished, with beautiful diagrams and well-constructed explanations.

This is the story of how we quelled the biggest threat to our service uptime for several years.

Structured Logging and Your Team

Structured logging can bring a lot of uniformity to your infrastructure, as lovingly explained in this article. Snyk explains how that uniformity allows for a standardized troubleshooting methodology that helps them get to the bottom of most problems in minutes.

Instead of focusing on the individual intricacies of each part of our system, we train on the common tools to be used for almost every kind of problem.

Your Feature Flag Management Needs to Include Retirement

Feature flags are awesome! But there’s a downside: adding lots of conditional handling to your code can significantly increase code complexity, which can in turn decrease maintainability and increase risk.

Charity Majors on Twitter

Following up on her appearance in the New York Times last week, Charity Majors posted this excellent Twitter thread about the importance of vendor relationship management and generating business value, as any kind of engineer. I’d argue especially as an SRE.

Google Cloud Platform Blog: Applying the Escalation Policy

Here’s the latest in Google’s CRE Life Lessons series. Previously, they explained how to build an Escalation Policy, and in this article, they analyze how it would be applied to several fictitious scenarios.

Dynamometer: Scale Testing HDFS on Minimal Hardware with Maximum Fidelity

LinkedIn needed a way to test their HDFS cluster against real-world traffic patterns. The existing solutions didn’t meet their needs (for reasons they explain toward the end), so they created Dynamometer.

Humanize Your Digital Operations

PagerDuty released a report this week entitled, “The State of IT Work-Life Balance”, which contains the results of their recent survey. This article is an overview, along with some related tidbits about alert fatigue.

Schrodinger’s Outage

Through an anecdote, Baron Schwartz cautions against the use of counter-factuals (“you should have…”) in analyzing the decisions leading up to an outage.

8 Things to Monitor During a Software Deployment

What it says on the tin. This article would make for a great checklist for deploys.

Outages

Singpass (Singapore ID system)
Uber
Fortnite
- Fortnite hit a new peak of 3.4 million concurrent players last Sunday… and that didn’t come without issues!
  
  They suffered 6 different outages over two days, and they posted this highly-detailed incident analysis just 5 days later. Normally I tend not to include outages for MMO games because they have so many and rarely post in-depth analyses, but this one is worth a read.
Binance (cryptocurrency exchange)
Google App Engine
US stock brokerages
- The US stock market had a rough week, and so did several brokerage websites as they dealt with the high trading volume.
Super Bowl Advertisers
- Several companies that purchased expensive commercial slots during the SuperBowl (an american sportsball thing, for you folks outside the US) were unable to handle the web traffic they brought in.
Super Bowl
- NBC had a 45-second blackout in their broadcast of the Superb Owl.

SRE Weekly Issue #109

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues