SRE Weekly Issue #123

View on sreweekly.com

I hope you all had a happy GDPR day! SRE Weekly’s privacy policy has not changed. Folks that subscribed by email would have seen a message that I only share your email address with MailChimp, and that’s the way it will stay.

You can unsubscribe at any time by following the link at the bottom of the email, but if you have any trouble at all with unsubscribing, please don’t hesitate to email me and I’ll take care of it for you.

Articles

LinkedOut: A Request-Level Failure Injection Framework

The system is highly configurable, allowing fine-grained A/B testing of failures at all levels of the microservice call tree.

Ephemeral port exhaustion and how to avoid it

Ephemeral port exhaustion can really ruin your day. Read this to learn how to deal with it, how to detect it before you have problems, and why you might run out of ports sooner than you expect.

Will Sewell — Pusher

Incident Management at Spotify

This incident report from 2013 is a great read. It’s really two inches in one, including an analysis of why a remediation task from the first wasn’t completed in time to prevent the second.

David Poblador i Garcia — Spotify

Charity Majors on Observability and Understanding the Operational Ramifications of a System

There are a few nice tidbits in this interview, including this one:

[…] the health of the system no longer matters. We’ve entered an era where what matters is the health of each individual event, or each individual user’s experience […]

Daniel Bryant – InfoQ

Canary Deployment: What Is It and How Can I Use It?

This article has introduction to implementing canary deployment and also includes a discussion of the potential downsides.

Erik [surname not given] — Rollout.io

Open-sourcing Katran, a scalable network load balancer

Lots of great detail in this announcement, including an analysis of how (and why) they designed their load balancer to function entirely in userspace without a kernel bypass mechanism.

Nikita Shirokov and Ranjeeth Dasineni — Facebook

Building low-overhead metrics collection for high-performance systems

Metrics are great, right? Except sometimes they’re not, when the metric collection itself adds enough load to impair the system.

Jonathan Brown — Wallaroo

Outages

Google BigQuery
- Click through for the full incident report.
  
  Configuration changes being rolled out on the evening of the incident were not applied in the intended order.
GCP Networking in us-east4
- Here’s some detail on the BGP issue that took down us-east4 last week.
Google StackDriver
- It’s a hat trick of three GCP incident followup reports. Happy reading!
Slack
Bank of New Zealand
Twitter
National Australia Bank
- This outage is particularly notable because the bank has stated their intention to compensate customers for their losses, such as estimated lost revenues from inability to complete sales transactions.

SRE Weekly Issue #123

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues