SRE Weekly Issue #123

I hope you all had a happy GDPR day!  SRE Weekly’s privacy policy has not changed.  Folks that subscribed by email would have seen a message that I only share your email address with MailChimp, and that’s the way it will stay.

You can unsubscribe at any time by following the link at the bottom of the email, but if you have any trouble at all with unsubscribing, please don’t hesitate to email me and I’ll take care of it for you.

SPONSOR MESSAGE

Maintaining reliability through cloud migration can be difficult. Learn how implementing an incident management solution can make migration faster, reduce costs, and make SRE-life easier: http://try.victorops.com/SREWeekly/Cloud-Migration-Incident-Management

Articles

The system is highly configurable, allowing fine-grained A/B testing of failures at all levels of the microservice call tree.

Ephemeral port exhaustion can really ruin your day. Read this to learn how to deal with it, how to detect it before you have problems, and why you might run out of ports sooner than you expect.

Will Sewell — Pusher

This incident report from 2013 is a great read. It’s really two inches in one, including an analysis of why a remediation task from the first wasn’t completed in time to prevent the second.

David Poblador i Garcia — Spotify

There are a few nice tidbits in this interview, including this one:

[…] the health of the system no longer matters.  We’ve entered an era where what matters is the health of each individual event, or each individual user’s experience […]

Daniel Bryant – InfoQ

This article has introduction to implementing canary deployment and also includes a discussion of the potential downsides.

Erik [surname not given] — Rollout.io

Lots of great detail in this announcement, including an analysis of how (and why) they designed their load balancer to function entirely in userspace without a kernel bypass mechanism.

Nikita Shirokov and Ranjeeth Dasineni — Facebook

Metrics are great, right? Except sometimes they’re not, when the metric collection itself adds enough load to impair the system.

Jonathan Brown — Wallaroo

Outages

  • Google BigQuery
    • Click through for the full incident report.

      Configuration changes being rolled out on the evening of the incident were not applied in the intended order.

  • GCP Networking in us-east4
    • Here’s some detail on the BGP issue that took down us-east4 last week.
  • Google StackDriver
    • It’s a hat trick of three GCP incident followup reports. Happy reading!
  • Slack
  • Bank of New Zealand
  • Twitter
  • National Australia Bank
    • This outage is particularly notable because the bank has stated their intention to compensate customers for their losses, such as estimated lost revenues from inability to complete sales transactions.
Updated: May 27, 2018 — 9:14 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme