SRE Weekly Issue #128

SPONSOR MESSAGE

Looking to go serverless? Beau Christensen, VictorOps Director of Platform Engineering, and Tom McLaughlin, Founder of ServerlessOps, sat down to talk about when VictorOps decided to venture into AWS:

http://try.victorops.com/SREWeekly/going-serverless

Articles

Humor for SREs! This is the most hilarious thing I’ve read all week.

James Mickens — USENIX ;login:logout

This focuses on various ways that Linux systems can fail to boot.

Chris Siebenmann

A (raw) transcript of a chat about Bloomberg’s adoption of SRE practices. It might be worth dropping it in a text editor and removing all occurrences of the phrase “sort of”. The real meat is in the discussion of what Bloomberg has learned (text search: “lessons learned”) and how to sell SRE as necessary in a company (text search: “ROI”).

Alan Shimel — devops.com

Channels employs three time-honored techniques to deliver these messages at low latency: fan-out, sharding, and load balancing. Let’s look inside the box!

Jim Fisher — Pusher

An in-depth explanation of how consistent hashing works. Love the hand-drawn diagrams!

Srushtika Neelakantam — Ably

Have you ever needed to generate a random number in code? whether it’s for rolling a dice, or shuffling a set, this tweet thread is here for you! There’s no reason that it should be easy or obvious, very experienced programmers repeat common mistakes. I did, before I learned …

Not strictly SRE-related, but then again it’s by Colm MacCárthaigh, who is SRE-related.

Colm MacCárthaigh

What should you do if you blow your error budget? Depends on whether you leaked it like a dripping faucet or splurged it all on big outages. Either way, you’ll need to investigate and make a plan.

Adrian Hilton, Alec Warner and Alex Bramley — Google

I love the two-method approach: a simple migration path for users that aren’t active all the time, and a more careful (and more complex) path for very busy users.

Xiang Li and Thomas Georgiou — Facebook

If you haven’t implemented alerts on support page views yet, do it now!! and thank me later. Here’s a view of how our dashboard looked as of a few minutes ago – a clear demonstration of user impact that supplements existing monitors and alerts.…

Click through for the graph. Monitor status and support page views… do we actually need any other monitoring? Only half-kidding.

Sri Harsha Kalavala

Outages

  • Google BigQuery
    • Google posted a followup analysis of the BigQuery outage on June 22.

      A new release of the BigQuery API introduced a software defect that caused the API component to return larger-than-normal responses to the BigQuery router server.

  • Fastly
    • Full disclosure: Fastly is my employer.
  • G Suite Status Dashboard
  • Slack
    • This week, Slack had a ~3-hour, near-total outage. Click through for their followup post.

      The network problems were caused by a bug included in an offline batch process of data, which resulted in unexpected network spikes and led all of our customers to become disconnected and unable to reconnect.

  • Google Home and Chromecast
Updated: July 1, 2018 — 9:01 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme