SRE Weekly Issue #128

View on sreweekly.com

Articles

The Saddest Moment

Humor for SREs! This is the most hilarious thing I’ve read all week.

James Mickens — USENIX ;login:logout

A broad overview of how modern Linux systems boot

This focuses on various ways that Linux systems can fail to boot.

Chris Siebenmann

DevOps Chat: SRE w/ Stig Sorensen of Bloomberg

A (raw) transcript of a chat about Bloomberg’s adoption of SRE practices. It might be worth dropping it in a text editor and removing all occurrences of the phrase “sort of”. The real meat is in the discussion of what Bloomberg has learned (text search: “lessons learned”) and how to sell SRE as necessary in a company (text search: “ROI”).

Alan Shimel — devops.com

How Pusher Channels has delivered 10,000,000,000,000 messages

Channels employs three time-honored techniques to deliver these messages at low latency: fan-out, sharding, and load balancing. Let’s look inside the box!

Jim Fisher — Pusher

How we implemented consistent hashing efficiently

An in-depth explanation of how consistent hashing works. Love the hand-drawn diagrams!

Srushtika Neelakantam — Ably

Colm MacCárthaigh on Twitter: random number generation

Have you ever needed to generate a random number in code? whether it’s for rolling a dice, or shuffling a set, this tweet thread is here for you! There’s no reason that it should be easy or obvious, very experienced programmers repeat common mistakes. I did, before I learned …

Not strictly SRE-related, but then again it’s by Colm MacCárthaigh, who is SRE-related.

Colm MacCárthaigh

Understanding error budget overspend – part one – CRE life lessons

What should you do if you blow your error budget? Depends on whether you leaked it like a dripping faucet or splurged it all on big outages. Either way, you’ll need to investigate and make a plan.

Adrian Hilton, Alec Warner and Alex Bramley — Google

Migrating Messenger storage to optimize performance

I love the two-method approach: a simple migration path for users that aren’t active all the time, and a more careful (and more complex) path for very busy users.

Xiang Li and Thomas Georgiou — Facebook

Sri Harsha Kalavala on Twitter: alert on support page views

If you haven’t implemented alerts on support page views yet, do it now!! and thank me later. Here’s a view of how our dashboard looked as of a few minutes ago – a clear demonstration of user impact that supplements existing monitors and alerts.…

Click through for the graph. Monitor status and support page views… do we actually need any other monitoring? Only half-kidding.

Sri Harsha Kalavala

Outages

Google BigQuery
- Google posted a followup analysis of the BigQuery outage on June 22.
  
  A new release of the BigQuery API introduced a software defect that caused the API component to return larger-than-normal responses to the BigQuery router server.
Fastly
- Full disclosure: Fastly is my employer.
G Suite Status Dashboard
Slack
- This week, Slack had a ~3-hour, near-total outage. Click through for their followup post.
  
  The network problems were caused by a bug included in an offline batch process of data, which resulted in unexpected network spikes and led all of our customers to become disconnected and unable to reconnect.
Google Home and Chromecast

SRE Weekly Issue #128

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues