Articles
Humor for SREs! This is the most hilarious thing I’ve read all week.
James Mickens — USENIX ;login:logout
This focuses on various ways that Linux systems can fail to boot.
Chris Siebenmann
A (raw) transcript of a chat about Bloomberg’s adoption of SRE practices. It might be worth dropping it in a text editor and removing all occurrences of the phrase “sort of”. The real meat is in the discussion of what Bloomberg has learned (text search: “lessons learned”) and how to sell SRE as necessary in a company (text search: “ROI”).
Alan Shimel — devops.com
Channels employs three time-honored techniques to deliver these messages at low latency: fan-out, sharding, and load balancing. Let’s look inside the box!
Jim Fisher — Pusher
An in-depth explanation of how consistent hashing works. Love the hand-drawn diagrams!
Srushtika Neelakantam — Ably
Have you ever needed to generate a random number in code? whether it’s for rolling a dice, or shuffling a set, this tweet thread is here for you! There’s no reason that it should be easy or obvious, very experienced programmers repeat common mistakes. I did, before I learned …
Not strictly SRE-related, but then again it’s by Colm MacCárthaigh, who is SRE-related.
Colm MacCárthaigh
What should you do if you blow your error budget? Depends on whether you leaked it like a dripping faucet or splurged it all on big outages. Either way, you’ll need to investigate and make a plan.
Adrian Hilton, Alec Warner and Alex Bramley — Google
I love the two-method approach: a simple migration path for users that aren’t active all the time, and a more careful (and more complex) path for very busy users.
Xiang Li and Thomas Georgiou — Facebook
If you haven’t implemented alerts on support page views yet, do it now!! and thank me later. Here’s a view of how our dashboard looked as of a few minutes ago – a clear demonstration of user impact that supplements existing monitors and alerts.…
Click through for the graph. Monitor status and support page views… do we actually need any other monitoring? Only half-kidding.
Sri Harsha Kalavala
Outages
- Google BigQuery
- Google posted a followup analysis of the BigQuery outage on June 22.
A new release of the BigQuery API introduced a software defect that caused the API component to return larger-than-normal responses to the BigQuery router server.
- Google posted a followup analysis of the BigQuery outage on June 22.
- Fastly
- Full disclosure: Fastly is my employer.
- G Suite Status Dashboard
- Slack
- This week, Slack had a ~3-hour, near-total outage. Click through for their followup post.
The network problems were caused by a bug included in an offline batch process of data, which resulted in unexpected network spikes and led all of our customers to become disconnected and unable to reconnect.
- This week, Slack had a ~3-hour, near-total outage. Click through for their followup post.
- Google Home and Chromecast