SRE Weekly Issue #19

Articles

A mystery with memory leaks and a magic number

I just love this story. I heard Rachel Kroll tell it during her keynote at SREcon, and here it is in article form. It’s an incredibly deep dive through a gnarly debugging session, and I can’t recommend enough that you read it. NSFL (not safe for the library), because it’s pretty darned hilarious.

Growing Up with MySQL: How we scaled our primary datastore by over 20x in 3 weeks

Christine Spang of Nylas shares a story of migrating from RDS to sharded self-run MySQL clusters using SQLProxy. I love the detail here! I’m looking to get more deeply technical articles in SRE Weekly, so if you come across any, I’d love it if you’d point them out to me.

Complacency: The Enemy of Resilience

Here’s the latest in Mathias Lafeldt’s Production Ready series. He makes the argument that too few failures can be a bad thing and argues for a chaos engineering approach.

Complacency is the enemy of resilience. The longer you wait for disaster to strike in production — merely hoping that everything will be okay — the less likely you are to handle emergencies well, both at a technical and organizational level.

timesketch

Timesketch is a tool for building timelines. It could be useful for building a deeper understanding of an incident as part of a retrospective.

SRE: An incomplete guide to cultural Narnia

Anthony Caiafa shares his take on what SRE actually means. To me, SRE seems to be a field even more in flux than DevOps, and definitions have yet to settle. For example, I feel that there’s a lot that a non-programmer can add to an SRE team — you just have to really think about what it means to engineer reliability (e.g. process design).

Introducing DGit

Github details Dgit, their new high-availability solution for storing git repos internally. Previously, they used pairs of servers with raid mirroring in each and synchronized using DRDB.

Book review: Site Reliability Engineering

An early review of Google’s new SRE book by Mike Doherty, a Google SRE. He was only peripherally involved in the publication and gives a fairly balanced take on the book. For an outside perspective, see danluu’s detailed chapter-by-chapter notes.

The importance of 'dogfooding' in the cloud

Amazon.com famously runs on AWS, so any AWS outage could potentially impact Amazon. Google, on the other hand, doesn’t currently run any of its external services on Google Cloud Platform. This article makes the argument that doing so would create a much bigger incentive to improve and sustain GCP’s reliability.

However, when Google had its recent 12-hour outage that took Snapchat offline, it didn’t impact any of Google’s real revenue-generating services. […] What would the impact have been if Google Search was down for 12 hours?

Thanks to Charity for this one.

Man accidentally ‘deletes his entire company’ with one line of bad code

Oops.

Note that there’s been some question on hangops #sre on whether this is a hoax. Either way I could totally see it happening.

On Writing Well When You’re In A Damn Hurry

I love the fact that statuspage.io is the author of this article. How many of us have agonized over the exact wording of a status site post?

Outages

Yahoo Mail
Business Wire
Google Compute Engine
- GCE suffered a severe network outage. It started as increased latency and at worst became a full outage of internet connectivity. Two days after the incident, Google released the best postmortem I’ve seen in a very long time. Full transparency, a terrible juxtaposition of two nasty bugs, a heartfelt apology, fourteen(!) remediation items… it’s clear their incident response was solid and they immediately did a very thorough retrospective.
North Korea
- North Korea had a series of internet outages, each of the same length at the same time on consecutive days. It’s interesting how people are trying to learn things about the reclusive country just from this pattern of outages.
Blizzard's Battle.net
Twitter
Misco
Two Alt-Coin exchanges (Shapeshift and Poloniex)
Home Depot

SRE Weekly Issue #19

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues