SRE Weekly Issue #19

Articles

I just love this story. I heard Rachel Kroll tell it during her keynote at SREcon, and here it is in article form. It’s an incredibly deep dive through a gnarly debugging session, and I can’t recommend enough that you read it. NSFL (not safe for the library), because it’s pretty darned hilarious.

Christine Spang of Nylas shares a story of migrating from RDS to sharded self-run MySQL clusters using SQLProxy. I love the detail here! I’m looking to get more deeply technical articles in SRE Weekly, so if you come across any, I’d love it if you’d point them out to me.

Here’s the latest in Mathias Lafeldt’s Production Ready series. He makes the argument that too few failures can be a bad thing and argues for a chaos engineering approach.

Complacency is the enemy of resilience. The longer you wait for disaster to strike in production — merely hoping that everything will be okay — the less likely you are to handle emergencies well, both at a technical and organizational level.

Timesketch is a tool for building timelines. It could be useful for building a deeper understanding of an incident as part of a retrospective.

Anthony Caiafa shares his take on what SRE actually means. To me, SRE seems to be a field even more in flux than DevOps, and definitions have yet to settle. For example, I feel that there’s a lot that a non-programmer can add to an SRE team — you just have to really think about what it means to engineer reliability (e.g. process design).

Github details Dgit, their new high-availability solution for storing git repos internally. Previously, they used pairs of servers with raid mirroring in each and synchronized using DRDB.

An early review of Google’s new SRE book by Mike Doherty, a Google SRE. He was only peripherally involved in the publication and gives a fairly balanced take on the book. For an outside perspective, see danluu’s detailed chapter-by-chapter notes.

Amazon.com famously runs on AWS, so any AWS outage could potentially impact Amazon. Google, on the other hand, doesn’t currently run any of its external services on Google Cloud Platform. This article makes the argument that doing so would create a much bigger incentive to improve and sustain GCP’s reliability.

However, when Google had its recent 12-hour outage that took Snapchat offline, it didn’t impact any of Google’s real revenue-generating services. […] What would the impact have been if Google Search was down for 12 hours?

Thanks to Charity for this one.

Oops.

Note that there’s been some question on hangops #sre on whether this is a hoax. Either way I could totally see it happening.

I love the fact that statuspage.io is the author of this article. How many of us have agonized over the exact wording of a status site post?

Outages

  • Yahoo Mail
  • Business Wire
  • Google Compute Engine
    • GCE suffered a severe network outage. It started as increased latency and at worst became a full outage of internet connectivity. Two days after the incident, Google released the best postmortem I’ve seen in a very long time. Full transparency, a terrible juxtaposition of two nasty bugs, a heartfelt apology, fourteen(!) remediation items… it’s clear their incident response was solid and they immediately did a very thorough retrospective.

  • North Korea
    • North Korea had a series of internet outages, each of the same length at the same time on consecutive days. It’s interesting how people are trying to learn things about the reclusive country just from this pattern of outages.

  • Blizzard's Battle.net
  • Twitter
  • Misco
  • Two Alt-Coin exchanges (Shapeshift and Poloniex)
  • Home Depot
Updated: April 17, 2016 — 9:38 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme