General

SRE Weekly Issue #19

Articles

I just love this story. I heard Rachel Kroll tell it during her keynote at SREcon, and here it is in article form. It’s an incredibly deep dive through a gnarly debugging session, and I can’t recommend enough that you read it. NSFL (not safe for the library), because it’s pretty darned hilarious.

Christine Spang of Nylas shares a story of migrating from RDS to sharded self-run MySQL clusters using SQLProxy. I love the detail here! I’m looking to get more deeply technical articles in SRE Weekly, so if you come across any, I’d love it if you’d point them out to me.

Here’s the latest in Mathias Lafeldt’s Production Ready series. He makes the argument that too few failures can be a bad thing and argues for a chaos engineering approach.

Complacency is the enemy of resilience. The longer you wait for disaster to strike in production — merely hoping that everything will be okay — the less likely you are to handle emergencies well, both at a technical and organizational level.

Timesketch is a tool for building timelines. It could be useful for building a deeper understanding of an incident as part of a retrospective.

Anthony Caiafa shares his take on what SRE actually means. To me, SRE seems to be a field even more in flux than DevOps, and definitions have yet to settle. For example, I feel that there’s a lot that a non-programmer can add to an SRE team — you just have to really think about what it means to engineer reliability (e.g. process design).

Github details Dgit, their new high-availability solution for storing git repos internally. Previously, they used pairs of servers with raid mirroring in each and synchronized using DRDB.

An early review of Google’s new SRE book by Mike Doherty, a Google SRE. He was only peripherally involved in the publication and gives a fairly balanced take on the book. For an outside perspective, see danluu’s detailed chapter-by-chapter notes.

Amazon.com famously runs on AWS, so any AWS outage could potentially impact Amazon. Google, on the other hand, doesn’t currently run any of its external services on Google Cloud Platform. This article makes the argument that doing so would create a much bigger incentive to improve and sustain GCP’s reliability.

However, when Google had its recent 12-hour outage that took Snapchat offline, it didn’t impact any of Google’s real revenue-generating services. […] What would the impact have been if Google Search was down for 12 hours?

Thanks to Charity for this one.

Oops.

Note that there’s been some question on hangops #sre on whether this is a hoax. Either way I could totally see it happening.

I love the fact that statuspage.io is the author of this article. How many of us have agonized over the exact wording of a status site post?

Outages

  • Yahoo Mail
  • Business Wire
  • Google Compute Engine
    • GCE suffered a severe network outage. It started as increased latency and at worst became a full outage of internet connectivity. Two days after the incident, Google released the best postmortem I’ve seen in a very long time. Full transparency, a terrible juxtaposition of two nasty bugs, a heartfelt apology, fourteen(!) remediation items… it’s clear their incident response was solid and they immediately did a very thorough retrospective.

  • North Korea
    • North Korea had a series of internet outages, each of the same length at the same time on consecutive days. It’s interesting how people are trying to learn things about the reclusive country just from this pattern of outages.

  • Blizzard's Battle.net
  • Twitter
  • Misco
  • Two Alt-Coin exchanges (Shapeshift and Poloniex)
  • Home Depot

SRE Weekly Issue #18

SRECon16 was awesome! Sorry for the light issue this week — still recovering from my con-hangover. I had an incredible time, and I enjoyed meeting many of you, both old subscribers and new. Thank you all for your support! When USENIX posts their recordings, I’ll share links to some of my favorite talks.

QotW, from Charity Majors’s day 1 closing keynote (paraphrased):

There are no bad decisions. We make the best decisions we can with the information we have at the time.

Love it. The second QotW was from Rachel Kroll’s day 1 opening keynote, which included a hilarious and cringe-worthy story of investigating a very well-hidden bug with an incredibly bizarre set of symptoms. I can’t recommend enough watching the keynotes, and, well, every talk.

More content next week, after I’ve caught up on my RSS feeds. Thanks again for the huge amount of support you all have shown me — all 250+ of you (and that’s just email subscribers)!

Articles

Telstra exec Kate McKenzie detailed some findings from internal investigations into the recent spate of Telstra incidents. There’s some nice detail here, including possible remediation items and an implication that Telstra is using a blameless retrospective process.

This is a short but excellent template for incident retrospectives in the form of a series of questions. A great place to start if you’re looking to improve your retrospective process.

Etsy’s morgue, a tool for tracking information related to postmortem investigations.

A rockin’ postmortem detailing the failure and recovery of a 1.7 PB filesystem, featuring the creation of a 3 TB ramdisk(!) to speed up the operation.

Thanks to phill-atlassian on hangops #incident_response for this one.

Outages

SRE Weekly Issue #17

I’m posting this week’s issue from the airport on my way to the west coast for business and SRECon16. I’m hoping to see some of you there! I’ll have a limited amount of incredibly exclusive hand-made SRE Weekly patches to give out — just ask.

Articles

I love surveys! This one is about incident response as it applies to security and operations. The study author is looking to draw parallels between these two kinds of IR. I can’t wait for the results, and I’ll definitely link them here.

Charity Majors gives us this awesomely detailed article about a Terraform nightmare. An innocent TF run in staging led to a merry bug-hunt down the rabbit-hole and ended in wiping out production — thankfully on a not-yet-customer-facing service. She continues on with an excellent HOWTO on fixing your Terraform config to avoid this kind of pitfall.

If you can’t safely test your changes in isolation away from prod, you don’t have infrastructure as code.

Uber set up an impressively rigorous test to determine which combination of serialization format and compression algorithm would hit the sweet spot between data size and compression speed. The article itself doesn’t directly touch on reliability, but of course running out of space in production is a deal-breaker, and I just love their methodology.

I make heavy use of Pinboard to automate my article curation for SRE Weekly. This week, IFTTT decided to axe support for Pinboard, and they did it in a kind of jerky way. The service’s owner Maciej wrote up a pretty hilarious and pointed explanation of the situation.

Thanks to Courtney for this one.

HipChat took another outage last week when they tried to push a remediation from a previous outage. Again with admirable speed, they’ve posted a detailed postmortem including dive excellent lessons that we all learn from.

This deployment was an important remediation from the previous outages and seemed like the right thing to do.
Lesson learned: No matter how much you want to remediate a problem for your users, consider the risk, complexity, and timing, and then reconsider again.

I love human error. Or rather, I love when an incident is reported as “human error”, because the story is inevitably more nuanced than that. Severe incidents are always the result of multiple things going wrong simultaneously. In this case, it was an operator mistake, insufficient radios and badges for responders, and lack of an established procedure for alerting utility customers.

A detailed exploration of latency and how it can impact online services, especially games.

Online gaming customers are twice as likely to abandon a game when they experience a network delay of 50 additional milliseconds

Say “eliminate downtime” and I’ll be instantly skeptical, but this article is a nice overview of predictive maintenance systems in datacenters.

Data centers use complex hardware that presents unforeseen problems that calendar-based maintenance checks simply cannot anticipate.

Outages

SRE Weekly Issue #16

Another packed issue this week, thanks in no small part to the folks on hangops #incident_response. You all rock!

This week, I broke 200 email subscribers. Thank you all so much! At Charity Majors‘s suggestion, I’ve started a Twitter account, SREWeekly, where I’ll post a link to each week’s issue as it comes out. Feel free to unsubscribe via email and follow there instead, if you’d prefer.

Articles

I love this article! Everything it says can be readily applied to SRE. It touches on blameless culture, causes of errors and methods of capturing incident information. Then there’s this excellent tidbit about analyzing all incidents, even near misses:

The majority of organizations target their most serious incidents for immediate attention. Events that lead to severe and/or permanent injury or death are typically underscored in an effort to prevent them from ever happening again. But recurrent errors that have the potential to do harm must also be prioritized for attention and process improvement. After all, whether an incident ultimately results in a near miss or an event of harm leading to a patient’s death is frequently a matter of a provider’s thoughtful vigilance, the resilience of the human body in resisting catastrophic consequences from the event, or sheer luck.

A short postmortem by PagerDuty for an incident earlier this month. I like how precise their impact figures are.

Thanks to cheeseprocedure on hangops #incident_response for this one.

Look past the contentious title, and you’ll see that this one’s got some really good guidelines for running an effective postmortem. To be honest, I think they’re saying essentially the same thing as the “blameless postmortem” folks. You can’t really be effective at finding a root cause without mentioning operator errors along the way; it’s just a matter of how they’re discussed.

Ultimately, the secret of those mythical DevOps blameless cultures that hold the actionable postmortems we all crave is that they actively foster an environment that accepts the realities of the human brain and creates a space to acknowledge blame in a healthy way. Then they actively work to look beyond it.

Thanks to tobert on hangops #incident_response for this one.

Ithaca College has suffered a series of days-long network outages, crippling everything from coursework to radio broadcasts. Their newspaper’s editorial staff spoke out this week on the cause and impact of the outages.

iTWire interviews Matthew Kates, Australia Country Manager for Zerto, a DR software firm, about the troubles Telstra has been dealing with. Kates does an admirable job of avoiding plugging his company, instead offering an excellent analysis of Telstra’s situation. He also gives us this gem, made achingly clear this week by Gliffy’s troubles:

Backing up your data once a day is no longer enough in this 24/7 ‘always on’ economy. Continuous data replication is needed, capturing every change, every second.

I love the part about an “architecture review” before choosing to implement a design for a new component (e.g. Kafka) and an “operability review” before deployment to ensure that monitoring, runbooks, etc. are all in place.

Atlassian posted an excellent, high-detail postmortem on last week’s instability. One of the main causes was an overloaded NAT service in a VPC, compounded by aggressive retries from their client.

Technically speaking, I’m not sure the NPM drama this week caused any actual production outages, but I feel like I’d be remiss in not mentioning it. Suffice it to say that we can never ignore human factors.

In reading the workshop agenda, it’s interesting to see how they handle human error in drug manufacturing.

Pusher shares a detailed description of their postmortem incident analysis process. I like that they front-load a lot of the information gathering and research process before the in-person review. They also use a tool to ensure that their postmortem reports have a consistent format.

Outages

  • Telstra
    • This makes the third major outage (plus a minor one) this year. Customers are getting pretty mad.

  • Gliffy
    • Gliffy suffered a heartbreaking 48-hour outage after an administrator mistakenly deleted the production db. They do have backups, but the backups take a long time to restore.

      Thanks to gabinante on hangops #incident_response for this one.

  • The Division (game)
  • DigitalOcean
    • A day after the incident, DigitalOcean posted an excellent postmortem. I like that they clearly explained the technical details behind the incident. While they mentioned the DDoS attack, they didn’t use it to try to avoid taking responsibility for the downtime. Shortly after this was posted, it spurred a great conversation on hangops #incident_response that included the post’s author.

      Thanks to rhoml on hangops #incident_response for this one.

SRE Weekly Issue #15

A packed issue this week with a really exciting discovery/announcement up top. Thanks to all of the awesome folks on the hangops slack and especially #incident_response for tips, feedback, and general awesomeness.

Articles

I’m so excited about this! A group of folks, some of whom I know already and the rest of whom I hope to know soon, have started the Operations Incident Board. The goal is to build up a center of expertise in incident response that organizations can draw on, including peer review of postmortem drafts.

They’ve also started the Postmortem Report Reviews project, in which contributors submit “book reports” on incident postmortems (both past and current). PRs with new reports are welcome, and I hope you all will consider writing at least one. I know I will!

This is exactly the kind of development I was hoping to see in SRE and I couldn’t be happier. I look forward to supporting the OIB project however I can, and I’ll be watching them closely as they get organized. Good luck and great work, folks!

Thanks to Charity Majors for pointing OIB out to me.

Here’s a postmortem report from Gabe Abinante covering the epic EBS outage if 2011. It’s a nice summary with a few links to further reading on how Netflix and Twilio dodged impact through resilient design. Heroku, my employer (though not at the time), on the other hand, had a pretty rough time.

A nice summary of a talk on Chaos Engineering given at QCon by Rachel Reese.

One engineer’s guide to becoming comfortable with being on call, and some tips on how to get there.

Another “human error” story, about a recently released report on a 2015 train crash. Despite the article’s title, I feel like it primarily tells a story of a whole bunch of stuff that went wrong that was unrelated to the driver’s errors.

A nice little analysis of a customer’s sudden performance nosedive. It turned out that support had had them turn on debug logging and forgot to tell them to turn it off.

In this case, the outages in question pertain to wireless phone operators. I wonder if Telstra was one of the companies surveyed.

Reliability risk #317: failing to invalidate credentials held by departing employees, especially when they’re fired.

Say… wouldn’t it be neat to start a Common Reliability Risks Database or something?

As the title suggests, this opinion piece calls into question DevOps as a panacea solution. Some organizations can’t afford the risk involved in continuous delivery, because they can’t survive even a minor outage that can be rolled back/forward quickly. These same organizations probably also can’t avail themselves of chaos engineering — at least not in production.

Fail fast and roll forward simply aren’t sustainable in many of today’s most core business applications such as banking, retail, media, manufacturing or any other industry vertical.

Thanks to Devops Weekly for this one.

Outages

  • Datadog
  • Tinder
    • Predictable hilarity ensued on Twitter.

  • HipChat Status
    • Atlassian’s HipChat has had a rocky week with several outages. They posted an initial description of the problems and a promise of a detailed postmortem soon.

      Thanks to dbsmasher on hangops #incident_response for the tip on this one.

  • Data Centre Outage Causes Drama For Theatre Ticket Seller
    • A switch failure takes out a ticket sales site. It’s interesting how many companies try to become ops companies. I hope we see that kind of practice diminish in favor of increased adoption of PaaS/IaaS.

  • Telstra
    • Another major outage for Telstra, and they’re offering another free data day. Perhaps this time they’ll top 2 petabytes. This article describes the troubles people saw during the last free data day including slow speeds and signal drops.

  • Squarespace
    • Water main break in their datacenter.

      Thanks to stonith on hangops #incident_response.

A production of Tinker Tinker Tinker, LLC Frontier Theme