SRE Weekly Issue #15

A packed issue this week with a really exciting discovery/announcement up top. Thanks to all of the awesome folks on the hangops slack and especially #incident_response for tips, feedback, and general awesomeness.

Articles

I’m so excited about this! A group of folks, some of whom I know already and the rest of whom I hope to know soon, have started the Operations Incident Board. The goal is to build up a center of expertise in incident response that organizations can draw on, including peer review of postmortem drafts.

They’ve also started the Postmortem Report Reviews project, in which contributors submit “book reports” on incident postmortems (both past and current). PRs with new reports are welcome, and I hope you all will consider writing at least one. I know I will!

This is exactly the kind of development I was hoping to see in SRE and I couldn’t be happier. I look forward to supporting the OIB project however I can, and I’ll be watching them closely as they get organized. Good luck and great work, folks!

Thanks to Charity Majors for pointing OIB out to me.

Here’s a postmortem report from Gabe Abinante covering the epic EBS outage if 2011. It’s a nice summary with a few links to further reading on how Netflix and Twilio dodged impact through resilient design. Heroku, my employer (though not at the time), on the other hand, had a pretty rough time.

A nice summary of a talk on Chaos Engineering given at QCon by Rachel Reese.

One engineer’s guide to becoming comfortable with being on call, and some tips on how to get there.

Another “human error” story, about a recently released report on a 2015 train crash. Despite the article’s title, I feel like it primarily tells a story of a whole bunch of stuff that went wrong that was unrelated to the driver’s errors.

A nice little analysis of a customer’s sudden performance nosedive. It turned out that support had had them turn on debug logging and forgot to tell them to turn it off.

In this case, the outages in question pertain to wireless phone operators. I wonder if Telstra was one of the companies surveyed.

Reliability risk #317: failing to invalidate credentials held by departing employees, especially when they’re fired.

Say… wouldn’t it be neat to start a Common Reliability Risks Database or something?

As the title suggests, this opinion piece calls into question DevOps as a panacea solution. Some organizations can’t afford the risk involved in continuous delivery, because they can’t survive even a minor outage that can be rolled back/forward quickly. These same organizations probably also can’t avail themselves of chaos engineering — at least not in production.

Fail fast and roll forward simply aren’t sustainable in many of today’s most core business applications such as banking, retail, media, manufacturing or any other industry vertical.

Thanks to Devops Weekly for this one.

Outages

  • Datadog
  • Tinder
    • Predictable hilarity ensued on Twitter.

  • HipChat Status
    • Atlassian’s HipChat has had a rocky week with several outages. They posted an initial description of the problems and a promise of a detailed postmortem soon.

      Thanks to dbsmasher on hangops #incident_response for the tip on this one.

  • Data Centre Outage Causes Drama For Theatre Ticket Seller
    • A switch failure takes out a ticket sales site. It’s interesting how many companies try to become ops companies. I hope we see that kind of practice diminish in favor of increased adoption of PaaS/IaaS.

  • Telstra
    • Another major outage for Telstra, and they’re offering another free data day. Perhaps this time they’ll top 2 petabytes. This article describes the troubles people saw during the last free data day including slow speeds and signal drops.

  • Squarespace
    • Water main break in their datacenter.

      Thanks to stonith on hangops #incident_response.

Updated: March 20, 2016 — 8:57 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme