SRE Weekly Issue #44

SPONSOR MESSAGE

DevOps Executive Webinar: Security for Startups in a DevOps World. http://try.victorops.com/l/44432/2016-10-12/fgh7n3

Articles

With all the “NoOps” and “Serverless” stuff floating around, do we need ops? Susan Fowler says not necessarily, but that we do need ops skills.

VictorOps is gathering data for the 2016 edition of their yearly State of On-Call Report (2015’s if you missed it). Please click the link above and take the survey if you have a moment! The report provides some pretty awesome stats that we can all use to improve the on-call experience at our organizations.

This survey is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Scalyr writes about cascading failure scenarios, using the DynamoDB outage of September 20th, 2015 (no, not this year’s September DynamoDB outage) as a case study.

Capacity problems are a common type of failure, and often they’re of this “cascading” variety. A system that’s thrashing around in a failure state often uses more resources than it did when it was healthy, creating a self-reinforcing overload.

Check it out! Apparently this newsletter started around the same time that SRE Weekly did. Content includes a lot of really nifty stuff about Linux system administration.

I previously linked to a twopart series by Mathias Lafeldt on writing postmortems. At my request, Jimdo graciously agreed to release their (previously) internal postmortem about the incident that prompted him to write the articles. Thanks so much, Mathias!

A review of what sounds like a really interesting play about just culture, blameless retrospectives, and restorative justice in aviation, based on real events.

Thanks to Mathias Lafeldt for this one.

When you’re big like Facebook, sometimes reliability means essentially building your own Internet.

If you haven’t had time to watch Matt Ranney’s talk on Scaling Uber to 1000 Microservices, check out this detailed summary. Growing your engineering force 10x over a year while still keeping the service reliable is a pretty impressive feat.

PagerDuty shares some tips for lowering your MTTR, but first they ask the important question: how are you measuring MTTR, and is lowering it meaningful?

David Christensen riffs on Charity Majors’s concept of “3 Types of Code”: “no code” (SaaS, PaaS, etc), “someone else’s code”, and “your code”. Try to spend as much development time as possible writing code that supports what makes your business unique (your key differentiator).

Julia Evans is back with a write-up of the lessons she’s learned as she’s begun to gain an understanding of operations. My favorite bit:

Stage 2.5: learn to be scared
I think learning to be scared is a really important skill – you should be worried about upgrading a database safely, or about upgrading the version of Ruby you’re using in production. These are dangerous changes!

SysAdvent is happening again this year! Click the link above if you’d like to propose an article or volunteer to be an editor.

Outages

  • United Airlines
  • Yahoo mail
  • Google Cloud
  • FNB (South Africa bank)
  • GlobalSign (SSL certificate authority)
    • GlobalSign had a major problem in their PKI that resulted in all of their certificates being treated as revoked. They’ve posted a detailed postmortem that’s pretty heavy on deep SSL details, but the basic story is that their OCSP service misinterpreted a routine action as a request to revoke their intermediate CA certificate. Yikes.I love this quote and the mental image of a panicked party with streamers and ribbon-cutting that it conjures up:

      Our AlphaSSL and CloudSSL customers had to wait a few hours more while an emergency key ceremony was held to create alternatives.

Updated: October 16, 2016 — 9:22 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme