SRE Weekly Issue #44

Articles

With all the “NoOps” and “Serverless” stuff floating around, do we need ops? Susan Fowler says not necessarily, but that we do need ops skills.

VictorOps State of On-Call Survey

VictorOps is gathering data for the 2016 edition of their yearly State of On-Call Report (2015’s if you missed it). Please click the link above and take the survey if you have a moment! The report provides some pretty awesome stats that we can all use to improve the on-call experience at our organizations.

This survey is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Irreversible Failures: Lessons from the DynamoDB Outage

Scalyr writes about cascading failure scenarios, using the DynamoDB outage of September 20th, 2015 (no, not this year’s September DynamoDB outage) as a case study.

Capacity problems are a common type of failure, and often they’re of this “cascading” variety. A system that’s thrashing around in a failure state often uses more resources than it did when it was healthy, creating a self-reinforcing overload.

cron.weekly newsletter

Check it out! Apparently this newsletter started around the same time that SRE Weekly did. Content includes a lot of really nifty stuff about Linux system administration.

Deployer API Outage Postmortem

I previously linked to a two–part series by Mathias Lafeldt on writing postmortems. At my request, Jimdo graciously agreed to release their (previously) internal postmortem about the incident that prompted him to write the articles. Thanks so much, Mathias!

Human Factors at The Fringe: My Eyes Went Dark

A review of what sounds like a really interesting play about just culture, blameless retrospectives, and restorative justice in aviation, based on real events.

Thanks to Mathias Lafeldt for this one.

Building one of the highest-capacity subsea cables in the Pacific

When you’re big like Facebook, sometimes reliability means essentially building your own Internet.

Lessons Learned from Scaling Uber to 2000 Engineers, 1000 Services, and 8000 Git repositories

If you haven’t had time to watch Matt Ranney’s talk on Scaling Uber to 1000 Microservices, check out this detailed summary. Growing your engineering force 10x over a year while still keeping the service reliable is a pretty impressive feat.

6 Essential Steps to Reducing Incident Resolution Time

PagerDuty shares some tips for lowering your MTTR, but first they ask the important question: how are you measuring MTTR, and is lowering it meaningful?

Put Down the Mouse, Step Away From the Keyboard

David Christensen riffs on Charity Majors’s concept of “3 Types of Code”: “no code” (SaaS, PaaS, etc), “someone else’s code”, and “your code”. Try to spend as much development time as possible writing code that supports what makes your business unique (your key differentiator).

Operations for software developers for beginners

Julia Evans is back with a write-up of the lessons she’s learned as she’s begun to gain an understanding of operations. My favorite bit:

Stage 2.5: learn to be scared
I think learning to be scared is a really important skill – you should be worried about upgrading a database safely, or about upgrading the version of Ruby you’re using in production. These are dangerous changes!

SysAdvent 2016 Author/Editor Signup

SysAdvent is happening again this year! Click the link above if you’d like to propose an article or volunteer to be an editor.

Outages

United Airlines
Yahoo mail
Google Cloud
FNB (South Africa bank)
GlobalSign (SSL certificate authority)
- GlobalSign had a major problem in their PKI that resulted in all of their certificates being treated as revoked. They’ve posted a detailed postmortem that’s pretty heavy on deep SSL details, but the basic story is that their OCSP service misinterpreted a routine action as a request to revoke their intermediate CA certificate. Yikes.I love this quote and the mental image of a panicked party with streamers and ribbon-cutting that it conjures up:
  
  Our AlphaSSL and CloudSSL customers had to wait a few hours more while an emergency key ceremony was held to create alternatives.

SRE Weekly Issue #44

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues