General

SRE Weekly Issue #16

lex

March 26, 2016

Another packed issue this week, thanks in no small part to the folks on hangops #incident_response. You all rock!

This week, I broke 200 email subscribers. Thank you all so much! At Charity Majors‘s suggestion, I’ve started a Twitter account, SREWeekly, where I’ll post a link to each week’s issue as it comes out. Feel free to unsubscribe via email and follow there instead, if you’d prefer.

Articles

Preventing human errors in healthcare

I love this article! Everything it says can be readily applied to SRE. It touches on blameless culture, causes of errors and methods of capturing incident information. Then there’s this excellent tidbit about analyzing all incidents, even near misses:

The majority of organizations target their most serious incidents for immediate attention. Events that lead to severe and/or permanent injury or death are typically underscored in an effort to prevent them from ever happening again. But recurrent errors that have the potential to do harm must also be prioritized for attention and process improvement. After all, whether an incident ultimately results in a near miss or an event of harm leading to a patient’s death is frequently a matter of a provider’s thoughtful vigilance, the resilience of the human body in resisting catastrophic consequences from the event, or sheer luck.

PagerDuty Status – .02% Accounts Had Isolated Incidents Dropped

A short postmortem by PagerDuty for an incident earlier this month. I like how precise their impact figures are.

Thanks to cheeseprocedure on hangops #incident_response for this one.

Blameless postmortems don’t work. Be blame-aware but don’t go negative

Look past the contentious title, and you’ll see that this one’s got some really good guidelines for running an effective postmortem. To be honest, I think they’re saying essentially the same thing as the “blameless postmortem” folks. You can’t really be effective at finding a root cause without mentioning operator errors along the way; it’s just a matter of how they’re discussed.

Ultimately, the secret of those mythical DevOps blameless cultures that hold the actionable postmortems we all crave is that they actively foster an environment that accepts the realities of the human brain and creates a space to acknowledge blame in a healthy way. Then they actively work to look beyond it.

Thanks to tobert on hangops #incident_response for this one.

Editorial: Network outages create embarrassing situation

Ithaca College has suffered a series of days-long network outages, crippling everything from coursework to radio broadcasts. Their newspaper’s editorial staff spoke out this week on the cause and impact of the outages.

Zerto zeroes in on teachings from Telstra’s outage troubles

iTWire interviews Matthew Kates, Australia Country Manager for Zerto, a DR software firm, about the troubles Telstra has been dealing with. Kates does an admirable job of avoiding plugging his company, instead offering an excellent analysis of Telstra’s situation. He also gives us this gem, made achingly clear this week by Gliffy’s troubles:

Backing up your data once a day is no longer enough in this 24/7 ‘always on’ economy. Continuous data replication is needed, capturing every change, every second.

What does Etsy’s architecture look like today?

I love the part about an “architecture review” before choosing to implement a design for a new component (e.g. Kafka) and an “operability review” before deployment to ensure that monitoring, runbooks, etc. are all in place.

An update on this week – HipChat Blog

Atlassian posted an excellent, high-detail postmortem on last week’s instability. One of the main causes was an overloaded NAT service in a VPC, compounded by aggressive retries from their client.

kik, left-pad, and npm

Technically speaking, I’m not sure the NPM drama this week caused any actual production outages, but I feel like I’d be remiss in not mentioning it. Suffice it to say that we can never ignore human factors.

FDAnews Announces — How to Reduce Human Error in the Manufacturing Floor Workshop

In reading the workshop agenda, it’s interesting to see how they handle human error in drug manufacturing.

Don't Repeat your Mistakes: Conducting Post-mortems

Pusher shares a detailed description of their postmortem incident analysis process. I like that they front-load a lot of the information gathering and research process before the in-person review. They also use a tool to ensure that their postmortem reports have a consistent format.

Outages

Telstra
- This makes the third major outage (plus a minor one) this year. Customers are getting pretty mad.
Gliffy
- Gliffy suffered a heartbreaking 48-hour outage after an administrator mistakenly deleted the production db. They do have backups, but the backups take a long time to restore.
  
  Thanks to gabinante on hangops #incident_response for this one.
The Division (game)
DigitalOcean
- A day after the incident, DigitalOcean posted an excellent postmortem. I like that they clearly explained the technical details behind the incident. While they mentioned the DDoS attack, they didn’t use it to try to avoid taking responsibility for the downtime. Shortly after this was posted, it spurred a great conversation on hangops #incident_response that included the post’s author.
  
  Thanks to rhoml on hangops #incident_response for this one.

SRE Weekly Issue #15

lex

March 20, 2016

General

Comments

View on sreweekly.com

A packed issue this week with a really exciting discovery/announcement up top. Thanks to all of the awesome folks on the hangops slack and especially #incident_response for tips, feedback, and general awesomeness.

Articles

Operations-Incident-Board/Postmortem-Report-Reviews · GitHub

I’m so excited about this! A group of folks, some of whom I know already and the rest of whom I hope to know soon, have started the Operations Incident Board. The goal is to build up a center of expertise in incident response that organizations can draw on, including peer review of postmortem drafts.

They’ve also started the Postmortem Report Reviews project, in which contributors submit “book reports” on incident postmortems (both past and current). PRs with new reports are welcome, and I hope you all will consider writing at least one. I know I will!

This is exactly the kind of development I was hoping to see in SRE and I couldn’t be happier. I look forward to supporting the OIB project however I can, and I’ll be watching them closely as they get organized. Good luck and great work, folks!

Thanks to Charity Majors for pointing OIB out to me.

OIB Postmortem Report: AWS EBS outage April 21st-April 24th 2011

Here’s a postmortem report from Gabe Abinante covering the epic EBS outage if 2011. It’s a nice summary with a few links to further reading on how Netflix and Twilio dodged impact through resilient design. Heroku, my employer (though not at the time), on the other hand, had a pretty rough time.

Chaos Testing of Microservices

A nice summary of a talk on Chaos Engineering given at QCon by Rachel Reese.

True Story: On-call Doesn't Have to Suck

One engineer’s guide to becoming comfortable with being on call, and some tips on how to get there.

Human error, ineffective communication led to Denver train crash

Another “human error” story, about a recently released report on a 2015 train crash. Despite the article’s title, I feel like it primarily tells a story of a whole bunch of stuff that went wrong that was unrelated to the driver’s errors.

Logging Yourself to Death

A nice little analysis of a customer’s sudden performance nosedive. It turned out that support had had them turn on debug logging and forgot to tell them to turn it off.

Network Outages Costing Operators $20B Annually

In this case, the outages in question pertain to wireless phone operators. I wonder if Telstra was one of the companies surveyed.

Network Admin Sabotages ISP's Network After Getting Fired, Now Faces Jail

Reliability risk #317: failing to invalidate credentials held by departing employees, especially when they’re fired.

Say… wouldn’t it be neat to start a Common Reliability Risks Database or something?

The DevOps Emperor has no Clothes – when DevOps is DeFunct

As the title suggests, this opinion piece calls into question DevOps as a panacea solution. Some organizations can’t afford the risk involved in continuous delivery, because they can’t survive even a minor outage that can be rolled back/forward quickly. These same organizations probably also can’t avail themselves of chaos engineering — at least not in production.

Fail fast and roll forward simply aren’t sustainable in many of today’s most core business applications such as banking, retail, media, manufacturing or any other industry vertical.

Thanks to Devops Weekly for this one.

Outages

Datadog
Tinder
- Predictable hilarity ensued on Twitter.
HipChat Status
- Atlassian’s HipChat has had a rocky week with several outages. They posted an initial description of the problems and a promise of a detailed postmortem soon.
  
  Thanks to dbsmasher on hangops #incident_response for the tip on this one.
Data Centre Outage Causes Drama For Theatre Ticket Seller
- A switch failure takes out a ticket sales site. It’s interesting how many companies try to become ops companies. I hope we see that kind of practice diminish in favor of increased adoption of PaaS/IaaS.
Telstra
- Another major outage for Telstra, and they’re offering another free data day. Perhaps this time they’ll top 2 petabytes. This article describes the troubles people saw during the last free data day including slow speeds and signal drops.
Squarespace
- Water main break in their datacenter.
  
  Thanks to stonith on hangops #incident_response.

SRE Weekly Issue #14

lex

March 13, 2016

General

Comments

View on sreweekly.com

Articles

Resilience Engineering: Part I

A classic from John Allspaw. Designing a resilient system isn’t about eliminating individual causes of downtime; it’s about continuing to operate in spite of them. Allspaw is a big proponent of looking beyond human error to the system surrounding the error.

…human error as a root cause isn’t where you should end, it’s where you should start your investigation.

7 Rules for Using Log Data Effectively in a Retrospective

This could just as well be titled, 7 Rules for Performing Effective Retrospectives. There’s some really great stuff in here and also some good references.

The rules are:

Learn, don’t blame
Know the scope of the system
Make sure you have all the relevant logs
Make sure the logs lineup with the timeline
Separate the noise from the information
Make sure the biases are known
Make sure you deal in facts and not counterfacts

Spotify’s Event Delivery

Spotify shares this deeply technical look at their event delivery and processing system that handles 700k messages per second. The bulk of the article details how they tested Google’s Cloud Pub/Sub to be sure it was reliable enough for their needs.

…Cloud Pub/Sub was being advertised as beta software; we were unaware of any organisation other than Google who were using it at our scale.

Taobao’s Security Breach from a Log Perspective

Taobao.com suffered a huge security breach in which credentials harvested from previous break-ins were used to break into accounts. This short write-up on DZone urges us to use anomaly detection to catch brute-force attacks like this as they happen.

Reports say the hackers executed approximately 100 million login attempts, and almost 21 million of these turned out to be successful.

Evolution of Useful Results from Anomaly Detection Systems

Speaking of anomaly detection, this article highlights the problems with existing anomaly detection systems and describes what a successful system would look like. I’ve yet to see a generalized anomaly detection system with an acceptable false positive rate that did better than specific, targeted monitoring.

2016 CloudEndure DR Survey

This survey, released last month, looks possibly interesting. I’m not 100% sure though, because their server is offline and I can’t retrieve it. Oh, the irony.

Outages

Xbox Live
Fox and ABC News
- Two large news sites suffered brief outages on Super Tuesday, an important voting day in the US. Both were apparently taken out by a failure in the analytics provider that they share in common.
DirecTV
PSN
Netflix
The Pirate Bay
EE webmail
Amazon.com
CenturyLink
- Miscommunication is cited in this construction-induced fiber cut.
The Division (game)
The KKK
- Staminus, a DDoS protection company, suffered a huge data breach including full names and credit card numbers. The attackers also took down their infrastructure causing an outage for big-name clients such as the KKK.

SRE Weekly Issue #13

lex

March 6, 2016

General

Comments

View on sreweekly.com

SRECon16 registration is open, and I’m excited to say that my colleague Courtney Eckhardt and I will be giving a talk together! If you come to the conference, I’d love it if you’d say hi.

Articles

The Netflix Tech Blog: Caching for a Global Netflix

A deep-dive on EVCache, Netflix’s open source sharding and replication layer on top of memcached.

EVCache is one of the critical components of Netflix’s distributed architecture, providing globally replicated data at RAM speed so any member can be served from anywhere.

Actionable Alerts: Reducing False Positives & Making On-call Suck Less – VictorOps

This is a guest post from one of our customers, Aaron, Director of Support Systems at CageData. He’s talking about making alerts actionable and why that’s important.

Are site reliability engineers the next data scientists?

TechCrunch gives us this overview of the field of SRE, including its origins, motivations, and guesses about its future.

DROWN Attack

Everyone’s favorite OpenSSL vulnerability of the year. I hope you all had a relatively easy patch day.

A “Principled”, Blameful Post-Mortem

A short but sweet analysis of an intermittent bug caused by inconsistent date formatting. The author uses the term “blameful postmortem” to mean finding reasons that explain how the client application was written with faulty date parsing logic (tl;dr: the server side truncated trailing zeroes in the fractional seconds). Really, I think this is less about blame than it is about understanding the full context in which a error was able to occur, and that’s exactly what a blameless postmortem is all about.

Reducing Technical Debt With Incident Management

Incidents can uncover technical debt in a system. Fixing the technical debt is often necessary if a repeat incident is to be avoided, but it can be difficult to track and allocate resources to make it happen. This article from PagerDuty suggests a method for tracking technical debt uncovered by incidents.

Are You Prepared To Handle More Than The “Routine” Incident?

When multiple incidents occur simultaneously, things can get hairy and you need to have an organized incident response structure. This article is about firefighting, but we can take their lessons and apply them to SRE.

7 Benefits of Incident Management in Supporting Applications

PagerDuty advocates for a model I’ve heard referred to as “Total Service Ownership”, where dev teams handle incident response for their subsystems rather than “throwing them over the wall” for Ops to support. Courtney and I will be talking about this and more at SRECon16 next month.

Outages

Telstra
- No free data day for this one.
Gopher
- Metafilter revived their gopher server after 15 years of downtime.
Salesforce.com
- Full disclosure: Salesforce.com (parent company of my employer, Heroku), is mentioned.
Uganda

SRE Weekly Issue #12

lex

February 28, 2016

General

Comments

View on sreweekly.com

Articles

danluu/post-mortems · GitHub

What an excellent resource! This repo contains a pile of postmortems for our reading and learning pleasure. I’m linking to the repo now, but I don’t promise not to call out specific awesome postmortems from it in the future.

Best Practices in Outage Communication: Internal Stakeholders

When you’re in the trenches trying to get the service back up and running, it can be hard to find the time to tell everyone else in your company what’s going on. It’s critically important though, add Statuspage.io writes in this article.

Full disclosure: Heroku, my employer, is mentioned.

What is High Availability?

Digital Ocean shares this overview of the basic concepts involved in high availability.

System Reliability and Availability Calculation

This article discusses a method of computing the availability of an overall system made up of individual components with differing availabilities. It gives general formulas and methods that are fairly simple, yet powerful.

Software: Immaculate, fetid and grimy

What do you do when you have to modify an existing production system that has less-than-wonderful code quality? This article is an impassioned plea to test the heck out of your changes and always try to release production-quality code the first time.

Google’s Anti-DDoS Project Shield Launches to Protect Free Speech

Google is launching a reverse-proxy for DDoS mitigation. Interestingly, it’s only for news and free speech sites and it’s completely free.

Outages

Xbox Live
PartyPoker
Telenor (mobile operator)
- This one’s interesting. Invalid signaling from another operator took down Telenor.
  
  The unusual data sent from an international operator was misinterpreted in software from Ericsson, halting part of the mobile traffic on Telenor’s network.
Office 365
T-Mobile
EE (UK mobile operator)
Xero
Verizon Wireless

← Older Posts

Newer Posts →

General

SRE Weekly Issue #16

Articles

Outages

SRE Weekly Issue #15

Articles

Outages

SRE Weekly Issue #14

Articles

Outages

SRE Weekly Issue #13

Articles

Outages

SRE Weekly Issue #12

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues