General

SRE Weekly Issue #24

lex

May 22, 2016

General

Comments

View on sreweekly.com

My favorite read this week was this first article. It’s long, but it’s well worth a full read.

Articles

So you wanna go on-prem do ya

Got customers begging to throw money at you if only you’d let them run your SaaS in-house? John Vincent suggests you think twice before going down that road. This isn’t just a garden-variety opinion piece. Clearly John is drawing on extensive experience as he closely examines all of the many pitfalls in trying to convert a service into a reliable, sustainable, supportable on-premises product.

Firefall — Outage Post-Mortem for Wednesday February 20th, 2013

An old but excellent postmortem for an incident stemming from accidental termination of a MySQL cluster.

Thanks to logikal on hangops #incident_response for this one.

Why the Friday Grid Roll? – Second Life

Earlier this year, Linden Lab had to do an emergency grid roll on a Friday to patch the GHOST (glibc) vulnerability. April Linden (featured here previously) shares a bit on why it was necessary and how Linden handled GHOST.

Breakdown in Medication Reconciliation Leads to Inpatient Dose 16 Times Higher Than Home Dose – BWH Safety Matters

This article may be about a medication error, but this could have come straight from a service outage post-analysis:

For example, if the system makes it time consuming and difficult to complete safety steps, it is more likely that staff will skip these steps in an effort to meet productivity goals.

Misstep led to death of two firefighters

Having a standard incident response process is crucial. When we fail to follow it, incidents can escalate rapidly. In the case of this story from South Africa, the article alleges that the Incident Commander led a team into the fire, rather than staying outside to coordinate safety.

I believe that mistakes during incident response in my job don’t lead directly to deaths now, but how soon before they do? And are my errors perhaps causing deaths indirectly even now? (Hat-tip to Courtney E. for that line of thinking.)

RCM for NA14 Disruptions of Service (Salesforce)

Salesforce published a root cause analysis for the outage last week.

Full disclosure: Salesforce (parent company of my employer, Heroku), is mentioned.

Stack Exchange Network Status — Partial Outage Postmortem – March 28th, 2016

Earlier this year, Stack Exchange suffered a short outage during a migration. The underlying issue seems to have been an inability to truly test the migration due to an inability to replicate the production environment (CDN and all) in development.

Outages

NBA 2K16
Westpac (AU bank)
iiNet (AU ISP)
Whatsapp
Iraq
- Iraq purportedly shut down its internet access (removed its BGP announcements) to prevent students from cheating on exams.
Virgin Mobile
- They offered users a data credit immediately.
Telstra
- Telstra had a long outage this week. They claim that the outage was caused by vandalism in Taree.
Datadog
- Thanks to acabrera on hangops #incident_response for this one.
Mailgun
Disney Ticketing
- Disney’s ticketing site suffered under an onslaught of traffic this week brought on by their free dining deal program. Reference: we had a heck of a time making our dining reservations.

SRE Weekly Issue #23

lex

May 15, 2016

General

Comments

View on sreweekly.com

Articles

SRE: It’s People All the Way Down

Here’s the talk on Heroku’s SRE model that fellow SRE Courtney Eckhardt and I gave at SRECon16 in April. Heroku uses a “Total Ownership” model for service operations, meaning that individual development teams are responsible for running and maintaining the services that they deploy. This in turn allows SRE to broaden our scope of responsibility to cover a wide range of factors that might impact reliability.

Full disclosure: Heroku, my employer, is mentioned.

RushCard to pay $19 million to users for last year’s outage

RushCard is a prepaid debit card system, and last year they had an outage that lasted for two weeks. As part of a settlement, RushCard will pay affected customers $100 – $500 for their troubles.

Many RushCard customers are low-income minority Americans who don’t have traditional bank accounts. Without access to their money stored on their RushCards, some customers told The Associated Press at the time they could not buy food for their children, pay bills, or pay for gas to get to their jobs.

Safety Reporting Leads to Safer Systems

This article in Brigham and Women’s Hospital’s Safety Matters series highlights the importance of encouraging reporting of safety incidents and a blameless culture. Two excellent case studies involving medication errors are examined.

Report ET2016: Fire on board a freight shuttle in the Channel Tunnel

In early 2015, a fire occurred in the Channel Tunnel. Click through for a summary of the recently-released post-incident analysis. It includes the multiple complicating factors that made this into a major incident plus lots of remediations — my favorite kind of report.

How We Monitor and Run Kafka at Scale

SignalFx shares their in-depth experience with Kafka in this article. This reminds me of moving around ElasticSearch indices:

Although Kafka currently can do quota-based rate limiting for producing and consuming, that’s not a applicable to partition movement. Kafka doesn’t have a concept of rate limiting during partition movement. If we try to migrate many partitions, each with a lot of data, it can easily saturate our network. So trying to go as fast as possible can cause migrations to take a very long time and increase the risk of message loss.

Auto-Scaling and Self-Defensive Services in Golang

Plagued by pages requiring tedious maintenance of a Golang process, this developer sought to make the service self-healing.

Step-by-Step High Availability with Docker and Java EE

For the Java crowd, Oracle published this simple guide on writing and deploying highly available Java EE apps using Docker. Sort of. Their example uses a single Nginx container for load balancing.

Outages

Salesforce.com
- One of 37 US Salesforce pods went down for over 18 hours. Data from a five-hour period was lost.
  
  Full disclosure: Salesforce.com (parent company of my employer, Heroku), is mentioned.
Time Warner and Cox (ISPs)
- Level3 fiber cut in New York City.
Montreal emergency responders’ radio communication network
UK Post Office
Betfair US
- Betfair’s site failed during the Kentucky Derby, and the CEO publicly blamed “human error”. This article suggests another theory:
  
  While there’s no proof that TVG came under DDOS attack on Saturday, Betfair US may have decided it was more advantageous for customers to think of them as occasionally ham-fisted stooges prone to disabling their own technology rather than an operator whose systems were unable to defend themselves – and their customers’ data – against malicious third parties.
Second Life
- Central DB failure plus a coincidental status site failure. Thanks to April Linden for the detailed report.
Software Update Destroys $286 Million Japanese Satellite
AppleID site
Intuit
Telstra DNS outage
Loblaw Stores
- All stores in the Loblaw chain in Canada, including Shopper’s Drug Mart, No Frills and Real Canadian Superstore, have been forced to close this morning due to “tech issues.”
Walmart, Lowe’s online stores
Afrihost Mobile
- They’re granting 1GB of free data to customers by way of apology.
Multiple AU Banks’ ATMs

SRE Weekly Issue #22

lex

May 8, 2016

General

Comments

View on sreweekly.com

Articles

The Recent Unpleasantness – Second Life

Landon McDowell, my (incredibly awesome) former boss at Linden Lab, wrote this article in 2014 detailing a spate of bad luck and outages they’d suffered. Causes included hardware failures, DDoS, and an integer DB column hitting its maximum value.

Apairy — Multi-protocol load testing by replaying traffic

I worked on testing the new class of database hardware mentioned in the previous article. In order to be sure the new hardware could handle our specific query pattern, I captured and replayed production queries in real-time using an open source tool written years earlier at Linden Lab called Apiary. This simple but powerful concept (capture and replay) was first introduced to me by one of Apiary’s co-authors, Charity Majors. I’ve since hacked a ton on Apiary and used at two subsequent jobs.

Empty DDoS Threats: Meet the Armada Collective

A group calling themselves the Armada Collective has been making DDoS extortion threats to many companies recently. Cloudflare called them out as entirely toothless, with no actual attacks, but apparently some companies have paid anyway.

Diagnosing performance degradation under adverse circumstances

An excellent deep dive into a performance issue (which really equals a reliability issue), including some good lessons learned.

U.S. Carriers Form “Resiliency Cooperative” to Handle Emergency Situations

This is specifically referring to disaster scenarios such as hurricanes, but the general idea of a “resiliency cooperative” intrigues me.

A video and other startling revelations from the NTSB’s investigation of the fatal Yellow Line smoke incident

A review of the Fire and Emergency Services response found flaws in the actions and procedures taken by the incident commander who was the active fire chief at the time. The NTSB said the commander had not training on the incident management system that would have prepared him to better command the response.

Chaos Monkey for Fun and Profit

Matthias Lafeldt goes deeper into chaos engineering in this latest installment of his series. He also introduces his Dockerized version of Netflix’s Chaos Monkey and shows how to automate chaos experiments to gain further confidence in your infrastructure’s reliability.

Drowning in Alerts: Blame it on Statistical Models for Anomaly Detection

A great overview of the difficulties inherent in anomaly detection and alerting. Note that this article is written by OpsClarity and the end reads a bit like an ad for their service.

Percona to Add Advanced High Availability to Enterprise and Premier Support Offerings

I’m not sure exactly what it is they’re offering now that they weren’t before, but this seems important. I think.

Outages

Telstra
- Telstra made a public commitment of $50 million to improve network resiliency, right about the time that they had a minor network outage. D’oh.
NBA 2K16 (game)
StatsCan (Canada’s Census)
- Canadians attempting to complete their mandatory surveys met with website service interruption
Telkom (South Africa telecom)
Union Bank ATMs
Etisalat (UAE ISP)
Vox (South Africa ISP)
MTN (South Africa telecom)
Elastic Cloud
- Elastic.co blogged a detailed post-incident analysis.

SRE Weekly Issue #21

lex

May 1, 2016

General

Comments

View on sreweekly.com

This week’s themes seem to be human error and network debugging. If you’re like me, you rarely have time to sit down and listen to podcasts, but if you ever get in the mood, this first link is a must-listen. I really can’t do it justice with my summary, but I’m very glad I listened to it, and I think you’ll like it too.

Articles

A Discussion on Human Error | PreAccident Investigation Podcast

We can try to train our workers to avoid error. We can design our systems to make errors less likely. This podcast argues that we go one step further and design our systems to be resilient in the face of inevitable error. Human error is normal and expected. Where are we one error away from a serious adverse event?

Steven Shorrock: "Life After Human Error" – Velocity Europe 2014

In this Velocity keynote, Steven Shorrock discusses human error from his point of view as an ergonomist and psychologist.

Tale of the Missing ACK

My old coworker (and network wizard) at Linden Lab wrote up this fascinating episode of network debugging. Sometimes you have to get really deep into the stack to track down reliability issues.

The Discovery of Apache Zookeeper’s Poison Packet

While we’re on the topic of debugging complicated networking failures, here’s PagerDuty’s analysis of a bug in Zookeeper. It turned out that triggering this bug involved the confluence of 3 other bugs that conspired to deliver a malformed packet to Zookeeper, which causes it to blow up. Yeesh.

Sysdig | How we found a bug in Amazon ELB

If you’re in the mood to read one more really deep and detailed network debugging session, this one’s for you. It goes through the process of gathering enough information to confidently implicate ELB as the source of abrupt connection closures.

The Flaw In All Things – blog dot lusis

John Vincent, featured here last week for his review of the new SRE book, writes this week about the burnout he’s suffering. I think it could best be described as operational risk burnout. I’m not sure what the solution is, but I’m really interested in the problem, and I hope that John considers writing more if he has any useful realizations. Good luck, John.

I couldn’t see anything but the largest configuration because all I could see was places where there was a risk. There were corners I wasn’t willing to cut (not bad corners like risking availability but more like “use a smaller instance here”) because I could see and feel and taste the pain that would come from having to grow the environment under duress.

An Inside Look at How The Ops Team Collaborates

How do you collaborate remotely during an incident? Some companies use conference bridges, but my former boss (and all-around incredible engineer and manager) Landon McDowell advocates for text-based chat. I started my career as part of the Ops team he describes, so I might be biased, but I totally agree: chat is far superior to phone bridges or VoIP.

Load balancing or balancing on the edge of a cliff?

This article starts out as a basic introduction to load-balancing, but where it goes next is really interesting. The author discusses how load-balancing can go wrong (think cascading failure as each remaining backend receives increasingly more traffic) and how to combat the pitfalls. Finally the author suggests two very intriguing concepts for smart load balancing systems that really got me thinking.

Outages

PagerDuty
- It’s especially interesting when PagerDuty goes down, because it might impact the reliability of many companies.
SendGrid
me&you mobile (South Africa)
Bureau of Water and Light (Lansing, MI, USA)
- Ransomware.
HipChat
- Here’s another speedy and detailed postmortem from Atlassian. Nice work, folks.
Large Hadron Collider
- Root cause: weasel.
Neotel (South Africa ISP)

SRE Weekly Issue #20

lex

April 24, 2016

General

Comments

View on sreweekly.com

Articles

Review: Site Reliability Engineering

Here’s a fairly negative review of the new Google SRE book. The author makes some well-articulated points against the tone of the book and its applicability outside Google. I’ve been hearing some talk of a condescending tone in the book, along with a tendency to claim “inventing” things that others also invented elsewhere. My copy arrives next week — should be an interesting read, for better or worse.

Full disclosure: Heroku, my employer, is mentioned.

The Ripple Effect Of Outages And Downtime Cannot Be Underestimated »

A discussion of the impact of an outage on a company’s brand. Skip the last bit; it’s an ad. The rest is worth reading, though.

Reputation and customer loyalty suffers dramatically. The Boston Consulting Group reports that over a quarter of users (28%) never return to a company’s web site if it doesn’t perform sufficiently well.

3 Way Ops Can Help Devs: A Developer Perspective

Conflict between “dev” and “ops” (whatever they’re called at a given company) can create reliability problems. SRE is in part an effort to relieve that tension, either through embedding or enacting process changes. This article gathers opinions and ideas from ops and dev engineers and proposes three methods for alleviating the tension.

CloudEndure’s 2016 Cloud Migration Survey Reveals 52% of Enterprise Companies Plan to Migrate to Public Clouds Over Next 2 Years

Another interesting survey-based report.

When asked what is the acceptable “downtime window” to finish migrations to minimize downtime, almost half (44%) of respondents said they cannot afford any downtime or, at most, just for under 1 hour.

I’ve done both kinds, and in my experience, migrations with planned downtime end up being the more painful ones, as one is under pressure to meet a predefined outage window, which inevitably slips.

Uptime: How Many 9s Do We Need?

In practice, there’s a point of diminishing returns after which you’re wasting money to get more availability than you need. That’s at the crux of this article, and it’s an interesting read.

dastergon/awesome-sre · GitHub

Haven’t gotten your fill from SRE Weekly? Here’s a long list of curated SRE-related links to peruse.

Fault Injection in Production

Here’s a classic from the venerable John Allspaw of Etsy on running gameday scenarios in production. The general process is to brainstorm possible failures, improve the system to handle them, and then test by actually inducing the failures in production.

Imagining failure scenarios and asking, “What if…?” can help combat this thinking and bring a constant sense of unease to the organization. This sense of unease is a hallmark of high-reliability organizations. Think of it as continuously deploying a BCP (business continuity plan).

(emphasis mine)

What to do with the "rm -rf" hoax question – Meta Server Fault

Yup, turns out it was a hoax. Still generated an interesting conversation though.

Outages

123-reg (UK web hosting)
- An error in a script resulted in mass deletion of customer sites.
SquareSpace
Nucleus Market (illicit goods market)
The Pirate Bay
More US voting issues
US state school testing systems
- This week, both New Jersey and Tennessee had to cancel testing due to failures in their computerized trading systems. I’ve mentioned TNReady previously here, and this is their third failure.
Facebook

← Older Posts

Newer Posts →

General

SRE Weekly Issue #24

Articles

Outages

SRE Weekly Issue #23

Articles

Outages

SRE Weekly Issue #22

Articles

Outages

SRE Weekly Issue #21

Articles

Outages

SRE Weekly Issue #20

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues