SRE WEEKLY – Page 101 – scalability, availability, incident response, automation

SRE Weekly Issue #9

lex

February 7, 2016

Articles

I spoke too soon in the last issue! Github has posted an extremely thorough postmortem that answers any questions one might have had about last week’s outage. I like the standard they’re holding themselves to for timely communication:

One of the biggest customer-facing effects of this delay was that status.github.com wasn’t set to status red until 00:32am UTC, eight minutes after the site became inaccessible. We consider this to be an unacceptably long delay, and will ensure faster communication to our users in the future.

Monitoring Business Metrics

Just monitoring servers isn’t enough to detect an outage. Sometimes even detailed service monitoring can miss an overall performance degradation that involves multiple services in an infrastructure. In this blog post, PagerDuty suggests also monitoring key business metrics (logins, purchase rate, etc).

What happened yesterday and what we are doing about it

In this case, “yesterday” is on 2013, but this is an excellent postmortem from Mailgun that can serve as an example for all of us.

Handling an Outage

A customer’s perspective on a datacenter outage, with emphasis on the need for early, frequent, and thorough communication from service providers.

Production Postmortem: the Razor Suicide

A nicely detailed outage postmortem, including the gorey details of the train of thought the engineers followed on the way to a solution. They hint at an important technique that’s not discussed nearly enough, in my opinion: judicious application of bandaid solutions to resolve the outage and allow engineers to continue their interrupted personal time. It’s not necessary to fix a problem the “right” way in the moment, and carefully-applied bandaids help reduce on-call burnout.

The Verification of a Distributed System

How can we be sure (or at least sort of confident) that distributed systems won’t fail? They can be incredibly complex, and their failures can be even more complex. Catie McCaffrey gives us this ACM Queue article about methods for formal and informal verification.

Efficiently testing distributed systems is not a solved problem, but by combining formal verification, model checking, fault injection, unit tests, canaries, and more, you can obtain higher confidence in system correctness.

Public Accountability — Postmortems — Medium

Medium has announced a commitment to publishing postmortems for all outages. I’d love to see more companies making a commitment like this. Thanks to reader Pete Shima for this link.

Outages

Healthplanfinder (WA, US)
- The system went down right before the deadline for users to enroll in plans.
Grindr
- The tweets during this outage were hilarious.
Shaw (ISP)
PlayStation Network
- The third outage this year for PSN.
Virgin Australia
- Another airline grounded by an outage.
British Telecom (UK ISP)
Amazon.com
Google App Engine
Delta
- What is it with airlines lately?
Google Compute Engine
- This one has a nice postmortem.
IRS E-File (US tax system)

SRE Weekly Issue #8

lex

January 31, 2016

General

Comments

View on sreweekly.com
If you only read two articles this week, make it these first two. They’re excellent and exactly the kind of content I’m looking for. If you come across (or write!) anything that would go well in SRE Weekly, I’d love it if you’d toss a link my way.

Articles

Maslow’s Hierarchy of SRE Needs

Liz Fong-Jones, a Googler and co-chair of SRECon, describes a scale of activities SRE teams engage in, from the basics (keeping the service operating) to having the freedom to improve the service.

High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads

This is a really awesome paper. Two Googlers describe in detail the pitfalls of failover-based systems and explain how they design multi-homed active/active services. If Google has learned a lesson, we’d all do well to learn from it, too:

Our experience has been that bolting failover onto previously singly-homed systems has not worked well. These systems end up being complex to build, have high maintenance overhead to run, and expose complexity to users. Instead, we started building systems with multi-homing designed in from the start, and found that to be a much better solution. Multi-homed systems run with better availability and lower cost, and result in a much simpler system overall.

Cloud outage audit update: The challenges with uptime

A review of CloudHarmony’s numbers on various cloud providers’ availability in 2015 versus 2014, along with a discussion of how customers deal with outages. I’m a little puzzled by this one:

That’s also partly why most public cloud workloads aren’t used for production or mission-critical applications.

I’m pretty sure plenty of mission-critical stuff is running in EC2, for example.

Interview: Building the Latest Campaign for David Guetta — Serverless Code

The team at parall.ax chose Lambda because there are no long-lived servers, and they could offload all the work of scaling their app up and down with demand to Amazon.

Europa Water Siphon

Randall Monroe takes on an important question: is it possible to siphon water from a Europa to Earth? Okay, the only relation to SRE is that a team of Google SREs submitted the question, but I really love What If.

The How and Why of Minimum Viable Runbooks

VictorOps distilled their Minimum Viable Runbooks series (featured here previously) into a polished PDF in their usual high quality and style.

Vodafone set to follow Spark in providing near live information on network faults

During an outage this week, Vodafone admitted that they forgot to update their status site. They are looking into an automated system to make updates during outages.

On Call Compensation for a Solo Sys Admin – What’s your experience? – Spiceworks

I’ve worked mostly jobs without compensation for on-call, but one with. Compensation is nice, but it was to offset a truly heinous level of pages, so it was small comfort. If you have any good articles about the merits and pitfalls of on-call compensation, please send them my way.

Outages

Lots of downtime this week, including some recurrences and some big names.

The Patriots’ (sports team) Microsoft Surface tablets
- This one’s notable because immediately after, a Surface ad played, causing a Microsoft marketing outage.
Microsoft Office 365 IMAP service
WhatsApp
Kaiser Permanente
PlayStation Network
- A new round of trouble for Sony.
Safari (browser)
- This one’s interesting: existing installations of Safari started crashing on iOS devices everywhere, due to a bug triggered by a change in some backend web service.
Africa (again)
- The backhoe: natural enemy of the network.
Vodafone internet in New Zealand
Google Drive
GitHub
- Github suffered a two-hour outage on Thursday that was caused by a power outage. Hats off to them for releasing a postmortem the day after, although I think there’s a lot left unsaid as to why a power outage took them down at all. It’s a shame, because those kinds of analyses can be especially educational and help us learn how to avoid similar problems.
HSBC online banking
- Another DDoS.

SRE Weekly Issue #7

lex

January 24, 2016

General

Comments

View on sreweekly.com

A big thanks to Charity Majors (@mipsytipsy) for tweeting about SRE Weekly and subsequently octupling my subscriber list!

Articles

Clients are Jerks: aka How Halo 4 DoSed the Services at Launch & How We Survived

This article is gold. CatieM explains why clients can’t be trusted, even when they’re written in-house. She describes how her team avoided an outage during the Halo 4 launch by turning off non-essential functionality. Had she trusted the clients, she might not have built in the kill switches that let her shed the excessive load caused by a buggy client.

Under the hood: Broadcasting live video to millions

Facebook recently released a live video streaming feature. Because they’re Facebook, they’re dealing with a scale that existing solutions can’t even come close to supporting (think millions of viewers for celebrity live video broadcasts). This article goes into detail about how they handle that level of concurrency for live streaming. I especially like the bit about request coalescing.

Uptime Funk

Best. I pretty much only like the parodies of Uptown Funk.

The Monitoring Death Spiral

This is a really great little essay comparing running a large infrastructure with flying a plane by instruments. Paying attention to just one or two instruments without understanding the big picture results in errors.

Thanks to Devops Weekly for this one.

Diagnosing Human Fail

An awesome incident response summary for an outage caused by domain name expiration. The live Grafana charts are awesome, along with the dashboard snapshot. It’s exciting to see how far that project has come!

The Factors That Impact Availability, Visualized

Calculating availability is hard. Really hard. First, you have to define just what constitutes availability in your system. Once you’ve decided how you calculate availability, you’ve defined the goalposts for improving it. In this article, VividCortex presents a general, theoretical formula for availability and a corresponding 3D graph that shows that improving availability involves both increasing MTBF and reducing MTTR.

Cloudy, with a chance of outage

TechCentral.ie gives us this opinion piece on the frequency of outages in major cloud providers. The author argues that, though reported outages may seem major, they still rarely cause violation of SLAs, and service availability is still probably better than individual companies could manage on their own.

Full disclosure: Heroku, my employer, is mentioned.

JetBlue, Verizon data center downtime raises DR, UPS questions

An external post-hoc analysis of the recent outage at JetBlue, with speculation on the seeming lack of effective DR plans at JetBlue and Verizon. The article also mentions the massive outage at 365 Main’s San Francisco datacenter in 2007, which is definitely worth a read if you missed that one.

Why Things Were Less Than Optimal This Past Weekend

Linden Lab Systems Engineer April wrote up a detailed postmortem of the multiple failures that went into a rough weekend for Second Life users. I worked on recovery from at least a few failures in that central database in my several years at Linden, and it’s pretty tricky managing the thundering herd that floods through the gates when you reopen them. Good luck folks, and thanks for the excellent write-up!

The Netflix Tech Blog: Automated Failure Testing

Netflix has taken the Chaos Monkey to the next level. Now their automated system investigates the services a given request touches and injects artificial failures in various dependencies to see if they cause end-user errors. It takes a lot of guts to decide that purposefully introducing user-facing failures is the best way to ultimately improve reliability.

…we’re actually impacting 500 members requests in a day, some of which are further mitigated by retries. When you’re serving billions of requests each day, the impact of these experiments is very small.

Outages

Only a few this week, but they were whoppers!

Twitter
- Twitter suffered a massive outage at least 2 hours long with sporadic availability for several hours after. Hilariously, they posted status about the outage on Tumblr.
Comcast (SF Bay area)
Africa
- This is the first time I’ve had an entire continent in this section. Most of Africa’s Internet was cut off from the rest of the world due to a pair of fiber cuts. South Africa was hit especially hard.

SRE Weekly Issue #6

lex

January 17, 2016

General

Comments

View on sreweekly.com

Articles

Designed to Fail – Brave New Geek

A discussion of failing fast, degrading gracefully, and applying back-pressure to avoid cascading failure in a service-oriented architecture.

Many times, it’s our own internal services which cause the biggest DoS attacks on ourselves.

Kernel Patching 101: How to Make Repairs Without System Downtime/

A SUSE developer introduces kGraft, SUSE’s system for live kernel patching. Anyone who survived the AWS reboot-a-thon is probably a big fan of live kernel patching solutions.

Not Everything Critical is Urgent. Learn the Difference.

One thing that is critical is avoiding burnout in on-call. This article is a description of the “urgency” feature in Pagerduty, but they make a generally applicable point: don’t wake someone for something just because it’s critical; only wake them if it needs immediate action.

Fallacies of Distributed Computing Explained

This is a review/update of the 1994 article. The fallacies still hold true, and anyone designing a large-scale service should heed them. The fallacies:

The network is reliable.
Latency is zero.
Bandwidth is infinite.
The network is secure.
Topology doesn’t change.
There is one administrator.
Transport cost is zero.
The network is homogeneous.

As I get into SRE Weekly, I repeatedly run across articles that I probably should have read long since in my career. Hopefully they’re new to some of you, too.

Delivering safer cars faster through automation and continuous delivery

Every position I’ve held has involved supporting reliability in a 24/7 service, but let’s be realistic: it’s unlikely someone would have died as a result of an outage. In cars, reliability takes a whole new meaning. I first got interested in MISRA and the other standards surrounding the code running in cars when I read some technical write-ups of the investigation surrounding the “unintended acceleration” incidents a few years back. This article discusses how devops practices are being applied in the development of vehicle code.

Security experts confirm Ukraine power grid blackout a ‘coordinated intentional attack’

Evidence has come out that the recent major power outage in Ukraine was a network-based attack (I can’t make myself say “cyber-” anything).

PS4 porn viewers rocket during PSN outage

I should have seen this coming.

Verizon grounds JetBlue — how could that happen? Another plan B gone bad

One blogger’s take on the JetBlue outage.

It’s very hard to create an entirely duplicate universe where you can test plan B. And it’s even hard to keep on testing it regularly and make sure it actually works. To wit: Your snow plow often doesn’t start after the first snow because it’s been sitting idle all summer.

SRECon16 Call for Participation

The SRECon call for participation is now open!

LostPass

Sean Cassidy has discovered an easy and indistinguishable phishing method for LastPass in Chrome, with a slightly less simple and effective method for Firefox. This one’s important for availability because many organizations rely heavily on LastPass. Compromising the right Employee’s vault could spell big trouble and possibly downtime.

Outages

GTA Online
EE (phone network)
Amplitude
- A truly heinous multi-day outage for Amplitude. The root cause appears to be inadvertent deletion of data in DynamoDB. Thanks to the folks at Amplitude for the extremely detailed status and analysis. Get some sleep, folks.
PlayStation Network
Xbox Live Down
JetBlue
- This one was all over the news. JetBlue points the finger at a Verizon datacenter outage.
TalkTalk
Yahoo Mail

SRE Weekly Issue #5

lex

January 10, 2016

General

Comments

View on sreweekly.com

Articles

You Own Your Availability

What does owning your availability really mean? Brave New Geek argues that it simply means owning your design decisions. I love this quote:

An SLA is not an insurance policy or a hedge against the business impact of an outage, it’s merely a refund policy.

ISIS hackers claim responsibility for BBC outage during “test attack”

Apparently last week’s BBC outage was “just a test”. Now we have to defend our networks against misdirected hacktivism?

Operations is More Than Just Systems Administration

Increased deployment automation leads to the suggestion that developers can now “do ops” (see also: “NoOps”). This author explains why operations is much more than deployment.

Full disclosure: Heroku, my employer, is briefly mentioned.

Oyster’s Underground Nightmare: When DevOps Kills Retail – DZone DevOps

Tips on how to move toward rapid releases without drastically increasing your risk of outages. They cite the Knight Capital automated trading mishap as a cautionary example, along with Starbucks and this week’s Oyster outage.

Holistic Configuration Management at Facebook | the morning paper

Facebook uses configuration for many facets of its service, and they embrace “configuration as code”. They make extensive use of automated testing and canary deployments to keep things safe.

Thousands of changes made by thousands of people is a recipe for configuration errors – a major source of site outages.

Quick Tips: How to Post Mortem Every Incident

PagerDuty shares a few ideas about how and why to do retrospective analysis of incidents.

Crossroads of Asynchrony and Graceful Degradation

Another talk from QCon. Netflix’s Nitesh Kant explains how an asynchronous microstructure architecture naturally supports graceful degradation. (thanks to DevOps Weekly for the link)

The Network is Reliable

One of the fallacies of distributed computing. This ACM Queue article is an informal survey of all sorts of fascinating ways that networks fail.

Outages

Nintendo Network Down After Service Suffers Unexpected Outage
HSBC Online Banking
Vodafone
- Flooding in their UK datacenter.
Easyspace
Oyster (London transit system)
Verizon Wireless
Time Warner Cable
Sony PlayStation Network
- Sony has said they will compensate users by extending subscriptions.

← Older Posts

Newer Posts →

SRE Weekly Issue #9

Articles

Outages

SRE Weekly Issue #8

Articles

Outages

SRE Weekly Issue #7

Articles

Outages

SRE Weekly Issue #6

Articles

Outages

SRE Weekly Issue #5

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues