SRE WEEKLY – Page 97 – scalability, availability, incident response, automation

SRE Weekly Issue #39

lex

September 11, 2016

Articles

A+ article! Susan Fowler has been a developer, an ops person, and now an SRE. That means she’s well-qualified to give an opinion on who should be on call, and she says that the answer is developers (in most cases). Bonus content includes “What does SRE become if developers are on call?”

[…]if you are going to be woken up in the middle of the night because a bug you introduced into code caused an outage, you’re going to try your hardest to write the best code you possibly can, and catch every possible bug before it causes an outage.

Thanks to Devops Weekly for this one.

New zine: Linux debugging tools you’ll love

I figured this new zine from Julia Evans would be mostly review for me. WRONG. I’d never heard of dstat, opensnoop, or execsnoop, or perf before, but I sure will be using them now. As far as I can tell, Julia wants to learn literally everything, and better yet, she wants to teach us what she learned and how she learned it. Hats off to her.

The ‘Change One Thing’ Rule

“While we’ve got the entire system down to do X, shall we do Y also?”

This article argues that we should never do Y. If something goes wrong, we won’t know whether to roll back X or Y, and it’ll take twice as long to figure out which one is to blame.

Systems blindness and how we deal with it

This week, Mathias introduces “system blindness”, the flawed understanding of how a system works and the lack of knowledge of how incomplete our understanding of it is. Whether we realize it or not, we struggle to mentally model the intricate interconnections in the increasingly complex systems we’re building.

There are no side effects, just effects that result from our flawed understanding of the system.

Building resilience in Spokes

I’ve mentioned Spokes (formerly DGit) here previously. This time, GitHub shares the details on how they designed Spokes for high durability and availability.

Ruby Is Dead! – You Need to Take Care of Its Memory Issues

TIL: Ruby can suffer from Java-style stop-the-world garbage collection freezes.

Facebook Engineers Crash Data Centers in Real-World Stress Test

Here’s recap of a talk about Facebook’s “Protect Storm”, given by VP Jay Parikh at @Scale. Project Storm involved retrofitting Facebook’s infrastructure time handle the failure of entire datacenters.

“I was having coffee with a colleague just before the first drill. He said, ‘You’re not going to go through with it; you’ve done all the prep work, so you’re done, right?’ I told him, ‘There’s only one way to find out’” if it works.

Failure is Always An Option: How a Blameless Culture Leads to Better Results

Here’s an interview with Jason Hand of VictorOps about the importance of a blameless culture. He mentions the idea that “Why?” is an inherently blameful kind of question (hat tip to John Allspaw’s Infinite “How?”s). I have to say that I’m not sure I agree with Jason’s other point that we shouldn’t bother attempting incident prevention, though. Just look at the work the aviation industry has done toward accident prevention.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

SCALE 15x Call for Proposals

SCALE has opened their CFP, and one of the chairs told me that they’d “love to get SRE focused sessions on open-source.”

Outages

British Airways
FLOW (Jamaica telecom)
SSP
- SSP provides a SaaS for insurance companies to run their business on. They’re dealing with a ten-plus-day outage initially caused by some kind of power issue that fried their SAN. As a result, they’re going to decommission the datacenter in question.
Heroku
- Full disclosure: Heroku is my employer.
Azure
- Two EU regions went down simultaneously.
Overwatch (game)
Asana
- Linked is a postmortem with an interesting set of root causes. A release went out that increased CPU usage, but it didn’t cause issues until peak traffic the next day. Asana is brave for enabling comments on their postmortem — not sure I’d have the stomach for that.Thanks to an anonymous contributor for this one.
ESPN’s fantasy football
- Unfortunate timing, being down on opening day.

SRE Weekly Issue #38

lex

September 4, 2016

General

Comments

Welcome to the many new subscribers that joined in the past week. I’m not sure who I have to thank for the sudden surge, but whoever you are, thanks!

View on sreweekly.com

Articles

The Fire Service and the Aviation Industry – Firefighter Safety – Crew Resource Management

What can the fire service learn about safety from the aviation industry? A 29-year veteran in the fire service answers that question in detail. We could in turn apply all of those lessons to operating complex infrastructures.

Wikipedia: High Reliability Organization

I’m surprised that I haven’t come across the term “High Reliability Organization” before reading the previous article. Here’s Wikipedia’s article on HROs.

A high reliability organization (HRO) is an organization that has succeeded in avoiding catastrophes in an environment where normal accidents can be expected due to risk factors and complexity.

Tracking Every Release

Etsy instruments their deployment system to strike a vertical line on their graphite graphs for every deploy. This helps them quickly figure out whether a deploy happened right before a key metric took a turn for the worse.

Undersea cables keep global enterprise networks afloat

A really interesting dive into the world of subsea network cables and the impact that cuts can have on businesses worldwide.

Testing in production comes out of the shadows

How closely can you really mimic production in your testing environments? In a way we’re all testing in production, and this article talks about getting that fact out in the open.

Two Suggestions to Help SL Scale

I wrote this article on my terrible little blog back in 2008 — forgive the horrid theme and apparently broken unicode support. This was well before I worked in Linden Lab’s Ops team, back when I was making a living as a user selling content in Second Life. What’s interesting to me in reading this article 8 years later is the user perspective on the impact of the string of bad outages, and especially Linden’s poor communication during outages.

Delta says it lost $100 million in revenue due to big outage

More on the impact of Delta Airline’s major outage last month.

We’re learning the wrong lessons from airline IT outages

Most often a catastrophic failure is not due to a lack of standards, but a breakdown or circumvention of established procedures that compounded into a disastrous outcome. Multilayer complex systems outages signify management failure to drive change and improvement.

Outages

Salesforce
- Salesforce.com was down or impared for several hours.
  
  Full disclosure: Salesforce is the parent company of my employer, Heroku.
dynamodb
Telstra Mail
Google Cloud Platform
- Normally I don’t include single-zone failures in EC2 or GCP, but this one has an extremely interesting and detailed postmortem.
EA (FIFA 16 and Battlefield 1 Beta)
Vodafone (Ireland)
Interpublic Group (Hollywood PR agency)
Vesk
- The Register noted that Vesk bragged about their 100% uptime even after the outage — including for all of 2016. From Vesk’s recently-changed about page:
  
  VESK has hit 100% uptime for all 2012, 2013, 2014, 2015 and 2016.”
PlayStation Network
PagerDuty
- PagerDuty is currently unable to process some inbound events. We are investigating the cause.
Telkom (South Africa telecom)
- The company cited suspected sabotage and offered a monetary reward.
Washington, DC 911 system
- Emergency services were knocked out for 90 minutes after a contract worker mistakenly hit the emergency shut-off button. The phrase “human error” is being tossed about.

SRE Weekly Issue #37

lex

August 28, 2016

General

Comments

View on sreweekly.com

Articles

The “network partitions are rare” fallacy

Sometimes I follow chains of references from article to article until I find a new author to follow, and this time it’s Kelly Sommers. In this gem, she debunks the rarity of network partitions by recasting them as availability partitions. If half of your nodes aren’t responding because their CPUs are pegged, you still have a network partition.

most partitions I’ve experienced have nothing to do with network infrastructure failures

How DIGIT Created High Availability on the Public Cloud to Keep Its Games Running

Two engineers from MMO company DIGIT gave this short, nicely detailed interview in which they outline how they achieve HA on AWS.

DevOps & SRE AMA Video

Here’s a recording of the DevOps/SRE AMA from a couple weeks back, in case you missed it.

No Way Out But Through

A blog post by Skyline, who is launching their new deployment-as-a-service offering. The intro is pretty great, outlining the inherent risks in changing code and releasing new code into production.

gh-ost: GitHub’s online schema migration tool for MySQL

Other online schema-change tools I’m familiar with (e.g. pt-online-schema-change) use triggers to keep a new table in sync with changes while copying old rows over. Instead, gh-ost monitors changes by hooking on as a replication slave. Very clever! This article goes into several reasons why this is a much better approach.

Outages

Google App Engine
- The outage occurred on August 11, but they posted a postmortem this week.
Buildkite
- Includes an extremely detailed postmortem starting with paging failure and running through 6 lessons learned. #ThereIsNoOneRootCause
Slack
Second Life
- Another awesome postmortem by April Linden.
Travis CI
Facebook
eBay
PlayStation Network
iiNet (ISP)
Google Compute Engine

SRE Weekly Issue #36

lex

August 21, 2016

General

Comments

View on sreweekly.com

Last week’s DevOps & SRE AMA was super fun! Thanks to the panelists for participating. Recordings should be posted soon.

Articles

Multi data center redundancy – sysadmin considerations

This is the second half of Server Density’s series on the lessons they learned as they transitioned to a multi-datacenter architecture. There are lots of interesting tidbits in here, such as an explanation of how they handle failover to the secondary DC and what they do if that goes wrong.

Full disclosure: Heroku, my employer, is mentioned.

How Complex Web Systems Fail

Here’s the second half of Mathias Lafeldt’s series that seeks to apply Richard Cook’s How Complex Systems Fail to web systems. The article is great, but the really awesome part is the thoughtful responses by Cook himself to both parts one and two, linked at the end of this article.

Why Reddit was down on Aug 11

Here’s a postmortem for last week’s outage that involved a migration gone awry.

Thanks to Jonathan Rudenberg for this one.

US Patent Office sued after it declared a power outage a ‘national holiday’

A patent holding firm is alleging that the USPTO overstepped its authority in declaring a system outage (reported in issue #4) to be treated as a national holiday for purposes of deadlines, and that this led to the plaintiff being sued.

Know Anyone With This High-Burnout Job?

Burnout is a crucially important consideration in a field with on-call work. VictorOps has a few tips for alleviating burnout gleaned from this year’s Monitorama.

Staging Servers Are Dead!

Edith Harbaugh says that staging servers present a reliability risk that doesn’t outweigh their benefit. This article is an update to her original article, which I also recommend reading.

Context aware MySQL pools via HAProxy

Github uses HAProxy to balance across is read-only MySQL replicas, which is a common method. Their technique for excluding lagging nodes while avoiding entirely emptying the pool if all nodes are lagging is pretty neat.

Thanks to Devops Weekly for this one.

Serverless Architectures

A highly detailed deep-dive on Serverless — what it means, benefits, and drawbacks. I especially enjoyed the #NoOps section:

[Ops] also means at least monitoring, deployment, security, networking and often also means some amount of production debugging and system scaling. These problems all still exist with Serverless apps and you’re still going to need a strategy to deal with them. In some ways Ops is harder in a Serverless world because a lot of this is so new.

#ServerlessIsMadeOfServers

Full disclosure: Heroku, my employer, is mentioned.

Outages

Slack
- A relatively minor issue, but it impacted me, so I logged it here while awaiting resolution.
MTN (mobile telecom)
Google Cloud Status Dashboard
- Postmortem included, with an interesting cause:
  
  During mitigation of a lower impact performance issue, Google engineers made a configuration change to pipeline orchestration. An error in this configuration caused validation within the orchestration component to reject all requests.
Tesla Vehicles
Xbox Live
Sky (ISP)
Facebook
Apple’s App Store
Twitter
Cisco Jasper
Optus
AT&T
NSA

SRE Weekly Issue #35

lex

August 14, 2016

General

Comments

View on sreweekly.com

Articles

AWS Networking, Environments and You

Whoops, here’s one that got lost in my review queue. Charity Majors (one of the usual suspects here at SRE Weekly) wrote one of her characteristically detailed and experience-filled posts on how to isolate your production, staging, and development environments in AWS.

Paradigm Check Point: Prefacing Debriefings

I can’t quite tell how much of this is John Allspaw’s writing and how much is written by the US Forestry Service, but I love it all. Here’s a bulleted list of points driving home the fact that we constantly strike a balance between risk and safety.

Multi data center redundancy – application considerations

Server Density added multi-datacenter redundancy to their infrastructure in 2013, and they were kind enough to document what they learned. In this first of two articles, they outline different kinds of multi-datacenter setups and go over the kinds of things you’ll have to think about as you retrofit your application.

Making a point with SLAs

This short opinion piece raises an excellent idea: SLAs aren’t for recouping the cost you incurred due to an outage, they are for making a point to a service provider about the outage.

Cost of Southwest’s tech outage climbs to at least $54 million

Southwest has released some numbers on the impact of last month’s outage that resulted in thousands of cancelled flights.

Netflix and Fill

Netflix gives us a rundown of how they prepare a title for release by pre-filling caches in their in-house CDN. I like the part about timing pre-filling during off-peak hours to avoid impacting the service.

Delta Datacenter Crash: Do the Math on Disaster Recovery ROI

How much is your company willing to invest for a truly effective DR solution? This article asks that question and along the way digs into what an effective DR solution looks like and why it costs so much.

Outages

Syria
- The Syrian government shut internet access down to prevent cheating on school exams.
Mailgun
- Linked, find a really interesting postmortem: Mailgun experienced an outage when their domain registrar placed their domain on hold abruptly. The registrar was subsequently largely uncommunicative, hampering incident resolution. Lesson learned: make sure you can trust your registrar, because they have the power to ruin your day.
Belnet
- The linked article has some intriguing detail about a network equipment failure that caused a routing loop.
Australia’s census website
- This caught my eye:
  
  Revolution IT simulated an average sustained peak of up to 350 submissions per second, but only expected up to 250 submission per second.
  
  Load testing only 40% above expected peak demand? That seems like a big red flag to me.
Reddit
Etisalat (UAE ISP)
Vodafone
Google Drive
AT&T
Delta Airline
- A datacenter power system failure resulted in cancelled flights worldwide.

← Older Posts

Newer Posts →

SRE Weekly Issue #39

Articles

Outages

SRE Weekly Issue #38

Articles

Outages

SRE Weekly Issue #37

Articles

Outages

SRE Weekly Issue #36

Articles

Outages

SRE Weekly Issue #35

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues