General

SRE Weekly Issue #29

lex

July 3, 2016

General

Comments

View on sreweekly.com

Articles

DevOps vs SRE: delayed coverage of the dumbest war – charity.wtf

I can’t summarize this awesome article well enough, so I’m just going to quote Charity a bunch:

the outcomes associated with operations (reliability, scalability, operability) are the responsibility of *everyone* from support to CEO.

if you have a candidate come in and they’re a jerk to your office manager or your cleaning person, don’t fucking hire that person because having jerks on your team is an operational risk

If you try and just apply Google SRE principles to your own org according to their prescriptive model, you’re gonna be in for a really, really bad time.

March Outages post-mortem

Traffic spikes can be incredibly difficult to handle, foreseen or not. Packagecloud.io details its efforts to survive a daily spike of 600% of normal traffic in March.

New high availability checklist now available | Azure

This checklist is aimed toward deployment on Azure, but a lot of the items could be generalized and applied to infrastructures deployed elsewhere.

Emails reveal months of missteps leading up to Tennessee’s disastrous online testing debut

In-depth detail surrounding the multiple failures of TNReady mentioned earlier this year (issues #10 and #20).

The Two Sides to Google Infrastructure for Everyone Else

A two-sided debate, both sides of which are Gareth Rushgrove (maintainer of the excellent Devops Weekly). Should we try to adopt Google’s way of doing things in our own infrastructures? For example, error budgets:

What if you’re operating an air traffic control system or a nuclear power station? Your goal is probably closer to zero outages

Outages

Telstra
- Another announcement that they’re dedicating more money to outages, and another subsequent outage. Telstra’s CEO says that the number of outages has not actually increased.
Google Compute Engine
- Click through for the full postmortem.
  
  On Wednesday 29 June 2016, newly created Google Compute Engine instances and newly created network load balancers in all zones were partially unreachable for a duration of 106 minutes.
Virgin Mobile
Google Calendar
Snapchat
Idea (mobile telecom)
Microsoft Office 365
Comcast (Boston, MA, US)

SRE Weekly Issue #28

lex

June 26, 2016

General

Comments

A more packed issue this week to make up for missing last week. This issue is coming to you from beautiful Cape Cod, where my family and I are vacationing for the week.

View on sreweekly.com

Articles

Postmortem-Report-Reviews/2016-04-14-lexelby-Google-Compute-Engine-2016-04-11.md

In April, Google Compute Engine suffered a major outage that was reported here. I wrote up this review for the Operations Incident Board’s Postmortem Report Reviews project.

The Netflix Tech Blog: Netflix Billing Migration to AWS

Migration of a service without downtime can be an incredibly challenging engineering feat. Netflix details their effort to migrate their billing system complete with its rens of terabytes of RDBMS data into EC2.

Our primary goal was to define a secure, resilient and granular path for migration to the Cloud, without impacting the member experience.

How Ransomware Changes Backup and Disaster Recovery

Ransomware is designed to really ruin your day. It not only corrupts your in-house data; it also tries to encrypt your backup. Even if it’s off-site. Does your backup/recovery strategy stand up to this kind of failure?

Making the Case for Real-Time Incident Management: Downtime Data and DevOps

VictorOps gives us this shiny, number-filled PDF that you can use as ammunition to convince execs that downtime really matters.

DevOps Students Learn the Value of Uptime With 3 a.m. Calls

Students of Holberton School‘s full-stack engineer curriculum are on-call and actually get paged in the middle of the night. Nifty idea. Why should training in on-call only be on-the-job?

Safety Moment – Rumble Strips

I think the rumble strip is a near-perfect safeguard.

That’s Pre-Accident Podcast’s Todd Conklin on rumble strips, the warning tracks on the sides of highways. This short (4-minute) podcast asks the question, can we apply the principles behind rumble strips in our infrastructures?

FCC Adopts Rules to Promote Reliable Submarine Cable Communications Infrastructure

The FCC adds undersea cable operators to the list of mandatory reporters to the NORS (Network Outage Reporting System). But companies such as AT&T claim that the reporting will be of limited value, since outages that have no end-user impact (due to redundant underseas links) must still be reported.

Updated high availability and disaster recovery app design guidance

Microsoft updated its article on designing highly available apps using Azure. These kinds of articles are important. In theory, no one ought to go down just because one EC2 or Azure region goes down.

Reducing Alert Noise

SignalFX published this four-part series on avoiding spurious alerts in metric-based monitoring systems. The tutorial bits are specific to SignalFX, but the general principles could be applied to any metric-based alerting system.

Thanks to Aneel at SignalFX for this one.

Outages

Baltimore, MD, USA 911 (emergency services)
- Verizon blames a routing error.
HBO NOW
- HBO Now’s stream of Game of Thrones sputtered and died just as the most anticipated episode of the season spooled up.
Bitfinex (Bitcoin exchange)
- The outage purportedly resulted in a Bitcoin price dip.
Telia (transit provider)
- This one was big. The Register reports in this article that Telia mistakenly routed Europe’s traffic to Hong Kong. Many services and providers were impacted including CloudFlare, Slack, and AWS’s eu-west-1 region.
Blizzard
LinkedIn
- Just two days after the Microsoft acquisition.
Verizon (Florida, USA)
Youtube
Spotify
PlayStation Network
Asos.com
Twitter (India)
US Air Force Inspector General’s database
- Thanks to Niall for this one.

SRE Weekly Issue #27

lex

June 13, 2016

General

Comments

View on sreweekly.com

Sorry I’m a tad late this week!

If you only have time to read one long article, I highly recommend this first one.

Articles

How Technology Led a Hospital To Give a Patient 38 Times His Dosage

This fascinating series delves deeply into the cascade of failures leading up to the nearly fatal overdose of a pediatric patient hospitalized for a routine colonoscopy. It’s a five-article series, and it’s well worth every minute you’ll spend reading it. Human error, interface design, misplaced trust in automation, learning from aviation; it’s all here in abundance and depth.

WTF is operations? #serverless

In this second part of a two part series (featured here last week), Charity Majors delves into what operations means as we move toward a “serverless” infrastructure.

If you chose a provider, you do not get to just point your finger at them in the post mortem and say it’s their fault. You chose them, it’s on you.

CenturyLink Targets ‘Six Nines’ Reliability | Light Reading

Interesting, though I have to say I’m a bit skeptical when I hear someone target six nines. Especially when they say this:

Redefining five nines is redefining them to go up to six nines,” said James Feger, CenturyLink’s vice president of network strategy and development […]

PAPod 73 – An Over-Emphasis on Prevention?

The Pre-Accident Podcast reminds us that incident response is just as important as incident prevention.

The Future of Incident Notification in the Modern Enterprise

As automated remediation increases, the problems that actually hit our pagers become more complex and higher-level. This opinion piece from PagerDuty explores that trend and where it’s leading us.

The major lesson IT can learn from Netflix’s high availability testing methodology

A high-level overview of the difference between HA and DR and Netflix’s HA testing tool, Chaos Monkey.

Outages

Fastly
- I noticed this one when it got in the way of my work.
Slack
The Pirate Bay
Telstra
eBay
SaskTel (Canada telecom)
34SP.com
Lending Club
Acunetix
- An opportunistic blackhat photoshopped a screenshot of the downed site, making it appear that they had breached its security.
EU referendum poll voter registration (UK)
Amazon EC2 and EBS (Sydney, AU)
- A major outage in one availability zone took down many sites and services in Australia. Amazon quickly released this detailed post-analysis. TL;DR: an extended utility voltage sag didn’t cause isolation breakers to trip, and the flywheel UPSes then quickly dumped all of their energy into the power grid.

SRE Weekly Issue #26

lex

June 5, 2016

General

Comments

View on sreweekly.com

Articles

Operational Best Practices #serverless – charity.wtf

Here’s Charity Majors being awesome as always. There’s a reason this article is first this week. In this part one of two articles, Charity recaps her recent talk at serverlessconf in which she argues that you can never get away from operations, no matter how “serverless” you go.

[…] no matter how pretty the abstractions are, you’re still dealing with dusty old concepts like “persistent state” and “queries” and “unavailability” and so forth […]

I’m still laughing about #NoDevs. Thought-leadering through trolling FTW.

Making Facebook Self-Healing

This is an older article (2011), but it’s still well worth reading. Facebook began automating remediation of standard hardware failure, and then they reinvested the time saved into improving the automation.

Today, the FBAR service is run by two full time engineers, but according to the most recent metrics, it’s doing the work of 200 full fine system administrators.

Autoscaling on Complex Telemetry

A system that doesn’t auto-scale to meet demand can be unreliable in the face of demand spikes. But auto-scaling adds complexity to a system, and increasing complexity can also decrease reliability. This article outlines a method to attempt to reason about auto-scaling based on multiple metrics. Bonus TIL: Erlang threads busy-wait for work.

You deleted the customer what, now? Human error – deal with it

A run-down of basic techniques for avoiding and dealing with human error. I like this article for a couple of choice quotes, such as: “human error scales up” — as your infrastructure grows bigger, the scope of potential damage from a single error also grows bigger.

Simplicity: A Prerequisite for Reliability

The latest in Mathias Lafeldt’s Production Ready series is this article about complexity.

The more complex a system, the more difficult it is to build a mental model of the system, and the harder it becomes to operate and debug it.

Outages

J.F.K International Airport (New York)
Epyx 1link
Mailgun
- Mailgun had a series of outages in May, and they’ve released this postmortem.
PlayStation Network
Apple
Amazon.com search
TeamViewer
- Rampant rumors circulated suggesting that the outage was a security breach and that many users’ computers had been hijacked. TeamViewer denies this and states that it was a DoS attack.
Cricket Wireless (US telecom)
Amazon Web Services (Sydney, AU)
- The outage took many Australian services down with it.

SRE Weekly Issue #25

lex

May 29, 2016

General

Comments

View on sreweekly.com

Articles

Supermarket Berkshelf Incident Post Mortem

This blows my mind. Chef held a live, public retrospective meeting for a recent production incident. I love this idea and I can only hope that more companies follow suit. The transparency is great, but more than that is their sharing of their retrospective process itself. They have a well-defined format for retrospectives including a statement of blamelessness at the beginning. Kudos to Chef for this, and thanks to Nell Shamrell-Harrington for posting the link on Hangops.

The actual incident was fairly interesting too. The crux of it comes down to this quote that we’ve probably all uttered ourselves at one point or another:

The further distant staging is from production, the more likely we are to introduce a bug.

8 Ways to Reduce Alert Fatigue

PagerDuty has this explanation of alert fatigue and some tips on preventing it. One thing they missed in their list of impacts of alert fatigue: employee attrition, which directly impacts reliability.

How to use Anycast to provide high availability to a RADIUS server

For the network-heads out there, here’s an article on how to set up Anycast routing.

FCC Approves Increased Network Outage Reporting

As we become more dependent on our mobile phones, the FCC is gathering information on provider outages. I, for one, wouldn’t be able to call 911 (emergency services) if AT&T had an outage, because I don’t have a land line.

No Procedure Survives First Contact With a Production Outage

I love this article if only for its title. It’s short, but its thesis bears considering: all the procedure documentation in the world won’t help you if you can’t find it during an incident, or it can’t practically be followed.

The only procedure that is worth a damn is one that has been successfully followed in the heat of battle.

Here comes the FUD! Legacy vendors sure to jump on the Salesforce outage

So when legacy vendors suggest that the Salesforce outage calls cloud into question, they tend to ignore the fact that their own systems suffer regular outages. They just rely on the fact that few people know about them.

Full disclosure: Salesforce (parent company of my employer, Heroku), is mentioned.

How Formal Verification Can Thwart Change-Induced Network Outages and Breaches

An introduction to the application of formal mathematical verification to network configurations. A good overview, but I wish it went into more practical detail.

[…] a software application designer might want to know that her code will never crash or that it will never execute certain functions without an authorized login. These are simple, practical questions – but answering them is computationally challenging because of the enormous number of possible ways code may be executed, […]

IFTTT Blog – Update: Keeping Pinboard on IFTTT

Earlier this year, I featured a story about Pinboard.in and IFTTT. IFTTT released this official apology and explanation of the problems Pinboard.in’s author outlined, and they (unofficially) promised to retain support through the end of 2016. Pinboard.in is an integral part of how I produce SRE Weekly every week, so I’m glad to see that this turned out for the best.

Fault Tolerance on the Cheap

This article is more on the theoretical side than practical, and it’s a really interesting read. It’s the second in a series, but I recommend reading both at once (or skipping the first).

A fault-tolerant system is one in which the unanticipated actions of a subcomponent do not bubble out as unanticipated behavior from the system as a whole.

Outages

Twitter
NS1
- NS1’s CEO posted this incredibly detailed and excellent postmortem on the sophisticated DDoS attacks they suffered.
Pirate Bay
WhatsApp
Virginia (US state) government network
Walmart MoneyCard
Telstra
- Telstra has had a hell of a time this year. This week social media and news were on fire with this days-long Telstra outage. This time, they’re offering customers a $25 credit instead of a free data day. Click through for Telstra’s explanation of what went wrong.
GitLab
- Linked is their post-incident analysis.
Kimbia (May 3)
- A couple weeks ago, Kimbia, a company that helps non-profits raise funds, suffered a massive failure. This occurred during Give Local America, a huge fundraising day for thousands of non-profits in the US, with the result that many organizations had a hard time accepting donations.

← Older Posts

Newer Posts →

General

SRE Weekly Issue #29

Articles

Outages

SRE Weekly Issue #28

Articles

Outages

SRE Weekly Issue #27

Articles

Outages

SRE Weekly Issue #26

Articles

Outages

SRE Weekly Issue #25

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues