General

SRE Weekly Issue #54

lex

January 8, 2017

General

Comments

View on sreweekly.com

Articles

Open-Sourcing Our Incident Response Documentation

Wow! PagerDuty made waves this week by releasing their internal incident response documentation. This is really exciting, and I’d love it if more companies did this. Their incident response procedures are detailed and obviously the result of hard-won experience. The hierarchical, almost militaristic command and control structure is intriguing and makes me wonder what problems they’re solving.

Load Testing Kept New Relic Running During Cyber Monday

Lots of detail on New Relic’s load testing strategy, along with an interesting tidbit:

In addition, as we predicted, many sites deployed new deal sites specifically for Cyber Monday with less than average testing. Page load and JavaScript error data represented by far the largest percentage increase in traffic volume, with a 56% bump[…]

The Problem with Pre-aggregated Metrics: Part 3, the “metrics”

Last in the series, this article is an argument that metrics aren’t always enough. Sometimes you need to see the details of the actual events (requests, database operations, etc) that produced the high metric values, and traditional metrics solutions discard these in favor of just storing the numbers.

Let’s Encrypt 2016 In Review

Let’s Encrypt has gone through a year of intense growth in usage. Their Incidents page has some nicely detailed postmortems, if you’re in the mood.

Blame. Language. Sharing.

An eloquent post on striving toward a learning culture in your organization, as opposed to a blaming one, when discussing adverse incidents.

Who Called Git? An Unusual Debugging Story

I like to include the occasional debugging deep-dive article, because it’s always good to keep our skills fresh. Here’s one from my coworker on finding the source of an unexpected git error message.

Full disclosure: Heroku, my employer, is mentioned.

Outages

Cloudflare
- Another leap second, another casualty.
  Full disclosure: Heroku, my employer, is mentioned.
U.S. Customs
DomainMonster

SRE Weekly Issue #53

lex

January 1, 2017

General

Comments

View on sreweekly.com

Articles

Take It to the Limit: Considerations for Building Reliable Systems

Without explicit limits, things fail in unexpected and unpredictable ways. Remember, the limits exist, they’re just hidden.

A Case Study in Global Fault Isolation

AWS gives us this in-depth explanation of their use of shuffle sharding in the Route 53 service. This is especially interesting given the Dyn DDoS attack a couple of months ago.

A container networking overview

How does container networking work? Julia Evans points her curious mind toward this question and shares what she learned.

[…] it’s important to understand what’s going on behind the scenes, so that if something goes wrong I can debug it and fix it.

The Problem with Math: Why Your Monitoring Solution is Wrong

More on the subject of percentiles and incorrect math this week from Circonus. The SLA calculation stuff is especially on point.

sysadvent: Day 20 – How to set and monitor SLAs

And speaking of SLAs, here’s an excellent article on how to design and adopt an SLA in your product or service.

Systems We Love 2016

A summary of a few notable Systems We Love talks. I’m so jealous of all of you folks that got to go!

#OnCallSelfie – PagerDuty

PagerDuty added #OnCallSelfie support to their app. Amusingly, that first picture is of my (awesome) boss. Hi, Joy!

Summary of Windows Azure Service Disruption on Feb 29th, 2012

A post-analysis of an Azure outage from 2012. The especially interesting thing to me is the secondary outage caused by eagerness to quickly deploy a fix to the first outage. There’s a cognitive trap here: we become overconfident when we think we’ve found The Root Cause and we rush to deploy a patch.

Outages

SRE Weekly Issue #52

lex

December 26, 2016

General

Comments

View on sreweekly.com

Merry Decemberween, all! Much like trash pickup service, SRE Weekly comes one day late when it falls on a holiday.

Articles

Why Percentiles Don’t Work the Way you Think

Percentiles are tricky beasts. Does that graph really mean what you think it means?

The math is just broken. An average of a percentile is meaningless.

Thanks to Devops Weekly for this one.

Uber blames humans for self-driving car traffic offenses as California orders halt

There’s that magical “human error” again.

Recent ChangeIP Outage

ChangeIP suffered a major outage two weeks ago and they posted this analysis of the incident. Thanks, folks! Does this sound familiar?

We learned that when we started providing this service to the world, we made design and data layout decisions that made sense at the time but no longer do.

Shuffle Sharding: massive and magical fault isolation

Shuffle sharding is a nifty technique for preventing impact from spreading to multiple users of your service. A great example is the way Route 53 assigns nameservers for hosted DNS zones:

sreweekly.com. 172800 IN NS ns-442.awsdns-55.com. sreweekly.com. 172800 IN NS ns-894.awsdns-47.net. sreweekly.com. 172800 IN NS ns-1048.awsdns-03.org. sreweekly.com. 172800 IN NS ns-1678.awsdns-17.co.uk.

Building and scaling the Fastly network, part 2: balancing requests

Fastly has a brilliant, simple, and clever solution to load balancing and connection draining using a switch ignorant of layer 4.

Incurring connection resets on upgrades has ramifications far beyond disrupting production traffic: it provides a disincentive for continuous software deployment.

Heroku Incident 1001: EU Region Network Latency

Heroku shared a post-analysis of their major outage on December 15.

Full disclosure: Heroku is my employer.

Outages

NTP server pool
- Load on the worldwide NTP server pool increased significantly due to a “buggy Snapchat app update”. What was Snapchat doing with NTP? (more details)
Zappos
- Zappos had a cross-promotion with T-Mobile, and the traffic overloaded them.Thanks to Amanda Gilmore for this one.
Slack
- Among other forms of impairment, /-commands were repeated numerous times. At $JOB, this meant that people accidentally paged their coworkers over and over until we disabled PagerDuty.
Librato
- “What went well” is an important part of any post-analysis.
Tumblr
Southwest Airlines

SRE Weekly Issue #51

lex

December 18, 2016

General

Comments

View on sreweekly.com

Articles

Etsy’s Debriefing Facilitation Guide for Blameless Postmortems

This is a big moment for the SRE field. Etsy has distilled the internal training materials they use to teach employees how to facilitate retrospectives (“debriefings” in Etsy parlance). They’ve released a guide and posted this introduction that really stands firmly on its own. I love the real-world story they share.

Debriefing Facilitation Guide [PDF]

And here’s the guide itself. This is essential reading for any SRE interested in understanding incidents in their organization.

Slicer: Auto-sharding for datacenter applications | the morning paper

Slicer is a general purpose sharding service. I normally think of sharding as something that happens within a (typically data) service, not as a general purpose infrastructure service. What exactly is Slicer then?

Click through to find out. It’ll be interesting to see what open source projects this paper inspires.

The Problem with Pre-aggregated Metrics: Part 2, the “aggregated”

The second in a series, this article delves into the pitfalls of aggregating metrics. Aggregation means you have to choose between bloating your time-series datastore or leaving out crucial stats that you may need during an investigation.

sysadvent: Day 15 – Take That Vacation: Eliminate Alerts Dragging You Back to the Office

I thought this was going to be primarily an argument for reducing burnout to improve reliability. That’s in there, but the bulk of this article is a bunch of tips and techniques for improving your monitoring and alerting to reduce the likelihood that you’ll be pulled away from your vacation.

sysadvent: Day 17 – Write it down or suffer the consequences

The title says it all. Losing the only person with the knowledge of how to keep your infrastructure running is a huge reliability risk. In this article, Heidi Waterhouse (who I coincidentally just met at LISA16!) makes it brilliantly clear why you need good documentation and how to get there.

Why Now Is the Time to Implement Redundant DNS

Here’s another overview of implementing a secondary DNS provider. I like that they cover the difficulties that can arise when you use a provider’s proprietary non-RFC DNS extensions such as weighted round-robin record sets.

Outages

EC2 (us-west-1), Heroku
- EC2’s Dublin region had an outage in the DNS resolver provided to instances via DHCP. Heroku was affected as well.Full disclosure: Heroku is my employer.
DirecTV Now
ChangeIP (DNS provider)
- ChangeIP tweeted that they suffered a major MySQL failure.
ATO (Australian Tax Office)
- The ATO lost a petabyte of data from their HPE 3PAR StoreServe SAN.
PSN
Battlefield 1

SRE Weekly Issue #50

lex

December 11, 2016

General

Comments

View on sreweekly.com

I’m back! The death plague was pretty terrible. A–, would not buy from again. I’m still catching up on articles from the past couple of weeks, so if I missed something important, please send a link my way!

I’m going to start paring down the Outages section a bit. In the past year, I’ve learned that telecom providers have outages all the time, and people complain loudly about them. They also generally don’t share useful postmortems that we can learn from. If I see a big one, I may still report it here, but for the rest, I’m going to omit them.

Articles

sysadvent: Day 1 – Why You Need a Postmortem Process

Gabe Abinante has been featured here previously for his contributions to the Operations Incident Board: Postmortem Report Reviews project. To kick off this year’s sysadvent, here’s his excellent argument for having a defined postmortem process.

sysadvent: Day 4 – Change Management: Keep it Simple, Stupid

Having a change management process is useful, even if it’s just a deploy/rollback plan. I knew all that, but this article had a really great idea that I hadn’t thought of before (but should have): your rollback plan should have a set of steps to verify that the rollback was successful.

sysadvent: Day 6 – No More On-Call Martyrs

Let’s be honest: being on-call is kind of an ego boost. It makes me feel important. But not getting paged is way better than getting paged, and we should remember that. #oncallselfie

The State of On-Call 2016-2017 — Kicking off Results Season

It’s that time of year again! In a long-standing (1-year-long) tradition here at SRE Weekly, I present you this year’s State of On-Call report from my kind sponsor, VictorOps.

The Problem with Preaggregated Metrics: Part 1, the “Pre”

Storing 99th and 95th percentile latency in your time-series DB is great, but what if you need a different percentile? Or if you need to see why those 1% of requests are taking forever? Perhaps they’re all to the same resource?

Orchestrator at GitHub

Orchestrator is a tool for managing a (possibly complex) tree of replicating MySQL servers. This includes master failure detection and automatic failover, akin to MHA4Mysql and other tools. GitHub has adopted Orchestrator and shares some details on how they use it.

Black Friday and Cyber Monday Performance Report 2016

A few notable brands suffered impaired availability on and around Black Friday this year. Hats off to AppDynamics for giving us some hard numbers.

Microsoft refuses to join the Zero Outage brigade, Google and AWS keep mum

Looks like I missed this “Zero Outage Framework” announcement the first time around. I love the idea of information-sharing and it’ll be interesting to see what they come up with. We can all benefit from this, especially if the giants like Microsoft join up.

HumanOps: Etsy on how unclear workplace expectations contribute to staff burnout

All IT managers would do well to heed this advice. Remember, burnout very often directly and indirectly impacts reliability.

“If you’re a manager and you are replying to email in the evening, you are setting the expectation to your team – whether you like it or not – that this is normal and expected behaviour”

Multiple DNS providers, the Perfect Gift this holiday Season

Signifai has this nice write-up about setting up redundant DNS providers. My favorite bit is how they polled major domains to see who had added a redundant provider since October 21, and they even shared the source for their polling tool!

The Burden of Running Systems

I’ve featured a lot of articles lately about reducing the amount of code you write. But does that mean that it’s always better to contract with a SaaS provider? This week’s Production Ready delves into the tradeoffs.

Outages

WoW, Battle.net
Grubhub, Seamless
DirecTV Now
- AT&T’s new online version of DirecTV saw issues in its second week of operation.
Zapier

← Older Posts

Newer Posts →

General

SRE Weekly Issue #54

Articles

Outages

SRE Weekly Issue #53

Articles

Outages

SRE Weekly Issue #52

Articles

Outages

SRE Weekly Issue #51

Articles

Outages

SRE Weekly Issue #50

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues