SRE WEEKLY – Page 85 – scalability, availability, incident response, automation

SRE Weekly Issue #70

lex

April 30, 2017

Articles

Enabling DNS split authority with OctoDNS

GitHub has released OctoDNS, their tool for synchronizing DNS across multiple providers. Shortly after the Dyn outage last fall (covered here), they still only had one DNS provider (source: direct observation). I suspected that this may have had to do with complication in keeping records synched across two providers – perhaps that’s why they created OctoDNS.

Introducing Bolt: On Instance Diagnostic and Remediation Platform

Bolt is Netflix’s “event driven diagnostic and remediation platform”, although it actually seems like a full-blown remote execution system for large fleets of servers.

Incident management at Google — adventures in SRE-land

A Google SRE takes us through their first on-call shift including running incident command for a production incident. I like the emphasis on a blameless postmortem.

Onboarding, On-Call and Learning

Pete Shima received some questions about onboarding SREs, and lucky us, he decided to answer them publicly. My favorite section is the one about connecting a new SRE to people across the company. I find that solid connections to folks in various positions are vital to getting my job done well. Thanks to Pete for the SRE Weekly mention!

Take A Moment To Refocus – Salesforce + Open Source = ❤

Salesforce has a humongous infrastructure, and they needed a tool to help visualize data from lots of monitoring systems. They created Refocus to serve that need, and they open sourced it. They had three goals: gather data from all of the monitoring systems, on-board new services quickly, and visualize data in a way that makes sense for each service.

Full disclosure: Salesforce (parent company of my employer, Heroku), is mentioned.

New zine: let’s learn tcpdump!

Tcpdump is a critical tool for debugging thorny network issues. Julia Evans created a new zine to help you learn the basics, although if her other zines are any indication, even a pro may learn a new trick or two. The zine is $10 now and will be available for free at some point in the future.

Sharks Want to Bite Google’s Undersea Cables

Turns out that sharks are a reliability risk. And not just those WFLB.

Why Code Gets Released too Early (and how to fix It)

From their Global Developer Survey, GitLab learned that it’s common for developers to release code before it’s production-ready in response to organizational pressures.

Code released before it’s ready might be good for meeting deadlines, but that’s about all it’s good for.

Four nines & failure rates – will the cloud ever cut it for transactional banking?

Here’s a pretty excellent analysis of why adopting the cloud can be difficult for banks. Just skip past the bit with the incorrect uptime calculation, since four nines of uptime actually equates to about 53 minutes’ downtime per year, not 9 hours.

Outages

London Marathon Donations
- Ebay and Virgin Money Giving both went down under the load as many flocked to place donations before the London Marathon.
CARLI
- CARLI is the Consortium of Academic Research Libraries in Illinois. I included this outage because of the short but sweetly personal postmortem from their network engineer.
Instagram
Reddit
- Sorry for the extended outage there. We failed back the maintenance performed earlier tonight. We’ll provide a post-mortem at a later date.

SRE Weekly Issue #69

lex

April 23, 2017

General

Comments

View on sreweekly.com

Articles

The tale of the flying gurney: MRI machines can turn objects into projectiles

In February of 2016, a metal hospital gurney was inadvertently wheeled* into an MRI room, resulting in a costly near-miss accident. Brigham and Women’s Hospital posted about the mishap on their Safety Matters blog and also released a Q&A with their chief quality officer about their dedication to an open and just culture.

If an employee at Brigham makes a mistake that anyone else could make, we will work on improving the system, rather than punishing the employee. We believe that in every circumstance involving “human error” there are systemic opportunities for mitigating reoccurrence.

* Yes, I used the passive voice on purpose. See what I did there?

Lies My Parents Told Me (About Logs)

Sometimes logs help us prevent outages or discover a cause. But raise your hand if you’ve seen logging cause an outage. Yeah, me too.

Syscall Auditing at Scale

Traditionally, auditd, together with Linux’s system call auditing support, has been used as part of security monitoring. Slack developed go-audit so that they could use system call auditing as a general monitoring tool. I can think of plenty of outages during which I’d have loved to be able to query system call patterns!

Introducing Stormcrow

Dropbox has some pretty complex needs around feature gating. Apparently existing tools couldn’t satisfy their use case so they wrote and released a tool with sophisticated user segmentation support.

Niffy: Perceptual Diffing to Catch Invisible Bugs

Why depend on fallible QA testing to spot regressions in a web UI? Computers are so much better at that kind of thing. Niffy spots the pixel changes between old and new code so you can see exactly what regressed before putting it in front of your users.

Online migrations at scale

In this beautifully-illustrated article, Stripe engineer Jacqueline Xu explains how Stripe safely rolled out a major database schema upgrade. Many code paths had to be refactored, and they took a methodical, incremental approach to avoid downtime. Thanks to this article, I now know about Scientist and can’t wait to use it.

Scaling your API with rate limiters

Speaking of Stripe, they have another polished post on how and why to add load shedding to your API.

GitHub – github/scientist: A Ruby library for carefully refactoring critical paths.

Scientist is such an awesome idea. The idea is to try out a new code path and see if it produces the same result as the old code path. It only returns the new code path, so you know you can safely prove to yourself whether the new code path is safe before exposing users to it.

AWS status: The complete guide to monitoring status on the web’s largest cloud provider

I’m including this article at least in part due to its mention of the February S3 outage. AWS had difficulty reporting the outage on its status site because of a dependency on S3.

CASE STUDY: Etsy, Sprouter and Conway’s Law

Conway’s Law is extremely important to us as SREs. As we can see in the example of Sprouter, a poor organizational structure can produce unreliable software. My fellow SRE, Courtney Eckhardt, loves to say, “My job is applying Conway’s Law in reverse.”

Outages

AT&T VoIP
- I received an anonymous anecdote from an SRE Weekly reader (thanks!) that this affected at least one hospital, with the result that critical phone communication was significantly hampered. What happened to the good old mostly-reliable traditional phone system? Irony: in the reader’s case, an announcement about the failure was sent out via email.
Three
- This is the second case this year of a telecom outage resulting in SMSes being delivered to the wrong people. Am I the only one that finds this an extremely surprising and concerning failure mode?
eBay
Red Hat

SRE Weekly Issue #68

lex

April 17, 2017

General

Comments

View on sreweekly.com

Articles

Increment: On-Call

The big story this week is the release of the inaugural issue of Increment, a newsletter by Stripe, edited by Susan Fowler. They bill it as “A digital magazine about how teams build and operate software systems at scale” and the first issue, dedicated to on-call, certainly delivers. Below, I’ll include my short take on each article in the issue.

What happens when the pager goes off?

Increment interviewed over thirty companies to build a picture of the common practices in incident response. I’m actually pretty surprised to hear that “it turns out that they all follow similar (if not completely identical) incident response processes”, but apparently the commonalities don’t stop at just process:

Slack and PagerDuty appear to be two points of failure across the entire tech industry

Bonus content: Julia Evans shared her notes on Twitter.

Who owns on-call?

Next up, Increment addresses the dichotomy of ops teams versus developers on call for their code. It turns out that the latter practice is more prevalent than I’d realized.

Crafting sustainable on-call rotations

After laying a solid groundwork of suggestions for avoiding burn-out in on-call, this next Increment article raises a really important point: on-call affects people differently based on privilege. Example: single parents are going to have a much harder time of it.

[…] if you set up an on-call rotation with a schedule or intensity that assumes the participants have no real responsibilities outside of the office, you are limiting the people who will be able to participate on your team.

The benefits of transparency

Remember a couple of months back when GitLab live-streamed their incident response? Increment caught up with their CEO to give us this in-depth interview about their radical transparency.

On-call at any size

Increment shares tips and key practices for setting up on-call, targeted to companies of size ranges varying from 0-10 employees all the way up to 10000+.

How should startups approach on-call and incident response?

Increment rounds out their issue with advice in the form of quotes from six of the companies they interviewed.

Introducing Honeycomb: A New Debugging Service to Address Flawed Systems

The other big news of the week is the official launch of Honeycomb.io. If you haven’t had a chance to check it out yet, here’s an introduction, and you can also sign up for a free one-month trial.

Outages

Melbourne IT
- A DDoS took out their DNS service, taking out customer domains and also sites they they host for customers. While this is a news article and not a formal post-analysis, it does include some pretty interesting technical detail from an interview with their CTO. I’m not sure that he did himself any favors by quoting the definition of their SLA:
  
  “People look at 99.9 per cent and think that’s seconds of downtime, but you work it out and it’s 45 minutes.”
Google Cloud HTTP(S) Load Balancer
- Google Cloud LB threw 502s for 25% of requests in a 22-minute period. They released this post-analysis 7 days later, and I have to say, the root cause is pretty interesting – and sadly familiar.
  
  A bug in the HTTP(S) Load Balancer configuration update process caused it to revert to a configuration that was substantially out of date.

SRE Weekly Issue #67

lex

April 9, 2017

General

Comments

View on sreweekly.com

Articles

Risky Business Requires Active Operators

This article is about the risks of automation. While automation can reduce risk by making errors less likely, it also disengages human operators from what’s actually happening, meaning that they’re less likely to catch and correct problems.

Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites

The author spent seven months sifting through, categorizing, and documenting over 1700 production incidents. The result was impressive: a massive improvement in the SRE team’s incident response process and documentation. It’s got me wondering if we can do something similar at $JOB.

Thanks to Steven Farlie for this one.

Fault Tolerance in a High Volume, Distributed System

A danger of a microservice architecture is that one failing service can affect those that depend on it, even indirectly. The Netflix API handles over 10000 requests per second, and it was carefully designed to avoid the case where a slow dependency breaks unrelated requests.

Without taking steps to ensure fault tolerance, 30 dependencies each with 99.99% uptime would result in 2+ hours downtime/month (99.99% * 30 = 99.7% uptime = 2+ hours in a month).

Human Factors at The Fringe: Nuclear Family

Nuclear Family is an interactive play in which the audience is presented with critical decisions as the characters move inexorably toward a nuclear plant disaster. The goal is to demonstrate local rationality, the principal that people make the best decision they can with the information they have at hand — even if in retrospect that decision led to an adverse outcome.

Owning Your Code is Better

Last year, PagerDuty moved toward giving developers operational responsibilty for the systems they create. The really cool thing about their transition is that they have hard stats on reduction of incidents, decrease in MTTR, and increase in changes deployed to production.

Making On-Call as Painless as Possible – PagerDuty

This post is primarily a new feature announcement, but the intro section is just awesome. I love the idea of designing a system with empathy for your future self that will be on call for it.

Graceful degradation

A short but enlightening blog post on designing systems to degrade gracefully.

when weird stuff happens, make sure it doesn’t cause harm you didn’t expect or plan for.

Outages

Razer
- Notably, this outage reset the careful customizations that people had made to their peripherals.
  Thanks to Steven Farlie for this one.
Heroku
- Heroku had a 2-day long disruption that spanned 3 status site posts.
  Full disclosure: Heroku is my employer.
DigitalOcean
- DigitalOcean accidentally deleted their primary database, resulting in a ~5-hour outage.
  
  A process performing automated testing was misconfigured using production credentials.

SRE Weekly Issue #66

lex

April 2, 2017

General

Comments

View on sreweekly.com

Articles

Sockets in a Bind: Troubleshooting Port Exhaustion in Heroku’s Routing Layer

I hope you’ll enjoy reading this debug session as much as I enjoyed co-writing it. My former co-worker and I did some serious digging to get to the bottom of an unexpected EADDRINUSE that caused a production incident.

Full disclosure: Heroku is my employer.

Redundancy does not imply fault tolerance: analysis of distributed storage reactions to single errors and corruptions

Distributed filesystems provide high availability by duplicating data. In this research paper, the researchers created errorfs, a FUSE plugin that passes through a backing filesystem but introduces single-bit errors. Result: almost all major distributed filesystems missed the error, resulting in corruption.

How we implement Disaster Recovery and High Availability with Postgres on Citus Cloud

The part I like most about this article is the emphasis on the difference between DR and HA.

Full disclosure: Heroku, my employer, is mentioned.

Breaking Things on Purpose

The S3 outage a month ago is a great reminder that chaos experiments are useful not just for taking down parts of our own infrastructure, but also simulating the failure of external dependencies.

HumanOps: It’s time to make DevOps personal

There are several core HumanOps principles, but the most important one to remember is that human health impacts business health.

It’s about time that we recognised that engineers are humans who get stressed and need downtime and that there are strong business as well as social reasons why these needs should be met.

Now Available – Videos from SREcon17

Impressively quickly, USENIX has posted the videos from SRECon17 Americas! I’ve linked to a post by Woodland Hunter, whose review of SRECon I featured here two weeks ago, with links to the talks he reviewed and more.

April Foolishness

The first article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

How Incident Management Boosts Employee Morale

PagerDuty theorizes that if developers don’t trust the incident response process, they’ll fear outages and thus be less productive. Proper incident management eases that fear so that they feel safer deploying code.

Reducing Alert Noise: Going from 1000 Alerts to 10 Alerts Overnight

This article could be titled, “Use these three wacky tricks to reduce your pages by 100x!” In all seriousness, the methods described are aggregation (group related alerts), routing (sort alerts by team), and classification (page-worthy alerts versus warnings).

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Slow down your internet with tc

A nice primer on using tc to induce latency, which is really important when testing the resiliency of systems to network instability. Thanks, Julia!

Risk Tolerance of Services

Here’s the second half of Stephen Thorne’s commentary on “Embracing Risk”, the third chapter in Google’s SRE book.

Scaling Incident Management

As your company grows in infrastructure size, number of employees, load, and other areas, how do you change your incident response to cope?

Outages

Azure status history
- While following up on an outage from a couple of weeks ago, I came upon this archive of Azure incidents, several with detailed postmortems. It’s a goldmine of interesting RCAs, but I wish they’d give each its own page for easy linking.

← Older Posts

Newer Posts →

SRE Weekly Issue #70

Articles

Outages

SRE Weekly Issue #69

Articles

Outages

SRE Weekly Issue #68

Articles

Outages

SRE Weekly Issue #67

Articles

Outages

SRE Weekly Issue #66

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues