SRE WEEKLY – Page 76 – scalability, availability, incident response, automation

SRE Weekly Issue #119

lex

April 29, 2018

Articles

If you missed the STELLA Report, released last fall during Velocity NYC by John Allspaw, Richard Cook, and David Woods, this podcast is a great intro. And even if you did catch it, it’s well worth a listen. The Food Fight folks interview John Allspaw and there’s some really stellar (see what I did there) back-and-forth discussion.

Alan Kraft and Nathen Harvey

Why I usually run ‘w’ first when troubleshooting unknown machines

Great idea. This reminds me of a couple jobs back where I rigged up our infrastructure to log every command entered at the shell into a Slack channel.

Rachel Kroll

Google: A Collection of Best Practices for Production Services

This excerpt from the Google SRE book is worth reading if only for this nifty idea for graceful degradation:

Other techniques include […] choosing a consistent subset of clients to receive errors, preserving a good user experience for the remainder.

Walk, talk and git commit: SRE onboarding (2/2)

In part two of this story, the author causes their first incident (oops) and subsequently significantly improves the performance of the system in question (cool!).

Evan Smith — Hosted Graphite

Blue/Green Deployment: What It Is and How it Reduces Your Risk

An introduction to blue/green deployments including the risks it helps to alleviate.

Mark Henke — Rollout.io

The Mon-ifesto Part 3: Alert Response and Post-Mortem

instead of giving guidelines on how and when to do things, I am going to lay out a few ideas on how to respond to alerts and leave it up to you to decide what methods work best for your app and your organization.

Peter Christian Fraedrich — Capital One

How to get a core dump for a segfault on Linux

Especially in Ubuntu, it’s harder than it used to be to get a core dump, thanks to apport and the like.

Julia Evans

Disaster recovery sites of exchanges under focus as NCDEX volumes fall

NCDEX, a stock exchange in Mumbai, India, has been operating out of its disaster recovery site for two weeks. Unfortunately, it looks like performance is not on par with the standard site.

Rajesh Bhayani — Business Standard

Southwest 1380 (engine failure 4/17/2018) ENTIRE EVENT: actual multi-sector ATC audio

You may have heard that a Southwest flight suffered a catastrophic engine failure that left one passenger dead. The day after my family flew a Southwest flight to Orlando. Yikes.

The air traffic control audio recording is incredible to listen to. The pilot that was on the radio was cool and calm as she responded to the incident and arranged for landing and emergency ground crews.

Outages

IRS (US tax system)
- The IRS had to extend the deadline for Americans to file their taxes as a result of an overload and outage in their electronic tax filing system.
TSB (bank)
Heroku
- Also this one.
Google Cloud Pub/Sub
Woolworth’s (grocery store chain)
Discord
Fortnite (game)
- I normally don’t include games, but this outage is amusing because downtime on Fortnite apparently causes a surge in traffic to a popular adult site, threatening their availability.
Telegram
Twitter
TSX (Montreal stock exchange)

SRE Weekly Issue #118

lex

April 16, 2018

General

Comments

View on sreweekly.com

Sorry, a little late this week as my family and I head off to Disney World! No issue next Sunday, and I’ll see you all on April 29.

Articles

Thoughts on the role of Incident Commander

I have different thoughts than the author on a few of the points, but it’s very useful and enlightening to see their thought process.

Will Gallego

Testing Fastly services on every PR: How USA TODAY tests their CDN prior to changing it

What it says on the tin. Pretty neat CI setup!

Bridget Lane — USA Today

Full disclosure: Fastly, my employer, is mentioned.

Why “Why-Run” Mode Is Considered Harmful

“Why-run” mode is Chef’s “do nothing” or “dry run” mode. As it turns out, it may not be so useful when trying to figure out what Chef will do.

Julian Dunn — Chef

Sustainable On-Call

Lots of deep thoughts on what makes on-call hard and what we can do about it.

Cody Wilbourn

How did this machine jump into the future?

One little typo is all it took.

Rachel Kroll

Watchdogs vs. Snowflakes

Q&A about a task queuing system that freezes up if the queue is kept full at all times.

But first, on-call: SRE onboarding (skydiving for nerds) 1/2

A new hire tells us what it’s like to get up to speed as an SRE at Hosted Graphite.

Evan Smith — Hosted Graphite

Outages

Discord
Mauritania
- Another one of those “oh look an entire country lost its Internet, this is the first time that’s ever happened!!1” articles.
Twitter

SRE Weekly Issue #117

lex

April 8, 2018

General

Comments

View on sreweekly.com

Articles

No, seriously. Root Cause is a Fallacy. –

Brilliant, just brilliant. This isn’t just another “there isn’t just one root cause” article to skip over. The author takes time to explain the concept with cogent examples and useful metaphors. This one really caught my eye:

What’s the root cause of success?
[…] When building a successful project, there’s never just one thing that goes right for it to succeed.

Will Gallego

Incident Management – Food Fight Podcast

This episode of Food Fight is an hour-long interview with guests Rob Schnepp, Ron Vidal, and Chris Hawley, the 3 firefighters behind Blackrock 3 Partners. It’s a great intro to the Incident Management System, and well worth a listen.

Shout-out to Maple Player, an android audio player with a really high-quality tempo increase feature. I was able to listen at 1.5x speed and still understand everything; otherwise, I wouldn’t have had time this week.

Nell Shamrell-Harrington and Nathen Harvey

Billing Incident Post-Mortem

Here’s one from the archives, an incident report from 2013. After a temporary network partition in a redis cluster, the replicas all tried to resynchronize at once, overloading the master. One of the results was that some customers got repeatedly charged for the same thing.

Twilio

It’s about what broke, not who broke it

You have to design a system such that the natural thing to do yields a good result and doesn’t put anyone in harm’s way.

Rachel Kroll

Consistent Hashing: Algorithmic Tradeoffs

I thought consistent hashing was largely solved. I was wrong! There are some good solutions out there, but you have to evaluate their relative trade-offs and pick the right one for your use case.

Damian Gryski

Full disclosure: Damian Gryski is my coworker at Fastly.

Computer science faces an ethics crisis. The Cambridge Analytica scandal proves it.

As you read this article, consider the ethical imperative of system reliability, when system reliability can literally mean life and death in some cases. That’s only going to be more common in the coming years.

Yonatan Zunger

LogicMonitor Uses Terraform, Packer & Consul for Disaster Recovery Environments

Our service needs to be available 24/7, without question. In order to ensure this happens, the LogicMonitor TechOps team uses HashiCorp Packer, Terraform, and Consul to dynamically build infrastructure for disaster recovery (DR) in a reliable and sustainable way.

Randall Thomson — LogicMonitor

The Travis CI Blog: Incident Post-Mortem and Security Advisory: Data Exposure After travis-ci.com Outage

On Tuesday, 13 March 2018 at 12:04 UTC a database query was accidentally run against our production database which truncated all tables.

Oof. Sorry, Travis folks, but a sincere thanks for sharing your experience with us.

Konstantin Haase — Travis CI

Preliminary Analysis of the Site Reliability Engineer Survey

I like these “preliminary results” better than the kinds of aggregate statistics you normally get from a survey report. There are real quotes from free-form survey answers, including a couple of real gems. There’s a link to download the actual survey report if you’re into that, too.

Dawn Parzych — Catchpoint

Outages

Statuspage.io
Mindbody Online (fitness studio booking service vendor)
Sling TV
Tinder
- The outage seemingly stemmed from privacy fixes Facebook put in place, resulting in a broken OAuth flow.
Microsoft Office 365
Twitter
Multiple Indian Government Websites
Grab
YouTube

SRE Weekly Issue #116

lex

April 1, 2018

General

Comments

View on sreweekly.com

Articles

BBC Online Outage on Saturday 19th July 2014

The BBC suffered two simultaneous major outages that broke their online streaming product and forced their website into a limited-functioning mode. This post-incident followup explains what happened and how they dealt with it.

Richard Cooper — BBC

Burst credits of t2 EC2 instances need monitoring

Bursting is a hidden reliability risk that has bitten me hard in the past. Click through for an explanation of the risk and how to mitigate it.

Michael Wittig — Cloudonaut

Observability: A Manifesto

This post has the most concise definition I’ve seen yet for observability, along with a quiz that will tell you whether you’re Doing It Right^TM.

the power to ask new questions of your system, without having to ship new code or gather new data in order to ask those new questions

Charity Majors — Honeycomb

Four interacting decisions break ssh access

This debugging story is an entertaining read, and it’s also got some useful stuff to watch out for in your systems.

Tick tick tick. Time is hard.

Rachel Kroll

GitHub – ahupowerdns/hello-dns: Hello and welcome to DNS!

Solid knowledge of how DNS works is critical for SREs. This repo contains an introduction to DNS written to be far more approachable than the (many!) DNS RFCs. It’s a work in progress but there’s a lot of good content already.

Bert Hubert and others

The Makeup of Successful Geographically-Distributed SRE Teams: Part 2 | LinkedIn Engineering

Within this post, we’ll discuss growth planning, the challenges associated with being part of a remote team, and some of the unexpected advantages geographically distributed SRE teams can offer.

Akhil Ahuja — LinkedIn

Twitter: mipsytipsy about alerting on metrics

Her thread starts here and continues being awesome:

Real talk, you should never have a paging alert on a system stats metric. Or a single host anything metric. (Or an aggregate host metric, or an aggregate divided by host count, or …)

Charity Majors

Outages

Telegram (messaging app)
Iomart (datacenter provider)
- Two separate network breaks cut off access to data centres run by cloud firm Iomart, affecting a wide range of customers
iTunes App Store
TD Ameritrade

SRE Weekly Issue #115

lex

March 25, 2018

General

Comments

View on sreweekly.com

Articles

Moving Past Shallow Incident Data

Metrics like Mean Time to Detection (MTTD), Resolution (MTTR), and the like pave over all of the incredibly valuable details of the individual incidents. If you place a lot of emphasis on aggregate incident response metrics, this article may cause you to rethink your methods.

Incidents are unplanned investments. When you focus solely on shallow data you are giving up the return on those investments that you can realize by deeper and more elaborate analysis.

John Allspaw — Adaptive Capacity Labs

Look for the duct tape

Duct tape: you know, all the little shell scripts you have in your ~/bin directory that you wrote because your system’s tooling got in your way or didn’t do what you needed? Find that, according to this article, and you’ll find interesting things to work on to make the system better. I’d add that these rough edges are often also the kinds of things that contribute to incidents.

Rachel Kroll

Incident review: API and Dashboard outage on 10 October 2017

A thoughtful and detailed incident post-analysis, including an in-depth discussion of the weeks-long investigation to determine the contributing factors. The outage involved the interaction of Pacemaker and Postgres.

Chris Sinjakli , Harry Panayiotou , Lawrence Jones , Norberto Lopes and Raúl Naveiras — GoCardless

How Chaos Engineering Can Bring Stability to Your Distributed Systems

Here’s a nice overview of chaos engineering, including a mention of a tool I wasn’t aware of for applying chaos to Docker containers.

Jennifer Riggins — The New Stack

Pull doesn’t scale – or does it?

The question in the title refers to the gathering of metrics from many systems in an infrastructure. Do they push their metrics in, or should the system pull metrics from each host instead? This Prometheus author explains why they pull and how it scales.

Julius Volz — Prometheus

Zero downtime deployments with containers

A primer on achieving seamless deployments with Docker, including examples.

Jussi Nummelin — Kontena

observability – Food Fight

I had some extra time for reviewing content this week, and I took the opportunity to listen to this episode of the Food Fight podcast, with a focus on observability. The discussion is really excellent, and there are some really thought-provoking moments.

Nell Shamrell-Harrington, with Nathen Harvey, Charity Majors, and Jamie Osler

Enable your Devs to do Ops

How? By writing runbooks. This article takes you through how, why, and what tools to use as you develop runbooks for your systems.

Francesco Negri — Buildo

How Threat Stack Does DevOps (Part IV): Making Engineers Accountable

As a security-focused company, it only makes sense that Threat Stack would focus on safety when giving developers access to operate their software production.

We believe that good operations makes for good security. Reducing the scope of engineers’ access to systems reduces the noise if we ever have to investigate malicious activity.

Pete Cheslock — Threat Stack

Outages

Data Action
- Data Action is a dependency of many Australian banks.
Travis CI
S3
- Amazon S3 had a pair of outages for connections through VPC Endpoints. The Travis CI, Datadog, and New Relic outages were around the same time, but I can’t tell conclusively whether they were related.
Datadog
New Relic

← Older Posts

Newer Posts →

SRE Weekly Issue #119

Articles

Outages

SRE Weekly Issue #118

Articles

Outages

SRE Weekly Issue #117

Articles

Outages

SRE Weekly Issue #116

Articles

Outages

SRE Weekly Issue #115

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues