SRE WEEKLY – Page 71 – scalability, availability, incident response, automation

SRE Weekly Issue #144

lex

October 21, 2018

General

Comments

View on sreweekly.com

Articles

Incident Management at GitLab

GitLab is incredibly open with their policies, and incident management is no exception.

GitLab

Resilience Weekly

Ooh, new newsletter! This one focuses specifically on resiliency. It seems to have just a few articles each week with in-depth summaries.

Thai Wood

Mo’ developers, mo’ problems: How serverless has trouble with teams

This article starts with a fictitious(?) account of the kind of failure that can occur when teams step on each other’s toes in a serverless environment. It goes on to discuss techniques for dealing with this class of problems, including careful permission management.

Toby Fee — jaxenter

REPT: reverse debugging of failures in deployed software

Sometimes fixing a rarely-occurring bug can be especially difficult. Recording enough information all the time to debug those rare failures would be too resource-intensive. Check out this fascinating technique for working backward from a memory dump to infer the prior contents of memory in the time leading up to a failure.

Adrian Colyer — The Morning Paper (summary)
Cui et al. (original paper)

To the brave new world of reactive systems and back

An introduction to the concept of reactive systems including a description of the high-level architectural features.

Sinkevich Uladzimir — The Server Side

Why do things go right?

Initially, you can improve reliability by studying incidents to find out what went wrong. This article explains why that strategy will only get you so far.

Thanks to Thomas Depierre for this one.
Sidney Dekker — Safety Differently

Chaos Monkey Guide for Engineers – Tips, Tutorials, and Training

Gremlin released this huge guide on chaos monkey, covering theory, practice, further reading, and lots of other resources.

Gremlin, inc.

Outages

YouTube
- YouTube had a major outage this past week, and a popular adult site saw a simultaneous uptick in traffic.
Fastly
- And this one too.Full disclosure: Fastly is my employer.
Amazon S3
Amazon.com
Amazon Prime Music
HSBC (Bank)
Yale (smart home products)
- Home security company Yale has denied that a server outage caused anyone to be locked out of their house, after an app used to remotely set and turn off one its smart alarm product went down late last week.
  
  Check out that first tweet in the article.
Instacart

SRE Weekly Issue #143

lex

October 14, 2018

General

Comments

View on sreweekly.com

Articles

How Etsy Handles Peeking in A/B Testing

There’s some great statistics theory in here. The challenge is: how can you have accurate, useful A/B tests without having to wait a long time to get a big enough sample size? Can you bail out early if you know the test has already failed? Can you refine the new feature mid-test?

Callie McRee and Kelly Shen — Etsy

SRE: The Biggest Lie Since Kanban | the agile admin

Don’t just rename your Ops team to “SRE” and expect anything different, says this author.

Ernest Mueller — The Agile Admin

We can do better than percentile latencies | theburningmonk.com

Great idea:

So what if we monitor the percentage of requests that are over the threshold instead? To alert us when our SLAs are violated, we can trigger alarms when that percentage is greater than 1% over some predefined time window.

Yan Cui

Dropbox traffic infrastructure: Edge network

There’s a ton of detail here, and it’s a great read. Lots of juicy tidbits about PoP selection, load balancing, and performance monitoring.

Oleg Guba and Alexey Ivanov — Dropbox

Full disclosure: Fastly, my employer, is mentioned.

Preliminary Report Pipeline: Over-pressure of a Columbia Gas of Massachusetts Low-pressure Natural Gas Distribution System

Even as a preliminary report there’s a lot to digest here about what caused the series of gas explosions last month in Massachusetts (US). I feel like I’ve been involved in incidents with similar contributing factors.

US National Transportation Safety Board (NTSB)

What I learned by bringing down LinkedIn.com – VentureBeat

This isn’t just a recap of a bad day, although the outage description is worth reading by itself. Readers also gain insight into the evolution of this engineer’s career and mindset, from entry-level to Senior SRE.

Katie Shannon — LinkedIn

https://about.gitlab.com/2018/10/11/gitlab-com-stability-post-gcp-migration/

GitLab, in their trademark radically open style, goes into detail on the reasons behind the recent increase in the reliability of their service.

Andrew Newdigate — GitLab

Getting to 99.999% Availability with Twilio’s Tyler Wells

Five nines are key when you consider that Twilio’s service uptime can literally mean life and death. Click through to find out why.

Charlie Taylor — Blameless

Outages

Travis CI
Google Compute Engine us-central1-c
- I can’t really summarize this incident report one well, but I highly recommend reading it.
Azure
- Duplicated here since I can’t deep-link:
  
  Summary of impact: Between 01:22 and 05:50 UTC on 13 Oct 2018, a subset of customers using Storage in East US may have experienced intermittent difficulties connecting to resources hosted in this region. Other services leveraging Storage in the region may have also experienced impact related to this incident.
Instagram
Heroku
- This one’s notable for the duration: about 10 days of diminished routing performance due to a bad instance.
Microsoft Outlook

SRE Weekly Issue #142

lex

October 7, 2018

General

Comments

View on sreweekly.com

Articles

The Big Hack: How China Used a Tiny Chip to Infiltrate U.S. Companies

The big news this week is the story from Bloomberg alleging a spy chip on SuperMicro motherboards. I say “alleging” because Amazon and Apple have issued unequivocal denials.

Jordan Robertson and Michael Riley — Bloomberg

Orlando Paramedics Didn’t Go In to Save Victims of the Pulse Shooting. Here’s Why.

There was a plan in the works in the months before the Pulse nightclub mass shooting in Florida (US) in 2016, designed for getting victims out of a “hot” zone. The story about why it wasn’t implemented echoes the kind of organizational failings we see as SREs.

Abe Aboraya — ProPublica

Open-sourcing StateService: Automating recovery of third-party services after a major outage

Facebook is at it again! Here’s a new system based on a state machine driven by Chef.

Declan Ryan — Facebook

Designing and implementing your disaster recovery plan using GCP | Google Cloud Blog

Google has produced a new guide on designing DR in Google Cloud Platform:

We’ve put together a detailed guide to help steer you through setting up a DR plan. We heard your feedback on previous versions of these DR articles and now have an updated four-part series to help you design and implement your DR plans.

Grace Mollison — Google

Russ Miles: Ignored Architects and Chaos Engineering

[…] you must be part of the team working on the system. You cannot be someone that hurts a system and then wait for others to fix the problem.

Jan Stenberg — InfoQ

Capacity Planning in Four Parts: Telling the Future without a Crystal Ball

If you’ve ever been woken in the middle of the night just to see that an alert could be solved by adding another server or two to the loadbalancer, you need capacity plans and you need them yesterday.

Evan Smith — Hosted Graphite

Building Blameless right from the beginning – Blameless: Better Reliability Through SRE

[…] our industry has finally reached the tipping point at which it has become viable to build distributed systems from scratch, at a fast pace of iteration and low cost of operation, all while still having a small team to execute

The author argues that it’s possible to avoid building tech debt while still retaining the velocity a new startup needs.

Author: Santiago Suarez Ordoñez — Blameless, Inc.

A Brief History of High Availability

From a single host, to a bigger host, to leader/follower replication and active/active setups. The distinction between active/active versus “Multi-Active” is worth reading.

Sean Loiselle — Cockroach Labs

Outages

Crowdpac (crowd-funding site)
- Crowdpac briefly went down as visitors swarmed the site to make donations to a campaign raising funds for the future opponent of US Senator Susan Collins, due to her controversial vote on the confirmation of (now-)Justice Kavanaugh.
AWS (us-west-2)
Ecobee (home automation)
German Parliament’s IT system
Instagram
Cisco Webex

SRE Weekly Issue #141

lex

September 30, 2018

General

Comments

View on sreweekly.com

Articles

Rethinking Netflix’s Edge Load Balancing

An outline of the design of Netflix’s new load balancer, with special emphasis on dealing with faltering backends. Great idea: servers report their utilization level in a response header. Tricky pitfall: the LB is so good at moving requests off of ailing backends that backend failure rate alerts don’t fire.

Mike Smith — Netflix

NewSQL database systems are failing to guarantee consistency, and I blame Spanner

This article begins by explaining consistency versus availability in distributed data stores and argues that the trade-off is less significant than one might think. Then it describes a pitfall seen in some new data stores. I’ve pondered aloud here in the past on how Spanner can make the claims it does, and this article explains that nicely.

Daniel Abadi

The redux of the fallacies of distributed computing

And here’s a refutation of part of the previous article by the creator of RavenDB.

Ayende Rahien

Getting The Airlines Back On Their Feet After A Disaster

It is tempting to think that ensuring the resilience or continuity of all the individual parts of a business will guarantee the resilience or continuity of the whole.

Dr. Sandra Bell

Upgrading GitHub from Rails 3.2 to 5.2

GitHub used an innovative technique to avoid holding open a long-running code branch while upgrading their application to support rails 5.2.

Eileen Uchitelle — GitHub

Travis CI: Build VMs boot failure on the sudo-enabled infrastructure: incident postmortem

Worker node errors led to cascading failure when they hit Google Compute Engine quotas.

Bogdana Vereha — Travis CI

Secret IBM script could have prevented 11-hour US tax day outage

This week, the US Internal Revenue Service (IRS) issued a report analyzing the tax-day outage that occurred this past April. Linked is a nice summary by the Register.

Thanks to reader Michael Fischer for a tip on the report.

Chris Mellor — The Register

Outages

Facebook
Amazon Alexa
Delta Airlines
Honeywell (smart thermostat manufacturer)
Zoho
- SaaS provider Zoho’s domain registration was revoked by its registrar after a run-of-the-mill phishing complaint, affecting 30 million users.
Steemit

SRE Weekly Issue #140

lex

September 23, 2018

General

Comments

View on sreweekly.com

Articles

Errata: The Servers Are Burning

My sincerest apologies to Dale Markowitz, the author of this article who I mispronouned in last week’s issue. I’m kicking myself, because I totally didn’t need to use a pronoun at all.

Dale Markowitz — LOGIC Magazine

Linux 4.19-rc4 released, an apology, and a maintainership note

Linus Torvalds made waves this week with an email apologizing for his unprofessional behavior and committing to improving.

Linus Torvalds

Designing for Failure to Avoid Disaster

A pretty detailed article on how LaunchDarkly designed their system for reliability. The streaming vs. polling section is especially interesting.

Adam Zimman — LaunchDarkly

Full disclosure: Fastly, my employer, is mentioned.

LogDevice: a distributed data store for logs – Facebook Code

Lots of details about how they achieve their reliability goals. I’d love to see a followup with more detail on why writing a solution in-house made sense versus adopting something like Kafka.

Mark Marchukov — Facebook

13 Reasons a Staging Environment Is Failing in Your Organization – DZone DevOps

The staging environment plays an important part. If staging isn’t working for your organization, make sure you aren’t making these common mistakes.

Harshit Paul — DZone

Mockers – overcoming testing challenges at Grab

The challenges in question involve testing a microservice’s interactions with other microservices. Read about their system for distributing and running mock servers for each microservice.

Mayank Gupta, K.Vineet Nair, Shivkumar Krishnan, Thuy Nguyen, and Vishal Prakash — Grab

BP is to blame for Deepwater Horizon, but its mistake was actually years of small mistakes.

My partner suggested I look into the Deepwater Horizon incident, and I’m glad I did. My two key takeaways were normalization of deviance and this gem:

Researchers who study disasters tell us that a long period without an accident can be a big risk factor in itself: Workers learn to expect safe operation as the norm and can’t even conceive of a devastating failure.

James B. Meigs — Slate

SRE Weekly Issue #144

Articles

Outages

SRE Weekly Issue #143

Articles

Outages

SRE Weekly Issue #142

Articles

Outages

SRE Weekly Issue #141

Articles

Outages

SRE Weekly Issue #140

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues