General

SRE Weekly Issue #84

lex

August 6, 2017

General

Comments

View on sreweekly.com

Articles

Availability Calculator

How many minutes per month is 99.95% availability? What about 99.957%? Here’s a tool that’ll give you a quick answer, by the author of awesome-sre.

Chaos Engineering: A Lesson From the Experts – DZone Performance

This article is a partial transcript of Catchpoint’s Chaos Engineering and DiRT AMA.

In chaos engineering, we’re saying, “Look, these systems are just too complicated at this point to reasonably expect an engineer to know all of the properties that a system has.”

When production is your test lab – what is “canarying”?

Somewhat intro-level, but I like this little gem:

[…] we have to put some requests at risk for the sake of protecting our overall availability. We want to keep that risk to a minimum. This raises the question: What is the smallest experiment we can run that still gives us confidence in the result?”

Designing for Scale, Part 3: Scaling Under Stress

This article chronicles New Relic’s attempt to test a new system to prove that it was ready for production.

FIFO, Exactly-Once, and Other Costs

SQS, Kafka, and others tout features like “exactly once” and “FIFO”, but there are necessarily some pretty big caveats and edge cases to those features that really can’t be ignored.

The SRE model

Really, the title should be “The Google SRE Model”. This article discusses Google’s philosophy that the SRE team is optional for any given system — but a team should be doing what SRE would be doing if they’re not around.

Building mOps (Modern Ops Process): Transparent Status and RCAs

SYNQ pushes for transparency in incident response and commits to publishing their RCAs publicly (like this one). They also include a simple template for RCAs at the end of the article.

Outages

AWS
- us-east-1 had another one-AZ network outage.
Poloniex (altcoin exchange)
Skype
British Airways
Canada
- A large portion of Canada had a major mobile phone and internet outage due to a fiber cut.
Heroku
- Heroku has had a string of major outages, marked as red on their status page. Apologies for not linking to them individually and as they’ve happened, but here’s a link to their historical list. No public statement has been posted yet.
  Full disclosure: Heroku is my employer.

SRE Weekly Issue #83

lex

July 30, 2017

General

Comments

View on sreweekly.com

Articles

Do You Suffer From Decision Fatigue?

Decision fatigue is the diminishment of certain mental faculties after making many decisions. It can cause incidents, and just as importantly, it can make incident response more difficult. After reading this article, I’m wondering if I should be asking incident responders to stop and drink a glass of orange juice before making a tough call during an incident.

The mystery of the hanging S3 downloads

Here’s an interesting debugging session that plumbs some of the more obscure depths of TCP.

Serverless computing has landed: How IT Ops can adapt

What does DR look like if your system is serverless? How do you manage performance if you don’t control the thing that loads (and hopefully pre-caches) your code?

Incident Management for Operations (book)

The new book on incident response from the folks at Blackrock3 has arrived! They draw on their years of fire incident response experience to teach us how to resolve outages. I had the privilege of attending one of Blackrock3’s 2-day training sessions last week and I highly recommend it.

Support Driven Development: Listen now so you don’t hear it later

I like the idea of focusing on reducing customer pain points, even if they’re not directly due to bugs. After all, reliability is all about the customer experience.

ChAP: Chaos Automation Platform

Netflix’s ChAP tests a target microservice by creating experimental and control clusters and routing a small portion of traffic to them.

Starting the Avalanche

Microservice-based architecture is great, right? The problem is that the fan-out of backend requests can create an amplification vector for a DDoS attack. A small, carefully-constructed API call from an attacker can result in a massive number of requests to services in the backend, taking them down.

Learning From Failure and Success – Production Ready

The latest from Mathias Lafeldt is this article about post-hoc learning. He draws on Zwiebeck and Cook, reminding us that both success and failure are normal circumstances in complex systems.

It’s important to understand that every outcome, successful or not, is the result of a gamble.

GitHub – dastergon/awesome-chaos-engineering: A curated list of awesome Chaos Engineering resources

Remember Awesome SRE? The same author, Pavlos Ratis, has pulled together a ton of links on Chaos Engineering. Thanks, Pavlos!

GitHub – dastergon/postmortem-templates: A collection of postmortem templates

He’s also compiled this set of postmortem templates, drawn from various sources. He’s unstoppable!

Pingdom’s Live Map Shows You The State Of The Internet As It Happens

What a great idea, and I wish I’d known about it earlier! Pingdom uses their aggregate monitoring data to create a live map of the internet. Might be useful for those big events like the Dyn DDoS or the S3 outage.

Verizon points finger at Niantic for problems at Pokémon Go event

Last week, I reported on the disaster that was Niantic’s Pokémon Go live event. Verizon wants to assure us that it wasn’t a capacity issue on their part.

Outages

EC2 (us-east-1)
- Between 6:47 AM and 7:10 AM PDT we experienced increased launch failures for EC2 Instances, degraded EBS volume performance and connectivity issues for some instances in a single Availability Zone in the US-EAST-1 Region. The issue has been resolved and the service is operating normally.
  
  This one seems to have affected several companies including Heroku and Rollbar.
Marketo
- Marketo failed to renew their domain name registration, reportedly due to a failure in their automated tooling.
Instagram
Report on July 7, 2017 incident | Gandi News
- Here’s one I missed from earlier this month.
  
  In all, 751 domains were affected by this incident, which involved an unauthorized modification of the name servers [NS] assigned to the affected domains that then forwarded traffic to a malicious site exploiting security flaws in several browsers.
  
  Thanks to an anonymous reader for this one.
Threat Stack Status – Config Audit Database Maintenance
- Another one I missed. This one appears to be a maintenance that went wrong.Thanks to an anonymous reader for this one.

SRE Weekly Issue #82

lex

July 23, 2017

General

Comments

View on sreweekly.com

Articles

Case studies in cloud migration: Netflix, Pinterest, and Symantec – Increment issue 2: Cloud

Increment issue #2 is out! Want to hear what it was like for these three big companies to move to the cloud? Read on.

An engineer’s guide to cloud capacity planning – Increment issue 2: Cloud

This article covers a lot of ground, from general strategy to specific methods for estimating capacity needs. I love this:

Perhaps surprisingly for engineers who work in mission-critical business applications, occasional spikes of 90%+ of our users being entirely unable to use the sole application of our company was an entirely acceptable engineering tradeoff versus sizing our capacity against our peak loads.

The strange geography of content delivery networks

I love the insight this article gives me into the huge networks of big CDNs.

Reducing MTTR

Key point: don’t count your chickens before they’ve recovered.

The MTTR time should be stopped when there is verification that all systems are once again operating as expected and end users are no longer negatively affected

In DevOps Incident Response, Plans Are Worthless, But Planning Is Everything

Scalyr explains how to move beyond specific playbooks to create a renewal incident response plan.

Dead man’s switch with AWS CloudWatch: Freshness-Alerting for Backups and Co

Here’s a nice little how-to:

A recent challenge for one of the teams I am currently involved was to find a way in AWS CloudWatch:

To alert if the metric breaches a specified threshold.

To alert if a particular metric has not been sent to CloudWatch within a specified interval.

High Availability Prometheus Alerting and Notification

And another short how-to, this on developing Prometheus with HA.

I won’t tell you to stop working, but I can try to help you not burn out

Self-care is critical in tech, not only for us as individuals, but for the health and reliability of the entire organization. Overstretched engineers make mistakes. This article introduces a new resource: selfcare.tech, which is a curated, open-source repository of self-care resources.

Outages

Today’s Outage · GitHub
- Old but good: this post-incident report from GitHub in 2010 recounts an outage caused by inadvertently running an automated test script against a producing db.
Pokémon Go Chicago event issues ticket refunds after widespread outage
- 20,000 people in one place trying to play Pokemon Go was apparently enough to overload several mobile phone networks.
YouTube

SRE Weekly Issue #81

lex

July 16, 2017

General

Comments

View on sreweekly.com

Articles

Failure Fridays: Four Years On

PagerDuty shared this timeline of their progress in adopting Chaos Engineering through their Failure Friday program. This is brilliant:

We realized that Failure Fridays were a great opportunity to exercise our Incident Response process, so we started using it as a training ground for our newest Incident Commanders before they graduated.

How Platforms and SREs Change the DevOps Contract

I’m a big proponent of having developers own their code in production. This article posits that SRE’s job is to provide a platform that enables developers to do that more easily. I like the idea that containers and serverless are ways of getting developers closer to operations.

These platforms and the CI/CD pipelines they enable make it easier than ever for teams to own their code from desktop to production.

Interview with AWS’s Werner Vogels about response to Amazon outages

This reads less like an interview and more like a description of Amazon’s incident response procedure. I started paying close attention at step 3, “Learn from it”:

Vogels places the blame not on the engineer directly responsible, but Amazon itself, for not having failsafes that could have protected its systems or prevented the incorrect input.

From Scala Unified Logging to Full System Observability — Part 1 of 3: Our Original State of Logging

Jonathan is a platform engineer at VictorOps, responsible for system scalability and performance. This is Part 1 in a 3-part series on system visibility, the detection part of incident management.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Site Reliability Engineer: Don’t fall victim to the bias blind spot

This article is about a different kind of human factor than articles I often link to: cognitive bias. The author presents a case for SREs as working to limit the effects of cognitive bias in making operational decisions.

Outages

OVH
- OVH suffered a major outage in a datacenter, taking down 50,000 websites that they host. The outage was caused by a leak in their custom water-cooling system and resulted in a painfully long 24-hour recovery from an offsite backup. The Register’s report (linked) is based on OVH’s incident log and is the most interesting datacenter outage description I’ve read this year.
Google Cloud Storage
- Google posted this followup for an outage that occurred on July 6th. As usual, it’s an excellent read filled with lots of juicy details. This caught my eye:
  
  […] attempts to mitigate the problem caused the error rate to increase to 97%.
  
  Apparently this was caused by a “configuration issue” and was quickly reverted. It’s notable that they didn’t include anything about this error in the remediations section.
Melbourne, AU’s Metro rail network
- A network outage stranded travelers, and switching to the DR site “wasn’t an option”.
Somalia

SRE Weekly Issue #80

lex

July 9, 2017

General

Comments

View on sreweekly.com

Articles

Linux tracing systems & how they fit together

I had no idea there were so many tracing systems in Linux! Fortunately Julia Evans did, and she learned all about them so that she could explain them to us.

There’s strace, and ltrace, kprobes, and tracepoints, and uprobes, and ftrace, and perf, and eBPF, and how does it all fit together and what does it all MEAN?

So you want to be an SRE?

What do you get when a high school teacher switches careers, goes to boot camp, and becomes an SRE? In this case, we get Krishelle Hardson-Hurley, who wrote this really great intro to the SRE field. She also included a set of links to other SRE materials. Thanks for the link to SRE Weekly, Krishelle!

Embracing Failure in a Container World – Production Ready

This issue of Production Ready is a transcript (with slides) of Mathias’s talk at ContainerDays on doing chaos engineering in a container-based infrastructure. I really like the idea of attaching a side-car container to inject latency using tc.

Why is Redfin running its site from a single data center without a backup facility?

Here’s an interesting side-effect from an IPO: Redfin was obliged to mention the fact that its website runs out of a single datacenter.

Event Foo: Designing for Results

This article, part of a series from Honeycomb.io on structured event logging, contains some tips on structuring your events well to get the most out of your logs.

The Peculiarities of High-Availability Data Center Design on a Cruise Ship

I’d never thought about what IT systems must exist on a cruise ship before. This article left me wanting to know more, so I found this ZDNet article with pictures and descriptions of another cruise ship datacenter layout.

Outages

Chase Bank
Data glitch sets tech company stock prices at $123.47
- Here’s an interesting one. Vendors that consume and distribute price information from Nasdaq incorrectly interpreted “normal test data” from Nasdaq as if it were real. It looked like a bunch of companies’s stock prices had crashed or jumped by huge amounts.
Alphabay

Summary of the December 24, 2012 Amazon ELB Service Event in the US-East Region
- Here’s a classic postmortem from Amazon, in which a developer inadvertently deleted the production ELB state information.
Slack: This was not normal. Really.
- I’d forgotten about this superlative example of an incident followup posting from Slack after a pair of outages in 2014. What reminded me was a commit to Dan Luu’s post-mortems repo in github that mentioned it.
Heroku Incident 372: HTTP Routing Errors
- Here’s another classic incident followup posting. Heroku spills the details on a major outage that cut off access to all applications for 30 minutes in 2012.Full disclosure: Heroku is my employer.

← Older Posts

Newer Posts →

General

SRE Weekly Issue #84

Articles

Outages

SRE Weekly Issue #83

Articles

Outages

SRE Weekly Issue #82

Articles

Outages

SRE Weekly Issue #81

Articles

Outages

SRE Weekly Issue #80

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues