General

SRE Weekly Issue #81

lex

July 16, 2017

Articles

PagerDuty shared this timeline of their progress in adopting Chaos Engineering through their Failure Friday program. This is brilliant:

We realized that Failure Fridays were a great opportunity to exercise our Incident Response process, so we started using it as a training ground for our newest Incident Commanders before they graduated.

How Platforms and SREs Change the DevOps Contract

I’m a big proponent of having developers own their code in production. This article posits that SRE’s job is to provide a platform that enables developers to do that more easily. I like the idea that containers and serverless are ways of getting developers closer to operations.

These platforms and the CI/CD pipelines they enable make it easier than ever for teams to own their code from desktop to production.

Interview with AWS’s Werner Vogels about response to Amazon outages

This reads less like an interview and more like a description of Amazon’s incident response procedure. I started paying close attention at step 3, “Learn from it”:

Vogels places the blame not on the engineer directly responsible, but Amazon itself, for not having failsafes that could have protected its systems or prevented the incorrect input.

From Scala Unified Logging to Full System Observability — Part 1 of 3: Our Original State of Logging

Jonathan is a platform engineer at VictorOps, responsible for system scalability and performance. This is Part 1 in a 3-part series on system visibility, the detection part of incident management.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Site Reliability Engineer: Don’t fall victim to the bias blind spot

This article is about a different kind of human factor than articles I often link to: cognitive bias. The author presents a case for SREs as working to limit the effects of cognitive bias in making operational decisions.

Outages

OVH
- OVH suffered a major outage in a datacenter, taking down 50,000 websites that they host. The outage was caused by a leak in their custom water-cooling system and resulted in a painfully long 24-hour recovery from an offsite backup. The Register’s report (linked) is based on OVH’s incident log and is the most interesting datacenter outage description I’ve read this year.
Google Cloud Storage
- Google posted this followup for an outage that occurred on July 6th. As usual, it’s an excellent read filled with lots of juicy details. This caught my eye:
  
  […] attempts to mitigate the problem caused the error rate to increase to 97%.
  
  Apparently this was caused by a “configuration issue” and was quickly reverted. It’s notable that they didn’t include anything about this error in the remediations section.
Melbourne, AU’s Metro rail network
- A network outage stranded travelers, and switching to the DR site “wasn’t an option”.
Somalia

SRE Weekly Issue #80

lex

July 9, 2017

General

Comments

View on sreweekly.com

Articles

Linux tracing systems & how they fit together

I had no idea there were so many tracing systems in Linux! Fortunately Julia Evans did, and she learned all about them so that she could explain them to us.

There’s strace, and ltrace, kprobes, and tracepoints, and uprobes, and ftrace, and perf, and eBPF, and how does it all fit together and what does it all MEAN?

So you want to be an SRE?

What do you get when a high school teacher switches careers, goes to boot camp, and becomes an SRE? In this case, we get Krishelle Hardson-Hurley, who wrote this really great intro to the SRE field. She also included a set of links to other SRE materials. Thanks for the link to SRE Weekly, Krishelle!

Embracing Failure in a Container World – Production Ready

This issue of Production Ready is a transcript (with slides) of Mathias’s talk at ContainerDays on doing chaos engineering in a container-based infrastructure. I really like the idea of attaching a side-car container to inject latency using tc.

Why is Redfin running its site from a single data center without a backup facility?

Here’s an interesting side-effect from an IPO: Redfin was obliged to mention the fact that its website runs out of a single datacenter.

Event Foo: Designing for Results

This article, part of a series from Honeycomb.io on structured event logging, contains some tips on structuring your events well to get the most out of your logs.

The Peculiarities of High-Availability Data Center Design on a Cruise Ship

I’d never thought about what IT systems must exist on a cruise ship before. This article left me wanting to know more, so I found this ZDNet article with pictures and descriptions of another cruise ship datacenter layout.

Outages

Chase Bank
Data glitch sets tech company stock prices at $123.47
- Here’s an interesting one. Vendors that consume and distribute price information from Nasdaq incorrectly interpreted “normal test data” from Nasdaq as if it were real. It looked like a bunch of companies’s stock prices had crashed or jumped by huge amounts.
Alphabay

Summary of the December 24, 2012 Amazon ELB Service Event in the US-East Region
- Here’s a classic postmortem from Amazon, in which a developer inadvertently deleted the production ELB state information.
Slack: This was not normal. Really.
- I’d forgotten about this superlative example of an incident followup posting from Slack after a pair of outages in 2014. What reminded me was a commit to Dan Luu’s post-mortems repo in github that mentioned it.
Heroku Incident 372: HTTP Routing Errors
- Here’s another classic incident followup posting. Heroku spills the details on a major outage that cut off access to all applications for 30 minutes in 2012.Full disclosure: Heroku is my employer.

SRE Weekly Issue #79

lex

July 3, 2017

General

Comments

View on sreweekly.com

Articles

Safety Moment – Who Failed or What Failed?

Asking “what failed?” can point an investigation in an entirely different and more productive direction.

[…] the power you have is not in the answer to your question; it’s in the question […]

Async IO on Linux: select, poll, and epoll

If you’re planning to write reliable, well-performing server code in Linux, you’ll need to know how to use epoll. Here’s Julia Evans to tell you what she learned about epoll and related syscalls.

You Cannot Have Exactly-Once Delivery Redux – Brave New Geek

Tyler Treat rectifies Kafka 0.11’s exactly-once semantics with his classic article, “You Cannot Have Exactly-Once Delivery”.

Scalability & High Availability

A “refcard” from Dzone covering a wide range of SRE basics, including load balancing, caching, clustering, redundancy, and fault tolerance.

BabyDuty or: How PagerDuty Accidentally Prepared me for Fatherhood

A PagerDuty engineer applies on-the-job expertise to labor, delivery, and parenting. Lots of concepts translate pretty well. Some… not so much.

Let’s stop talking about quality

As an SRE, I want “quality” code to be shipped so that our system is reliable. But what am I really after? Sam Stokes says we should avoid using the term “quality” in favor of finding common ground and understanding the whole situation.

The reality is that doing anything in the real world involves difficult decisions in the face of constraints.

Build Observable Systems

The value of logs is in what questions you can answer with them.

A sample rate of 20 means “there were 19 other events just like this one”. A sample rate of 1 means “this event is interesting on its own”.

Why Signiant uses StatusPage instead of a DIY tool

The Signiant team previously had no dedicated solution for incident communication. As a result, any hiccup in service resulted in a flooded queue for service agents and a stuffed inbox of “what’s going on here” notes from internal team members.

How do you cut a monolith in half?

In practice, a message broker is a service that transforms network errors and machine failures into filled disks.

Queues inevitably run in two states: full, or empty.

You can use a message broker to glue systems together, but never use one to cut systems apart.

Outages

Fastly
- A big, but brief, global degradation in Fastly. [related news story]
Rackspace
Pinboard.in
- Pinboard.in experienced a bit of feature degradation as its admin replaced a disk. I’m only including this because it meant that I couldn’t post this issue on time. ;)
  Pinboard‘s really awesome, and I wouldn’t be able to put together this newsletter without it. The API is super-simple to use, and I’m able to save and classify links right on my phone. A+, would socially bookmark with again.

SRE Weekly Issue #78

lex

June 25, 2017

General

Comments

View on sreweekly.com

Articles

Language Bias in Accident Investigation

This Master’s thesis by Crista Vesel seeks to answer the question, “How does the language used in the U.S. Forest Service’s Serious Accident Investigation Guide bias accident investigation analysis?” It’s an awe-inspiring analysis, drawing on Dekker, Woods, Cook, and other authors I’ve linked here repeatedly.

The most exciting part for me was the confirmation of some vague thoughts I’ve had around the use of passive versus active voice in retrospectives. By using passive voice, we can seek to reduce the kind of blaming that is inherent in active/agentive language.

What can developers learn from being on call? – Julia Evans

It’s by Julia Evans. Just read it.

Being responsible for my programs’ operations makes me a better developer

Determining Alert Urgency

PagerDuty again draws on ITIL, this time to outline an example system for classifying incident impact and urgency in order to determine priority.

ChaosCat: Automating Fault Injection at PagerDuty

PagerDuty’s take on automating chaos includes a chat-bot that lets folks trigger one-off host failures, along with running periodically, of course.

Unfortunately, ChaosCat is significantly tied into our internal infrastructure tooling. For the moment this means we won’t be open-sourcing it.

Reduce downtime with Azure Site Recovery service

This article is an overview of Microsoft’s DRaaS offering, Azure Site Recovery. Protip: you can just scroll past the signup-gate if you don’t feel like entering your email address.

How We Scaled Our Cache and Got a Good Night’s Sleep

Grab evaluated a couple of existing solutions but went with a simple custom sharding layer as a method to scale out their Redis usage.

Outages

Rollbar
LinkedIn
Skype
- Suspected DDoS.
ATO (Australian Tax Office)
Dyn
- Dyn suffered a long outage, and they posted an amazing 28 detailed updates to their status site before all was said and done. That’s something to aspire to.
Heroku
- Heroku posted a followup for their series of incidents early this month. Sorry for missing posting those outages when they happened!Full disclosure: Heroku is my employer.

SRE Weekly Issue #77

lex

June 18, 2017

General

Comments

View on sreweekly.com

I really love that some of you are taking vacations. Preventing burnout is really critical for improving reliability. That said, if you’d please exempt my address from your vacation auto-responder, that’d be super-cool ;)

Articles

Systemic brittleness, reactions to failure, and Conroy’s Law

Last week, I linked to a reddit story of an engineer that was unfairly fired for a mistake on their first day. Dr. Richard Cook picked this up and wrote up a great analysis of the underlying organizational issues.

Thanks to John Allspaw for this one.

Australian Tax Office’s post-incident report on the SAN outages

This was released the week before last, but it took me awhile to digest it. The ATO did a very thorough post-analysis on their two outages and released this polished report. I like that they took full responsibility for the outage even though it was an issue with a fully-managed vendor SAN offering, and they clearly sought to learn as much as possible.

Applications of (pin)trace data

Pinterest tech lead Suman Karumuri explains how they use distributed tracing and the benefits it’s brought them.

With these new use cases, we see tracing infrastructure as the third pillar of monitoring our services in addition to metrics and log search systems.

An Imaginary Apology Letter From Your Airline CEO

Frustrated by British Airways’s Willie Walsh’s public statement regarding their major outage, TripWire founder Gene Kim took it upon himself to write an open letter of apology as if he were an airline CEO. It’s pretty great.

NGINX Plus High Availability on AWS

This article explores several options for HA with Nginx: put an ELB in front of it, Route 53 with health checks, or an elastic IP switched either by keepalived or a Lambda function.

On-Calliday: A Guide to Unsucking Your On-Call Experience

I’ve been following GitLab’s blog since their engineer accidentally deleted their database earlier this year, and I’m glad I did. This article touches on all sorts of topics near to my heart: preventing burnout, examining incident response metrics, enforcing vacations, incident command, and having developers go on-call for what they wrote.

The hidden cost of “Dark DR:” The economic argument for active/active operations

The costs associated with running a full-capacity redundant system in a secondary site can be numerous and subtle. Those costs can be especially hard to swallow when expected returns on infrastructure investments prove elusive.

A/B Testing and Beyond: Improving the Netflix Streaming Experience with Experimentation and Data Science

Netflix explains in depth the careful scientific experiments they perform in production in order to improve the QoE (quality of experience).

Outages

Google Cloud Services
- 62-minute multiple-zone total internet outage in asia-northeast1. Postmortem linked, including a description of several contributing factors.
  
  We apologize for the impact this issue had on our customers, and especially to those customers with deployments across multiple zones in the asia-northeast1 region. We recognize we failed to deliver the regional reliability that multiple zones are meant to achieve.
Coinbase
YouTube

SRE Weekly Issue #81

Articles

Outages

SRE Weekly Issue #80

Articles

Outages

SRE Weekly Issue #79

Articles

Outages

SRE Weekly Issue #78

Articles

Outages

SRE Weekly Issue #77

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues