General

SRE Weekly Issue #125

lex

June 10, 2018

Articles

Don’t use Go’s default HTTP client (in production)

Go’s HTTP client defaults to no timeout. Making HTTP requests with no timeout is rarely a good idea and has been at the heart of many incidents I’ve been involved in.

Nathan Smith

What DBAs need to know about Cloud Spanner, part 1: Keys and indexes

A few times now, I’ve made offhand comments about how Spanner promises a lot and I’d like to know what the catches are. Here they are! In all fairness, they’re pretty reasonable constraints to work with.

Niel Markwick and Robert Saxby — Google

A Postmortem Template

I’d refer to this as more of a retrospective template, but in any case, it’s pretty nifty!

Michael Kehoe

What happens when online games go down for maintenance

This is a news report rather than a technical deep-dive. It’s got some pretty interesting (and amusing) stories from various MMOs.

Alex Wiltshire — PC Gamer

Lessons from Building Observability Tools at Netflix

Here’s how Netflix does observability.

Kevin Lew and Sangeeta Narayanan — Netflix

Heroku Status

Looks like I’ve missed a few incident followup posts from Heroku in the past couple months:

#1548: Increased errors in starting dynos
#1535: Post-incident Dyno Restarts
#1459: Scheduled API Maintenance on Monday March 26 at 23:00 UTC (4:00 PM PT)’
#1413: Dyno Availability
#1414: Heroku Connect Sync Delays
#1395: Heroku Connect Availability
#1393: Heroku Connect unavailable
#1379: Dyno boot issues

Outages

Walt Disney World Website and My Disney Experience Mobile App
- Having just been to Disney World in April, I can attest to the severity of this kind of outage and the importance of the app.
Paytm (digital wallet service)
ASX (Australian Stock Exchange)
- Inadvertent release of fire suppression gas damaged some equipment and halted trading.
Instagram
LSE (London Stock Exchange)
Twitter
Today we mitigated 1.1.1.1
- On May 31, 2018 we had a 17 minute outage on our 1.1.1.1 resolver service; this was our doing and not the result of an attack.
  
  Cloudflare shares some detail on what went wrong in this comprehensive incident analysis.
  
  Marek Majkowski — Cloudflare

SRE Weekly Issue #124

lex

June 3, 2018

General

Comments

View on sreweekly.com

Today’s my birthday! Bit of a short issue this week as a result, but lots of interesting outages.

Articles

Fault Tolerance is not High Availability

These terms are not interchangeable. Learn about the ins and outs of fault tolerance to highlight the differences between the two concepts.

Fernando Doglio

Seamless and scalable IT major incident management

What caught my eye in this article: AIIMS, the Australasian InterService Incident Management System. It’s the equivalent of the Incident Management System (IMS) in the US.

Ian Jones

Outages

Discord
Facebook Messenger
AWS EC2 (us-east-1 and us-east-2)
- AWS had a power outage in one of the datacenters that makes up one of the availability zones in us-east-1, causing some instances and services to go offline. Separately, they had a network (transit) failure in us-east-2.
Instapaper
Philips Hue (smart light bulbs)
Visa
- Hot on the heels of the MasterCard outage a couple of weeks ago, Visa suffered an outage.
Heroku
- Heroku had a series of outages:
Ticketfly

SRE Weekly Issue #123

lex

May 27, 2018

General

Comments

View on sreweekly.com

I hope you all had a happy GDPR day! SRE Weekly’s privacy policy has not changed. Folks that subscribed by email would have seen a message that I only share your email address with MailChimp, and that’s the way it will stay.

You can unsubscribe at any time by following the link at the bottom of the email, but if you have any trouble at all with unsubscribing, please don’t hesitate to email me and I’ll take care of it for you.

Articles

LinkedOut: A Request-Level Failure Injection Framework

The system is highly configurable, allowing fine-grained A/B testing of failures at all levels of the microservice call tree.

Ephemeral port exhaustion and how to avoid it

Ephemeral port exhaustion can really ruin your day. Read this to learn how to deal with it, how to detect it before you have problems, and why you might run out of ports sooner than you expect.

Will Sewell — Pusher

Incident Management at Spotify

This incident report from 2013 is a great read. It’s really two inches in one, including an analysis of why a remediation task from the first wasn’t completed in time to prevent the second.

David Poblador i Garcia — Spotify

Charity Majors on Observability and Understanding the Operational Ramifications of a System

There are a few nice tidbits in this interview, including this one:

[…] the health of the system no longer matters. We’ve entered an era where what matters is the health of each individual event, or each individual user’s experience […]

Daniel Bryant – InfoQ

Canary Deployment: What Is It and How Can I Use It?

This article has introduction to implementing canary deployment and also includes a discussion of the potential downsides.

Erik [surname not given] — Rollout.io

Open-sourcing Katran, a scalable network load balancer

Lots of great detail in this announcement, including an analysis of how (and why) they designed their load balancer to function entirely in userspace without a kernel bypass mechanism.

Nikita Shirokov and Ranjeeth Dasineni — Facebook

Building low-overhead metrics collection for high-performance systems

Metrics are great, right? Except sometimes they’re not, when the metric collection itself adds enough load to impair the system.

Jonathan Brown — Wallaroo

Outages

Google BigQuery
- Click through for the full incident report.
  
  Configuration changes being rolled out on the evening of the incident were not applied in the intended order.
GCP Networking in us-east4
- Here’s some detail on the BGP issue that took down us-east4 last week.
Google StackDriver
- It’s a hat trick of three GCP incident followup reports. Happy reading!
Slack
Bank of New Zealand
Twitter
National Australia Bank
- This outage is particularly notable because the bank has stated their intention to compensate customers for their losses, such as estimated lost revenues from inability to complete sales transactions.

SRE Weekly Issue #122

lex

May 20, 2018

General

Comments

View on sreweekly.com

Articles

Rapid response: how we fixed our on call process to avoid engineer burnout

After adopting a “full ownership” philosophy, this company faced burnout, with five or more separate developers on call simultaneously. Read about their awesome solution involving a shared on-call rotation staffed entirely by volunteers, spurred by the incentive of extra compensation.

Brian Scanlan — Intercom

Google Cloud Platform Blog: SRE vs. DevOps: competing standards or close friends?

What exactly is SRE and how does it relate to DevOps? Earlier this year, we (Liz Fong-Jones and Seth Vargo) launched a video series to help answer some of these questions and reduce the friction between the communities. This blog post summarizes the themes and lessons of each video in the series to offer actionable steps toward better, more reliable systems.

Liz Fong-Jones and Seth Vargo — Google

Making LinkedIn’s Organic Feed Handle Peak Traffic

After a load test uncovered a scaling issue, they dug deep, finding issues with garbage collection settings, cascading failures, and an overeager retry strategy.

Val Markovic — LinkedIn

7 Tips to Get New Engineers Ready to Be On-Call

These tips cover the basics and will be especially useful for teams onboarding engineers that have never been on-call before.

Just Culture & High Reliability: The Initial Approach

This article examines a case study of an EMS company attempting to adopt a just culture policy. There’s a great discussion of why it’s not a good idea to lay blame on individuals when systemic problems may be far more important.

Larry Boxman and Paul LeSage — JEMS (Journal of Emergency Medical Services)

SRE@Xero: Managing Incidents Part III

In this third and final article in a series, Xero lays out their process for analyzing incidents after the fact. Thanks to the Xero folks for being so open about your processes and for taking the time to write these articles!

Karthik Nilakant — Xero

Want to Debug Latency?

I like the nifty heat maps with example distributed traces. Neat idea!

JBD — Google

Outages

Sutter Health
Fortnite (incident analysis)
- I really love how deep and technical Fortnite is with their incident analysis articles! Here’s one for their outage in mid-april.
  The Epic Team
Google Compute Engine (us-east4 region)
Atlassian Statuspage
Roku
Hulu

SRE Weekly Issue #121

lex

May 13, 2018

General

Comments

View on sreweekly.com

Articles

Defining SLOs for services with dependencies

This latest in the CRE Life Lessons series takes on dependencies and how they impact a service’s SLO in obvious and subtle ways.

Robert van Gent — Google

How a SaaS provider made microservices deployment safely chaotic

This company discovered that the benefits of microservices came with some significant downsides. Here’s how they turned to chaos testing to improve reliability.

Meredith Courtemanche — TechTaret

The benefits of chaos engineering-as-a-service

Keeping in mind that this is written by the CTO of Gremlin, it contains some good points about buying versus building your chaos engineering system. It would apply to other chaos engineering services too — if there were any.

Matt Fornaciari — Gremlin, Inc.

Zero Downtime Updates with HashiCorp Terraform

Even as an experienced Terraform user, I learned about some Terraform features I hadn’t been aware of.

Nic Jackson — Hashicorp

How Your Systems Keep Running Day After Day

In issue #98, I linked to a recording of John Allspaw’s DOES17 talk. In case you didn’t have time to listen, here’s a transcript. If you didn’t have time to read the Stella Report, I highly recommend reading this as an intro to the major concepts therein.

John Allspaw

Outages

Fastly
- Full disclosure: Fastly is my employer.
Travis CI
Python Package Index (PyPI)
Honeycomb
- Wow, I just love Honeycomb’s post-incident analyses, and this one is no exception. Highly recommend.
  
  Rule of thumb as a developer: it’s probably not the database, it’s probably your code.
  
  Turns out that it was, in this case!
  
  Andy Isaacson
Hulu
Instagram
MasterCard

SRE Weekly Issue #125

Articles

Outages

SRE Weekly Issue #124

Articles

Outages

SRE Weekly Issue #123

Articles

Outages

SRE Weekly Issue #122

Articles

Outages

SRE Weekly Issue #121

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues