General

SRE Weekly Issue #140

lex

September 23, 2018

Articles

My sincerest apologies to Dale Markowitz, the author of this article who I mispronouned in last week’s issue. I’m kicking myself, because I totally didn’t need to use a pronoun at all.

Dale Markowitz — LOGIC Magazine

Linux 4.19-rc4 released, an apology, and a maintainership note

Linus Torvalds made waves this week with an email apologizing for his unprofessional behavior and committing to improving.

Linus Torvalds

Designing for Failure to Avoid Disaster

A pretty detailed article on how LaunchDarkly designed their system for reliability. The streaming vs. polling section is especially interesting.

Adam Zimman — LaunchDarkly

Full disclosure: Fastly, my employer, is mentioned.

LogDevice: a distributed data store for logs – Facebook Code

Lots of details about how they achieve their reliability goals. I’d love to see a followup with more detail on why writing a solution in-house made sense versus adopting something like Kafka.

Mark Marchukov — Facebook

13 Reasons a Staging Environment Is Failing in Your Organization – DZone DevOps

The staging environment plays an important part. If staging isn’t working for your organization, make sure you aren’t making these common mistakes.

Harshit Paul — DZone

Mockers – overcoming testing challenges at Grab

The challenges in question involve testing a microservice’s interactions with other microservices. Read about their system for distributing and running mock servers for each microservice.

Mayank Gupta, K.Vineet Nair, Shivkumar Krishnan, Thuy Nguyen, and Vishal Prakash — Grab

BP is to blame for Deepwater Horizon, but its mistake was actually years of small mistakes.

My partner suggested I look into the Deepwater Horizon incident, and I’m glad I did. My two key takeaways were normalization of deviance and this gem:

Researchers who study disasters tell us that a long period without an accident can be a big risk factor in itself: Workers learn to expect safe operation as the norm and can’t even conceive of a devastating failure.

James B. Meigs — Slate

Outages

SRE Weekly Issue #139

lex

September 16, 2018

General

Comments

View on sreweekly.com

Articles

Let’s Encrypt at Scale

Find out how AutoTrader deployed TLS to 3000 vendor websites, and what they did when things went wrong despite their careful deployment strategy.

Lee Goodman — AutoTrader

Incident Response – Speedbird 38

An excellent short piece about incident response, using the radio recordings from an aircraft accident as a case study.

Sri Ray

checklists: an operational gift

No production operation is too big or too small for a checklist. Similarly, no situation is too strenuous for one.

Sri Ray

How to write a status page update

[…] in this new series, we’re sharing some of our internal SRE processes. This first post looks at the guidelines our SRE team follow to communicate with customers during an incident, with some practical tips, examples, and the thinking behind it all.

Fran Garcia — Hosted Graphite

Multi-Cloud Is a Trap

Here’s why adopting a multi-cloud strategy may not do what you want, while also making your life much harder.

Tyler Treat

Finding and fixing software bugs automatically with SapFix and Sapienz

Last fall, I linked to a couple of talks on research in automated bugfixing. Facebook has now deployed such a system to production.

Yue Jia, Ke Mao, Mark Harman — Facebook

Postmortem: VSTS 4 September 2018

Microsoft’s Visual Studio Team System (VSTS) was one of the services impacted by the major Azure outage earlier this month. Here’s an in-depth analysis of what went wrong and what they might (or might not) be able to do to prevent a similar incident.

Buck Hodges — Microsoft

Outages

GitHub
Travis CI
- Also this one.
Twitch
Xero
- Xero experienced an outage this week and posted this article explaining what went wrong.
  Tony Stewart — Xero

SRE Weekly Issue #138

lex

September 9, 2018

General

Comments

View on sreweekly.com

Articles

096: Resilience Engineering with John Allspaw – Greater Than Code

This episode of Greater Than Code features John Allspaw, and it’s pretty much as awesome as I expected. Some highlights:

rather than asking how an incident happened, ask what prevented it from being worse
ask “how” rather than “why” an incident happened
humans plus technology are together a cognitive system
how can you make automation a team player?

Janelle Klein, John Sawers, Rein Henrichs, and Jessica Kerr, with John Allspaw

Serverless: Cold Start War

What does cold start look like on various FaaS platforms? This article has hard numbers obtained through empirical testing.

Mikhail Shilkov

@colmmacc on Twitter: shuffle sharding

Colm MacCárthaigh explains how shuffle sharding improves reliability by acting like some kind of magic lever made of math.

Colm MacCárthaigh — AWS (thanks to Thread Reader for the thread rollup)

Why CDN capacity numbers don’t matter

Who cares if your CDN has an eleventeen terabaud backbone uplink? What really matters is how they can serve your traffic.

Matt Levine — CacheFly

The Servers Are Burning

An engineer pushes a small change and OkCupid goes up in flames.

A new, entry-level employee takes down a big site — due not to a bug in his software, but in a dependency.

Dale Markowitz — LOGIC Magazine (Issue #5)

Observerless: The hottest new thing in monitoring you’re already doing

What happens when you mix Observability and Serverless? Corey Quinn of Last Week in AWS lets you in on the hottest new operations practice.

Corey Quinn

Undersea Internet cables are at risk

How will climate change and rising sea levels impact the reliability of our networks?

Carol Barford — iAfrikan

NOVA – Why Trains Crash

I watched this Nova (PBS) episode this week, and I highly recommend it. It explores why trains crash and what governments are doing to improve safety. The link above requires membership, but you can also watch it on Netflix.

PBS

Outages

Google Cloud Storage
- Linked is Google’s apology and followup analysis. Other Google Cloud Platform services dependent on GCS were also impacted.
Azure status
Papertrail
- August 31:
  
  Our hosting provider will be restarting a significant number of servers during this time window.
  
  September 5:
  
  Our provider has taken down more than expected capacity.
Slack
HubSpot
Amazon.com search
Azure (South Central US) and Office 365
- Lightning strikes took out the cooling systems, causing an emergency shutdown.
Facebook

SRE Weekly Issue #137

lex

September 2, 2018

General

Comments

View on sreweekly.com

Articles

Auth0 Architecture: Running In Multiple Cloud Providers And Regions

Read about their transition from multi-cloud to all AWS and how they scaled to 10x the login throughput.

Dirceu Tiegs — Auth0

Franken-algorithms: the deadly consequences of unpredictable code

This article on the emergent behavior of algorithms is well worth thinking about as an SRE. Even without machine learning, our infrastructures have complex emergent behaviors, as you can read in any incident retrospective.

Andrew Smith — The Guardian

Netflix, LinkedIn and Gremlin Engineers Talk Chaos Engineering – The New Stack

This interesting pitfall of chaos engineering stood out to me:

[…] if you hand a team 50 vulnerabilities, they’re probably not going to fix any of them. You know what I mean? So you have to figure out a way to prioritize those …

Andrea Echstenkamper with Nora Jones (Netflix), Ted Strzalkowski (LInkedIn), and Pat Higgins (Gremlin)

We want machines to be people and people to be machines. What is wrong with us?

Well worth a quick listen (2 minutes 30 seconds).

Todd Conklin — Pre-Accident Podcast

Now available: The open source guide to DevOps monitoring tools

In this series, we’ll dig into different types of observability tools. For each type, we’ll cover what they’re used for, what specific tools are available, some use cases, and any unique characteristics that may come up during your search for a new tool.

Linked above is an introduction to the article series. The first in the series is also out, focusing on time-series metric systems.

Dan Barker

Outages

Slack
GitHub
Duo
- Duo posted this followup analysis for two major outages in the past two weeks.
Tesla car network
Heroku Incident #1620
- Also #1622.
Microsoft Office 365
OCBC (bank)
Scotiabank

SRE Weekly Issue #136

lex

August 26, 2018

General

Comments

View on sreweekly.com

Articles

The 18 ghosts in your infrastructure stack that can cause failure (and how to avoid them)

This infographic shows how Ably’s client library and backend infrastructure is designed to work around many common failure modes. My favorite: they have redundant TLS certificates from distinct issuers.

Matthew O’Riordan — Ably

QA Instability Implies Production Instability

This article argues that spending a little time to fix staging can make production significantly more stable.

Michael Nygard

Through a Dashboard Darkly – Brain of Buildchimp – Curated Selections from the Voices in My Head

This is a story of a flawed development process on top of a flawed infrastructure, without the necessary data to drive decision-making. It’s also a story of waking up to these problems and charting a way out.

[…]

As it turns out, pure reasoning cannot solve the kind of problems you see in the production environment of a complex application. These problems are almost always more difficult, since they have survived all of the testing you could throw at them.

John Casey

Simple/hard metrics that help reduce MTTR when looking for a root cause

A story of a somewhat rare failure case (a datacenter heat buildup event) and how to monitor for such a thing without contributing to metrics overload.

Pavel Trukhanov — okmeter

Shipping Software Should Not Be Scary – charity.wtf

On twitter this week, @srhtcn noted that “Many incidents happen during or right after release” and asked for advice on ways to fix this.

Great advice, useful for managers and individual contributors.

Charity Majors

Outages

Apple CloudKit
- There appears to be some prolonged issues with Apple’s CloudKit service today, which Apple offers to developers as a way to store user data and sync across devices. Several developers have reported to us that they have seen data for their apps temporarily wiped in the last 24 hours as the CloudKit service experiences some form of outage.
Heroku
Commonwealth Bank (AU)
Coles (Supermarket chain)
Sydney, AU train system
reddit
- And another one.

SRE Weekly Issue #140

Articles

Outages

SRE Weekly Issue #139

Articles

Outages

SRE Weekly Issue #138

Articles

Outages

SRE Weekly Issue #137

Articles

Outages

SRE Weekly Issue #136

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues