General

SRE Weekly Issue #188

lex

October 6, 2019

Articles

Two of the hardest problems of testing in production are curtailing blast radius and dealing with state. In this post, I aim to explore the topic of curtailing blast radius in more detail.

Cindy Sridharan

Operating Apache Kafka Clusters 24/7 Without A Global Ops Team

This team was getting paged constantly to fix failed Kafka nodes, and their outlook for the future was looking even worse. They responded by developing an auto-remediation system.

Andrey Falko — Lyft

Feds Say Boeing 737s Need to Be Better Designed for Humans

As last week’s Boeing-related article explained, Boeing and Airbus have significantly different philosophies regarding the role of pilots vs aircraft in aviation safety. This new NTSB report strikes at the heart of that dichotomy.

Alex Davies – Wired

Transitioning a typical engineering ops team into an SRE powerhouse

This is an especially interesting read because the team in question was a network operations team, and the members largely had no software engineering experience. Part of the transformation involved essentially training them for a new career.

Tom Wright — Google

Microservice Observability, Part 1: Disambiguating Observability and Monitoring

My favorite part is the explanation of why observability is critical in microservice architectures.

The system is no longer in one of two states but more like one of n-factorial states.

Tyler Treat

All you need to know about caching for serverless applications

Given that Lambda et al. auto-scale, is caching still relevant? Find out why by reading this article.

Yan Cui

Outages

GitHub
- Repository forking operations were delayed.
Statuspage.io
Slack
- Some customers are seeing an error code (“1AE32E16D91F”) when connecting to Slack.
  
  Now I really want to know what 1AE32E16D91F is…
Twitter

SRE Weekly Issue #187

lex

September 29, 2019

General

Comments

View on sreweekly.com

Articles

Atlassian Incident Management Handbook

I love it when companies publish their incident management documentation! Atlassian’s offering is high-quality — both in content and production value. The Major Incident Manager Cheatsheet at the end is worth distributing to your team.

Atlassian

Evolving Regional Evacuation

Netflix shares more about their N+1 AWS region redundancy design, and it all revolves around accurately modeling demand.

Niosha Behnam — Netflix

Simple is Complex

Interactions between simple microservices can lead to unexpected emergent behaviors.

To restate: this system is not complicated. But it is complex.

Avdi Grimm

What Really Brought Down the Boeing 737 Max?

What we had in the two downed airplanes was a textbook failure of airmanship.

While I don’t necessarily agree with the blame-laden language of this article, it provides some interesting new details. It strikes me that, while MCAS may not be directly responsible for the crashes, it made it significantly harder to recover from contemporaneous pilot errors.

William Langewiesche — The New York Times

Observability — A 3-Year Retrospective

My favorite part is the role-playing scenarios of debugging a problem with observability tooling and traditional tools.

Charity Majors

TCP: Out of Memory — Consider Tuning TCP_Mem

Tuning your TCP stack is important on busy servers.

Ram Lakshmanan

Outages

Google Cloud Platform
- This incident primarily affected the control plane of many GCP services. It stemmed from a cascading failure in an important key-value store used by all of them.
Facebook and Instagram
Google Maps
GoDaddy
Target (retailer)
Discord
Fastly
- Plus two others.
Squarespace
GitHub

SRE Weekly Issue #186

lex

September 22, 2019

General

Comments

View on sreweekly.com

Articles

DBMS Musings: Introduction to Transaction Isolation Levels

This article is highly technical, while also not being overwhelmingly detailed.

It is very important that a database user is aware of the isolation level guaranteed by the database system, and what concurrency bugs may emerge as a result.

Daniel Abadi

How to Avoid the 5 SRE Implementation Traps that Catch Even the Best Teams

The traps are:

You don’t have enough cross-team usage or buy-in.

Your difficult and dense process is slowing down incident response.

Postmortems are underutilized and don’t encompass in-depth learnings.

You wait for incidents to happen.

You stop at incident management without SLOs.

Lyon Wong — Blameless

Distributed Tracing: Impact on Engineering Organizations

Need to argue the benefits of implementing distributed tracing in your organization? This article will help you get started.

dm03514

Love (and Alerting) in the Time of Cholera (and Observability)

The question is: what is the proper role of alerting in the modern era of distributed systems? Have alerting best practices changed with the shift from monitoring and known-unknowns to observability and unknown-unknowns?

Charity Majors

Enhancing Bandaid load balancing at Dropbox by leveraging real-time backend server load information

Round-robin load balancing often isn’t good enough; it’s necessary to intelligently route requests to nodes that aren’t overloaded. How do you get information about backend health to distributed load balancer nodes efficiently? A: add a response header.

Haowei Yuan, Yi-Shu Tai, and Dmitry Kopytkov — Dropbox

Splash the cache: how caching improved our reliability

By adding in-memory caching with a mere 3-second TTL, these folks achieved a ~75% cache hit rate, allowing them to withstand request spikes without an outage.

MINA GYIMAH — Pusher

Outages

Tokbox
- Thanks to Aos Dabbagh for this one.
Chef (system administration tool)
- Many of us experienced failures in our Chef runs after their former employee removed their code. Chef posted a followup explaining their position on the matter.
Fastly
Reddit
Net4 (hosting provider)
Salesforce
LinkedIn
Google Search
Heroku
Squarespace
- Also this one.

SRE Weekly Issue #185

lex

September 15, 2019

General

Comments

View on sreweekly.com

Articles

What Really Happened to Malaysia’s Missing Airplane

This is a tough read, but really enlightening.

Thanks to Courtney Eckhardt for this one.

William Langewiesche — The Atlantic

Nines are not enough: meaningful metrics for clouds

Read this to find out why it’s so hard to nail down SLOs for cloud services.

Adrian Colyer — The Morning Paper (summary)

Mogul & Wilkes (original paper)

‘Screaming car wreck’ of internet routing needs a fire brigade

BGP: the horrifying, ugly monster lurking at the base of the Internet.

Stilgherrian — ZDNet

The Global Internet Is Being Attacked by Sharks, Google Confirms

A different kind of monster.

Will Oremus — Slate

Shrinking the impact of production incidents using SRE principles

When you’re thinking about best practices for improving phases of the production incident cycle, there are three SRE principles that particularly matter for this task. Keep these in mind as you think about reliability.

Myk Taylor — Google

Optimizing Business Response When Technical Incidents Happen

It’s important that we remember that there’s more to incident response than the technical aspect.

George Miranda — PagerDuty

What I learnt from failure

Learn from this Second Officer’s account of a maritime near-miss and the five lessons they learned. My favorite:

As professionals, we always have more than one goal.

Nippin Anand — Safety Differently

Outages

SRE Weekly Issue #184

lex

September 8, 2019

General

Comments

View on sreweekly.com

Articles

systems thinking in practice

This article relates to Donella H. Meadows’s book, Thinking in Systems.

What follows is Meadows’ list of leverage points outfitted with those my ideas of where or how they can be applied to software development and web operations.

Ryan Frantz

@j_antman on Twitter: on-call horror story

D:

I know its past an hour but… we got ~600 Nagios emails a day. Boss forbade us from muting any of them. In weekly status meeting, he’d often quiz on-call on a random alert. If oncall didnt know about it, boss would often scream at us…

Jason Antman (@j_antman)

Introduction To Jepsen Testing At Couchbase

Find out how the Couchbase folks use Jepsen to test their database offering.

Korrigan Clark

Never Alone On Call

A supportive on-call environment is critical to ensuring reliability and resiliency.

Deirdre Mahon — Honeycomb

A Response to ‘Towards an Understanding Tech Debt’

This is a follow-on to an article I linked to awhile back.

It’s really simpler to call it Tech Risk.

I love the idea of tracking the decisions an organization makes and the risks they entail.

Sarah Baker

Outages

Google App Engine
Fastly
Wikipedia
Yahoo Mail
AOL Mail
Tesla App
- Some Tesla owners were locked out of their cars when the app stopped working.
Amazon’s Elastic Block Store (EBS) in us-east-1
- Amazon experienced an outage that resulted in the total loss of a small percentage of EBS volumes.
Heroku Incident Followups
- - Incident 1891
  - Incident 1892
  Both incidents involved an outage in Heroku’s upstream provider.
Heroku Incident #1896
- Also this one: #1897.

SRE Weekly Issue #188

Articles

Outages

SRE Weekly Issue #187

Articles

Outages

SRE Weekly Issue #186

Articles

Outages

SRE Weekly Issue #185

Articles

Outages

SRE Weekly Issue #184

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues