SRE WEEKLY – Page 64 – scalability, availability, incident response, automation

SRE Weekly Issue #186

lex

September 22, 2019

General

Comments

View on sreweekly.com

Articles

DBMS Musings: Introduction to Transaction Isolation Levels

This article is highly technical, while also not being overwhelmingly detailed.

It is very important that a database user is aware of the isolation level guaranteed by the database system, and what concurrency bugs may emerge as a result.

Daniel Abadi

How to Avoid the 5 SRE Implementation Traps that Catch Even the Best Teams

The traps are:

You don’t have enough cross-team usage or buy-in.

Your difficult and dense process is slowing down incident response.

Postmortems are underutilized and don’t encompass in-depth learnings.

You wait for incidents to happen.

You stop at incident management without SLOs.

Lyon Wong — Blameless

Distributed Tracing: Impact on Engineering Organizations

Need to argue the benefits of implementing distributed tracing in your organization? This article will help you get started.

dm03514

Love (and Alerting) in the Time of Cholera (and Observability)

The question is: what is the proper role of alerting in the modern era of distributed systems? Have alerting best practices changed with the shift from monitoring and known-unknowns to observability and unknown-unknowns?

Charity Majors

Enhancing Bandaid load balancing at Dropbox by leveraging real-time backend server load information

Round-robin load balancing often isn’t good enough; it’s necessary to intelligently route requests to nodes that aren’t overloaded. How do you get information about backend health to distributed load balancer nodes efficiently? A: add a response header.

Haowei Yuan, Yi-Shu Tai, and Dmitry Kopytkov — Dropbox

Splash the cache: how caching improved our reliability

By adding in-memory caching with a mere 3-second TTL, these folks achieved a ~75% cache hit rate, allowing them to withstand request spikes without an outage.

MINA GYIMAH — Pusher

Outages

Tokbox
- Thanks to Aos Dabbagh for this one.
Chef (system administration tool)
- Many of us experienced failures in our Chef runs after their former employee removed their code. Chef posted a followup explaining their position on the matter.
Fastly
Reddit
Net4 (hosting provider)
Salesforce
LinkedIn
Google Search
Heroku
Squarespace
- Also this one.

SRE Weekly Issue #185

lex

September 15, 2019

General

Comments

View on sreweekly.com

Articles

What Really Happened to Malaysia’s Missing Airplane

This is a tough read, but really enlightening.

Thanks to Courtney Eckhardt for this one.

William Langewiesche — The Atlantic

Nines are not enough: meaningful metrics for clouds

Read this to find out why it’s so hard to nail down SLOs for cloud services.

Adrian Colyer — The Morning Paper (summary)

Mogul & Wilkes (original paper)

‘Screaming car wreck’ of internet routing needs a fire brigade

BGP: the horrifying, ugly monster lurking at the base of the Internet.

Stilgherrian — ZDNet

The Global Internet Is Being Attacked by Sharks, Google Confirms

A different kind of monster.

Will Oremus — Slate

Shrinking the impact of production incidents using SRE principles

When you’re thinking about best practices for improving phases of the production incident cycle, there are three SRE principles that particularly matter for this task. Keep these in mind as you think about reliability.

Myk Taylor — Google

Optimizing Business Response When Technical Incidents Happen

It’s important that we remember that there’s more to incident response than the technical aspect.

George Miranda — PagerDuty

What I learnt from failure

Learn from this Second Officer’s account of a maritime near-miss and the five lessons they learned. My favorite:

As professionals, we always have more than one goal.

Nippin Anand — Safety Differently

Outages

SRE Weekly Issue #184

lex

September 8, 2019

General

Comments

View on sreweekly.com

Articles

systems thinking in practice

This article relates to Donella H. Meadows’s book, Thinking in Systems.

What follows is Meadows’ list of leverage points outfitted with those my ideas of where or how they can be applied to software development and web operations.

Ryan Frantz

@j_antman on Twitter: on-call horror story

D:

I know its past an hour but… we got ~600 Nagios emails a day. Boss forbade us from muting any of them. In weekly status meeting, he’d often quiz on-call on a random alert. If oncall didnt know about it, boss would often scream at us…

Jason Antman (@j_antman)

Introduction To Jepsen Testing At Couchbase

Find out how the Couchbase folks use Jepsen to test their database offering.

Korrigan Clark

Never Alone On Call

A supportive on-call environment is critical to ensuring reliability and resiliency.

Deirdre Mahon — Honeycomb

A Response to ‘Towards an Understanding Tech Debt’

This is a follow-on to an article I linked to awhile back.

It’s really simpler to call it Tech Risk.

I love the idea of tracking the decisions an organization makes and the risks they entail.

Sarah Baker

Outages

Google App Engine
Fastly
Wikipedia
Yahoo Mail
AOL Mail
Tesla App
- Some Tesla owners were locked out of their cars when the app stopped working.
Amazon’s Elastic Block Store (EBS) in us-east-1
- Amazon experienced an outage that resulted in the total loss of a small percentage of EBS volumes.
Heroku Incident Followups
- - Incident 1891
  - Incident 1892
  Both incidents involved an outage in Heroku’s upstream provider.
Heroku Incident #1896
- Also this one: #1897.

SRE Weekly Issue #183

lex

September 1, 2019

General

Comments

View on sreweekly.com

Articles

I test in prod – Increment: Testing

Another issue of Increment, on a topic integral to SRE: testing.

It doesn’t matter if you’ve already read everything Charity Majors has written; in this article she’s still managed to find new and even more compelling ways to argue that we should embrace testing in production.

My two other favorite articles from this issue:

What Broke the Bank (Chris Stokel-Walker)
Tests from the Crypt (Tammy Butow)

Charity Majors — Honeycomb

We Re-Launched The New York Times Paywall and No One Noticed

That’s exactly what we hoped for.

They rewrote this critical service and carefully deployed it to avoid user impact, using a technique I love: run the new code alongside the old for awhile to verify that it returns the same result.

Jeremy Gayed, Said Ketchman, Oleksii Khliupin, Ronny Wang and Sergey Zheznyakovskiy — New York Times

For CAs: What makes a Good Incident Response?

This is aimed at Certification Authorities dealing with TLS certificate misissuance issues and the like, but it very much applies to any kind of incident.

BONUS CONTENT: An incident report from LetsEncrypt just a few days later included this gem, exactly in line with what Ryan wrote:

After initially confirming the report we reached out to multiple other CAs that we believed would also be affected.

Ryan Sleevi

Our incident postmortem template

Whose? Hosted Graphite’s. Definitely worth a read.

Fran Garcia — Hosted Graphite

Understanding the risk profile of your technical debt

Which brings me to this unpopular opinion: All code is technical debt.

However, debt itself isn’t bad. It can be risky, especially if misunderstood, but debt itself is not inherently good or bad. It’s a tool.

Dormain Drewitz — Pivotal

Building a Culture of Continuous Improvement – Blameless: Better Reliability Through SRE

Blameless is running a free workshop on writing post-incident reports.

In this talk we will discuss the elements of an effective postmortem and the challenges faced while defining the process. We will introduce concrete methodologies that alleviate the cognitive overhead and emotional burden of doing postmortems.

Blameless

Outages

Heroku Status
- Heroku experienced 8+ hours of instability. This status page posting is really worth a read because it has:
  - meticulously detailed customer impact
  - no sugar-coating
  - detailed workarounds when they were available
  Hats off to you, folks.
Slack
Reddit
Sling TV
Disney Plus
- Increased traffic from a sale caused instability.
Fastly

SRE Weekly Issue #182

lex

August 25, 2019

General

Comments

View on sreweekly.com

Articles

Continuous Verification of Friday Deploys

Friday deploys are going to be necessary occasionally, even if we try to ban them. Doing so will only mean that we’re less experienced at executing Friday deploys successfully.

Will Gallego

Complicated & Complex Systems in Safety Management

Jet engines are Complicated. The system of jet engine maintenance (including the technicians, policies, schedules, etc) is Complex. Understanding the difference is key to managing complex systems.

Adam Johns

Hindsight 28 — Change: Changing to adapt and adapting to change

In this issue, we have articles from the front-line, as well as from safety, legal, leadership, human factors and psychology specialists.

Hindsight is a magazine targeted at air traffic controllers. An example article title from this issue:

Mode-Switching in Air Traffic Control

Thanks to Greg Burek for this one.

FCC Issues Report on CenturyLink Network Outage

The US Federal Communications Commission released their report on an outage last December that took down 911 (emergency services) across a large swathe of the US.

This outage was caused by an equipment
failure catastrophically exacerbated by a network configuration error.

An explanation of the difference between Isolation levels vs. Consistency levels

They’re two separate concepts, but they’re often presented together, blurring the line between them.

Daniel Abadi

Studying Human Error Is Not Enough In Child Welfare Safety Management

I love the idea of applying the ideas of resilience engineering to child welfare services. This article quotes from Hollnagel and Dekker.

Tom Morton and Jess McDonald

Outages

Amazon Cloud Outage Causing Major Issues at Some Crypto Exchanges – CoinDesk
Amazon and some Cryptocurrency Exchanges
- AWS had an outage in Asia Pacific, affecting some cryptocurrency exchanges. There’s some speculation that the outage may have resulted in some bitcoins being bought for under 1 USD (way below market value).
WhatsApp
GitHub
Google OAuth
- Along with preventing logins to Google some services, this also affected “Log in with Google” on non-Google sites.

SRE Weekly Issue #186

Articles

Outages

SRE Weekly Issue #185

Articles

Outages

SRE Weekly Issue #184

Articles

Outages

SRE Weekly Issue #183

Articles

Outages

SRE Weekly Issue #182

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues