General

SRE Weekly Issue #188

A message from our sponsor, VictorOps:

[Free Webinar] Last chance to register for this week’s live webinar – How to Succeed in Machine Learning Without Really Trying. See how IT and engineering leaders are implementing ML to build more robust systems and improve on-call incident response

http://try.victorops.com/sreweekly/machine-learning-webinar

Articles

Two of the hardest problems of testing in production are curtailing blast radius and dealing with state. In this post, I aim to explore the topic of curtailing blast radius in more detail.

Cindy Sridharan

This team was getting paged constantly to fix failed Kafka nodes, and their outlook for the future was looking even worse. They responded by developing an auto-remediation system.

Andrey Falko — Lyft

As last week’s Boeing-related article explained, Boeing and Airbus have significantly different philosophies regarding the role of pilots vs aircraft in aviation safety. This new NTSB report strikes at the heart of that dichotomy.

Alex Davies – Wired

This is an especially interesting read because the team in question was a network operations team, and the members largely had no software engineering experience. Part of the transformation involved essentially training them for a new career.

Tom Wright — Google

My favorite part is the explanation of why observability is critical in microservice architectures.

The system is no longer in one of two states but more like one of n-factorial states.

Tyler Treat

Given that Lambda et al. auto-scale, is caching still relevant? Find out why by reading this article.

Yan Cui

Outages

  • GitHub
    • Repository forking operations were delayed.
  • Statuspage.io
  • Slack
    • Some customers are seeing an error code (“1AE32E16D91F”) when connecting to Slack.

      Now I really want to know what 1AE32E16D91F is…

  • Twitter

SRE Weekly Issue #187

A message from our sponsor, VictorOps:

Machine learning (ML) isn’t just a buzzword anymore — it’s affecting how we communicate, shop, live and respond to critical DevOps incidents. Grab your spot for this free webinar to learn about driving success in incident management with machine learning:

http://try.victorops.com/sreweekly/machine-learning-webinar

Articles

I love it when companies publish their incident management documentation! Atlassian’s offering is high-quality — both in content and production value. The Major Incident Manager Cheatsheet at the end is worth distributing to your team.

Atlassian

Netflix shares more about their N+1 AWS region redundancy design, and it all revolves around accurately modeling demand.

Niosha Behnam — Netflix

Interactions between simple microservices can lead to unexpected emergent behaviors.

To restate: this system is not complicated. But it is complex.

Avdi Grimm

What we had in the two downed airplanes was a textbook failure of airmanship.

While I don’t necessarily agree with the blame-laden language of this article, it provides some interesting new details. It strikes me that, while MCAS may not be directly responsible for the crashes, it made it significantly harder to recover from contemporaneous pilot errors.

William Langewiesche — The New York Times

My favorite part is the role-playing scenarios of debugging a problem with observability tooling and traditional tools.

Charity Majors

Tuning your TCP stack is important on busy servers.

Ram Lakshmanan

Outages

SRE Weekly Issue #186

A message from our sponsor, VictorOps:

See why DevOps teams are more collaborative and transparent than traditional IT operations – helping them build highly efficient incident management and response systems:

http://try.victorops.com/sreweekly/devops-incident-management-guide

Articles

This article is highly technical, while also not being overwhelmingly detailed.

It is very important that a database user is aware of the isolation level guaranteed by the database system, and what concurrency bugs may emerge as a result.

Daniel Abadi

The traps are:

  1. You don’t have enough cross-team usage or buy-in.
  2. Your difficult and dense process is slowing down incident response.
  3. Postmortems are underutilized and don’t encompass in-depth learnings.
  4. You wait for incidents to happen.
  5. You stop at incident management without SLOs.

Lyon Wong — Blameless

Need to argue the benefits of implementing distributed tracing in your organization? This article will help you get started.

dm03514

The question is: what is the proper role of alerting in the modern era of distributed systems?  Have alerting best practices changed with the shift from monitoring and known-unknowns to observability and unknown-unknowns?

Charity Majors

Round-robin load balancing often isn’t good enough; it’s necessary to intelligently route requests to nodes that aren’t overloaded. How do you get information about backend health to distributed load balancer nodes efficiently? A: add a response header.

Haowei Yuan, Yi-Shu Tai, and Dmitry Kopytkov — Dropbox

By adding in-memory caching with a mere 3-second TTL, these folks achieved a ~75% cache hit rate, allowing them to withstand request spikes without an outage.

MINA GYIMAH — Pusher

Outages

SRE Weekly Issue #185

A message from our sponsor, VictorOps:

Machine learning is already being used in many DevOps processes – driving highly efficient workflows across the entire software delivery lifecycle. See how machine learning is currently being used to improve incident management and response in production environments:

http://try.victorops.com/sreweekly/machine-learning-in-incident-management

Articles

This is a tough read, but really enlightening.

Thanks to Courtney Eckhardt for this one.

William Langewiesche — The Atlantic

Read this to find out why it’s so hard to nail down SLOs for cloud services.

Adrian Colyer — The Morning Paper (summary)

Mogul & Wilkes (original paper)

BGP: the horrifying, ugly monster lurking at the base of the Internet.

Stilgherrian — ZDNet

A different kind of monster.

Will Oremus — Slate

When you’re thinking about best practices for improving phases of the production incident cycle, there are three SRE principles that particularly matter for this task. Keep these in mind as you think about reliability.

Myk Taylor — Google

It’s important that we remember that there’s more to incident response than the technical aspect.

George Miranda — PagerDuty

Learn from this Second Officer’s account of a maritime near-miss and the five lessons they learned. My favorite:

As professionals, we always have more than one goal.

Nippin Anand — Safety Differently

Outages

SRE Weekly Issue #184

A message from our sponsor, VictorOps:

Do you dream of reducing MTTA from four hours to two minutes? Learn how you can improve incident detection, alerting, real-time incident collaboration and cross-functional transparency to make on-call suck less and build more reliable services:

http://try.victorops.com/sreweekly/improved-incident-response

Articles

This article relates to Donella H. Meadows’s book, Thinking in Systems.

What follows is Meadows’ list of leverage points outfitted with those my ideas of where or how they can be applied to software development and web operations.

Ryan Frantz

D:

I know its past an hour but… we got ~600 Nagios emails a day. Boss forbade us from muting any of them. In weekly status meeting, he’d often quiz on-call on a random alert. If oncall didnt know about it, boss would often scream at us…

Jason Antman (@j_antman)

Find out how the Couchbase folks use Jepsen to test their database offering.

Korrigan Clark

A supportive on-call environment is critical to ensuring reliability and resiliency.

Deirdre Mahon — Honeycomb

This is a follow-on to an article I linked to awhile back.

It’s really simpler to call it Tech Risk.

I love the idea of tracking the decisions an organization makes and the risks they entail.

Sarah Baker

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme