General

SRE Weekly Issue #189

A message from our sponsor, VictorOps:

Adopt an incremental approach to machine learning to empower DevOps and IT teams and make on-call incident management suck less. Check out the open webinar recording today.

http://try.victorops.com/sreweekly/machine-learning-in-devops-and-it

Articles

…no reason. Asking for a friend.

Daniel Kolitz — Gizmodo

Multi-cloud may not be your first choice — but it may not be your choice at all.

Krishnan Subramanian – StackSense

Should you deploy on a Friday?
If you’ve got the confidence in your build and deploy pipelines, go for it.
If you don’t, go build some confidence.

Mitch Pomery — DEV

This story was so good I read it twice. The little details under the hood of your automation tools can reach out and bite you.

Rachel by the Bay

D&D-themed game days!

Lukas van Driel — Q42

Some interesting details courtesy of leaked internal audio from Facebook.

Casey Newton — The Verge

How do they cheat? By making assumptions about where a read for a given datum is likely to come from.

Daniel Abadi

The incident was the result of mismatched library versions.

Outages

  • PG&E Website
    • PG&E is a power company in California, USA. They’re cutting power as a way of preventing the risk of fires starting from power lines blown around in high winds.
  • Instagram

SRE Weekly Issue #188

A message from our sponsor, VictorOps:

[Free Webinar] Last chance to register for this week’s live webinar – How to Succeed in Machine Learning Without Really Trying. See how IT and engineering leaders are implementing ML to build more robust systems and improve on-call incident response

http://try.victorops.com/sreweekly/machine-learning-webinar

Articles

Two of the hardest problems of testing in production are curtailing blast radius and dealing with state. In this post, I aim to explore the topic of curtailing blast radius in more detail.

Cindy Sridharan

This team was getting paged constantly to fix failed Kafka nodes, and their outlook for the future was looking even worse. They responded by developing an auto-remediation system.

Andrey Falko — Lyft

As last week’s Boeing-related article explained, Boeing and Airbus have significantly different philosophies regarding the role of pilots vs aircraft in aviation safety. This new NTSB report strikes at the heart of that dichotomy.

Alex Davies – Wired

This is an especially interesting read because the team in question was a network operations team, and the members largely had no software engineering experience. Part of the transformation involved essentially training them for a new career.

Tom Wright — Google

My favorite part is the explanation of why observability is critical in microservice architectures.

The system is no longer in one of two states but more like one of n-factorial states.

Tyler Treat

Given that Lambda et al. auto-scale, is caching still relevant? Find out why by reading this article.

Yan Cui

Outages

  • GitHub
    • Repository forking operations were delayed.
  • Statuspage.io
  • Slack
    • Some customers are seeing an error code (“1AE32E16D91F”) when connecting to Slack.

      Now I really want to know what 1AE32E16D91F is…

  • Twitter

SRE Weekly Issue #187

A message from our sponsor, VictorOps:

Machine learning (ML) isn’t just a buzzword anymore — it’s affecting how we communicate, shop, live and respond to critical DevOps incidents. Grab your spot for this free webinar to learn about driving success in incident management with machine learning:

http://try.victorops.com/sreweekly/machine-learning-webinar

Articles

I love it when companies publish their incident management documentation! Atlassian’s offering is high-quality — both in content and production value. The Major Incident Manager Cheatsheet at the end is worth distributing to your team.

Atlassian

Netflix shares more about their N+1 AWS region redundancy design, and it all revolves around accurately modeling demand.

Niosha Behnam — Netflix

Interactions between simple microservices can lead to unexpected emergent behaviors.

To restate: this system is not complicated. But it is complex.

Avdi Grimm

What we had in the two downed airplanes was a textbook failure of airmanship.

While I don’t necessarily agree with the blame-laden language of this article, it provides some interesting new details. It strikes me that, while MCAS may not be directly responsible for the crashes, it made it significantly harder to recover from contemporaneous pilot errors.

William Langewiesche — The New York Times

My favorite part is the role-playing scenarios of debugging a problem with observability tooling and traditional tools.

Charity Majors

Tuning your TCP stack is important on busy servers.

Ram Lakshmanan

Outages

SRE Weekly Issue #186

A message from our sponsor, VictorOps:

See why DevOps teams are more collaborative and transparent than traditional IT operations – helping them build highly efficient incident management and response systems:

http://try.victorops.com/sreweekly/devops-incident-management-guide

Articles

This article is highly technical, while also not being overwhelmingly detailed.

It is very important that a database user is aware of the isolation level guaranteed by the database system, and what concurrency bugs may emerge as a result.

Daniel Abadi

The traps are:

  1. You don’t have enough cross-team usage or buy-in.
  2. Your difficult and dense process is slowing down incident response.
  3. Postmortems are underutilized and don’t encompass in-depth learnings.
  4. You wait for incidents to happen.
  5. You stop at incident management without SLOs.

Lyon Wong — Blameless

Need to argue the benefits of implementing distributed tracing in your organization? This article will help you get started.

dm03514

The question is: what is the proper role of alerting in the modern era of distributed systems?  Have alerting best practices changed with the shift from monitoring and known-unknowns to observability and unknown-unknowns?

Charity Majors

Round-robin load balancing often isn’t good enough; it’s necessary to intelligently route requests to nodes that aren’t overloaded. How do you get information about backend health to distributed load balancer nodes efficiently? A: add a response header.

Haowei Yuan, Yi-Shu Tai, and Dmitry Kopytkov — Dropbox

By adding in-memory caching with a mere 3-second TTL, these folks achieved a ~75% cache hit rate, allowing them to withstand request spikes without an outage.

MINA GYIMAH — Pusher

Outages

SRE Weekly Issue #185

A message from our sponsor, VictorOps:

Machine learning is already being used in many DevOps processes – driving highly efficient workflows across the entire software delivery lifecycle. See how machine learning is currently being used to improve incident management and response in production environments:

http://try.victorops.com/sreweekly/machine-learning-in-incident-management

Articles

This is a tough read, but really enlightening.

Thanks to Courtney Eckhardt for this one.

William Langewiesche — The Atlantic

Read this to find out why it’s so hard to nail down SLOs for cloud services.

Adrian Colyer — The Morning Paper (summary)

Mogul & Wilkes (original paper)

BGP: the horrifying, ugly monster lurking at the base of the Internet.

Stilgherrian — ZDNet

A different kind of monster.

Will Oremus — Slate

When you’re thinking about best practices for improving phases of the production incident cycle, there are three SRE principles that particularly matter for this task. Keep these in mind as you think about reliability.

Myk Taylor — Google

It’s important that we remember that there’s more to incident response than the technical aspect.

George Miranda — PagerDuty

Learn from this Second Officer’s account of a maritime near-miss and the five lessons they learned. My favorite:

As professionals, we always have more than one goal.

Nippin Anand — Safety Differently

Outages

SRE WEEKLY © 2015 Frontier Theme