General

SRE Weekly Issue #190

A message from our sponsor, VictorOps:

In the latest guide, Resilience First, you’ll learn about the origin of SRE, how it’s evolved over the last few years, and the future of its impact on building highly observable, resilient applications and infrastructure.

http://try.victorops.com/sreweekly/sre-golden-signals-guide

Articles

This company had a really challenging on-call situation to fix. Monolithic codebase, and a huge team with so many people in the on-call rotation that folks were out of practice by the time it was their turn.

Molly Struve

This article includes charts, observations, and conclusions from the author’s by-hand analysis and categorization of several hundred incidents.

Subbu Allamaraju

Charity Majors replied to a suggestion to write alerts for everything with her ideas for a better way.

Charity Majors (@mipsytipsy)

Where many databases use threading to handle concurrent clients, PostgreSQL forks one child process per client. This has ramifications that an operator must take into consideration.

Kristi Anderson — High Scalability

This article is about attributes, but it doesn’t mention a specific system. I have yet to find an anomaly detection system that doesn’t produce so many false positives that it’s useless.

Hive mind: if you’re using an anomaly detection system that actually works and doesn’t drown you with false positives, I want to hear about it. Bonus points if you want to write an article about it!

Amit Levi

Outages

SRE Weekly Issue #189

A message from our sponsor, VictorOps:

Adopt an incremental approach to machine learning to empower DevOps and IT teams and make on-call incident management suck less. Check out the open webinar recording today.

http://try.victorops.com/sreweekly/machine-learning-in-devops-and-it

Articles

…no reason. Asking for a friend.

Daniel Kolitz — Gizmodo

Multi-cloud may not be your first choice — but it may not be your choice at all.

Krishnan Subramanian – StackSense

Should you deploy on a Friday?
If you’ve got the confidence in your build and deploy pipelines, go for it.
If you don’t, go build some confidence.

Mitch Pomery — DEV

This story was so good I read it twice. The little details under the hood of your automation tools can reach out and bite you.

Rachel by the Bay

D&D-themed game days!

Lukas van Driel — Q42

Some interesting details courtesy of leaked internal audio from Facebook.

Casey Newton — The Verge

How do they cheat? By making assumptions about where a read for a given datum is likely to come from.

Daniel Abadi

The incident was the result of mismatched library versions.

Outages

  • PG&E Website
    • PG&E is a power company in California, USA. They’re cutting power as a way of preventing the risk of fires starting from power lines blown around in high winds.
  • Instagram

SRE Weekly Issue #188

A message from our sponsor, VictorOps:

[Free Webinar] Last chance to register for this week’s live webinar – How to Succeed in Machine Learning Without Really Trying. See how IT and engineering leaders are implementing ML to build more robust systems and improve on-call incident response

http://try.victorops.com/sreweekly/machine-learning-webinar

Articles

Two of the hardest problems of testing in production are curtailing blast radius and dealing with state. In this post, I aim to explore the topic of curtailing blast radius in more detail.

Cindy Sridharan

This team was getting paged constantly to fix failed Kafka nodes, and their outlook for the future was looking even worse. They responded by developing an auto-remediation system.

Andrey Falko — Lyft

As last week’s Boeing-related article explained, Boeing and Airbus have significantly different philosophies regarding the role of pilots vs aircraft in aviation safety. This new NTSB report strikes at the heart of that dichotomy.

Alex Davies – Wired

This is an especially interesting read because the team in question was a network operations team, and the members largely had no software engineering experience. Part of the transformation involved essentially training them for a new career.

Tom Wright — Google

My favorite part is the explanation of why observability is critical in microservice architectures.

The system is no longer in one of two states but more like one of n-factorial states.

Tyler Treat

Given that Lambda et al. auto-scale, is caching still relevant? Find out why by reading this article.

Yan Cui

Outages

  • GitHub
    • Repository forking operations were delayed.
  • Statuspage.io
  • Slack
    • Some customers are seeing an error code (“1AE32E16D91F”) when connecting to Slack.

      Now I really want to know what 1AE32E16D91F is…

  • Twitter

SRE Weekly Issue #187

A message from our sponsor, VictorOps:

Machine learning (ML) isn’t just a buzzword anymore — it’s affecting how we communicate, shop, live and respond to critical DevOps incidents. Grab your spot for this free webinar to learn about driving success in incident management with machine learning:

http://try.victorops.com/sreweekly/machine-learning-webinar

Articles

I love it when companies publish their incident management documentation! Atlassian’s offering is high-quality — both in content and production value. The Major Incident Manager Cheatsheet at the end is worth distributing to your team.

Atlassian

Netflix shares more about their N+1 AWS region redundancy design, and it all revolves around accurately modeling demand.

Niosha Behnam — Netflix

Interactions between simple microservices can lead to unexpected emergent behaviors.

To restate: this system is not complicated. But it is complex.

Avdi Grimm

What we had in the two downed airplanes was a textbook failure of airmanship.

While I don’t necessarily agree with the blame-laden language of this article, it provides some interesting new details. It strikes me that, while MCAS may not be directly responsible for the crashes, it made it significantly harder to recover from contemporaneous pilot errors.

William Langewiesche — The New York Times

My favorite part is the role-playing scenarios of debugging a problem with observability tooling and traditional tools.

Charity Majors

Tuning your TCP stack is important on busy servers.

Ram Lakshmanan

Outages

SRE Weekly Issue #186

A message from our sponsor, VictorOps:

See why DevOps teams are more collaborative and transparent than traditional IT operations – helping them build highly efficient incident management and response systems:

http://try.victorops.com/sreweekly/devops-incident-management-guide

Articles

This article is highly technical, while also not being overwhelmingly detailed.

It is very important that a database user is aware of the isolation level guaranteed by the database system, and what concurrency bugs may emerge as a result.

Daniel Abadi

The traps are:

  1. You don’t have enough cross-team usage or buy-in.
  2. Your difficult and dense process is slowing down incident response.
  3. Postmortems are underutilized and don’t encompass in-depth learnings.
  4. You wait for incidents to happen.
  5. You stop at incident management without SLOs.

Lyon Wong — Blameless

Need to argue the benefits of implementing distributed tracing in your organization? This article will help you get started.

dm03514

The question is: what is the proper role of alerting in the modern era of distributed systems?  Have alerting best practices changed with the shift from monitoring and known-unknowns to observability and unknown-unknowns?

Charity Majors

Round-robin load balancing often isn’t good enough; it’s necessary to intelligently route requests to nodes that aren’t overloaded. How do you get information about backend health to distributed load balancer nodes efficiently? A: add a response header.

Haowei Yuan, Yi-Shu Tai, and Dmitry Kopytkov — Dropbox

By adding in-memory caching with a mere 3-second TTL, these folks achieved a ~75% cache hit rate, allowing them to withstand request spikes without an outage.

MINA GYIMAH — Pusher

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme