General

SRE Weekly Issue #192

A message from our sponsor, VictorOps:

Keeping your local repository in sync with an open-source GitHub repo can cause headaches. But, it can also lead to more flexible, resilient services. See how these techniques can help you maintain consistency between both environments:

http://try.victorops.com/sreweekly/keeping-github-and-local-repos-in-sync

Articles

This is a reply/follow-on/not-rebuttal to the article I linked to last week, Deploy on Fridays, or Don’t. I really love the vigorous discussion!

Charity Majors

And this is a reply to Charity’s earlier article, Friday Deploy Freezes Are Exactly Like Murdering Puppies. Keep it coming, folks!

Marko Bjelac

In this story from the archives, a well-meaning compiler optimizes away a NULL pointer check, yielding an exploitable kernel bug. I love complex systems (kinda).

Jonathan Corbet — LWN

A new report has been released about a major telecommunications outage last winter. This summary paints the picture of a classic complex systems failure.

Ronald Lewis

Making engineers responsible for their code and services in production offers multiple advantages—for the engineer as well as the code.

Julie Gunderson — PagerDuty

Outages

SRE Weekly Issue #191

A message from our sponsor, VictorOps:

Need a new SRE podcast? Then check out episode one of the new VictorOps podcast, Ship Happens. Engineering Manager, Benton Rochester sits down with Bethany Abbott, TechOps Manager at NS1 to discuss on-call and the gender gap in tech.

http://try.victorops.com/sreweekly/ship-happens-episode-one

Articles

Check it out! A new zine dedicated to post-incident reviews. This first issue includes a reprint of 4 real gems from the past month plus one original article about disseminating lessons learned from incidents.

Emil Stolarsky and Jaime Woo

I swear, it’s like they heard me talking about anomaly detection last week. Anyone used this thing? I’d love to hear your experience. Better still, perhaps you’d like to write a blog post or article?

I know this isn’t Security Weekly, but this vulnerability has the potential to cause reliability issues, and it’s dreadfully simple to understand and exploit.

Hoai Viet Nguyen and Luigi Lo Iacono

In this incident followup from the archives, read the saga of a deploy gone horribly wrong. It took them hours and several experiments to figure out how to right the ship.

CCP Goliath — EVE Online

The best practices:

  1. Create a culture of experimentation
  2. Define what success looks like as a team
  3. Statistical significance
  4. Proper segmentation
  5. Recognize your biases
  6. Conduct a retro
  7. Consider experiments during the planning phase
  8. Empower others
  9. Avoid technical debt

Dawn Parzych — LaunchDarkly

Mantis uses an interesting stream processing / subscriber model for observability tooling.

From the beginning, we’ve built Mantis with this exact guiding principle in mind: Let’s make sure we minimize the costs of observing and operating our systems without compromising on required and opportunistic insights.

Cody Rioux, Daniel Jacobson, Jeff Chao, Neeraj Joshi, Nick Mahilani, Piyush Goyal, Prashanth Ramdas, and Zhenzhong Xu — Netflix

choosing not to deploy on Fridays is very different than having the capability to deploy on Fridays.  You should have the capability to deploy at any time.

We can’t ever be sure deploy will be safe, but we can be sure that folks have plans for their weekend.

David Mangot — Mangoteque

Outages

  • Amazon Route 53
    • Route 53 had significant DNS resolution impairment.

      Their status site still doesn’t allow deep linking or browsing the archive in any kind of manageable way, so here’s the full text of their followup post:

      On October 22, 2019, we detected and then mitigated a DDoS (Distributed Denial of Service) attack against Route 53. Due to the way that DNS queries are processed, this attack was first experienced by many other DNS server operators as the queries made their way through DNS resolvers on the internet to Route 53. The attack targeted specific DNS names and paths, notably those used to access the global names for S3 buckets. Because this attack was widely distributed, a small number of ISPs operating affected DNS resolvers implemented mitigation strategies of their own in an attempt to control the traffic. This is causing DNS lookups through these resolvers for a small number of AWS names to fail. We are doing our best to identify and contact these operators, as quickly as possible, and working with them to enhance their mitigations so that they do not cause impact to valid requests. If you are experiencing issues, please contact us so we can work with your operator to help resolve.

  • Heroku
    • I’m guessing this stemmed from the Route 53 incident.

      Our infrastructure provider is currently reporting intermittent DNS resolution errors. This may result in issues resolving domains to our services.

  • Twitter
  • Yahoo Mail
  • Hosted Graphite
  • Discord
  • Google Cloud Platform

SRE Weekly Issue #190

A message from our sponsor, VictorOps:

In the latest guide, Resilience First, you’ll learn about the origin of SRE, how it’s evolved over the last few years, and the future of its impact on building highly observable, resilient applications and infrastructure.

http://try.victorops.com/sreweekly/sre-golden-signals-guide

Articles

This company had a really challenging on-call situation to fix. Monolithic codebase, and a huge team with so many people in the on-call rotation that folks were out of practice by the time it was their turn.

Molly Struve

This article includes charts, observations, and conclusions from the author’s by-hand analysis and categorization of several hundred incidents.

Subbu Allamaraju

Charity Majors replied to a suggestion to write alerts for everything with her ideas for a better way.

Charity Majors (@mipsytipsy)

Where many databases use threading to handle concurrent clients, PostgreSQL forks one child process per client. This has ramifications that an operator must take into consideration.

Kristi Anderson — High Scalability

This article is about attributes, but it doesn’t mention a specific system. I have yet to find an anomaly detection system that doesn’t produce so many false positives that it’s useless.

Hive mind: if you’re using an anomaly detection system that actually works and doesn’t drown you with false positives, I want to hear about it. Bonus points if you want to write an article about it!

Amit Levi

Outages

SRE Weekly Issue #189

A message from our sponsor, VictorOps:

Adopt an incremental approach to machine learning to empower DevOps and IT teams and make on-call incident management suck less. Check out the open webinar recording today.

http://try.victorops.com/sreweekly/machine-learning-in-devops-and-it

Articles

…no reason. Asking for a friend.

Daniel Kolitz — Gizmodo

Multi-cloud may not be your first choice — but it may not be your choice at all.

Krishnan Subramanian – StackSense

Should you deploy on a Friday?
If you’ve got the confidence in your build and deploy pipelines, go for it.
If you don’t, go build some confidence.

Mitch Pomery — DEV

This story was so good I read it twice. The little details under the hood of your automation tools can reach out and bite you.

Rachel by the Bay

D&D-themed game days!

Lukas van Driel — Q42

Some interesting details courtesy of leaked internal audio from Facebook.

Casey Newton — The Verge

How do they cheat? By making assumptions about where a read for a given datum is likely to come from.

Daniel Abadi

The incident was the result of mismatched library versions.

Outages

  • PG&E Website
    • PG&E is a power company in California, USA. They’re cutting power as a way of preventing the risk of fires starting from power lines blown around in high winds.
  • Instagram

SRE Weekly Issue #188

A message from our sponsor, VictorOps:

[Free Webinar] Last chance to register for this week’s live webinar – How to Succeed in Machine Learning Without Really Trying. See how IT and engineering leaders are implementing ML to build more robust systems and improve on-call incident response

http://try.victorops.com/sreweekly/machine-learning-webinar

Articles

Two of the hardest problems of testing in production are curtailing blast radius and dealing with state. In this post, I aim to explore the topic of curtailing blast radius in more detail.

Cindy Sridharan

This team was getting paged constantly to fix failed Kafka nodes, and their outlook for the future was looking even worse. They responded by developing an auto-remediation system.

Andrey Falko — Lyft

As last week’s Boeing-related article explained, Boeing and Airbus have significantly different philosophies regarding the role of pilots vs aircraft in aviation safety. This new NTSB report strikes at the heart of that dichotomy.

Alex Davies – Wired

This is an especially interesting read because the team in question was a network operations team, and the members largely had no software engineering experience. Part of the transformation involved essentially training them for a new career.

Tom Wright — Google

My favorite part is the explanation of why observability is critical in microservice architectures.

The system is no longer in one of two states but more like one of n-factorial states.

Tyler Treat

Given that Lambda et al. auto-scale, is caching still relevant? Find out why by reading this article.

Yan Cui

Outages

  • GitHub
    • Repository forking operations were delayed.
  • Statuspage.io
  • Slack
    • Some customers are seeing an error code (“1AE32E16D91F”) when connecting to Slack.

      Now I really want to know what 1AE32E16D91F is…

  • Twitter
A production of Tinker Tinker Tinker, LLC Frontier Theme