General

SRE Weekly Issue #86

SPONSOR MESSAGE

More efficiently reach on-call teams and incident responders with a new way to deploy Live Call Routing using Twilio Functions and VictorOps. Check it out:
http://try.victorops.com/LiveCallRouting/SREWeekly

Articles

Charity Majors knocks one out of the park with this article on the importance of testing (safely) in production.

Why does testing in production get such a bad rap when we all do it? The key is to do it right.

And speaking of baseball metaphors, here’s a PagerDuty engineer’s first-person account of shadowing on-call during an incident and the lessons she learned.

If you have time, please consider filling out this short survey on post-incident reviews (a.k.a. “retrospectives”) as part of a master’s thesis.

Mathias Lafeldt of Gremlin Inc. gives us this tutorial on moving from hand-run chaos experiments to a fully automated chaos system.

Recently, Jason Hand’s new ebook, Post-Incident Reviews, was published. Here’s his summary of the key points in the first three chapters.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

This article describes metrics in three main categories and explains how (and whether) to set up alerts for each kind.

Good output metrics are a close proxy for dollars earned or saved by the system per minute.

Like the previous article, Ilan Rabinovitch of Datadog advocates for symptom-based monitoring and alerting. I like his concept of the improved “durability” of symptom-based alerting (as opposed to cause-based):

[…] you don’t have to update your alert definitions every time your underlying system architectures change.

Our systems are always in flux, and this sometimes leads to failure. Mathias expands on this line of thinking to urge seeking to understand the many conditions that led to a failure, rather than a particular root cause.

Hosted Graphite had a gnarly problem to solve: how to get information about overload conditions from the backend to the front end where throttling could be enacted.

Outages

SRE Weekly Issue #85

SPONSOR MESSAGE

Being on-call sucks – but is it getting better? See what 800+ professionals have to say about being on-call in VictorOps’ annual “State of On-Call” report.
http://try.victorops.com/StateofOnCall/SREWeekly

Articles

Here’s Charity Majors with another gem about how ops looks in the era of distributed systems.

You simply can’t develop quality software for distributed systems without constant attention to its operability, maintainability, and debuggability.

I hope most of you have been reading up on the infamous “Googler manifesto”, and if so, maybe you’ve already seen this article. What caught my eye is the emphasis on people-oriented engineering, because these are the skills that have become increasingly important to me as an SRE.

A key metric goes through the roof and pages you. Why? Answering that can be really easy if you can quickly see the changes deployed to your system around the same time. This article is about a specific product that solves this problem and is thus a bit advertisey, but it’s still a good read.

Here’s a good argument for anomaly detection. Great, but I still have yet to see anomaly detection that I trust! That said, this was still an interesting read due to the real-world story about a glitch Wal-Mart faced.

For the Java crowd, here’s a primer on Resilience4j, a framework that makes it easier to write code that can recover from errors.

I like the description of their “The Watch” pager rotation in which developers periodically serve.

Grab engineers talk about migrating from Redis to ElastiCache veeeery carefully.

In a nutshell, we planned to switch the datasource for the 20k QPS system, without any user experience impact, while in a live running mode.

Outages

  • Paragon (game)
    • Epic Games released version 42 of Paragon, and the new version unexpectedly overloaded their servers. To get back to a good state, they were forced into developing novel code and upgrading a DB on the fly.
  • FedEx
  • SYNQ
    • As mentioned here previously, SYNQ has dedicated to posting their incident RCAs publicly. In this one, they identified a need for better regression testing.

SRE Weekly Issue #84

SPONSOR MESSAGE

Being on-call sucks – but is it getting better? See what 800+ professionals have to say about being on-call in VictorOps’ annual “State of On-Call” report.
http://try.victorops.com/StateofOnCall/SREWeekly

Articles

How many minutes per month is 99.95% availability? What about 99.957%? Here’s a tool that’ll give you a quick answer, by the author of awesome-sre.

This article is a partial transcript of Catchpoint’s Chaos Engineering and DiRT AMA.

In chaos engineering, we’re saying, “Look, these systems are just too complicated at this point to reasonably expect an engineer to know all of the properties that a system has.”

Somewhat intro-level, but I like this little gem:

[…] we have to put some requests at risk for the sake of protecting our overall availability. We want to keep that risk to a minimum. This raises the question: What is the smallest experiment we can run that still gives us confidence in the result?”

This article chronicles New Relic’s attempt to test a new system to prove that it was ready for production.

SQS, Kafka, and others tout features like “exactly once” and “FIFO”, but there are necessarily some pretty big caveats and edge cases to those features that really can’t be ignored.

Really, the title should be “The Google SRE Model”. This article discusses Google’s philosophy that the SRE team is optional for any given system — but a team should be doing what SRE would be doing if they’re not around.

SYNQ pushes for transparency in incident response and commits to publishing their RCAs publicly (like this one). They also include a simple template for RCAs at the end of the article.

Outages

  • AWS
    • us-east-1 had another one-AZ network outage.
  • Poloniex (altcoin exchange)
  • Skype
  • British Airways
  • Canada
    • A large portion of Canada had a major mobile phone and internet outage due to a fiber cut.
  • Heroku
    • Heroku has had a string of major outages, marked as red on their status page. Apologies for not linking to them individually and as they’ve happened, but here’s a link to their historical list. No public statement has been posted yet.

      Full disclosure: Heroku is my employer.

SRE Weekly Issue #83

SPONSOR MESSAGE

The definitive guide for DevOps Post-Incident Reviews (AKA – Postmortems). Learn why traditional methods don’t work – and why fast incident response isn’t enough. Download your free copy of the 90+ page eBook from O’Reilly Media and VictorOps.
http://try.victorops.com/post_incident_review/SREWeekly

Articles

Decision fatigue is the diminishment of certain mental faculties after making many decisions. It can cause incidents, and just as importantly, it can make incident response more difficult. After reading this article, I’m wondering if I should be asking incident responders to stop and drink a glass of orange juice before making a tough call during an incident.

Here’s an interesting debugging session that plumbs some of the more obscure depths of TCP.

What does DR look like if your system is serverless? How do you manage performance if you don’t control the thing that loads (and hopefully pre-caches) your code?

The new book on incident response from the folks at Blackrock3 has arrived! They draw on their years of fire incident response experience to teach us how to resolve outages. I had the privilege of attending one of Blackrock3’s 2-day training sessions last week and I highly recommend it.

I like the idea of focusing on reducing customer pain points, even if they’re not directly due to bugs. After all, reliability is all about the customer experience.

Netflix’s ChAP tests a target microservice by creating experimental and control clusters and routing a small portion of traffic to them.

Microservice-based architecture is great, right? The problem is that the fan-out of backend requests can create an amplification vector for a DDoS attack. A small, carefully-constructed API call from an attacker can result in a massive number of requests to services in the backend, taking them down.

The latest from Mathias Lafeldt is this article about post-hoc learning. He draws on Zwiebeck and Cook, reminding us that both success and failure are normal circumstances in complex systems.

It’s important to understand that every outcome, successful or not, is the result of a gamble.

Remember Awesome SRE? The same author, Pavlos Ratis, has pulled together a ton of links on Chaos Engineering.  Thanks, Pavlos!

He’s also compiled this set of postmortem templates, drawn from various sources.  He’s unstoppable!

What a great idea, and I wish I’d known about it earlier! Pingdom uses their aggregate monitoring data to create a live map of the internet. Might be useful for those big events like the Dyn DDoS or the S3 outage.

Last week, I reported on the disaster that was Niantic’s Pokémon Go live event. Verizon wants to assure us that it wasn’t a capacity issue on their part.

Outages

  • EC2 (us-east-1)
    • Between 6:47 AM and 7:10 AM PDT we experienced increased launch failures for EC2 Instances, degraded EBS volume performance and connectivity issues for some instances in a single Availability Zone in the US-EAST-1 Region. The issue has been resolved and the service is operating normally.

      This one seems to have affected several companies including Heroku and Rollbar.

  • Marketo
    • Marketo failed to renew their domain name registration, reportedly due to a failure in their automated tooling.
  • Instagram
  • Report on July 7, 2017 incident | Gandi News
    • Here’s one I missed from earlier this month.

      In all, 751 domains were affected by this incident, which involved an unauthorized modification of the name servers [NS] assigned to the affected domains that then forwarded traffic to a malicious site exploiting security flaws in several browsers.

      Thanks to an anonymous reader for this one.

  • Threat Stack Status – Config Audit Database Maintenance
    • Another one I missed. This one appears to be a maintenance that went wrong.Thanks to an anonymous reader for this one.

SRE Weekly Issue #82

SPONSOR MESSAGE

The definitive guide for DevOps Post-Incident Reviews (AKA – Postmortems). Learn why traditional methods don’t work – and why fast incident response isn’t enough. Download your free copy of the 90+ page eBook from O’Reilly Media and VictorOps.
http://try.victorops.com/post_incident_review/SREWeekly

Articles

Increment issue #2 is out! Want to hear what it was like for these three big companies to move to the cloud? Read on.

This article covers a lot of ground, from general strategy to specific methods for estimating capacity needs. I love this:

Perhaps surprisingly for engineers who work in mission-critical business applications, occasional spikes of 90%+ of our users being entirely unable to use the sole application of our company was an entirely acceptable engineering tradeoff versus sizing our capacity against our peak loads.

I love the insight this article gives me into the huge networks of big CDNs.

Key point: don’t count your chickens before they’ve recovered.

The MTTR time should be stopped when there is verification that all systems are once again operating as expected and end users are no longer negatively affected

Scalyr explains how to move beyond specific playbooks to create a renewal incident response plan.

Here’s a nice little how-to:

A recent challenge for one of the teams I am currently involved was to find a way in AWS CloudWatch:

  1. To alert if the metric breaches a specified threshold.
  2. To alert if a particular metric has not been sent to CloudWatch within a specified interval.

And another short how-to, this on developing Prometheus with HA.

Self-care is critical in tech, not only for us as individuals, but for the health and reliability of the entire organization. Overstretched engineers make mistakes. This article introduces a new resource: selfcare.tech, which is a curated, open-source repository of self-care resources.

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme