General

SRE Weekly Issue #88

SPONSOR MESSAGE

Acknowledge and resolve IT & DevOps alerts directly from Slack with the new native integration with VictorOps. Learn all about it here:
http://try.victorops.com/slack/SREWeekly

Articles

From Catie McCaffrey:

I’m often asked how to get started with Distributed Systems, so this post documents my path and some of the resources I found most helpful. It is by no means meant to be an exhaustive list.

Julia Evans just blew my mind (once again). In this article, among other things, she links to a tool that tells you which function in the kernel dropped a packet. I’ve been wishing for such a tool for years!

I love that companies are starting to publish lessons learned from game days and other chaos experiments. Just like a post-incident followup, there’s so much we can learn by following along.

It’s an absolute must for any disaster recovery plan worth its name to include power supply as a crucial factor – because, without power, you simply can’t do business.

Here’s the last installment of Jason Hand’s digest version of his new eBook, Post-Incident Reviews.

If I leave you with one take-away from this guide, it should be that every incident provides an opportunity for your team to be more prepared for the next one.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

How can you prevent a colo failure? Obviously, colo customers can’t, but we can at least prepare. This article has advice for understanding a provider’s history, policies, and procedures related to outages.

Just click through.

In this analysis of the factors leading to a plane crash, we see another example of the critical role that human/computer interfaces play in allowing (or preventing) humans to recover from a system failure.

Move over, backhoes: water is the other natural enemy of the fiber optic network.

The New York Times has a Kafka installation containing everything they’ve published in their entire history, and it powers the front page, search, suggestions, and everything else.

Outages

  • AbeBooks.com
    • AbeBooks is the place to go for out-of-print books and old editions. The site going down meant that many used booksellers lost a major sales outlet.
  • Gmail
  • Apple developer portal
  • Google Drive
  • iCloud Mail
  • Heroku
    • Heroku posted a pile of public followups this past week:
      • Incidents 1251 and 1254 – In both of these incidents, applications failed due to missing debian packages normally provided by the Heroku platform.
      • Incident 1257 – For a few minutes, 10% of requests to Heroku applications hosted in Europe failed.
      • Incident 1270 – Applications last deployed over 3 years ago spontaneously stopped working.

      Full disclosure: Heroku is my employer.

SRE Weekly Issue #87

SPONSOR MESSAGE

More efficiently reach on-call teams and incident responders with a new way to deploy Live Call Routing using Twilio Functions and VictorOps. Check it out:
http://try.victorops.com/LiveCallRouting/SREWeekly

Articles

John Allspaw describes the Architecture Review Working Group at Etsy. I like the idea of an open discussion with peers before creating a novel system that will add significant operational burden.

Here’s part two of Jason Hand’s series of posts with key takeaways from his new eBook, “Post-Incident Reviews”. In the next three chapters, he shows why a traditional RCA process misses the mark.

[…] problems stem — not from one primary cause — but from the complex interplay of our systems and the teams tasked with managing them.

Honeycomb.io eschews plain monitoring in favor of “observability”, which they define as the ability to “ask any arbitrary question” about a system.

But here’s the thing: in distributed systems, or in any mature, complex application of scale built by good engineers… the majority of your questions trend towards the unknown-unknown.

Here’s another primer on microservices. It has a nice “caveats” section, which is exactly where operations and reliability come into the picture.

Facebook shared a lot of detail about how they evolved from 3 daily pushes to quasi-continuous releases. They’ve got a well-defined canary system, reminding me of Charity’s article on testing in production last week.

AppDynamics presents their list in shiny PDF form. You’ll have to fill in your spam-bucket address contact info to download it.

PagerDuty is hosting a “breakathon”: small teams will compete to resolve a series of infrastructure issues. Sounds like bunch of fun!

Outages

  • Japan
    • Google accidentally announced some BGP prefixes it shouldn’t have, taking Japan offline for a couple of hours. Linked above is a really neat in-depth analysis from BGPmon, for all you BGP geeks out there.

      Since Google essentially leaked a full table towards Verizon, we get to peek into what Google’s peering relationships look like and how their peers traffic engineer towards Google.

  • Heroku
  • AWS
    • EC2’s Ireland region suffered an outage in VPC peering on August 23. Their status site doesn’t allow for deep links, so here’s an excerpt:

      11:32 AM PDT We are investigating network connectivity issues for some instances in the EU-WEST-1 Region.

      11:55 AM PDT We have identified root cause of the network connectivity issues in the EU-WEST-1 Region. Connectivity between peered VPCs is affected by this issue. Connectivity between instances within a VPC or between instances and the Internet or AWS services is not affected. We continue to work towards full recovery.

      12:51 PM PDT Between 10:32 AM and 12:44 PM PDT we experienced connectivity issues when using VPC peering in the EU-WEST-1 Region. Connectivity between instances in the same VPC and from instances to the Internet or AWS services was not affected. The issue has been resolved and the service is operating normally.

  • Google Cloud
    • Google Cloud suffered a massive 30-hour worldwide outage in some cloud load-balancers. In their impressive style, they posted frequent updates during the incident and issued a followup analysis of the incident just 2 days after resolution.

      In order to prevent the issue, Google engineers are working to enhance automated canary testing that simulates live-migration events, detection of load balancing packets loss, and enforce more restrictions on new configuration changes deployment for internal representation changes.

  • WhatsApp
  • Twitch (video streaming service)

SRE Weekly Issue #86

SPONSOR MESSAGE

More efficiently reach on-call teams and incident responders with a new way to deploy Live Call Routing using Twilio Functions and VictorOps. Check it out:
http://try.victorops.com/LiveCallRouting/SREWeekly

Articles

Charity Majors knocks one out of the park with this article on the importance of testing (safely) in production.

Why does testing in production get such a bad rap when we all do it? The key is to do it right.

And speaking of baseball metaphors, here’s a PagerDuty engineer’s first-person account of shadowing on-call during an incident and the lessons she learned.

If you have time, please consider filling out this short survey on post-incident reviews (a.k.a. “retrospectives”) as part of a master’s thesis.

Mathias Lafeldt of Gremlin Inc. gives us this tutorial on moving from hand-run chaos experiments to a fully automated chaos system.

Recently, Jason Hand’s new ebook, Post-Incident Reviews, was published. Here’s his summary of the key points in the first three chapters.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

This article describes metrics in three main categories and explains how (and whether) to set up alerts for each kind.

Good output metrics are a close proxy for dollars earned or saved by the system per minute.

Like the previous article, Ilan Rabinovitch of Datadog advocates for symptom-based monitoring and alerting. I like his concept of the improved “durability” of symptom-based alerting (as opposed to cause-based):

[…] you don’t have to update your alert definitions every time your underlying system architectures change.

Our systems are always in flux, and this sometimes leads to failure. Mathias expands on this line of thinking to urge seeking to understand the many conditions that led to a failure, rather than a particular root cause.

Hosted Graphite had a gnarly problem to solve: how to get information about overload conditions from the backend to the front end where throttling could be enacted.

Outages

SRE Weekly Issue #85

SPONSOR MESSAGE

Being on-call sucks – but is it getting better? See what 800+ professionals have to say about being on-call in VictorOps’ annual “State of On-Call” report.
http://try.victorops.com/StateofOnCall/SREWeekly

Articles

Here’s Charity Majors with another gem about how ops looks in the era of distributed systems.

You simply can’t develop quality software for distributed systems without constant attention to its operability, maintainability, and debuggability.

I hope most of you have been reading up on the infamous “Googler manifesto”, and if so, maybe you’ve already seen this article. What caught my eye is the emphasis on people-oriented engineering, because these are the skills that have become increasingly important to me as an SRE.

A key metric goes through the roof and pages you. Why? Answering that can be really easy if you can quickly see the changes deployed to your system around the same time. This article is about a specific product that solves this problem and is thus a bit advertisey, but it’s still a good read.

Here’s a good argument for anomaly detection. Great, but I still have yet to see anomaly detection that I trust! That said, this was still an interesting read due to the real-world story about a glitch Wal-Mart faced.

For the Java crowd, here’s a primer on Resilience4j, a framework that makes it easier to write code that can recover from errors.

I like the description of their “The Watch” pager rotation in which developers periodically serve.

Grab engineers talk about migrating from Redis to ElastiCache veeeery carefully.

In a nutshell, we planned to switch the datasource for the 20k QPS system, without any user experience impact, while in a live running mode.

Outages

  • Paragon (game)
    • Epic Games released version 42 of Paragon, and the new version unexpectedly overloaded their servers. To get back to a good state, they were forced into developing novel code and upgrading a DB on the fly.
  • FedEx
  • SYNQ
    • As mentioned here previously, SYNQ has dedicated to posting their incident RCAs publicly. In this one, they identified a need for better regression testing.

SRE Weekly Issue #84

SPONSOR MESSAGE

Being on-call sucks – but is it getting better? See what 800+ professionals have to say about being on-call in VictorOps’ annual “State of On-Call” report.
http://try.victorops.com/StateofOnCall/SREWeekly

Articles

How many minutes per month is 99.95% availability? What about 99.957%? Here’s a tool that’ll give you a quick answer, by the author of awesome-sre.

This article is a partial transcript of Catchpoint’s Chaos Engineering and DiRT AMA.

In chaos engineering, we’re saying, “Look, these systems are just too complicated at this point to reasonably expect an engineer to know all of the properties that a system has.”

Somewhat intro-level, but I like this little gem:

[…] we have to put some requests at risk for the sake of protecting our overall availability. We want to keep that risk to a minimum. This raises the question: What is the smallest experiment we can run that still gives us confidence in the result?”

This article chronicles New Relic’s attempt to test a new system to prove that it was ready for production.

SQS, Kafka, and others tout features like “exactly once” and “FIFO”, but there are necessarily some pretty big caveats and edge cases to those features that really can’t be ignored.

Really, the title should be “The Google SRE Model”. This article discusses Google’s philosophy that the SRE team is optional for any given system — but a team should be doing what SRE would be doing if they’re not around.

SYNQ pushes for transparency in incident response and commits to publishing their RCAs publicly (like this one). They also include a simple template for RCAs at the end of the article.

Outages

  • AWS
    • us-east-1 had another one-AZ network outage.
  • Poloniex (altcoin exchange)
  • Skype
  • British Airways
  • Canada
    • A large portion of Canada had a major mobile phone and internet outage due to a fiber cut.
  • Heroku
    • Heroku has had a string of major outages, marked as red on their status page. Apologies for not linking to them individually and as they’ve happened, but here’s a link to their historical list. No public statement has been posted yet.

      Full disclosure: Heroku is my employer.

A production of Tinker Tinker Tinker, LLC Frontier Theme