SRE Weekly Issue #112

SPONSOR MESSAGE

Are your monitoring and incident management tools integrated? You shouldn’t be monitoring your infrastructure and code in an old-school fashion. http://try.victorops.com/SREWeekly/Monitoring

Articles

an outage of a provider that we don’t use, directly or indirectly, resulted in our service becoming unavailable.

I don’t think I even need to add anything to that to make you want to read this article.

Fran Garcia — Hosted Graphite

The big story this week is the memcached UDP amplification DDoS method, used to send 1.3 Tbps (!) toward our friends at GitHub. Their description is linked above.

Sam Kottler — GitHub

The internet was alight with related discussions:

An excellent template that you can use as a basis for writing runbooks.

Catie McCaffrey

This author of an upcoming O’Reilly book is looking for small contributions for a crowd-sourced chapter:

In two paragraphs or less, what do you think is the relationship between DevOps and SRE? How are they similar? How are they different? Can both be implemented at every organization? Can the two exist in the same org at the same time? And so on…

David Blank-Edelman

Bandaid started as a reverse proxy that compensated for inefficiencies in our server-side services.

I’m intrigued by the way it handles its queue in last-in first-out order, on the theory that a request that’s been waiting for a long time is likely to be cancelled by its requester.

Dmitry Kopytkov and Patrick Lee — Dropbox

This is an amusing recap of five major outages of the past few years. If you’ve been subscribed for awhile, it’ll be review, but I still enjoyed the reminder.

Michael Rabinowitz

This article summarizes a new research paper on “fail-slow” hardware failures. When hardware only kind of fails, it can often have more disastrous consequences than when it stops working outright.

Robin Harris — Storage Bits

This is an awe-inspiring way to make a point about designing systems to be resilient to human error.

If it’s possible for a human to hit the wrong button and set off an entire fireworks display by accident, then maybe the problem isn’t with the human; it’s with that button.

If it’s possible to mix up minutes and fractions of a second like we’ve done deliberately, then maybe the system isn’t clear, or maybe the pre-launch checklist isn’t thorough enough.

Tom Scott

There are some really great ideas in this article around preventing and ameliorating the technical debt that can be inherent in the use of feature flags. Ostensibly this article is about using Split.io, but the ideas are broadly applicable.

Adil Aijaz — Split

Outages

SRE Weekly Issue #111

I’m trying an experiment this week: I’ve included authors at the bottom of each article.  I feel like it’s only fair to increase exposure for the folks that put in the significant effort necessary to write articles.  It also saves me having to mention names and companies, hopefully leaving more room for useful summaries.

If you like it, great!  If not, please let me know why — reply by email or tweet @SREWeekly.  I feel like this is the right thing to do from the perspective of crediting authors, but I’d like to know if a significant number of you disagree.

Hat-tip to Developer Tools Weekly for the idea.

SPONSOR MESSAGE

Gain visibility throughout your entire organization. Visualize time series metrics with VictorOps and Grafana. http://try.victorops.com/SREWeekly/Grafana

Articles

Conversations around compensation for on-call. What has worked or not for you? $$ vs PTO. Alerts vs Scheduled vs Actual Time?1 x 1.5 or 2x?

The replies to her tweet are pretty interesting and varied.

Lisa Phillips, VP at Fastly
Full disclosure: Fastly is my employer.

This thread is incredibly well phrased, explaining exactly why it’s important for developer to be on call and how to make that not terrible. Bonus content: the thread also branches out into on-call compensation.

if you aren’t supporting your own services, your services are qualitatively worse **and** you are pushing the burden of your own fuckups onto other people, who also have lives and sleep schedules.

Charity Majors — Honeycomb

This week, Blackrock3 Partners posted an excerpt from their book, Incident Management for Operations that you can read free of charge. If you enjoy it, I highly recommend you sign up for their first-ever open enrollment IMS training course. I know I keep pushing this, but I truly believe that incident response in our industry as a whole will be significantly improved if more people train with these folks.

“On-call doesn’t have to suck” has been a big theme lately, with articles and comments on both sides. Here’s a pile of great advice from my favorite ops heroine.

Charity Majors — Honeycomb

An interesting little debugging story involving unexpected SSL server-side behavior.

Ayende Rahien — RavenDB

In this post, I’m going to take a look at a sample application that uses the Couchbase Server Multi-Cluster Aware (MCA) Java client. This client goes hand-in-hand with Couchbase’s Cross-Data Center Replication (XDCR) capabilities.

Hod Greeley — Couchbase

Tips for how to go about scaling your on-call policy and procedures in order to be fair and humane to engineers.

Emel Dogrusoz — OpsGenie

Outages

SRE Weekly Issue #110

SPONSOR MESSAGE

Learn how to accelerate your path to full-stack monitoring and alerting in this webinar. Register now: http://try.victorops.com/SREWeekly/ZenossWebinar

Articles

Facebook goes in-depth on their preparations for New Year’s Day 2018 in their live streaming infrastructure. They used forecasting based on last year and various kinds of load testing to develop the right kind of scaling strategy to meet demand.

Cindy Sridharan went and blew up the internet with an excellent and controversial tweet about on-call. She took to Medium to address all of the discussion that followed, and the result is a pretty excellent article about on-call and work/life balance.

A discussion about how RavenDB handles resource exhaustion, and just how resource exhaustion can be defined and detected.

Honeycomb on using observability tooling to precisely analyze how a change actually affects your users. Did the new feature/bugfix have the effect you expected?

Pusher is obsessed with low latency, and for good reason. When they saw high long-tail latency, they discovered that Haskell’s garbage collector is optimized for throughput, rather than latency.

Facebook’s Project Waterbear seeks to improve resiliency across many of their services through a combination of chaos engineering, cultural changes, and improvements to Rest.li, their common REST framework.

As SREs, we measure, analyze, and provide best practices to help improve the resilience of each application for the application owners and engineering teams.

The tradeoff for more resilient, soft-failing software systems is more complex debugging when things go wrong. As these problems are now more likely to reside deep in application code — which wasn’t the case not along ago — observability tooling is playing catchup.

OpsGenie analyzes AWS’s new DynamoDB Global Tables, a cross-region multi-master NoSQL datastore. They share the upsides and the pitfalls and include a discussion of how to transition to a global table.

A Netflix manager shares his reasons for still being on-call even though he’s a manager, and they’re pretty great. A lot of it has to do with keeping in tune with what it’s like being a developer on his team, especially with regard to on-call burden and operability.

Outages

SRE Weekly Issue #109

SPONSOR MESSAGE

Asking five (or more) whys is outdated. So is trying to find a Root Cause Analysis. Take a look at the case against RCA. http://try.victorops.com/SREWeekly/RootCause

Articles

Pusher had a problem: their service was being bombarded by connections from rogue clients, and they needed to enforce limits. This article is highly polished, with beautiful diagrams and well-constructed explanations.

This is the story of how we quelled the biggest threat to our service uptime for several years.

Structured logging can bring a lot of uniformity to your infrastructure, as lovingly explained in this article. Snyk explains how that uniformity allows for a standardized troubleshooting methodology that helps them get to the bottom of most problems in minutes.

Instead of focusing on the individual intricacies of each part of our system, we train on the common tools to be used for almost every kind of problem.

Feature flags are awesome! But there’s a downside: adding lots of conditional handling to your code can significantly increase code complexity, which can in turn decrease maintainability and increase risk.

Following up on her appearance in the New York Times last week, Charity Majors posted this excellent Twitter thread about the importance of vendor relationship management and generating business value, as any kind of engineer. I’d argue especially as an SRE.

Here’s the latest in Google’s CRE Life Lessons series. Previously, they explained how to build an Escalation Policy, and in this article, they analyze how it would be applied to several fictitious scenarios.

LinkedIn needed a way to test their HDFS cluster against real-world traffic patterns. The existing solutions didn’t meet their needs (for reasons they explain toward the end), so they created Dynamometer.

PagerDuty released a report this week entitled, “The State of IT Work-Life Balance”, which contains the results of their recent survey. This article is an overview, along with some related tidbits about alert fatigue.

Through an anecdote, Baron Schwartz cautions against the use of counter-factuals (“you should have…”) in analyzing the decisions leading up to an outage.

What it says on the tin. This article would make for a great checklist for deploys.

Outages

  • Singpass (Singapore ID system)
  • Uber
  • Fortnite
    • Fortnite hit a new peak of 3.4 million concurrent players last Sunday… and that didn’t come without issues!

      They suffered 6 different outages over two days, and they posted this highly-detailed incident analysis just 5 days later. Normally I tend not to include outages for MMO games because they have so many and rarely post in-depth analyses, but this one is worth a read.

  • Binance (cryptocurrency exchange)
  • Google App Engine
  • US stock brokerages
    • The US stock market had a rough week, and so did several brokerage websites as they dealt with the high trading volume.
  • Super Bowl Advertisers
    • Several companies that purchased expensive commercial slots during the SuperBowl (an american sportsball thing, for you folks outside the US) were unable to handle the web traffic they brought in.
  • Super Bowl
    • NBC had a 45-second blackout in their broadcast of the Superb Owl.

SRE Weekly Issue #108

Wow, I have a lot of great content to share with you this week!  Sometimes it seems like awesome articles come in waves… not sure what that’s about.

SPONSOR MESSAGE

ChatOps continues to gain momentum in all industries. See what Jason Hand had to say about the progression. http://try.victorops.com/SREWeekly/ChatOpsUpdate

Articles

This is the first in a series where New York Times CTO, Nick Rockwell, talks to leaders in the technology world about their work.

There’s so incredibly much awesome in this conversation, and I’ve already seen the internet alight with people quoting it. Charity says so many insightful things that I’m going to have to reread this a couple of times to absorb it all. It’s a must-read!

Xero SRE is back, this time with an article about their incident response process and an overview of their chatbot, Multivac. The bot assists with paging and information tracking and, crucially, guides incident responders through a checklist of actions such as determining severity.

Here’s a fun little distributed system debugging story from the founder of RavenDB.

This CNN article goes into a little more detail about what happened. To my eye, there’s not enough in those details to warrant firing, so there must be more than has been shared publicly.

LinkedIn’s growth from a single datacenter to multiple “hyperscale” locations was accompanied by a cultural shift. They transitioned from “‘Site-Up’ is priority #1” to “taking intelligent risks” as their overall reliability improved.

The program is nominally aimed toward “a variety of industries, including the aerospace, automotive, maritime, manufacturing, oil, chemical, power transmission, medical device, infrastructure planning and extreme event response sectors”, though I can’t help but wonder if it might be applicable to IT.

“Well I’d cut out the pizza and beer and instead pay for Splunk.”

This author pushes us to resist the urge to write something in-house and instead look for external services or software, when the tool is not key to delivering customer value.

Here’s a very well-articulated argument for using a third-party feature-flag service rather than writing your own. I’ve seen every pitfall they mention and more. This article is by Rollout.io, a feature-flag service, but they notably don’t mention their product even once, and they don’t need to. Nicely done, folks.

I think there’s another layer we get out of the postmortem process itself that hasn’t usually been part of the discussion: communicating about your service’s long-term stability.

We should look beyond merely preventing the same kind of incident in the future and improving our incident response process, says this article from PagerDuty.

How many times have you been paged for a server at 95% disk usage, only to find that it’s still months away from full? This article by SignalFX is about a feature on their platform, but its concepts are generally applicable to other tools.

A primer on testing failover in a MongoDB Atlas cluster.

Large numbers of SREs went scrambling last month when we realized that we may suddenly run out of resources on our NoSQL workloads. Here are some concrete numbers on how things actually turned out.

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme