SRE Weekly Issue #142

SPONSOR MESSAGE

Becoming a reliability engineer takes a unique set of skills and a breadth of knowledge. See what it takes to become an SRE, and use this as a resource to quickly ramp-up new SREs:

http://try.victorops.com/sreweekly/becoming-a-reliability-engineer

Articles

The big news this week is the story from Bloomberg alleging a spy chip on SuperMicro motherboards. I say “alleging” because Amazon and Apple have issued unequivocal denials.

Jordan Robertson and Michael Riley — Bloomberg

There was a plan in the works in the months before the Pulse nightclub mass shooting in Florida (US) in 2016, designed for getting victims out of a “hot” zone. The story about why it wasn’t implemented echoes the kind of organizational failings we see as SREs.

Abe Aboraya — ProPublica

Facebook is at it again! Here’s a new system based on a state machine driven by Chef.

Declan Ryan — Facebook

Google has produced a new guide on designing DR in Google Cloud Platform:

We’ve put together a detailed guide to help steer you through setting up a DR plan. We heard your feedback on previous versions of these DR articles and now have an updated four-part series to help you design and implement your DR plans.

Grace Mollison — Google

[…] you must be part of the team working on the system. You cannot be someone that hurts a system and then wait for others to fix the problem.

Jan Stenberg — InfoQ

If you’ve ever been woken in the middle of the night just to see that an alert could be solved by adding another server or two to the loadbalancer, you need capacity plans and you need them yesterday.

Evan Smith — Hosted Graphite

[…] our industry has finally reached the tipping point at which it has become viable to build distributed systems from scratch, at a fast pace of iteration and low cost of operation, all while still having a small team to execute

The author argues that it’s possible to avoid building tech debt while still retaining the velocity a new startup needs.

Author: Santiago Suarez Ordoñez — Blameless, Inc.

From a single host, to a bigger host, to leader/follower replication and active/active setups. The distinction between active/active versus “Multi-Active” is worth reading.

Sean Loiselle — Cockroach Labs

Outages

SRE Weekly Issue #141

SPONSOR MESSAGE

Are you exploring serverless architecture on AWS? Check out this post to get step-by-step instructions for setting up and maintaining DynamoDB to keep it from waking you up with unactionable alerts:

http://try.victorops.com/sreweekly/dynamodb-and-aws

Articles

An outline of the design of Netflix’s new load balancer, with special emphasis on dealing with faltering backends. Great idea: servers report their utilization level in a response header. Tricky pitfall: the LB is so good at moving requests off of ailing backends that backend failure rate alerts don’t fire.

Mike Smith — Netflix

This article begins by explaining consistency versus availability in distributed data stores and argues that the trade-off is less significant than one might think. Then it describes a pitfall seen in some new data stores. I’ve pondered aloud here in the past on how Spanner can make the claims it does, and this article explains that nicely.

Daniel Abadi

And here’s a refutation of part of the previous article by the creator of RavenDB.

Ayende Rahien

It is tempting to think that ensuring the resilience or continuity of all the individual parts of a business will guarantee the resilience or continuity of the whole.

Dr. Sandra Bell

GitHub used an innovative technique to avoid holding open a long-running code branch while upgrading their application to support rails 5.2.

Eileen Uchitelle — GitHub

Worker node errors led to cascading failure when they hit Google Compute Engine quotas.

Bogdana Vereha — Travis CI

This week, the US Internal Revenue Service (IRS) issued a report analyzing the tax-day outage that occurred this past April. Linked is a nice summary by the Register.

Thanks to reader Michael Fischer for a tip on the report.

Chris Mellor — The Register

Outages

SRE Weekly Issue #140

SPONSOR MESSAGE

Are you exploring serverless architecture on AWS? Check out this post to get step-by-step instructions for setting up and maintaining DynamoDB to keep it from waking you up with unactionable alerts:

http://try.victorops.com/sreweekly/dynamodb-and-aws

Articles

My sincerest apologies to Dale Markowitz, the author of this article who I mispronouned in last week’s issue. I’m kicking myself, because I totally didn’t need to use a pronoun at all.

Dale Markowitz — LOGIC Magazine

Linus Torvalds made waves this week with an email apologizing for his unprofessional behavior and committing to improving.

Linus Torvalds

A pretty detailed article on how LaunchDarkly designed their system for reliability. The streaming vs. polling section is especially interesting.

Adam Zimman — LaunchDarkly

Full disclosure: Fastly, my employer, is mentioned.

Lots of details about how they achieve their reliability goals. I’d love to see a followup with more detail on why writing a solution in-house made sense versus adopting something like Kafka.

Mark Marchukov — Facebook

The staging environment plays an important part. If staging isn’t working for your organization, make sure you aren’t making these common mistakes.

Harshit Paul — DZone

The challenges in question involve testing a microservice’s interactions with other microservices. Read about their system for distributing and running mock servers for each microservice.

Mayank Gupta, K.Vineet Nair, Shivkumar Krishnan, Thuy Nguyen, and Vishal Prakash — Grab

My partner suggested I look into the Deepwater Horizon incident, and I’m glad I did. My two key takeaways were normalization of deviance and this gem:

Researchers who study disasters tell us that a long period without an accident can be a big risk factor in itself: Workers learn to expect safe operation as the norm and can’t even conceive of a devastating failure.

James B. Meigs — Slate

Outages

SRE Weekly Issue #139

SPONSOR MESSAGE

SRE teams need to prepare for incidents. Maintain high levels of uptime, prepare for downtime, and create more reliable services by optimizing incident detection, response, and remediation workflows:

http://try.victorops.com/sreweekly/preparing-for-downtime

Articles

Find out how AutoTrader deployed TLS to 3000 vendor websites, and what they did when things went wrong despite their careful deployment strategy.

Lee Goodman — AutoTrader

An excellent short piece about incident response, using the radio recordings from an aircraft accident as a case study.

Sri Ray

No production operation is too big or too small for a checklist. Similarly, no situation is too strenuous for one.

Sri Ray

[…] in this new series, we’re sharing some of our internal SRE processes. This first post looks at the guidelines our SRE team follow to communicate with customers during an incident, with some practical tips, examples, and the thinking behind it all.

Fran Garcia — Hosted Graphite

Here’s why adopting a multi-cloud strategy may not do what you want, while also making your life much harder.

Tyler Treat

Last fall, I linked to a couple of talks on research in automated bugfixing. Facebook has now deployed such a system to production.

Yue Jia, Ke Mao, Mark Harman — Facebook

Microsoft’s Visual Studio Team System (VSTS) was one of the services impacted by the major Azure outage earlier this month. Here’s an in-depth analysis of what went wrong and what they might (or might not) be able to do to prevent a similar incident.

Buck Hodges — Microsoft

Outages

SRE Weekly Issue #138

SPONSOR MESSAGE

A dedication to SRE will improve the lives of your customers and team. For our August Roundup, we’ve compiled a list of top SRE articles in order to help you keep up with the latest news, tips, and topics in SRE:

http://try.victorops.com/sreweekly/august-sre-roundup

Articles

This episode of Greater Than Code features John Allspaw, and it’s pretty much as awesome as I expected. Some highlights:

  • rather than asking how an incident happened, ask what prevented it from being worse
  • ask “how” rather than “why” an incident happened
  • humans plus technology are together a cognitive system
  • how can you make automation a team player?

Janelle Klein, John Sawers, Rein Henrichs, and Jessica Kerr, with John Allspaw

What does cold start look like on various FaaS platforms? This article has hard numbers obtained through empirical testing.

Mikhail Shilkov

Colm MacCárthaigh explains how shuffle sharding improves reliability by acting like some kind of magic lever made of math.

Colm MacCárthaigh — AWS (thanks to Thread Reader for the thread rollup)

Who cares if your CDN has an eleventeen terabaud backbone uplink? What really matters is how they can serve your traffic.

Matt Levine — CacheFly

An engineer pushes a small change and OkCupid goes up in flames.

A new, entry-level employee takes down a big site — due not to a bug in his software, but in a dependency.

Dale Markowitz — LOGIC Magazine (Issue #5)

What happens when you mix Observability and Serverless? Corey Quinn of Last Week in AWS lets you in on the hottest new operations practice.

Corey Quinn

How will climate change and rising sea levels impact the reliability of our networks?

Carol Barford — iAfrikan

I watched this Nova (PBS) episode this week, and I highly recommend it. It explores why trains crash and what governments are doing to improve safety. The link above requires membership, but you can also watch it on Netflix.

PBS

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme