SRE Weekly Issue #146

A message from our sponsor, VictorOps:

Automation can be used to help classify incident severity and route alerts to the right person or team. Learn how SRE teams are leveraging a refined incident classification and alert routing process to improve system reliability:

http://try.victorops.com/sreweekly/classifying-incident-severity

Articles

NRE Labs is a no-strings-attached, community-centered initiative to bring the skills of automation within reach for everyone. Through short, simple exercises, all right here in the browser, you can learn the tools, skills, and processes that will put you on the path to becoming a Network Reliability Engineer.

Tips on designing your on-call to be fair to the humans involved, including gems like an automatic day off after a middle-of-the-night page.

David Mytton — StackPath

GitHub’s major outage stemmed from a brief cut in connectivity between two of their data centers.

Errata: Last week I mentioned the possibility of a network cut and cited an article about GitHub’s database architecture. I should have credited @dbaops, who made the connection.

Rumors of undocumented packet rate limits in EC2 abound, and I’ve personally run afoul of them. Backed by direct experimentation, this article unmasks the limits.

Matthew Barlocker — Blue Matador

This sounds an awful lot like those packet rate limits from the previous article…

Chris McFadden — SparkPost

Ever hear of that traffic intersection where they took out all of the signs, and suddenly everyone drove more safely? Woolworth’s tried a similar experiment with their stores, with interesting results.

Sidney Dekker — Safety Differently

Find out how they discovered the bug and what they did about it. Required reading if you use gRPC, since in some cases it falls to obey timeouts.

Ciaran Gaffney and Fran Garcia — Hosted Graphite

when we sit with a team to plan the experiment, that is when the light goes on… they start realising how many things they missed and they start cataloging what bad things could happen if something goes bad…

Russ Miles — ChaosIQ

Outages

SRE Weekly Issue #145

A message from our sponsor, VictorOps:

When SRE teams track incident management KPIs and benchmarks, they better optimize the way they operate–helping SREs create more resilient teams and build more reliable systems:

http://try.victorops.com/sreweekly/top-incident-management-kpis

Articles

An article on looking past human error in investigating air sports (definition) accidents, drawing on the writing of Don Norman. Special emphasis on slips versus mistakes:

“Slips tend to occur more frequently to skilled people than to novices
[…]

Mara Schmid — Blue Skies Magazine

An VP of NS1 explains how his company rewrote and deployed their core service without downtime.

Shannon Weyric — NS1

This guide from Hosted Graphite has a ton of great advice and reads almost as if they’ve released their internal incident response guidelines. Bonus content: check out this exemplary post-incident followup from their status site.

Fran Garcia — Hosted Graphite

Check it out, Atlassian posted their incident management documentation publicly!

On Monday I gave a talk at DOES18 called “All the World’s a Platform”, where I talked about a bunch of the lessons learned by using and abusing and running and building platforms at scale.

I promised to do a blog post with the takeaways, so here they are.

Charity Majors

[…] at a certain point, it’s too expensive to keep fixing bugs because of the high-opportunity cost of building new features. You need to decide your target for stability just like you would availability, and it should not be 100%.

Kristine Pinedo — Bugsnag

Maelstrom is Facebook’s tool to assist engineers in safely moving traffic off of impaired infrastructure.

Adrian Colyer — The Morning Paper (summary)
Veeraraghavan et al. — Facebook (original paper)

Attempting to stamp out failure entirely can have the paradoxical effect of reducing resiliency to anomalous situations. Instead, we need to handle failure constructively.

Daniel Hummerdal — Safety Differently

v

Outages

SRE Weekly Issue #144

A message from our sponsor, VictorOps:

Customers expect reliability–even in today’s era of CI/CD and Agile software development. That’s why SRE is more important than ever. Learn about the importance of getting buy-in from your entire team when taking on SRE:

http://try.victorops.com/sreweekly/organizational-sre-support

Articles

GitLab is incredibly open with their policies, and incident management is no exception.

GitLab

Ooh, new newsletter! This one focuses specifically on resiliency. It seems to have just a few articles each week with in-depth summaries.

Thai Wood

This article starts with a fictitious(?) account of the kind of failure that can occur when teams step on each other’s toes in a serverless environment. It goes on to discuss techniques for dealing with this class of problems, including careful permission management.

Toby Fee — jaxenter

Sometimes fixing a rarely-occurring bug can be especially difficult. Recording enough information all the time to debug those rare failures would be too resource-intensive. Check out this fascinating technique for working backward from a memory dump to infer the prior contents of memory in the time leading up to a failure.

Adrian Colyer — The Morning Paper (summary)
Cui et al. (original paper)

An introduction to the concept of reactive systems including a description of the high-level architectural features.

Sinkevich Uladzimir — The Server Side

Initially, you can improve reliability by studying incidents to find out what went wrong. This article explains why that strategy will only get you so far.

Thanks to Thomas Depierre for this one.
Sidney Dekker — Safety Differently

Chaos Monkey Guide for Engineers – Tips, Tutorials, and Training

Gremlin released this huge guide on chaos monkey, covering theory, practice, further reading, and lots of other resources.

Gremlin, inc.

Outages

SRE Weekly Issue #143

SPONSOR MESSAGE

Minimum viable runbooks are a way to spend less time building runbooks and more time using them. Learn more about creating actionable runbooks to support SRE and make on-call suck less:

http://try.victorops.com/sreweekly/minimum-viable-runbooks

Articles

There’s some great statistics theory in here. The challenge is: how can you have accurate, useful A/B tests without having to wait a long time to get a big enough sample size? Can you bail out early if you know the test has already failed? Can you refine the new feature mid-test?

Callie McRee and Kelly Shen — Etsy

Don’t just rename your Ops team to “SRE” and expect anything different, says this author.

Ernest Mueller — The Agile Admin

Great idea:

So what if we monitor the percentage of requests that are over the threshold instead? To alert us when our SLAs are violated, we can trigger alarms when that percentage is greater than 1% over some predefined time window.

Yan Cui

There’s a ton of detail here, and it’s a great read. Lots of juicy tidbits about PoP selection, load balancing, and performance monitoring.

Oleg Guba and Alexey Ivanov — Dropbox

Full disclosure: Fastly, my employer, is mentioned.

Even as a preliminary report there’s a lot to digest here about what caused the series of gas explosions last month in Massachusetts (US). I feel like I’ve been involved in incidents with similar contributing factors.

US National Transportation Safety Board (NTSB)

This isn’t just a recap of a bad day, although the outage description is worth reading by itself. Readers also gain insight into the evolution of this engineer’s career and mindset, from entry-level to Senior SRE.

Katie Shannon — LinkedIn

GitLab, in their trademark radically open style, goes into detail on the reasons behind the recent increase in the reliability of their service.

Andrew Newdigate — GitLab

Five nines are key when you consider that Twilio’s service uptime can literally mean life and death. Click through to find out why.

Charlie Taylor — Blameless

Outages

  • Travis CI
  • Google Compute Engine us-central1-c
    • I can’t really summarize this incident report one well, but I highly recommend reading it.
  • Azure
    • Duplicated here since I can’t deep-link:

      Summary of impact: Between 01:22 and 05:50 UTC on 13 Oct 2018, a subset of customers using Storage in East US may have experienced intermittent difficulties connecting to resources hosted in this region. Other services leveraging Storage in the region may have also experienced impact related to this incident.

  • Instagram
  • Heroku
    • This one’s notable for the duration: about 10 days of diminished routing performance due to a bad instance.
  • Microsoft Outlook

SRE Weekly Issue #142

SPONSOR MESSAGE

Becoming a reliability engineer takes a unique set of skills and a breadth of knowledge. See what it takes to become an SRE, and use this as a resource to quickly ramp-up new SREs:

http://try.victorops.com/sreweekly/becoming-a-reliability-engineer

Articles

The big news this week is the story from Bloomberg alleging a spy chip on SuperMicro motherboards. I say “alleging” because Amazon and Apple have issued unequivocal denials.

Jordan Robertson and Michael Riley — Bloomberg

There was a plan in the works in the months before the Pulse nightclub mass shooting in Florida (US) in 2016, designed for getting victims out of a “hot” zone. The story about why it wasn’t implemented echoes the kind of organizational failings we see as SREs.

Abe Aboraya — ProPublica

Facebook is at it again! Here’s a new system based on a state machine driven by Chef.

Declan Ryan — Facebook

Google has produced a new guide on designing DR in Google Cloud Platform:

We’ve put together a detailed guide to help steer you through setting up a DR plan. We heard your feedback on previous versions of these DR articles and now have an updated four-part series to help you design and implement your DR plans.

Grace Mollison — Google

[…] you must be part of the team working on the system. You cannot be someone that hurts a system and then wait for others to fix the problem.

Jan Stenberg — InfoQ

If you’ve ever been woken in the middle of the night just to see that an alert could be solved by adding another server or two to the loadbalancer, you need capacity plans and you need them yesterday.

Evan Smith — Hosted Graphite

[…] our industry has finally reached the tipping point at which it has become viable to build distributed systems from scratch, at a fast pace of iteration and low cost of operation, all while still having a small team to execute

The author argues that it’s possible to avoid building tech debt while still retaining the velocity a new startup needs.

Author: Santiago Suarez Ordoñez — Blameless, Inc.

From a single host, to a bigger host, to leader/follower replication and active/active setups. The distinction between active/active versus “Multi-Active” is worth reading.

Sean Loiselle — Cockroach Labs

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme