General

SRE Weekly Issue #76

This week, I had the awesome opportunity to attend a short-form training session on the Incident Management System (the broader system that includes Incident Command) given by Blackrock 3 Partners.  Shout-out to Rob, Ron, and Chris – it was awesome meeting you guys, and I really enjoyed our conversations!

SPONSOR MESSAGE

Upcoming devops.com webinar: Top 10 Practices of Highly Successful DevOps Incident Management Teams. Learn more and register: http://try.victorops.com/SRE_Weekly/IncidentMgmtWebinar

Articles

In case you missed it, Uber kicked off this and another investigation in response to a blog post by Susan Fowler, an SRE whose writing I’ve featured here a number of times. I’m pleased at this first step by Uber and I’m looking forward to what comes next. It might be a leave of absence for Uber’s CEO, although no decision has been made yet.

Here’s the 2013 article that started it all. If you’re unfamiliar with Jepsen, it’s an article series on testing various distributed data systems for partition tolerance, along with a companion tool set for inducing failures.

For those not completely “cloud native” (ugh) by this point, here’s a nifty primer on some of the BGP tricks you’ll need to know if you manage your own IP transit links.

Redis has a pretty big gotcha regarding deletion of expired keys, as these engineers discovered. In fact, my experience with Redis was full of operational gotchas like this.

This poor anonymous Reddit poster had a very bad day. The community rallied around them to explain that no, the anonymous poster is not to blame. One of the top commenters is Yorick Peterse, the engineer that inadvertently deleted GitLab.com’s main database earlier this year. Click through to see blamelessness in action.

PagerDuty is deeply invested in the Incident Management System, and most especially Incident Command. This article is a great overview, and if you want more, don’t forget that they also released their incident response documentation awhile back, including their Incident Commander training material.

The main theme in this article by StatusPage.io is the direct relationship between increasing complexity and difficulty in attaining high reliability. I like the mention of microservices as a trade-off and not a panacea.

Automation doesn’t replace ops, it augments it. Abstraction doesn’t replace ops, it hides it. Function as a service doesn’t remove complexity, it increases it exponentially.

Outages

  • Amazon product pages went down today in a rare outage
    • The linked story was for an outage on June 7th. There was at least one additional similar outage on June 9th (source: personal experience).
  • Verelox
    • Dutch hosting provider Verelox is having a really rough time:

      First of all, we want to offer our apologies for any inconvenience. Unfortunately, an ex administrator has deleted all customer data and wiped most servers.

      Ouch. Good luck, folks.

SRE Weekly Issue #75

SPONSOR MESSAGE

Upcoming devops.com webinar: Top 10 Practices of Highly Successful DevOps Incident Management Teams. Learn more and register: http://try.victorops.com/SRE_Weekly/IncidentMgmtWebinar

Articles

I’m super-excited to share that I’ll be speaking at Velocity NYC this October! My talk is about what exactly you can do to get out from under a failure of your single DNS provider, if you were so unfortunate as to have only one. It turns out that this question is much harder to answer than I ever imagined.

And while we’re on the subject of DNS, GitHub shared the design they used for their new resilient DNS infrastructure.

I really love when folks take the time to write up their experience in this kind of migration.

Don’t gloss over this one! I don’t want to spoil the punchline of this short but awesome article, but I will say that I always enjoy seeing data that makes me question my previous assumptions.

Production Ready is back! One way we can try to make our systems resilient to human errors is to build checklists. If it works for medicine, it can work for us.

Katie Ballinger, SRE at CircleCI, was part of the SRECon17 Americas panel, “Training New SREs. I’m grateful to her for this recap for those of us that didn’t make it to the conference.

Microservices are pretty popular right now, and lots of folks have great stuff to say about them. But much like with a lot of the tips in Google’s SRE book, we shouldn’t just blindly implement them. If your company isn’t Netflix or Uber, microservices may cause more harm than good, says Adam Drake.

Not only is this a good idea if you want Ops to be able to actually run your code without pulling their hair, it just generally means more reliable code. This article goes not only into the “how”, but the “why” too.

Outages

SRE Weekly Issue #74

This is the first issue sent to over 2000 email subscribers (not to mention the 500+ Twitter followers and an unknown number of RSS subscribers!).  Wow!  Thank you all so much for reading and for all the great feedback you’ve sent over the past year and a half.  You make this fun.

SPONSOR MESSAGE

Upcoming devops.com webinar: Top 10 Practices of Highly Successful DevOps Incident Management Teams. Learn more and register: http://try.victorops.com/SRE_Weekly/IncidentMgmtWebinar

Articles

The holy grail of high availability is a multi-datacenter (or cloud) active/active architecture. This article goes into why, including examples of common pitfalls of traditional disaster recovery solutions.

Neat idea: here’s a Stack Overflow question asking for critique of a proposed outline for a post-incident analysis. It’s a great start already, and the answers include some pretty top-notch suggestions.

A tutorial on setting up multi-region failover for an S3-hosted website, written in response to February’s major S3 outage in us-east.

Last week, I linked to an article about debugging an overloaded ELB node. This week we have the sequel, a deep dive into the intricate details behind the problem, complete with a trip into the glibc source code.

Netflix uses data science to figure out how to fill the limited space on their edge content delivery nodes with the videos that people will request, all while (hopefully) avoiding hot nodes.

Zayna Shahzad, a PagerDuty software engineer, did customer support for a day, and she learned a ton. As SREs, we have the customer experience directly in our sights, so this kind of thing sounds like a really great idea.

Charity Majors does not want to be an SRE. Find out why by watching this 5-minute video interview between her and Rob Hirschfeld. I don’t often link to videos, because who has time to watch stuff? But this one is pretty intriguing.

Server Density originated the term “humanops”, and now they share 12 parts of how they practice it.

A Malaysian doctor writes about how to ensure that the national health system’s on-call policy is safe for doctors.

The passing of a paediatrician-to-be involved in a road traffic accident (motor-vehicle accident) recently is indeed a heart-breaking news to the whole medical fraternity. With the incident, a persistent recurring issue also resurfaced – work-related commuting accident ie road traffic accidents involving exhausted doctors after on-calls.

Do what better? Prevent and end illegal and unethical actions like discrimination, harassment, and retaliation. This article is by Susan Fowler, featured here a bunch, and while it’s not directly related to SRE, it’s so important that I urge you to read it.

Outages

  • Monitorama 2017 PDX
    • Monitorama (and a swathe of Portland) suffered a power outage last week. The organizers created a status site post (linked) and quickly organized a disaster recovery site: an entirely separate conference venue. Seriously amazing work, and oddly appropriate given the conference subject matter.

      If you didn’t make it to Monitorama, here’s a summary from LinkedIn SRE Michael Kehoe.

  • Sacramento Airport (CA, USA)
  • British Airways

SRE Weekly Issue #73

SPONSOR MESSAGE

Concerned about downtime? VictorOps helps you prepare, respond, and recover from IT and DevOps Incidents. Swing by our product center to learn how and start your trial. http://try.victorops.com/SREWeekly/ProductCenter

Articles

ELBs (Amazon’s Elastic Load Balancers) depend on clients properly respecting DNS round-robin record sets. This article follows a debugging session in excellent detail as they try to answer the question: why are our clients preferring (and overloading) just one ELB IP?

Sarah Schieffer Riehl shares her take on ServerlessConf Austin 2017. She’s got a healthy dose of skepticism that I like, concluding that “serverful and serverless architectures don’t do the same things.” I like this bit:

For processes that require polling or any kind of server wakefulness, converting to a serverless architecture can be an exercise in “serverless for serverless’ sake”.

Wow, this dovetails so well into the Todd Conklin’s “Safety Moment” from last week, on imagining all the possible things that could go wrong.  I’d love to hear more thoughts along these lines: is it possible to design a reliable system without envisioning the majority of things that could go wrong?

PagerDuty outlines an incident lifecycle management policy based on ITIL.

DropBox created Cape for “asynchronous processing of billions of events a day, powering many Dropbox features”. Example: you upload a text file, and a Cape job indexes it immediately for full-text searching. I’d love to hear more on why existing solutions didn’t fit the bill, although they do cover their requirements in depth.

When I signed on for my first SRE position, I had no idea how huge a part vendor relations would play in ensuring reliability.

Initially, LinkedIn’s SRE team hired engineers only based on technical skill. As they’ve grown, they’ve discovered the importance of collaboration skills as well.

StatusPage.io explains the reasons for having a solid incident communication policy and guides you through setting one up.

As the title suggest, this ACM Queue article goes into some depth on the kinds of calculations one might make when designing a reliable system. Specifically, they focus on service dependencies and introduce Google’s “rule of the extra 9”: a dependency should have one more nine of reliability than the thing that critically depends on it.

At the next conference, when somebody tries to sell you a circuit breaker talk, tell them that this is only the starter and ask for the main course.

Outages

SRE Weekly Issue #72

SPONSOR MESSAGE

Concerned about downtime? VictorOps helps you prepare, respond, and recover from IT and DevOps Incidents. Swing by our product center to learn how and start your trial. http://try.victorops.com/SREWeekly/ProductCenter

Articles

Idempotence is a critically important tool in building a reliable system. Stripe explains the concept and shows how they wrap theoretically non-idempotent actions like charging a credit card into safely idempotent API calls.

Here’s an account of an effort to move from server-based paging (this server is down) to functional-based alerting (this user action isn’t working), with a resulting impressive reduction in out-of-hours paging.

It pays to study up and deeply understand what a simple metric like “cpu utilization” really means.

Why am I linking to AWS’s status site? Look closely, and you’ll see that the “green checkmark i” symbol has been replaced with a far more noticeable blue circle with a white diamond. Check out the old icon here for comparison. End of an era, or just another way of presenting the same information?

The author introduces a new Ruby gem, grpc-commons that makes it easy to add circuit breaker and statsd support to a grpc client.

Along with being a tutorial on setting up Zipkin with Python, this article also explains some basic Zipkin concepts.

PagerDuty is apparently trying to position itself as more than just a paging service, with a few new features around the entire incident lifecycle. I’m especially interested in checking out the new postmortem tooling.

I included this article last week, but my link was outdated and returned a 404. Here’s the corrected link — sorry about that!

I put a call out for a review of Elastic’s new beta anomaly detection feature last week, and here one is! Thanks to an Elastic employee for forwarding this link to me.

This article cautions one to be careful to look past an obvious root cause, because a deeper systemic or policy problem may be lurking behind it.

Serverless / FaaS abstract away traditional provisioning, and they make it really easy to ignore planning for resource usage.

Wow, what a concept:

you can think of […] reliable systems […] as successfully imagining all of the potential things that could go wrong

This 2.5-minute podcast from Todd Conklin has a really great question: to achieve reliability, do we have to try to imagine in advance all of the possible ways our systems could fail?

A patient was given an incorrect syringe resulting in a 5x insulin overdose. Brigham and Women’s Hospital reports on the accident and what they’re doing to prevent mistakes of this sort in the future.

Consumers today have increasingly high expectations for digital applications and service performance, but do IT personnel feel equipped to rise to the occasion? In this survey, we uncover the extent of the digital services expectation gap between consumers and IT teams as well as top strategies teams are using to solve digital disruption challenges.

Outages

  • Our First Kubernetes Outage – Saltside Engineering
    • Kudos to the Saltside folks for sharing a public postmortem for an internal, non-customer-impacting outage!

      This is public postmortem for an a complete shutdown of our internal Kubernetes cluster. It’s shared with you all so everyone may learn.

  • “Re-experience the fun of customizing your Place Page!” A Tale of Oops from Ops
    • Ouch. Linden Lab’s ops team discovered the hard way that they didn’t have a working backup copy of some customer data. The best part of this article is the discussion of the “Shrek Ears” tradition at Linden. It’s one of the things I remember most fondly from my time there, and having worn the ears a few times in my day, I can attest to the fact that it’s a great way to handle the psychological impact of making a mistake.
  • Chase (bank)
  • Facebook
A production of Tinker Tinker Tinker, LLC Frontier Theme