General

SRE Weekly Issue #58

SPONSOR MESSAGE

“The How and Why of Minimum Viable Runbooks.” Get the free ebook from VictorOps.

Articles

I’m going to break my normal format and post this outage up here in the article section. Here’s why: GitLab was extremely open about this incident, their incident response process, and even the actual incident response itself.

Linked is their blog post about the incident, with an analysis from 24 hours after the incident that runs circles around the postmortems released by many other companies days after an outage. They also linked to their raw incident response notes (a Google Doc).

Here’s what really blows me away: they live streamed their incident response on youtube. They’re also working on their postmortem document publicly in a merge request and tracking remediations publicly in their issue tracker. Incredible.

Their openness is an inspiration to all of us. Here are a couple of snippets from the email I sent them earlier this week that is (understandably) still awaiting a response:

[…] I’m reaching out with a heartfelt thank you for your openness during and after the incident. Sharing your incident response notes and conclusions provides an unparalleled educational resource for engineers at all kinds of companies. Further, your openness encourages similar sharing at other companies. The benefit to the community is incalculable, and on behalf of my readers, I want to thank you!

[…] Incidents are difficult and painful, but it’s the way that a company conducts themselves during and after that leaves a lasting impression.

Julia Evans is back this week with a brand new zine about networking. It’ll be posted publicly in a couple weeks, but until then, you can get your own shiny copy just by donating to the ACLU (who have been doing a ton of awesome work!). Great idea, Julia!

You can now read the Google SRE book online for free! Pretty nifty. Thanks Google.

An in-depth dive into how Twitter scales. I’m somewhat surprised that they only moved off third-party hosting as recently as 2010. Huge thanks to Twitter for being so open about their scaling challenges and solutions.

Here’s a good intro to unikernels, if you’re unfamiliar with them. The part that caught my attention is under the heading, “How Do You Debug the Result?”. I’m skeptical of the offered solution, “just log everything you need to debug any problem”. If that worked, I’d never need to pull out strace and lsof, yet I find myself using them fairly often.

This article reads a whole lot more like “process problems” than “human error”. Gotta love the flashy headline, though.

Just what exactly does that “five nines” figure in that vendor’s marketing brochures mean, anyway?

Your customers may well take to Twitter to tell you (and everyone else) about your outages. PagerDuty shares suggestions for how to handle it and potentially turn it to your advantage.

This operations team all agreed to work a strict 9-to-5 and avoid checking email or slack after hours. They shared their experience every day in a “dark standup” on Slack: a text-based report of whether each engineer is getting behind and what they’ll do with the extra hours they would normally have worked. They shared their conclusions in this article, and it’s an excellent read.

Faced with limited financing and a high burn rate, many startups focus on product development and application coding at the expense of back of operations engineering.

And the result is operational technical debt.

It’s really interesting to me that paying physicians extra for on-call shifts seems to be an industry standard. Of all my jobs, only one provided special compensation for on-call. It made the rather rough on-call work much more palatable. Does your company provide compensation or Time Off In Lieu (TOIL)? I’d love it if you’d write an article about the reasons behind the policy and the pros and cons!

Bringing non-traditional Ops folks, including developers, on-call can be a tricky process. Initial reactions tend to be highly polarized, either total pushback and refusal, or a meek acceptance coupled with fear of the unknown. For the former, understanding the root of the refusal is useful. For the latter, providing clarity and training is important.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Kind of neat, especially in this era of infrastructures built on (mostly) commodity hardware.

The Emperor has no clothes on, NoOps isn’t a thing, and you still have to monitor your serverless applications. Sorry about that.

Outages

  • Telstra
    • A fire at an exchange resulted in an outage and somehow also caused SMSes to be mis-delivered to random recipients.
  • Heroku
    • Full disclosure: Heroku is my employer.
  • Google App Engine
  • 123-Reg

SRE Weekly Issue #57

A short one this week as I recover from a truly heinous chest-cold.  Thanks, 2017.

SPONSOR MESSAGE

“The How and Why of Minimum Viable Runbooks.” Get the free ebook from VictorOps.

Articles

In this issue of Production Ready, Mathias shows how his team set up semantic monitoring. They continuously run integration tests and feed the results into their monitoring system, rather than running CI only when building new code.

[…] just because the services themselves report to be healthy doesn’t necessarily mean the integration points between them are fine too.

By “construction outage”, the headline means “a network outage due to a fiber cut that was caused by construction”. It will be interesting to see whether this suit is successful.

Recommendations for an on-call hand-off procedure. It’s geared toward using the VictorOps platform, but the main ideas apply more broadly. I like the idea of reviewing deploys as well as incidents and for running a monthly review of handoffs.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Outages

SRE Weekly Issue #56

SPONSOR MESSAGE

It’s time to fix your incident management. Built for DevOps, VictorOps helps you respond to incidents faster and more effectively. Try it out for free.

Articles

If you have a minute (it’ll only take one!), would you please fill out this survey? Gabe Abinante (featured here previously) is gathering information about the on-call experience with an eye toward presenting it at Monitorama.

Wow, what a resource! As the URL says, this is “some ops for devs info”. Tons of links to useful background for developers that are starting to learn how to do operations. Thanks to the author for the link to SRE Weekly!

AWS Lambda response time can increase sharply if your function is accessed infrequently. I love the graphs in this post.

A top-notch article on how to avoid common load-testing pitfalls. Great for SREs as well as developers!

A description of an investigation into poor performance in a service with a 100% < 5ms SLA.

Docker posted this article on how they designed InfraKit for high availability.

No!!

A blanket block of ICMP on your network device breaks some important features like ping, traceroute, MTU discovery, and the like. MTU discovery (Fragmentation Required) is especially important, and ignoring it can cause connections to appear to time out for no obvious reason.

Outages

SRE Weekly Issue #55

SPONSOR MESSAGE

It’s time to fix your incident management. Built for DevOps, VictorOps helps you respond to incidents faster and more effectively. Try it out for free.

Articles

Nothing is worse than finding out that your confidence in your backup strategy was ill-founded (the hard way). Facebook prevents this with what is, in retrospect, a blatantly obvious idea that I never thought of: continuously, automatically testing your backups by trying to restore them.

Route 53 can do failover based on health checks, but it doesn’t know how to check if a database is healthy. This article discusses using an HTTP endpoint that checks the status of the DB and returns status 200 or 500 depending on whether the DB is up. There’s also a discussion of how to handle failure of the HTTP endpoint itself.

Chaos Monkey was designed with the idea of having it run all the time on a schedule, but as Mathias Lafeldt shares, you can also (or even exclusively) trigger failures through an API. He even wrote a CLI for the API.

Here’s a link shared with me by its author. If you write something you think other SREs will like, please don’t hesitate to send it my way! I love this article, because load testing is yet another aspect of the growing trend toward developers owning the operation of their code.

This article is short and sweet. There are four rock-bottom metrics that you really need to figure out if something is wrong with your service. They had me at “Downstreamistan”.

This description of Chaos Engineering is more rigorous than casual articles, making for a pretty interesting read even if you already know all about it.

Although the term “chaos” evokes a sense of unpredictability, a fundamental assumption of chaos engineering is that complex systems exhibit behaviors regular enough to be predicted.

I haven’t had a chance to watch this yet, but the description is riveting even by itself. Click through for a link to play the documentary directly.

Outages

  • Second Life
    • One transit provider failed and automatic failover didn’t work. Once they were back up, the subsequent thundering herd of logins threatened to take them back down. Click through for a detailed post-analysis.
  • S3, EC2 API
    • On January 10, S3 had issues processing DELETE requests (though you wouldn’t know it from looking at the history section of their status page). Various (presumably) dependent services such as Heroku and PackageCloud.io had simultaneous outages.

      Full disclosure: Heroku is my employer.

  • Lloyds Bank
  • Mailgun
  • Battlefield 1
  • Facebook

SRE Weekly Issue #54

SPONSOR MESSAGE

The “2016/17 State of On-Call” report from VictorOps is now available to download. Learn what 800+ respondents have to say about life on-call, and steps they’re taking to make it better. Get your free copy here: https://victorops.com/state-of-on-call

Articles

Wow! PagerDuty made waves this week by releasing their internal incident response documentation. This is really exciting, and I’d love it if more companies did this. Their incident response procedures are detailed and obviously the result of hard-won experience. The hierarchical, almost militaristic command and control structure is intriguing and makes me wonder what problems they’re solving.

Lots of detail on New Relic’s load testing strategy, along with an interesting tidbit:

In addition, as we predicted, many sites deployed new deal sites specifically for Cyber Monday with less than average testing. Page load and JavaScript error data represented by far the largest percentage increase in traffic volume, with a 56% bump[…]

Last in the series, this article is an argument that metrics aren’t always enough. Sometimes you need to see the details of the actual events (requests, database operations, etc) that produced the high metric values, and traditional metrics solutions discard these in favor of just storing the numbers.

Let’s Encrypt has gone through a year of intense growth in usage. Their Incidents page has some nicely detailed postmortems, if you’re in the mood.

An eloquent post on striving toward a learning culture in your organization, as opposed to a blaming one, when discussing adverse incidents.

I like to include the occasional debugging deep-dive article, because it’s always good to keep our skills fresh. Here’s one from my coworker on finding the source of an unexpected git error message.

Full disclosure: Heroku, my employer, is mentioned.

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme