SRE WEEKLY – Page 87 – scalability, availability, incident response, automation

SRE Weekly Issue #60

lex

February 20, 2017

Sorry I’m late this week! My family experienced a low-redundancy event as two grown-ups and one kid (so far) have been laid low by Norovirus.

That said, I’m glad that the delay provided me the opportunity to share this first article so soon after it was published.

Articles

Reflecting on one very, very strange year at Uber

Susan Fowler’s articles have been featured here several times previously, and she’s one of my all-time favorite authors. Now it seems that while she was busy writing awesome articles and a book, she was also dealing with a terribly toxic and abhorrent environment of sexual harassment and discrimination at Uber. I can only be incredibly thankful that somehow, despite their apparent best efforts, Uber did not manage to drive Susan out of engineering as happens all to often in this kind of scenario.

Your Team Could Be Just Like Uber, Especially If You’re Certain It’s Not.

Even, and perhaps especially if we think we’re doing a good job preventing the kind of abusive environment Susan described, it’s quite possible we’re just not aware of the problems. Likely, even. This kind of situation is unfortunately incredibly common.

GitLab.com / runbooks

Wow, what a cool idea! GitLab open-sourced their runbooks. Not only are their runbooks well-structured and great as examples, some of them are general enough to apply to other companies.

Ship Small Diffs

Every line of code has some probability of having an undetected flaw that will be seen in production. Process can affect that probability, but it cannot make it zero. Large diffs contain many lines, and therefore have a high probability of breaking when given real data and real traffic.

Full disclosure: Heroku, my employer, is mentioned.
Thanks to Devops Weekly for this one.

Swapping, memory limits, and cgroups

TIL: cgroup memory limits can cause a group of processes to use swap even when the system as a whole is not under memory pressure. Thanks again, Julia Evans!

Firefighting is a Team Sport

This week from VictorOps is nifty primer on structuring your team’s on-call and incident response. I love when a new concept catches my eye like this one:

While much has been said about the importance of keeping after-action analysis blameless, I think it is doubly important to keep escalations blameless. A lone wolf toiling away in solitude makes for a great comic book, but rarely leads to effective resolution of incidents in complex systems.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

How We Improved Our Server Performance

Open source IoT platform ThingsBoard’s authors share a detailed account of how they diagnosed and fixed reliability and throughput issues in their software so that it could handle 30k incoming events per second.

Building a Scalable Online Game with Azure – Part 1

There’s both theory and practice in this article, which opens with an architecture discussion and then continues into the steps to deploy a first verison in a testing Azure environment on your workstation.

Load Balancers: Simplifying High Availability

I don’t often link to new product announcements, but DigitalOcean’s new Load Balancer product caught my attention. It looks to be squarely aimed at improving on Amazon’s ELB product.

Cloud Spanner

Okay, apparently I do link to product announcements often. Google unveiled a new beta product this week for their Cloud Platform: Cloud Spanner. Based on their Spanner paper from 2012, they have some big claims.

Cloud Spanner is the first and only relational database service that is both strongly consistent and horizontally scalable. […] With automatic scaling, synchronous data replication, and node redundancy, Cloud Spanner delivers up to 99.999% (five 9s) of availability for your mission critical applications.

Outages

US National Weather Service
- The U.S. National Weather Service said on Tuesday it suffered its first-ever outage of its data system during Monday’s blizzard in New England, keeping the agency from sending out forecasts and warnings for more than two hours. [Reuters]
The Travis CI Blog: Postmortem for 2017-02-04 container-based Infrastructure issues
- A garden-variety bug in a newly-deployed version was exacerbated by a failed rollback, in a perfect example of a complex failure with a complex intersection of contributing factors.
Instapaper Outage Cause & Recovery
- Last week, I incorrectly stated that Instapaper’s database hit a performance cliff. In actuality, their RDS instance was, unbeknownst to them, running on an ext3 filesystem with its single-file limit of 2TB per file. Their only resolution path when they ran out of space was to mysqldump all their data and restore into a new DB running on ext4.
  
  Even if we had executed perfectly, from the moment we diagnosed the issue to the moment we had a fully rebuilt database, the total downtime would have been at least 10 hours.

SRE Weekly Issue #59

lex

February 12, 2017

General

Comments

View on sreweekly.com

Much like I did with telecoms, I’ve decided that it’s time to stop posting every MMO game outage that I see go by. They rarely share useful postmortems and they’re frequently the target of DDoS attacks. If I see an intriguing one go by though, I’ll be sure to include it.

Articles

Significant Threats to Patient Safety from Healthcare Provider Overload and Burnout Should Not Be Overlooked

Here’s a great article about burnout in the healthcare sector. There’s mention of second victims (see also Sydney Dekker) and a vicious circle: burnout leads to mistakes, which lead to adverse patient outcomes, which lead to guilt and frustration, which leads to burnout.

What Does Downtime Cost Your Business?

Every week, I find and ignore at least one bland article about the “huge cost of downtime”. They almost never have anything interesting or new to say. This article by PagerDuty takes a different approach that I find refreshing, starting off by defining “downtime” itself.

Honest status reporting and AWS service status “truth” in a post-truth world

A frustrated CEO speaks out against AWS’s infamously sanguine approach to posting on their status site.

Postmortem of database outage of January 31 | GitLab

As mentioned last week, here’s the final, published version of GitLab’s postmortem for their incident at the end of last month.

An ideal environment is one in which you can make mistakes but easily and quickly recover from them with minimal to no impact.

Jepsen: MongoDB 3.4.0-rc3

MongoDB contracted Jepsen to test their new replication protocol. Jepsen found some issues, which are fixed, and now MongoDB gets a clean bill of health. Pretty impressive! Even cooler is that the Mongo folks have integrated Jepsen’s tests into their CI.

Outages

Instapaper
- Instapaper hit a performance cliff with their database, and the only path forward was to dump all data and load it into a new, more powerful DB instance.
Google Cloud Status Dashboard
- Google released a postmortem for a network outage at the end of January.
OWASA (Orange County, FL, USA water authority)
- OWASA had to cut off the municipal water supply for 3 days after an accidental overfeed of fluoride into the drinking supply. They engaged in an impressive post-analysis and released a detailed root cause analysis document. It was a pretty interesting read, and I highly recommend clicking through to the PDF and reading it. There you’ll see that “human error” was a proximal but by no means root cause of the outage, especially since the human in question corrected their error after just 12 seconds.

SRE Weekly Issue #58

lex

February 5, 2017

General

Comments

View on sreweekly.com

Articles

GitLab.com Database Incident

I’m going to break my normal format and post this outage up here in the article section. Here’s why: GitLab was extremely open about this incident, their incident response process, and even the actual incident response itself.

Linked is their blog post about the incident, with an analysis from 24 hours after the incident that runs circles around the postmortems released by many other companies days after an outage. They also linked to their raw incident response notes (a Google Doc).

Here’s what really blows me away: they live streamed their incident response on youtube. They’re also working on their postmortem document publicly in a merge request and tracking remediations publicly in their issue tracker. Incredible.

Their openness is an inspiration to all of us. Here are a couple of snippets from the email I sent them earlier this week that is (understandably) still awaiting a response:

[…] I’m reaching out with a heartfelt thank you for your openness during and after the incident. Sharing your incident response notes and conclusions provides an unparalleled educational resource for engineers at all kinds of companies. Further, your openness encourages similar sharing at other companies. The benefit to the community is incalculable, and on behalf of my readers, I want to thank you!

[…] Incidents are difficult and painful, but it’s the way that a company conducts themselves during and after that leaves a lasting impression.

New zine: “Networking! ACK!”

Julia Evans is back this week with a brand new zine about networking. It’ll be posted publicly in a couple weeks, but until then, you can get your own shiny copy just by donating to the ACLU (who have been doing a ton of awesome work!). Great idea, Julia!

Google SRE Book is Now Free

You can now read the Google SRE book online for free! Pretty nifty. Thanks Google.

The Infrastructure Behind Twitter: Scale

An in-depth dive into how Twitter scales. I’m somewhat surprised that they only moved off third-party hosting as recently as 2010. Huge thanks to Twitter for being so open about their scaling challenges and solutions.

Understanding Unikernels

Here’s a good intro to unikernels, if you’re unfamiliar with them. The part that caught my attention is under the heading, “How Do You Debug the Result?”. I’m skeptical of the offered solution, “just log everything you need to debug any problem”. If that worked, I’d never need to pull out strace and lsof, yet I find myself using them fairly often.

Leeds Teaching pathology IT crash blamed on “human error”

This article reads a whole lot more like “process problems” than “human error”. Gotta love the flashy headline, though.

Are you being fooled when it comes to resilience?

Just what exactly does that “five nines” figure in that vendor’s marketing brochures mean, anyway?

Twitter Killed The Call Center

Your customers may well take to Twitter to tell you (and everyone else) about your outages. PagerDuty shares suggestions for how to handle it and potentially turn it to your advantage.

The Dark Standup

This operations team all agreed to work a strict 9-to-5 and avoid checking email or slack after hours. They shared their experience every day in a “dark standup” on Slack: a text-based report of whether each engineer is getting behind and what they’ll do with the extra hours they would normally have worked. They shared their conclusions in this article, and it’s an excellent read.

A very short primer on DIY, technical debt and DevOps alerting

Faced with limited financing and a high burn rate, many startups focus on product development and application coding at the expense of back of operations engineering.

And the result is operational technical debt.

No Easy Answers For Scheduling Physician On-Call Coverage

It’s really interesting to me that paying physicians extra for on-call shifts seems to be an industry standard. Of all my jobs, only one provided special compensation for on-call. It made the rather rough on-call work much more palatable. Does your company provide compensation or Time Off In Lieu (TOIL)? I’d love it if you’d write an article about the reasons behind the policy and the pros and cons!

On-Call Ways and Means: A Developer’s Guide

Bringing non-traditional Ops folks, including developers, on-call can be a tricky process. Initial reactions tend to be highly polarized, either total pushback and refusal, or a meek acceptance coupled with fear of the unknown. For the former, understanding the root of the refusal is useful. For the latter, providing clarity and training is important.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Booted up in 1993, this server still runs — but not for much longer

Kind of neat, especially in this era of infrastructures built on (mostly) commodity hardware.

Don’t let serverless applications dodge performance monitoring

The Emperor has no clothes on, NoOps isn’t a thing, and you still have to monitor your serverless applications. Sorry about that.

Outages

Telstra
- A fire at an exchange resulted in an outage and somehow also caused SMSes to be mis-delivered to random recipients.
Heroku
- Full disclosure: Heroku is my employer.
Google App Engine
123-Reg

SRE Weekly Issue #57

lex

January 29, 2017

General

Comments

View on sreweekly.com

A short one this week as I recover from a truly heinous chest-cold. Thanks, 2017.

Articles

Implementing Semantic Monitoring

In this issue of Production Ready, Mathias shows how his team set up semantic monitoring. They continuously run integration tests and feed the results into their monitoring system, rather than running CI only when building new code.

[…] just because the services themselves report to be healthy doesn’t necessarily mean the integration points between them are fine too.

Gotham Perils: Clifford Chance Sues City Over Construction Outage

By “construction outage”, the headline means “a network outage due to a fiber cut that was caused by construction”. It will be interesting to see whether this suit is successful.

On-Call Handoffs: Empowering Adaptability in Incident Response

Recommendations for an on-call hand-off procedure. It’s geared toward using the VictorOps platform, but the main ideas apply more broadly. I like the idea of reviewing deploys as well as incidents and for running a monthly review of handoffs.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Outages

Battlefield 1
PSN
Stack Exchange
- Stack Exchange had a 12-minute outage on January 24. Click through for their postmortem, published two days later.
United Airlines
DirecTV Now

SRE Weekly Issue #56

lex

January 22, 2017

General

Comments

View on sreweekly.com

Articles

On-Call Survey

If you have a minute (it’ll only take one!), would you please fill out this survey? Gabe Abinante (featured here previously) is gathering information about the on-call experience with an eye toward presenting it at Monitorama.

Operations For Developers – you wrote it, we all run it!

Wow, what a resource! As the URL says, this is “some ops for devs info”. Tons of links to useful background for developers that are starting to learn how to do operations. Thanks to the author for the link to SRE Weekly!

AWS Lambda Performance and Cold Starts

AWS Lambda response time can increase sharply if your function is accessed infrequently. I love the graphs in this post.

Four load testing mistakes developers love to make

A top-notch article on how to avoid common load-testing pitfalls. Great for SREs as well as developers!

Instrumentation: Worst case performance matters

A description of an investigation into poor performance in a service with a 100% < 5ms SLA.

Docker: InfraKit Under the Hood

Docker posted this article on how they designed InfraKit for high availability.

Should I block ICMP?

No!!

A blanket block of ICMP on your network device breaks some important features like ping, traceroute, MTU discovery, and the like. MTU discovery (Fragmentation Required) is especially important, and ignoring it can cause connections to appear to time out for no obvious reason.

Outages

← Older Posts

Newer Posts →

SRE Weekly Issue #60

Articles

Outages

SRE Weekly Issue #59

Articles

Outages

SRE Weekly Issue #58

Articles

Outages

SRE Weekly Issue #57

Articles

Outages

SRE Weekly Issue #56

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues