General

SRE Weekly Issue #21

lex

May 1, 2016

This week’s themes seem to be human error and network debugging. If you’re like me, you rarely have time to sit down and listen to podcasts, but if you ever get in the mood, this first link is a must-listen. I really can’t do it justice with my summary, but I’m very glad I listened to it, and I think you’ll like it too.

Articles

A Discussion on Human Error | PreAccident Investigation Podcast

We can try to train our workers to avoid error. We can design our systems to make errors less likely. This podcast argues that we go one step further and design our systems to be resilient in the face of inevitable error. Human error is normal and expected. Where are we one error away from a serious adverse event?

Steven Shorrock: "Life After Human Error" – Velocity Europe 2014

In this Velocity keynote, Steven Shorrock discusses human error from his point of view as an ergonomist and psychologist.

Tale of the Missing ACK

My old coworker (and network wizard) at Linden Lab wrote up this fascinating episode of network debugging. Sometimes you have to get really deep into the stack to track down reliability issues.

The Discovery of Apache Zookeeper’s Poison Packet

While we’re on the topic of debugging complicated networking failures, here’s PagerDuty’s analysis of a bug in Zookeeper. It turned out that triggering this bug involved the confluence of 3 other bugs that conspired to deliver a malformed packet to Zookeeper, which causes it to blow up. Yeesh.

Sysdig | How we found a bug in Amazon ELB

If you’re in the mood to read one more really deep and detailed network debugging session, this one’s for you. It goes through the process of gathering enough information to confidently implicate ELB as the source of abrupt connection closures.

The Flaw In All Things – blog dot lusis

John Vincent, featured here last week for his review of the new SRE book, writes this week about the burnout he’s suffering. I think it could best be described as operational risk burnout. I’m not sure what the solution is, but I’m really interested in the problem, and I hope that John considers writing more if he has any useful realizations. Good luck, John.

I couldn’t see anything but the largest configuration because all I could see was places where there was a risk. There were corners I wasn’t willing to cut (not bad corners like risking availability but more like “use a smaller instance here”) because I could see and feel and taste the pain that would come from having to grow the environment under duress.

An Inside Look at How The Ops Team Collaborates

How do you collaborate remotely during an incident? Some companies use conference bridges, but my former boss (and all-around incredible engineer and manager) Landon McDowell advocates for text-based chat. I started my career as part of the Ops team he describes, so I might be biased, but I totally agree: chat is far superior to phone bridges or VoIP.

Load balancing or balancing on the edge of a cliff?

This article starts out as a basic introduction to load-balancing, but where it goes next is really interesting. The author discusses how load-balancing can go wrong (think cascading failure as each remaining backend receives increasingly more traffic) and how to combat the pitfalls. Finally the author suggests two very intriguing concepts for smart load balancing systems that really got me thinking.

Outages

PagerDuty
- It’s especially interesting when PagerDuty goes down, because it might impact the reliability of many companies.
SendGrid
me&you mobile (South Africa)
Bureau of Water and Light (Lansing, MI, USA)
- Ransomware.
HipChat
- Here’s another speedy and detailed postmortem from Atlassian. Nice work, folks.
Large Hadron Collider
- Root cause: weasel.
Neotel (South Africa ISP)

SRE Weekly Issue #20

lex

April 24, 2016

General

Comments

View on sreweekly.com

Articles

Review: Site Reliability Engineering

Here’s a fairly negative review of the new Google SRE book. The author makes some well-articulated points against the tone of the book and its applicability outside Google. I’ve been hearing some talk of a condescending tone in the book, along with a tendency to claim “inventing” things that others also invented elsewhere. My copy arrives next week — should be an interesting read, for better or worse.

Full disclosure: Heroku, my employer, is mentioned.

The Ripple Effect Of Outages And Downtime Cannot Be Underestimated »

A discussion of the impact of an outage on a company’s brand. Skip the last bit; it’s an ad. The rest is worth reading, though.

Reputation and customer loyalty suffers dramatically. The Boston Consulting Group reports that over a quarter of users (28%) never return to a company’s web site if it doesn’t perform sufficiently well.

3 Way Ops Can Help Devs: A Developer Perspective

Conflict between “dev” and “ops” (whatever they’re called at a given company) can create reliability problems. SRE is in part an effort to relieve that tension, either through embedding or enacting process changes. This article gathers opinions and ideas from ops and dev engineers and proposes three methods for alleviating the tension.

CloudEndure’s 2016 Cloud Migration Survey Reveals 52% of Enterprise Companies Plan to Migrate to Public Clouds Over Next 2 Years

Another interesting survey-based report.

When asked what is the acceptable “downtime window” to finish migrations to minimize downtime, almost half (44%) of respondents said they cannot afford any downtime or, at most, just for under 1 hour.

I’ve done both kinds, and in my experience, migrations with planned downtime end up being the more painful ones, as one is under pressure to meet a predefined outage window, which inevitably slips.

Uptime: How Many 9s Do We Need?

In practice, there’s a point of diminishing returns after which you’re wasting money to get more availability than you need. That’s at the crux of this article, and it’s an interesting read.

dastergon/awesome-sre · GitHub

Haven’t gotten your fill from SRE Weekly? Here’s a long list of curated SRE-related links to peruse.

Fault Injection in Production

Here’s a classic from the venerable John Allspaw of Etsy on running gameday scenarios in production. The general process is to brainstorm possible failures, improve the system to handle them, and then test by actually inducing the failures in production.

Imagining failure scenarios and asking, “What if…?” can help combat this thinking and bring a constant sense of unease to the organization. This sense of unease is a hallmark of high-reliability organizations. Think of it as continuously deploying a BCP (business continuity plan).

(emphasis mine)

What to do with the "rm -rf" hoax question – Meta Server Fault

Yup, turns out it was a hoax. Still generated an interesting conversation though.

Outages

123-reg (UK web hosting)
- An error in a script resulted in mass deletion of customer sites.
SquareSpace
Nucleus Market (illicit goods market)
The Pirate Bay
More US voting issues
US state school testing systems
- This week, both New Jersey and Tennessee had to cancel testing due to failures in their computerized trading systems. I’ve mentioned TNReady previously here, and this is their third failure.
Facebook

SRE Weekly Issue #19

lex

April 17, 2016

General

Comments

View on sreweekly.com

Articles

A mystery with memory leaks and a magic number

I just love this story. I heard Rachel Kroll tell it during her keynote at SREcon, and here it is in article form. It’s an incredibly deep dive through a gnarly debugging session, and I can’t recommend enough that you read it. NSFL (not safe for the library), because it’s pretty darned hilarious.

Growing Up with MySQL: How we scaled our primary datastore by over 20x in 3 weeks

Christine Spang of Nylas shares a story of migrating from RDS to sharded self-run MySQL clusters using SQLProxy. I love the detail here! I’m looking to get more deeply technical articles in SRE Weekly, so if you come across any, I’d love it if you’d point them out to me.

Complacency: The Enemy of Resilience

Here’s the latest in Mathias Lafeldt’s Production Ready series. He makes the argument that too few failures can be a bad thing and argues for a chaos engineering approach.

Complacency is the enemy of resilience. The longer you wait for disaster to strike in production — merely hoping that everything will be okay — the less likely you are to handle emergencies well, both at a technical and organizational level.

timesketch

Timesketch is a tool for building timelines. It could be useful for building a deeper understanding of an incident as part of a retrospective.

SRE: An incomplete guide to cultural Narnia

Anthony Caiafa shares his take on what SRE actually means. To me, SRE seems to be a field even more in flux than DevOps, and definitions have yet to settle. For example, I feel that there’s a lot that a non-programmer can add to an SRE team — you just have to really think about what it means to engineer reliability (e.g. process design).

Introducing DGit

Github details Dgit, their new high-availability solution for storing git repos internally. Previously, they used pairs of servers with raid mirroring in each and synchronized using DRDB.

Book review: Site Reliability Engineering

An early review of Google’s new SRE book by Mike Doherty, a Google SRE. He was only peripherally involved in the publication and gives a fairly balanced take on the book. For an outside perspective, see danluu’s detailed chapter-by-chapter notes.

The importance of 'dogfooding' in the cloud

Amazon.com famously runs on AWS, so any AWS outage could potentially impact Amazon. Google, on the other hand, doesn’t currently run any of its external services on Google Cloud Platform. This article makes the argument that doing so would create a much bigger incentive to improve and sustain GCP’s reliability.

However, when Google had its recent 12-hour outage that took Snapchat offline, it didn’t impact any of Google’s real revenue-generating services. […] What would the impact have been if Google Search was down for 12 hours?

Thanks to Charity for this one.

Man accidentally ‘deletes his entire company’ with one line of bad code

Oops.

Note that there’s been some question on hangops #sre on whether this is a hoax. Either way I could totally see it happening.

On Writing Well When You’re In A Damn Hurry

I love the fact that statuspage.io is the author of this article. How many of us have agonized over the exact wording of a status site post?

Outages

Yahoo Mail
Business Wire
Google Compute Engine
- GCE suffered a severe network outage. It started as increased latency and at worst became a full outage of internet connectivity. Two days after the incident, Google released the best postmortem I’ve seen in a very long time. Full transparency, a terrible juxtaposition of two nasty bugs, a heartfelt apology, fourteen(!) remediation items… it’s clear their incident response was solid and they immediately did a very thorough retrospective.
North Korea
- North Korea had a series of internet outages, each of the same length at the same time on consecutive days. It’s interesting how people are trying to learn things about the reclusive country just from this pattern of outages.
Blizzard's Battle.net
Twitter
Misco
Two Alt-Coin exchanges (Shapeshift and Poloniex)
Home Depot

SRE Weekly Issue #18

lex

April 10, 2016

General

Comments

View on sreweekly.com

SRECon16 was awesome! Sorry for the light issue this week — still recovering from my con-hangover. I had an incredible time, and I enjoyed meeting many of you, both old subscribers and new. Thank you all for your support! When USENIX posts their recordings, I’ll share links to some of my favorite talks.

QotW, from Charity Majors’s day 1 closing keynote (paraphrased):

There are no bad decisions. We make the best decisions we can with the information we have at the time.

Love it. The second QotW was from Rachel Kroll’s day 1 opening keynote, which included a hilarious and cringe-worthy story of investigating a very well-hidden bug with an incredibly bizarre set of symptoms. I can’t recommend enough watching the keynotes, and, well, every talk.

More content next week, after I’ve caught up on my RSS feeds. Thanks again for the huge amount of support you all have shown me — all 250+ of you (and that’s just email subscribers)!

Articles

Inside Telstra’s network woes: what happened

Telstra exec Kate McKenzie detailed some findings from internal investigations into the recent spate of Telstra incidents. There’s some nice detail here, including possible remediation items and an implication that Telstra is using a blameless retrospective process.

luxops/A post-mortem session format.md at master

This is a short but excellent template for incident retrospectives in the form of a series of questions. A great place to start if you’re looking to improve your retrospective process.

GitHub – etsy/morgue

Etsy’s morgue, a tool for tracking information related to postmortem investigations.

CSC – The largest unplanned outage in years and how we survived it

A rockin’ postmortem detailing the failure and recovery of a 1.7 PB filesystem, featuring the creation of a 3 TB ramdisk(!) to speed up the operation.

Thanks to phill-atlassian on hangops #incident_response for this one.

Outages

The Pirate Bay
Erie County (NY, USA) 911 emergency service
- Neaverth said that an electrical room employee hit an emergency kill switch, contributing to the 911 failure. That employee, however, denies having pressed the switch.
GitHub
Home Depot

SRE Weekly Issue #17

lex

April 3, 2016

General

Comments

I’m posting this week’s issue from the airport on my way to the west coast for business and SRECon16. I’m hoping to see some of you there! I’ll have a limited amount of incredibly exclusive hand-made SRE Weekly patches to give out — just ask.
View on sreweekly.com

Articles

Security / DevOps / Sysadmin incident response survey

I love surveys! This one is about incident response as it applies to security and operations. The study author is looking to draw parallels between these two kinds of IR. I can’t wait for the results, and I’ll definitely link them here.

Terraform, VPC, and why you want a tfstate file per env

Charity Majors gives us this awesomely detailed article about a Terraform nightmare. An innocent TF run in staging led to a merry bug-hunt down the rabbit-hole and ended in wiping out production — thankfully on a not-yet-customer-facing service. She continues on with an excellent HOWTO on fixing your Terraform config to avoid this kind of pitfall.

If you can’t safely test your changes in isolation away from prod, you don’t have infrastructure as code.

Putting the Squeeze on Trip Data

Uber set up an impressively rigorous test to determine which combination of serialization format and compression algorithm would hit the sweet spot between data size and compression speed. The article itself doesn’t directly touch on reliability, but of course running out of space in production is a deal-breaker, and I just love their methodology.

https://blog.pinboard.in/

I make heavy use of Pinboard to automate my article curation for SRE Weekly. This week, IFTTT decided to axe support for Pinboard, and they did it in a kind of jerky way. The service’s owner Maciej wrote up a pretty hilarious and pointed explanation of the situation.

Thanks to Courtney for this one.

March 25th Incident Report – HipChat Blog

HipChat took another outage last week when they tried to push a remediation from a previous outage. Again with admirable speed, they’ve posted a detailed postmortem including dive excellent lessons that we all learn from.

This deployment was an important remediation from the previous outages and seemed like the right thing to do.
Lesson learned: No matter how much you want to remediate a problem for your users, consider the risk, complexity, and timing, and then reconsider again.

FPUA: Human error caused December natural-gas outage

I love human error. Or rather, I love when an incident is reported as “human error”, because the story is inevitably more nuanced than that. Severe incidents are always the result of multiple things going wrong simultaneously. In this case, it was an operator mistake, insufficient radios and badges for responders, and lack of an established procedure for alerting utility customers.

What Can Kill a Game Faster Than Darth Vader? Answer: Latency

A detailed exploration of latency and how it can impact online services, especially games.

Online gaming customers are twice as likely to abandon a game when they experience a network delay of 50 additional milliseconds

How predictive maintenance can eliminate downtime

Say “eliminate downtime” and I’ll be instantly skeptical, but this article is a nice overview of predictive maintenance systems in datacenters.

Data centers use complex hardware that presents unforeseen problems that calendar-based maintenance checks simply cannot anticipate.

Outages

MedStar Health
HipChat
- Yet another HipChat outage for beleagered Atlassian, this one after the above-linked postmortem.
Spotify
Wisconsin (US state) voting system
- More trouble in the US primary election process.
Instagram
Sydney, AU Digital radio stations
- The issue was traced back to Telstra, which had a major equipment failure inside the North Sydney exchange.
Telstra
- Telstra’s free data day correlated with lots of complaints of slowdowns by Telstra users. They surpassed the previous free data day’s data transfer (1.8 PB) by 4pm AEST.
Sprint

← Older Posts

Newer Posts →

General

SRE Weekly Issue #21

Articles

Outages

SRE Weekly Issue #20

Articles

Outages

SRE Weekly Issue #19

Articles

Outages

SRE Weekly Issue #18

Articles

Outages

SRE Weekly Issue #17

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues