General

SRE Weekly Issue #65

lex

March 26, 2017

General

Comments

View on sreweekly.com

Articles

Monitoring Weekly Newsletter

Look, a new newsletter about monitoring! I’m really excited to see what they have to offer.

Last Week in AWS

And another new newsletter! Like Monitoring Weekly, I’m betting this one will have a lot of articles of interest to SREs.

Sandstorm or Significant: The evolving role of context in incident management

VictorOps held a webinar last Thursday to present and discuss the concept of context in incident management. Just paging in a responder isn’t enough: we need to get them up to speed on the incident as soon as possible. Ideally, the page itself would include snapshots of relevant graphs, links to playbooks, etc. But if we’re not careful and add too much information, the responder is overloaded by a “sandstorm” of irrelevant data. “time to learn” — post incident learning careful of info overload in presenting context with pages

This webinar was created by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Embracing Risk

Here’s the next in Stephen Thorne’s series of commentary on chapters of the SRE book. I like that Google makes an effort not to be too reliable for fear of setting expectations too high, and they’re also realistic in their availability goals: no end-user will notice a 20-second outage.

Infinity Is a Bad Timeout

Writing an API, system, server or really anything people might make use of? Don’t make the default timeout be infinite.

5 Incident Management Tools You Need During a Firefight

PagerDuty really has been churning out excellent articles in the past couple of weeks. [Spoiler Alert] The five things are: internal communication, monitoring, a public status site, a support ticket system, and a defined incident response procedure.

Avoiding Incident Response Bottlenecks

Keep on rockin’ it, PagerDuty. This time they identify common problems that hinder incident response and give suggestions on how to fix them.

SREcon17: Brave new world of site reliability engineering

The author reviews their experience at SRECon17 Americas, including interesting bits on Julia Evans, training, recruiting, and diversity.

Human Error? No Problem

I love that the ideas we’re talking about regarding human error apply even to commercial cannabis growing.

Sadly, little is known about the nature of these errors, mainly because our quest for the truth ends where it should begin, once we know it was a human error or is “someone’s fault.”

Sometimes Boring is Better — Production Ready

The newer and shinier the technology, the bigger the production risk.

In other words, software that has been around for a decade is well understood and has fewer unknowns.

Outages

Kings College London storage system outage and data loss
- Kings College London’s HP storage system suffered a routine failure that, due to a firmware bug, resulted in loss of the entire array. Linked is an incredibly detailed PDF including multiple contributing factors and many remediations. Example: primary backups were to another folder on the same storage system, and secondary tape backups were purposefully incomplete.
Ryanair
- This one’s interesting to me because it seems to have been self-inflicted due to a flash sale.
Apple Store
- Another (possibly) self-inflicted outage due to a sale.
Microsoft Azure
Discord Status – Connectivity Issues
- Finally, my search alert for “thundering herd” paid off! I hadn’t heard of Discord before now, but they sure do write a great postmortem. Did you know that the thundering herd is a sports team?

SRE Weekly Issue #64

lex

March 19, 2017

General

Comments

View on sreweekly.com

Articles

SREcon17 Day 1 Highlights

I wasn’t able to make it to SRECon17 Americas this year, but it sounds like it was a great time. (day two summary)

So you want to be a wizard

My heroine, Julia Evans, gave the plenary session at SRECon17 Americas, all about how to learn how to be an excellent engineer (or really anything!). She proved herself once again not just as an excellent student, but also an inspiring teacher. The best part is that she posted the abstract, slides, and a transcript of her talk shortly after giving it! This is a really excellent resource for folks like me that weren’t there, and I hope more talk-givers will follow her example.

Counterfactual Thinking, Rules, and The Knight Capital Accident

This article is long, but I wish I’d carved out time for it long ago, because it’s really incredible and well worth the read. John Allspaw uses the SEC analysis of the Knight Capital incident as a starting point to introduce and discuss the problems with Counterfactual Thinking (“if the engineer had just done ___, this wouldn’t have happened”).

You Can’t Have a Rollback Button

Rolling back a flawed code release can have significant risk. It doesn’t always fix the problem because the erroneous code may have had effects on other parts of the system. Sometimes, as in the Knight Capital incident, a rollback exacerbates the problem.

The Production Environment at Google

This is part two of an annotation of the google SRE book by Stephen Thorne, a Google SRE. Part Three is available too.

Measuring Technical Debt With Incident Management Data

Here’s an interesting idea: using metadata about incidents as a proxy for measuring technical debt. PagerDuty goes over the definition of technical debt before diving into measuring it.

How is team-member-1 doing?

GitLab posted an update on “team-member-1”, the engineer that entered the commands that caused their production DB to be erased. I love that they posted this, because I for one was worried about “team-member-1” as a second victim.

U mad bro? Disaster planning for on-call – VictorOps

During an incident, emotions can run strong. How can we set up incident response so as to provide the best environment for our responders?

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Outages

AWS Route 53
- Route 53 had a control plane outage, though actual query responses were unaffected.
Square
- Square suffered a 2-hour outage, and if this postmortem is any indication, they learned a lot from it. This bit is interesting in light of the article above about rollbacks:
  
  We rolled back all software changes that happened leading up to the incident. This is a non-negotiable response to any customer-impacting event; our engineers are trained to undo any change that happened before an incident regardless of how plausible it is that the change caused the issue.
StatusPage.io
- This happened during Square’s outage and impacted their ability to communicate.
CBS
- CBS’s site was down, so people couldn’t fill out their fantasy sportsball brackets 1 hour before the game started.

SRE Weekly Issue #63

lex

March 12, 2017

General

Comments

View on sreweekly.com

Articles

ferd.ca -> Queues Don’t Fix Overload

I love the analogy: you can’t work around a slow drain with a bigger sink.

Tenets of SRE

Stephen Thorne, a Google SRE, annotates the first chapter of the Google SRE book with his personal opinions and interpretations.

Avoid An Oscar Epic Fail: 4 Practices For High Reliability

The author of this short article starts with the blooper during the Oscars and beautifully segues into a description of techniques organizations can use to halt the propagation of errors.

Sandstorm or Significant: The evolving role of context in Incident Management

This webinar looks really interesting, and I’m going to try to see it. It’s about the importance of providing context to incident responders, how much to provide, and how to provide it.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Outages

AT&T 911 service
- AT&T customers were unable to make emergency calls across the US. The Federal Communications Commission (FCC) is investigating.
Bitly (link-shortening service)
- The cause here is interesting: Comcast’s automated system decided Bitly was a phishing site.
Post-mortem: Outages on 1/19/17 and 1/23/17 – Skyliner
- I really like their methodical hunt for the offending memory leak.
HSBC (Bank)
Google accidentally resets OnHub and Google Wifi routers with server error
- The routers occasionally ping Google servers for authorization, and on February 23rd the server was sending back an error message. Through some esoteric fallback mechanism in the routers, this caused them to reset to factory settings. So, a problem on Google’s servers can reset your router. Oops.
Incident 1059 | Heroku Status
- Heroku posted a followup regarding their outage on February 28th stemming from the Amazon S3 outage.
  Full disclosure: Heroku is my employer and I was involved in writing this followup.

SRE Weekly Issue #62

lex

March 5, 2017

General

Comments

View on sreweekly.com

S3 fails, and suddenly it’s SRE “go-time” at companies everywhere! I don’t know about you, but I sure am exhausted.

Articles

Think of Latency as a Pseudo-permanent Network Partition

When you do as the title suggests, you realize that network partitions go from the realm of theoretical to everyday.

Ask 5 Whys to get to the root of any problem

Asana shares their “Five Whys” process, which they use not only for outages but even for missed deadlines. This caught my eye:

Our team confidently focuses on problem mitigation while fighting a fire, knowing that there will be time for post-mortem and long-term fixes later.

Organizing Software Deployments to Match Failure Conditions

Using Route 53 as a case study, AWS engineers explain how they carefully designed their deploy process to reduce impact from failed deploys.

One method to reduce potential impact is to shape your deployment strategies around the failure conditions of your service. Thus, when a deployment fails, the service owner has more control over the blast radius as well as the scope of the impact.

Moving persistent data out of Redis

GitHub used a data-driven approach when migrating a storage load from Redis to MySQL. It’s a good thing they did, because a straight one-for-one translation would have overloaded MySQL.

Actionable Alerts

We’ve heard before that it’s important to make sure that your alerts are actionable. I like that this article goes into some detail on why we sometimes tend to create inactionable alerts before explaining how to improve your alerting.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

How to stop Ubuntu Xenial from randomly killing your big processes

Ubuntu backported a security fix into Xenial’s kernel last month, and unfortunately, they introduced a regression. Under certain circumstances, the kernel will give up way too easily when attempting to find memory to satisfy an allocation and will needlessly trigger the OOM killer. A fix was released on February 20th.

Beating the CAP Theorem Checklist

Need to tell someone their ~~perpetual motion machine~~ CAP-satisfying system won’t work? Low on time? Use this handy checklist to explain why their idea won’t work.

Why we are not leaving the cloud

GitLab seriously considered fleeing the cloud for a datacenter, and they asked the community for feedback. That feedback was very useful and was enough to change their minds. The common theme: “you are not an infrastructure company, so why try to be one?”

Instrumenting High Volume Services: Part 2

If you’ve got a firehose of events going into your metrics/log aggregation system, you may need to reduce load on it by only sending in a portion of your events. Do you pick one out of every N? HoneyComb’s makers suggest an interesting alternative: tag each sampled event you send as representing N events from the source — and N is allowed to very between samples.

Outages

Amazon S3
- Amazon S3 in the us-east-1 region went down, taking many sites and services down with it, including Trello, Heroku, portions of Slack and GitHub, and tons more. Amazon’s status page had a note at the top but was otherwise green across the board for hours. Meanwhile nearly 100% of S3 requests failed and many other AWS services burned as well.Their outage summary (linked above) indicated that the outage uncovered a dependency of their status site on S3. Oops. Once they got that fixed a few hours later, they posted something I’ve never seen before: actual red icons.Full disclosure: Heroku is my employer.
Joyent: Postmortem for July 27 outage of the Manta service
- Here’s a deeply technical post-analysis of a Postgresql outage that Joyent experienced in 2015. A normally benign automatic maintenance (an auto-vacuum) turned into total DB lockup due to their workload.
PagerDuty
GoDaddy
- DDoS attack on their nameservers.

SRE Weekly Issue #61

lex

February 26, 2017

General

Comments

View on sreweekly.com

A fairly large Outages section this week as I experiment with including post-analyses there even for older incidents.

Articles

7 Steps to Avoiding Downtime

Every week, there’s an article with a title like this (just like with “costs of downtime”). Almost every week, they’re total crap, but this one from PagerDuty is a bit better than the rest. The bit that interests me is the assertion that a microservice-based architecture “makes maintenance much easier” and “makes your app more resilient”. Sure it can, but it can also just mean that you trade one problem for 1300 problems.

Optimizing Your Alert Management Process

Coping with that complexity requires a different approach to monitoring and alert management. You need to do much more than treat incident management as a process of responding to alerts in the order they come in or assuming that every alert requires action.

This post explains why a flexible, nuanced approach to alert management is vital, and how to implement it.

Scaling @ HelloFresh: API Gateway

HelloFresh live-migrated their infrastructure to an API gateway to facilitate a transition to microservices. They kindly wrote up their experience, which is especially educational because their first roll-out attempt didn’t go as planned.

[…] our first attempt at going live was pretty much a disaster. Even though we had a quite nice plan in place we were definitely not ready to go live at that point.

The Pros and Cons of Eating Your Own Dog Food — Production Ready

In this issue, Mathias shows us the benefits of “dogfooding” and cases where it can break down. I like the way the feedback loop is shortened, so that developers feel a painful user experience and have incentive to quickly fix it. It reminds me a lot of the feedback loop you get when developers go on call for the services they write.

Making Sense of the Application Monitoring Landscape

A breakdown of four categories of monitoring tools using the “2×2” framework. I like the mapping of “personas” (engineering roles) to the monitoring typesa they tend to find most useful.

Outages

Cloudflare: “Cloudbleed”
- Cloudflare experienced a minor outage due to mitigating a major leak of private information. They posted this (incredibly!) detailed analysis of the bug and their response to it. Other vendors such as PagerDuty, Monzo, TechDirt, and MaxMind posted responses to the outage. There’s also this handy list of sites using cloudflare.
mailgun
- Here’s a really interesting postmortem for a Mailgun outage I linked to in January. What apparently started off as a relatively minor outage was significantly exacerbated “due to human error”. The intriguing bit: the “corrective actions” section makes no mention at all of process improvements to make the system more resilient to this kind of error.
All Circuits are Busy Now: The 1990 AT&T Long Distance Network Collapse
- In 1990, the entire AT&T phone network experienced a catastrophic failure, and 50% of all calls failed. The analysis is pretty interesting and shows us that a simple bug can break even an incredibly resilient distributed system.
  
  the Jan. 1990 incident showed the possibility for all of the modules to go “crazy” at once, how bugs in self-healing software can bring down healthy systems, and the difficulty of detecting obscure load- and time-dependent defects in software.
vzaar
- They usually fork a release branch off of master, test it, and push that out to production. This time, they accidentally pushed master to production. How do I know that? Because they published this excellent post-analysis of the incident just two days after it happened.
U.S. Dept. of Homeland Security
- This article has some vague mention of an expired certificate.
YouTube
CD Baby
Facebook

SRE Weekly Issue #65

Articles

Outages

SRE Weekly Issue #64

Articles

Outages

SRE Weekly Issue #63

Articles

Outages

SRE Weekly Issue #62

Articles

Outages

SRE Weekly Issue #61

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues