General

SRE Weekly Issue #245

A message from our sponsor, StackHawk:

Check out how we have built our microservices in Kubernetes here at StackHawk.
https://sthwk.com/kube-services

Articles

A Certificate Transparency (CT) log failed, resulting in its permanent retirement. The incident involved unintended effects from load testing being performed in a staging environment. I have a huge amount of admiration and respect for the transparency of certification authorities (CAs) when things go wrong.

Trust Asia

I like the idea that adding the ability to fail over to your system makes it much more complicated and thus more likely to fail.

Andre Newman — Gremlin

This one introduces some interesting concepts: the error kernel and property testing.

Kenneth Cross — HelloSign

[…] to be resilient, we must test everything, which consumes time that we don’t spend innovating. A good trade-off is to test in production.

Xavier Grand — Algolia

More useful tips as you develop your post-incident analysis process. I like their definition of “blameless”.

Zachary Flower — Splunk

Exactly once delivery is hard to implement and requires explicit coordination at all levels, including the client. Ably explains how their flavor works.

Paddy Byers — Ably

The most effective (if scary) way to understand how your stateless service operates under load

Utsav Shah — Software at Scale

Some good tips here — and a reminder that we may see even more traffic than normal due to social distancing.

Outages

SRE Weekly Issue #244

A message from our sponsor, StackHawk:

Are you attending KubeCon this week? Be sure to swing by StackHawk’s virtual booth to get a t-shirt and be entered to win a Nintendo Switch.

Articles

If you’re gonna operate on a pile of computers all at once that numbers 6+ figures, making you type that number in is a way to make you pause and think about what you’re doing.

Rachel by the bay

Find out why they decided to focus less on nines, and what they did instead.

Robert Sullivan

Reminds me of the classic:

It’s not DNS
There’s no way it’s DNS
It was DNS

— (ssbroski on reddit)
Mike S.

Their front-end made duplicate calls to the new API to test load and response time prior to cutting over.

Michael P. Geraci — OkCupid

This is really cool. The researchers created a role-play scenario based on a real plane crash. They tried to get participants to blame “human error”, so that they could then surprise them with all of the (many) contributing factors that were involved.

Emily S. Patterson, Richard I. Cook, David D. Woods, Marta L. Render

Tips from one Sysadmin’s journey to becoming an SRE.

Josh Duffney — Octopus Deploy

Outages

SRE Weekly Issue #243

A message from our sponsor, StackHawk:

The shift to rapid, frequent deployments over the past decade initially left application security behind. Modern AppSec belongs in the CI/CD pipeline.
http://sthwk.com/app-sec-pipeline

Articles

Sometimes I come across a simple but mind-blowingly awesome new idea. This is one of those times.

During periods of high load and errors, Netflix’s edge load balancer sends feedback to the apps running on users’ devices, adjusting their retry and backoff strategy to keep the service running as smoothly as possible but avoid a thundering herd. Brilliant.

Manuel Correa, Arthur Gonigberg, and Daniel West — Netflix

I helped to invent new approaches to correlate telemetry signals (exemplars, correlation between tracing and logging, profiler labels) that helped our engineers to navigate latency problems faster.

Facebook has two very different users for live streaming: “normal” users and broadcasters streaming sporting events and the like.

Hemal Khatri, Alex Lambert, Jordi Cenzano and Rodrigo Broilo — Facebook

This article covers the outcomes of research performed in 2019 on how engineers at Google debug production issues, including the types of tools, high-level strategies, and low-level tasks that engineers use in varying combinations to debug effectively.

Charisma Chan and Beth Cooper — Google

The three patterns discussed in this paper are:

  • decompensation
  • working at cross purposes
  • getting stuck in outdated behaviors

David Woods and Matthieu Branlat

Outages

SRE Weekly Issue #242

A message from our sponsor, StackHawk:

StackHawk just raised a $10M series A. Read the blog by CEO Joni Klippert about what we’ve built and where are going in our mission to bring application security to developers.
http://sthwk.com/series-a

Articles

The work of SREs and the material we produce can be an excellent source of information to onboard new employees (not just SREs!).

Author Emily Arnot — Blameless

Having safeguards in your tools to prevent errors, is wise. Allowing the user to disable those safeguards when the need arises is even wiser.

Rachel by the bay

Lots of factors contributed to the crash and destruction of this $175 million USD aircraft. The pilot escaped with minor injuries.

Colonel Bryan T. Callahan et al. — USAF

Serverless isn’t going to make ops go away. NoOps is a myth.

Charity Majors — Honeycomb

In this blog post, we’ll present reliability-centric metrics and key performance indicators (KPIs) that show the positive impact that reliability has on businesses.

Andre Newman — Gremlin

“Outage of a CRL server” isn’t the first thing that would come to mind when diagnosing a database connection failure.

Oren Eini — RavenDB

Telltale combines anomaly detection, alerting, dashboarding, and incident management.

Andrei Ushakov, Seth Katz, Janak Ramachandran, Jeff Butsch, Peter Lau, Ram Vaithilingam, and Greg Burrell — Netflix

What?! I had no idea this was possible! You can transfer file descriptors (and the open files they point to) to another process, even outside of the normal parent/child process relationship.

Cindy Sridharan

Outages

  • GeoComply
    • GeoComply, a geo-location service used by most online gaming sites in the US to monitor the physical location of their customers, experienced a major outage.

  • Coinbase
  • Twitter

SRE Weekly Issue #241

A message from our sponsor, StackHawk:

Want a quick glimpse of how StackHawk works? Check out this 11 minute demo from SnykCon last week and learn about modern application security testing for DevOps teams.
http://sthwk.com/snykcon-demo

Articles

A quick note on last week’s issue: Google posted an updated version of their Google Chat incident summary with the “confidential” language removed. They also updated the content at the original link.

T-Mobile, one of the main mobile phone carriers in the US, had a major outage earlier this year. This report is essentially a retrospective performed by the US FCC (Federal Communications Commission). The report details the satisfyingly complex interplay of contributing factors in the incident.

US Federal Communications Commission

How can you be sure your failover plan will actually work? Hint: it’s almost certainly not going to work properly the first time you try it.

Adrian Cockcroft

In this blog post, we’ll look at the business value of SRE through customer focus, observability, and efficiency.

Emily Arnott — Blameless

Netflix has some interesting ideas around sampling, performance, and storage for their tracing system.

Maulik Pandey — Netflix

Oh, I do0 love reading stories of systems failing in interesting ways. This first installment contains five of the 10.

Yoz Grahame — LaunchDarkly

Black Friday is coming. Here are some ideas on how to deal with the rush — and how to analyze how you dealt with it when it’s over.

Nelly Wilson — Google

Two of my favorite authors/speakers have conspired to create a book on one of my favorite topics. Take my money! Oh wait, they’re giving it away, too?!

Nora Jones and Casey Rosenthal

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme