General

SRE Weekly Issue #242

A message from our sponsor, StackHawk:

StackHawk just raised a $10M series A. Read the blog by CEO Joni Klippert about what we’ve built and where are going in our mission to bring application security to developers.
http://sthwk.com/series-a

Articles

The work of SREs and the material we produce can be an excellent source of information to onboard new employees (not just SREs!).

Author Emily Arnot — Blameless

Having safeguards in your tools to prevent errors, is wise. Allowing the user to disable those safeguards when the need arises is even wiser.

Rachel by the bay

Lots of factors contributed to the crash and destruction of this $175 million USD aircraft. The pilot escaped with minor injuries.

Colonel Bryan T. Callahan et al. — USAF

Serverless isn’t going to make ops go away. NoOps is a myth.

Charity Majors — Honeycomb

In this blog post, we’ll present reliability-centric metrics and key performance indicators (KPIs) that show the positive impact that reliability has on businesses.

Andre Newman — Gremlin

“Outage of a CRL server” isn’t the first thing that would come to mind when diagnosing a database connection failure.

Oren Eini — RavenDB

Telltale combines anomaly detection, alerting, dashboarding, and incident management.

Andrei Ushakov, Seth Katz, Janak Ramachandran, Jeff Butsch, Peter Lau, Ram Vaithilingam, and Greg Burrell — Netflix

What?! I had no idea this was possible! You can transfer file descriptors (and the open files they point to) to another process, even outside of the normal parent/child process relationship.

Cindy Sridharan

Outages

  • GeoComply
    • GeoComply, a geo-location service used by most online gaming sites in the US to monitor the physical location of their customers, experienced a major outage.

  • Coinbase
  • Twitter

SRE Weekly Issue #241

A message from our sponsor, StackHawk:

Want a quick glimpse of how StackHawk works? Check out this 11 minute demo from SnykCon last week and learn about modern application security testing for DevOps teams.
http://sthwk.com/snykcon-demo

Articles

A quick note on last week’s issue: Google posted an updated version of their Google Chat incident summary with the “confidential” language removed. They also updated the content at the original link.

T-Mobile, one of the main mobile phone carriers in the US, had a major outage earlier this year. This report is essentially a retrospective performed by the US FCC (Federal Communications Commission). The report details the satisfyingly complex interplay of contributing factors in the incident.

US Federal Communications Commission

How can you be sure your failover plan will actually work? Hint: it’s almost certainly not going to work properly the first time you try it.

Adrian Cockcroft

In this blog post, we’ll look at the business value of SRE through customer focus, observability, and efficiency.

Emily Arnott — Blameless

Netflix has some interesting ideas around sampling, performance, and storage for their tracing system.

Maulik Pandey — Netflix

Oh, I do0 love reading stories of systems failing in interesting ways. This first installment contains five of the 10.

Yoz Grahame — LaunchDarkly

Black Friday is coming. Here are some ideas on how to deal with the rush — and how to analyze how you dealt with it when it’s over.

Nelly Wilson — Google

Two of my favorite authors/speakers have conspired to create a book on one of my favorite topics. Take my money! Oh wait, they’re giving it away, too?!

Nora Jones and Casey Rosenthal

Outages

SRE Weekly Issue #240

A message from our sponsor, StackHawk:

Be sure to register for SnykCon to learn about the latest DevSecOps trends. And while you are there, check out the StackHawk booth for our Nintendo Switch giveaway.
http://bit.ly/SnykConStackHawk

Articles

This interesting post-incident analysis is marked as “Google Customer Confidential – Not for publication or distribution”, but Google linked it directly from their public status page. I normally would not include a seemingly “leaked” incident report like this, but in this case I think the “confidential” label is erroneous.

Google

I keep re-learning and re-forgetting about TCP_NODELAY.

Rachel By the Bay

The distinction between the two is a lot more nuanced than it may seem. What are we really trying to say wit those words?

Michael Nygard

This incident from the week before last involved a Let’s Encrypt API rate limit.

Don’t you hate when you’re minding your own business upgrading your OS, and you run smack into a kernel bug in the ext4fs code?

…ext4 performance on kernel versions above 4.5 and below 5.6 suffers severely in the presence of concurrent sequential I/O on rotating disks.

Ryan Underwood — LinkedIn

Google discusses DDoS attacks and how they deal with them, including a 2.5Tbps attack in 2017.

Damian Menscher — Google

I love these first-hand incident stories. This one is from an engineer at Heroku who was a contributing factor in an incident last month.

Damien Mathieu — Heroku (Salesforce)

Outages

SRE Weekly Issue #239

A message from our sponsor, StackHawk:

CI/CD has changed software engineering. Application security, however, has been left behind. Why doesn’t your CI pipeline have AppSec checks?
https://www.stackhawk.com/blog/ci-pipeline-security-bug-testing?utm_source=SREWeekly

Articles

Don’t scale up farther than you need to! If you won’t ever see more than 100 RPS, don’t architect for 100,000.

Ayende Rahien

This one covers several common patterns of SRE practice and then offers insight on what to look for as you design your own SRE team.

Rob Cummings — Slalom Build

Abstractions make us more productive, and, indeed, we humans can’t build complex systems without them. But we need to be able to peel away the abstraction layers when things go wrong, so we can discover the implicit precondition that’s been violated.

Lorin Hochstein

Coordination between nodes in a distributed system can kill performance. What kinds of problems require coordination? The CALM theorem can tell us.

Joseph M. Hellerstein and Peter Alvaro — Communications of the ACM

Here’s another good post-incident analysis document template that you can use as inspiration for your own.

Hannah Culver — Blameless

As your product ages, it transitions from “cool new thing” to “tool everyone uses and expects to Just Work”. Your reliability needs will change accordingly.

Lyon Wong — Blameless

Outages

SRE Weekly Issue #238

My daughters asked earlier today what I do at work, and I explained all about SRE, reliability, and the importance of work-life balance.  They said to tell you they say hi!

A message from our sponsor, StackHawk:

CI/CD has changed software engineering. Application security, however, has been left behind. Why doesn’t your CI pipeline have AppSec checks?
https://www.stackhawk.com/blog/ci-pipeline-security-bug-testing?utm_source=SREWeekly

Articles

Lots of really great advice in here. And really, with a title like that, I couldn’t resist reading it!

Charity Majors

Last week, I mentioned a Google Cloud Platform outage that affected multiple services. Here’s the detailed post-analysis by Google.

Google

This one is along the lines of the classic Ironies of Automation paper by Bainbridge. In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.How can automation be a team player, and what happens when it isn’t?

Nadine Sarter and David Woods (original paper)

Thai Wood — Resilience Roundup (summary)

In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.How can you use chaos engineering when failures in the system can be critical and even life-threatening?

Carl Chesser — Infoq

In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.

Emily Arnot — Blameless

This article suggests using chaos engineering to tell if your microservice-based architecture is secretly a monolith in disguise.

Andre Newman — Gremlin

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme