General

SRE Weekly Issue #187

A message from our sponsor, VictorOps:

Machine learning (ML) isn’t just a buzzword anymore — it’s affecting how we communicate, shop, live and respond to critical DevOps incidents. Grab your spot for this free webinar to learn about driving success in incident management with machine learning:

http://try.victorops.com/sreweekly/machine-learning-webinar

Articles

I love it when companies publish their incident management documentation! Atlassian’s offering is high-quality — both in content and production value. The Major Incident Manager Cheatsheet at the end is worth distributing to your team.

Atlassian

Netflix shares more about their N+1 AWS region redundancy design, and it all revolves around accurately modeling demand.

Niosha Behnam — Netflix

Interactions between simple microservices can lead to unexpected emergent behaviors.

To restate: this system is not complicated. But it is complex.

Avdi Grimm

What we had in the two downed airplanes was a textbook failure of airmanship.

While I don’t necessarily agree with the blame-laden language of this article, it provides some interesting new details. It strikes me that, while MCAS may not be directly responsible for the crashes, it made it significantly harder to recover from contemporaneous pilot errors.

William Langewiesche — The New York Times

My favorite part is the role-playing scenarios of debugging a problem with observability tooling and traditional tools.

Charity Majors

Tuning your TCP stack is important on busy servers.

Ram Lakshmanan

Outages

SRE Weekly Issue #186

A message from our sponsor, VictorOps:

See why DevOps teams are more collaborative and transparent than traditional IT operations – helping them build highly efficient incident management and response systems:

http://try.victorops.com/sreweekly/devops-incident-management-guide

Articles

This article is highly technical, while also not being overwhelmingly detailed.

It is very important that a database user is aware of the isolation level guaranteed by the database system, and what concurrency bugs may emerge as a result.

Daniel Abadi

The traps are:

  1. You don’t have enough cross-team usage or buy-in.
  2. Your difficult and dense process is slowing down incident response.
  3. Postmortems are underutilized and don’t encompass in-depth learnings.
  4. You wait for incidents to happen.
  5. You stop at incident management without SLOs.

Lyon Wong — Blameless

Need to argue the benefits of implementing distributed tracing in your organization? This article will help you get started.

dm03514

The question is: what is the proper role of alerting in the modern era of distributed systems?  Have alerting best practices changed with the shift from monitoring and known-unknowns to observability and unknown-unknowns?

Charity Majors

Round-robin load balancing often isn’t good enough; it’s necessary to intelligently route requests to nodes that aren’t overloaded. How do you get information about backend health to distributed load balancer nodes efficiently? A: add a response header.

Haowei Yuan, Yi-Shu Tai, and Dmitry Kopytkov — Dropbox

By adding in-memory caching with a mere 3-second TTL, these folks achieved a ~75% cache hit rate, allowing them to withstand request spikes without an outage.

MINA GYIMAH — Pusher

Outages

SRE Weekly Issue #185

A message from our sponsor, VictorOps:

Machine learning is already being used in many DevOps processes – driving highly efficient workflows across the entire software delivery lifecycle. See how machine learning is currently being used to improve incident management and response in production environments:

http://try.victorops.com/sreweekly/machine-learning-in-incident-management

Articles

This is a tough read, but really enlightening.

Thanks to Courtney Eckhardt for this one.

William Langewiesche — The Atlantic

Read this to find out why it’s so hard to nail down SLOs for cloud services.

Adrian Colyer — The Morning Paper (summary)

Mogul & Wilkes (original paper)

BGP: the horrifying, ugly monster lurking at the base of the Internet.

Stilgherrian — ZDNet

A different kind of monster.

Will Oremus — Slate

When you’re thinking about best practices for improving phases of the production incident cycle, there are three SRE principles that particularly matter for this task. Keep these in mind as you think about reliability.

Myk Taylor — Google

It’s important that we remember that there’s more to incident response than the technical aspect.

George Miranda — PagerDuty

Learn from this Second Officer’s account of a maritime near-miss and the five lessons they learned. My favorite:

As professionals, we always have more than one goal.

Nippin Anand — Safety Differently

Outages

SRE Weekly Issue #184

A message from our sponsor, VictorOps:

Do you dream of reducing MTTA from four hours to two minutes? Learn how you can improve incident detection, alerting, real-time incident collaboration and cross-functional transparency to make on-call suck less and build more reliable services:

http://try.victorops.com/sreweekly/improved-incident-response

Articles

This article relates to Donella H. Meadows’s book, Thinking in Systems.

What follows is Meadows’ list of leverage points outfitted with those my ideas of where or how they can be applied to software development and web operations.

Ryan Frantz

D:

I know its past an hour but… we got ~600 Nagios emails a day. Boss forbade us from muting any of them. In weekly status meeting, he’d often quiz on-call on a random alert. If oncall didnt know about it, boss would often scream at us…

Jason Antman (@j_antman)

Find out how the Couchbase folks use Jepsen to test their database offering.

Korrigan Clark

A supportive on-call environment is critical to ensuring reliability and resiliency.

Deirdre Mahon — Honeycomb

This is a follow-on to an article I linked to awhile back.

It’s really simpler to call it Tech Risk.

I love the idea of tracking the decisions an organization makes and the risks they entail.

Sarah Baker

Outages

SRE Weekly Issue #183

A message from our sponsor, VictorOps:

Incident management and response don’t need to suck. See how you can build a collaborative incident management plan with shared transparency into developer availability and on-call schedules for IT operations:

http://try.victorops.com/sreweekly/incident-management-plan

Articles

Another issue of Increment, on a topic integral to SRE: testing.

It doesn’t matter if you’ve already read everything Charity Majors has written; in this article she’s still managed to find new and even more compelling ways to argue that we should embrace testing in production.

My two other favorite articles from this issue:

Charity Majors — Honeycomb

That’s exactly what we hoped for.

They rewrote this critical service and carefully deployed it to avoid user impact, using a technique I love: run the new code alongside the old for awhile to verify that it returns the same result.

Jeremy Gayed, Said Ketchman, Oleksii Khliupin, Ronny Wang and Sergey Zheznyakovskiy — New York Times

This is aimed at Certification Authorities dealing with TLS certificate misissuance issues and the like, but it very much applies to any kind of incident.

BONUS CONTENT: An incident report from LetsEncrypt just a few days later included this gem, exactly in line with what Ryan wrote:

After initially confirming the report we reached out to multiple other CAs that we believed would also be affected.

Ryan Sleevi

Whose? Hosted Graphite’s. Definitely worth a read.

Fran Garcia — Hosted Graphite

Which brings me to this unpopular opinion: All code is technical debt.

However, debt itself isn’t bad. It can be risky, especially if misunderstood, but debt itself is not inherently good or bad. It’s a tool.

Dormain Drewitz — Pivotal

Blameless is running a free workshop on writing post-incident reports.

In this talk we will discuss the elements of an effective postmortem and the challenges faced while defining the process. We will introduce concrete methodologies that alleviate the cognitive overhead and emotional burden of doing postmortems.

Blameless

Outages

  • Heroku Status
    • Heroku experienced 8+ hours of instability. This status page posting is really worth a read because it has:
      • meticulously detailed customer impact
      • no sugar-coating
      • detailed workarounds when they were available

      Hats off to you, folks.

  • Slack
  • Reddit
  • Sling TV
  • Disney Plus
    • Increased traffic from a sale caused instability.
  • Fastly
A production of Tinker Tinker Tinker, LLC Frontier Theme