SRE Weekly Issue #184

A message from our sponsor, VictorOps:

Do you dream of reducing MTTA from four hours to two minutes? Learn how you can improve incident detection, alerting, real-time incident collaboration and cross-functional transparency to make on-call suck less and build more reliable services:

http://try.victorops.com/sreweekly/improved-incident-response

Articles

This article relates to Donella H. Meadows’s book, Thinking in Systems.

What follows is Meadows’ list of leverage points outfitted with those my ideas of where or how they can be applied to software development and web operations.

Ryan Frantz

D:

I know its past an hour but… we got ~600 Nagios emails a day. Boss forbade us from muting any of them. In weekly status meeting, he’d often quiz on-call on a random alert. If oncall didnt know about it, boss would often scream at us…

Jason Antman (@j_antman)

Find out how the Couchbase folks use Jepsen to test their database offering.

Korrigan Clark

A supportive on-call environment is critical to ensuring reliability and resiliency.

Deirdre Mahon — Honeycomb

This is a follow-on to an article I linked to awhile back.

It’s really simpler to call it Tech Risk.

I love the idea of tracking the decisions an organization makes and the risks they entail.

Sarah Baker

Outages

SRE Weekly Issue #183

A message from our sponsor, VictorOps:

Incident management and response don’t need to suck. See how you can build a collaborative incident management plan with shared transparency into developer availability and on-call schedules for IT operations:

http://try.victorops.com/sreweekly/incident-management-plan

Articles

Another issue of Increment, on a topic integral to SRE: testing.

It doesn’t matter if you’ve already read everything Charity Majors has written; in this article she’s still managed to find new and even more compelling ways to argue that we should embrace testing in production.

My two other favorite articles from this issue:

Charity Majors — Honeycomb

That’s exactly what we hoped for.

They rewrote this critical service and carefully deployed it to avoid user impact, using a technique I love: run the new code alongside the old for awhile to verify that it returns the same result.

Jeremy Gayed, Said Ketchman, Oleksii Khliupin, Ronny Wang and Sergey Zheznyakovskiy — New York Times

This is aimed at Certification Authorities dealing with TLS certificate misissuance issues and the like, but it very much applies to any kind of incident.

BONUS CONTENT: An incident report from LetsEncrypt just a few days later included this gem, exactly in line with what Ryan wrote:

After initially confirming the report we reached out to multiple other CAs that we believed would also be affected.

Ryan Sleevi

Whose? Hosted Graphite’s. Definitely worth a read.

Fran Garcia — Hosted Graphite

Which brings me to this unpopular opinion: All code is technical debt.

However, debt itself isn’t bad. It can be risky, especially if misunderstood, but debt itself is not inherently good or bad. It’s a tool.

Dormain Drewitz — Pivotal

Blameless is running a free workshop on writing post-incident reports.

In this talk we will discuss the elements of an effective postmortem and the challenges faced while defining the process. We will introduce concrete methodologies that alleviate the cognitive overhead and emotional burden of doing postmortems.

Blameless

Outages

  • Heroku Status
    • Heroku experienced 8+ hours of instability. This status page posting is really worth a read because it has:
      • meticulously detailed customer impact
      • no sugar-coating
      • detailed workarounds when they were available

      Hats off to you, folks.

  • Slack
  • Reddit
  • Sling TV
  • Disney Plus
    • Increased traffic from a sale caused instability.
  • Fastly

SRE Weekly Issue #182

A message from our sponsor, VictorOps:

Collaborate with the right teammates, find the right information and resolve system outages in minutes. Play the VictorOps on-call game to test your skillz and compete against your friends and coworkers.

http://try.victorops.com/sreweekly/on-call-game

Articles

Friday deploys are going to be necessary occasionally, even if we try to ban them. Doing so will only mean that we’re less experienced at executing Friday deploys successfully.

Will Gallego

Jet engines are Complicated. The system of jet engine maintenance (including the technicians, policies, schedules, etc) is Complex. Understanding the difference is key to managing complex systems.

Adam Johns

In this issue, we have articles from the front-line, as well as from safety, legal, leadership, human factors and psychology specialists.

Hindsight is a magazine targeted at air traffic controllers. An example article title from this issue:

Mode-Switching in Air Traffic Control

Thanks to Greg Burek for this one.

The US Federal Communications Commission released their report on an outage last December that took down 911 (emergency services) across a large swathe of the US.

This outage was caused by an equipment
failure catastrophically exacerbated by a network configuration error.

They’re two separate concepts, but they’re often presented together, blurring the line between them.

Daniel Abadi

I love the idea of applying the ideas of resilience engineering to child welfare services. This article quotes from Hollnagel and Dekker.

Tom Morton and Jess McDonald

Outages

SRE Weekly Issue #181

A message from our sponsor, VictorOps:

Think you’ve got what it takes to quickly resolve a system outage? Test your on-call skillz with the new VictorOps on-call adventure game.

http://try.victorops.com/sreweekly/on-call-game

Articles

Root Cause Analysis is a flawed concept, and seeking it almost inevitably results in treating people unfairly. I like the concept of “Least Effort to Remediate” introduced in this article.

Casey Rosenthal — Verica

Slack developed a load simulation tool and used it to verify a new feature, Enterprise Key Management

Serry Park, Arka Ganguli, and Joe Smith

After reviewing the history of the term “antifragility”, this article explains why it is a flawed concept and contrasts it with Chaos Engineering.

This is where the concept of antifragility veers from a truism into bad advice.

Casey Rosenthal

Outages

SRE Weekly Issue #180

A message from our sponsor, VictorOps:

Endorsing a culture of blameless transparency around post-incident reviews can lead to continuous improvement and more resilient services. Check out an interesting technique that SRE teams are using to improve post-incident analysis and learn more from failure:

http://try.victorops.com/sreweekly/ishikawas-fishbone-diagram

Articles

This reads like a mini list of war stories from a grizzled veteran reliability engineer… because that’s exactly what it is. Don’t forget to click the link at the bottom for the followup post!

rachelbythebay

The myths:

  1. Add Redundancy
  2. Simplify
  3. Avoid Risk
  4. Enforce Procedures
  5. Defend against Prior Root Causes
  6. Document Best Practices and Runbooks
  7. Remove the People Who Cause Accidents

If that doesn’t make you want to read this, I don’t know what will.

Casey Rosenthal — Verica

The graveyard that no one dared tread in was the Terraform code. Once they got CI/CD set up, deploys became much easier — and less scary.

Liz Fong-Jones — Honeycomb

My favorite idea in this article is that the absence of “errors” is not the same thing as safety.

Thai Woods (summary)

Sidney Dekker (original paper)

High availability and resilience are key features of Kubernetes. But what do you do when your Kubernetes cluster starts to become unstable and it looks like your ship is starting to sink?

Tim Little — Kudos

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme