SRE Weekly Issue #184

Articles

This article relates to Donella H. Meadows’s book, Thinking in Systems.

What follows is Meadows’ list of leverage points outfitted with those my ideas of where or how they can be applied to software development and web operations.

Ryan Frantz

@j_antman on Twitter: on-call horror story

I know its past an hour but… we got ~600 Nagios emails a day. Boss forbade us from muting any of them. In weekly status meeting, he’d often quiz on-call on a random alert. If oncall didnt know about it, boss would often scream at us…

Jason Antman (@j_antman)

Introduction To Jepsen Testing At Couchbase

Find out how the Couchbase folks use Jepsen to test their database offering.

Korrigan Clark

Never Alone On Call

A supportive on-call environment is critical to ensuring reliability and resiliency.

Deirdre Mahon — Honeycomb

A Response to ‘Towards an Understanding Tech Debt’

This is a follow-on to an article I linked to awhile back.

It’s really simpler to call it Tech Risk.

I love the idea of tracking the decisions an organization makes and the risks they entail.

Sarah Baker

Outages

Google App Engine
Fastly
Wikipedia
Yahoo Mail
AOL Mail
Tesla App
- Some Tesla owners were locked out of their cars when the app stopped working.
Amazon’s Elastic Block Store (EBS) in us-east-1
- Amazon experienced an outage that resulted in the total loss of a small percentage of EBS volumes.
Heroku Incident Followups
- - Incident 1891
  - Incident 1892
  Both incidents involved an outage in Heroku’s upstream provider.
Heroku Incident #1896
- Also this one: #1897.

SRE Weekly Issue #184

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues