Articles
This article relates to Donella H. Meadows’s book, Thinking in Systems.
What follows is Meadows’ list of leverage points outfitted with those my ideas of where or how they can be applied to software development and web operations.
Ryan Frantz
D:
I know its past an hour but… we got ~600 Nagios emails a day. Boss forbade us from muting any of them. In weekly status meeting, he’d often quiz on-call on a random alert. If oncall didnt know about it, boss would often scream at us…
Jason Antman (@j_antman)
Find out how the Couchbase folks use Jepsen to test their database offering.
Korrigan Clark
A supportive on-call environment is critical to ensuring reliability and resiliency.
Deirdre Mahon — Honeycomb
This is a follow-on to an article I linked to awhile back.
It’s really simpler to call it Tech Risk.
I love the idea of tracking the decisions an organization makes and the risks they entail.
Sarah Baker
Outages
- Google App Engine
- Fastly
- Wikipedia
- Yahoo Mail
- AOL Mail
- Tesla App
- Some Tesla owners were locked out of their cars when the app stopped working.
- Amazon’s Elastic Block Store (EBS) in us-east-1
- Amazon experienced an outage that resulted in the total loss of a small percentage of EBS volumes.
- Heroku Incident Followups
-
Both incidents involved an outage in Heroku’s upstream provider.
-
- Heroku Incident #1896
- Also this one: #1897.