SRE Weekly Issue #210

Articles

Netflix open sourced their incident management system.

Put simply, Dispatch is:

All of the ad-hoc things you’re doing to manage incidents today, done for you, and a bunch of other things you should’ve been doing, but have not had the time!

Kevin Glisson, Marc Vilanova, Forest Monsen — Netflix

Reading /proc/pid/cmdline can hang forever

I wasn’t aware of this little pitfall of memory cgroups.

rachelbythebay

In space, no one can hear you kernel panic

Your failover DB instance is cute. Try 4x+ redundancy. That’s the kind of engineering required when designing systems to operate in space.

Glenn Fleishman — Increment

A single person on-call “rotation” is a critical vulnerability

This post enumerates some of the risks introduced when a single person carries 100% of the on-call duties of a team, and shows why those risks are not simply eliminated by increasing the number of people in the rotation.

Daniel Condomitti — FireHydrant

Experimental study on the effect of procedure under unexpected situations

This is a pretty nifty experiment showing the importance of letting folks use their judgement to handle unexpected situations rather than relying on adherence to procedures.

Thai Wood — Resilience Roundup (summary)

Makoto Takahashi, Daisuke Karikawa, Genta Sawasato and Yoshitaka Hoshii — Tohoku University (original paper)

Coronavirus/COVID-19 and USENIX Conferences

FYI: SRECon Americas West has been rescheduled to June 2-4.

Millions of tiny databases

This week, we have another summary of the Physalia paper. I especially like the bit about poison pills.

Adrian Colyer — The Morning Paper (summary)

Brooker et al. — NSDI’20 (original paper)

How did software get so reliable without proof?

In this case, “proof” means “formal proof”.

It’s not that software got so reliable without proof: it’s that systems that include software got so reliable without proof.

Lorin Hochstein

Outages

Let’s Encrypt Status
- Let’s Encrypt purposefully suspended certificate issuance to investigate a bug around validating CAA DNS records. See their initial report and subsequent full report for details.
  Subsequently, they decided to revoke 3 million certificates with a pretty short warning. Both actions (the revocations and taking down issuance initially) were likely warranted and mandated under the compliance guidelines that CAs are subjected to.
  
  I’ve found two third-party incidents so far that seem to stem from the revocations:
  * statuspage.io
  * Heroku
  
  Got any more? Please do send them my way.
Robinhood (stock trading platform)
- Thanks to Daniel Lucas for this and a couple other recent ones. for this one.
G Suite Status Dashboard
PagerDuty
Uber
Interactive Brokers (Stock Broker)
Binion’s and Four Queens (Las Vegas casinos)
- Slot machines stopped working, and an eerie quiet descended.
crates.io incident report for 2020-02-20
- On 2020-02-20 at 21:28 UTC we received a report from a user of crates.io that their crate was not available on the index even after 10 minutes since the upload. This was a bug in the crates.io webapp exposed by a GitHub outage.
  
  crates.io is the Rust language package registry.
  
  Pietro Albini — crates.io
Discord

SRE Weekly Issue #210

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues