Articles
Netflix open sourced their incident management system.
Put simply, Dispatch is:
All of the ad-hoc things you’re doing to manage incidents today, done for you, and a bunch of other things you should’ve been doing, but have not had the time!
Kevin Glisson, Marc Vilanova, Forest Monsen — Netflix
I wasn’t aware of this little pitfall of memory cgroups.
rachelbythebay
Your failover DB instance is cute. Try 4x+ redundancy. That’s the kind of engineering required when designing systems to operate in space.
Glenn Fleishman — Increment
This post enumerates some of the risks introduced when a single person carries 100% of the on-call duties of a team, and shows why those risks are not simply eliminated by increasing the number of people in the rotation.
Daniel Condomitti — FireHydrant
This is a pretty nifty experiment showing the importance of letting folks use their judgement to handle unexpected situations rather than relying on adherence to procedures.
Thai Wood — Resilience Roundup (summary)
Makoto Takahashi, Daisuke Karikawa, Genta Sawasato and Yoshitaka Hoshii — Tohoku University (original paper)
FYI: SRECon Americas West has been rescheduled to June 2-4.
This week, we have another summary of the Physalia paper. I especially like the bit about poison pills.
Adrian Colyer — The Morning Paper (summary)
Brooker et al. — NSDI’20 (original paper)
In this case, “proof” means “formal proof”.
It’s not that software got so reliable without proof: it’s that systems that include software got so reliable without proof.
Lorin Hochstein
Outages
- Let’s Encrypt Status
- Let’s Encrypt purposefully suspended certificate issuance to investigate a bug around validating CAA DNS records. See their initial report and subsequent full report for details.
Subsequently, they decided to revoke 3 million certificates with a pretty short warning. Both actions (the revocations and taking down issuance initially) were likely warranted and mandated under the compliance guidelines that CAs are subjected to.
I’ve found two third-party incidents so far that seem to stem from the revocations:
* statuspage.io
* HerokuGot any more? Please do send them my way.
- Let’s Encrypt purposefully suspended certificate issuance to investigate a bug around validating CAA DNS records. See their initial report and subsequent full report for details.
- Robinhood (stock trading platform)
- Thanks to Daniel Lucas for this and a couple other recent ones. for this one.
- G Suite Status Dashboard
- PagerDuty
- Uber
- Interactive Brokers (Stock Broker)
- Binion’s and Four Queens (Las Vegas casinos)
- Slot machines stopped working, and an eerie quiet descended.
- crates.io incident report for 2020-02-20
-
On 2020-02-20 at 21:28 UTC we received a report from a user of crates.io that their crate was not available on the index even after 10 minutes since the upload. This was a bug in the crates.io webapp exposed by a GitHub outage.
crates.io is the Rust language package registry.
Pietro Albini — crates.io
-
- Discord