A fairly large Outages section this week as I experiment with including post-analyses there even for older incidents.
Articles
Every week, there’s an article with a title like this (just like with “costs of downtime”). Almost every week, they’re total crap, but this one from PagerDuty is a bit better than the rest. The bit that interests me is the assertion that a microservice-based architecture “makes maintenance much easier” and “makes your app more resilient”. Sure it can, but it can also just mean that you trade one problem for 1300 problems.
Coping with that complexity requires a different approach to monitoring and alert management. You need to do much more than treat incident management as a process of responding to alerts in the order they come in or assuming that every alert requires action.
This post explains why a flexible, nuanced approach to alert management is vital, and how to implement it.
HelloFresh live-migrated their infrastructure to an API gateway to facilitate a transition to microservices. They kindly wrote up their experience, which is especially educational because their first roll-out attempt didn’t go as planned.
[…] our first attempt at going live was pretty much a disaster. Even though we had a quite nice plan in place we were definitely not ready to go live at that point.
In this issue, Mathias shows us the benefits of “dogfooding” and cases where it can break down. I like the way the feedback loop is shortened, so that developers feel a painful user experience and have incentive to quickly fix it. It reminds me a lot of the feedback loop you get when developers go on call for the services they write.
A breakdown of four categories of monitoring tools using the “2×2” framework. I like the mapping of “personas” (engineering roles) to the monitoring typesa they tend to find most useful.
Outages
- Cloudflare: “Cloudbleed”
- Cloudflare experienced a minor outage due to mitigating a major leak of private information. They posted this (incredibly!) detailed analysis of the bug and their response to it. Other vendors such as PagerDuty, Monzo, TechDirt, and MaxMind posted responses to the outage. There’s also this handy list of sites using cloudflare.
- mailgun
- Here’s a really interesting postmortem for a Mailgun outage I linked to in January. What apparently started off as a relatively minor outage was significantly exacerbated “due to human error”. The intriguing bit: the “corrective actions” section makes no mention at all of process improvements to make the system more resilient to this kind of error.
- All Circuits are Busy Now: The 1990 AT&T Long Distance Network Collapse
- In 1990, the entire AT&T phone network experienced a catastrophic failure, and 50% of all calls failed. The analysis is pretty interesting and shows us that a simple bug can break even an incredibly resilient distributed system.
the Jan. 1990 incident showed the possibility for all of the modules to go “crazy” at once, how bugs in self-healing software can bring down healthy systems, and the difficulty of detecting obscure load- and time-dependent defects in software.
- In 1990, the entire AT&T phone network experienced a catastrophic failure, and 50% of all calls failed. The analysis is pretty interesting and shows us that a simple bug can break even an incredibly resilient distributed system.
- vzaar
- They usually fork a release branch off of master, test it, and push that out to production. This time, they accidentally pushed master to production. How do I know that? Because they published this excellent post-analysis of the incident just two days after it happened.
- U.S. Dept. of Homeland Security
- This article has some vague mention of an expired certificate.
- YouTube
- CD Baby