Articles
This is important, and well worth a read. Where’s the SRE connection? The article explains that the U.S. Surgeon General’s comment that masks are “not effective” led to a stigma against those that wear them here. That kind of unintended sociological effect is uncovered commonly in incident post-analysis.
Sui Huang
Pagerduty ran the numbers and discovered an increase in incidents recently, especially in certain companies.
Rachel Obstler — PagerDuty
Here’s the scoop on all those GitHub incidents in February.
Keith Ballinger — GitHub
No, it won’t be possible to continue operating business-as-usual. For the unforeseeable future, teams across the world will be dealing with cutbacks, infrastructure instability, and more. However, with SRE best practices, your team can embrace resilience and adapt through this difficult time.
Hannah Culver — Blameless
5 tips for incident management when you’re suddenly remote
I love the concept of “ephemeral information”, that is, discussions that happen out-of-band, making it much harder to analyze the incident after the fact.
Blake Thorne — Atlassian
Grey failure turned a seemingly reasonable auto-recovery mechanism into a DoS caused by a thundering herd.
Panagiotis Moustafellos, Uri Cohen, and Sylvain Wallez — Elastic
Outages
- G Suite
- Google Cloud Platform
- GCP had a major incident that caused the G Suite outage.GCP also had an (apparently) unrelated outage later in the day.
- BitBay (cryptocurrency exchange)
- Netflix
- Uber
- Fastly
- Also this one.Full disclosure: Fastly is my employer.
- Discord
- Brightcove
- Zoom
- DoorDash
- Nest
- Canvas (remote learning tool)