Articles
The steps are:
- Know How Much Time Is Spent On Toil
- Find The Toil
- Determine The Root Causes Of Toil
- Find And Prioritize The Low-Hanging Fruit
- Promote Toil Reduction
Aater Suleman — Forbes
I like how they try to strike a balance and avoid reviewing too far in depth, while still hitting everything important.
Milan PlžÃk — Grafana Labs
Lots of good stuff in this one about one of my favorite topics, service ownership.
Kenneth Rose — OpsLevel
This is the intro I needed to understand Conflict-Free Replicated Data Types.
Jo Stichbury — Ably
Availability, maintainability and reliability all have distinct—if related—meanings, and they each play different roles in reliability operations.
JJ Tang — DevOps.com
The five Ps come from medicine and understanding medical accidents, but they apply equally well to analyzing incidents in IT.
Lydia Leong
I really love the focus on de-emphasizing finding action items in incident retrospectives, in favor of learning.
Gergely Orosz — The Pragmatic Engineer
Outages
- AT&T SMS in the US
- This week, I saw several status pages point to some kind of problem in their ability to send SMS notifications to AT&T phones. I thought this was interesting because usually I don’t learn about an outage solely from other companies’ status pages.
- Google Meet
- Tesco
- Coinbase
- Zomato
- Barclays
- HSBC