SRE Weekly is a newsletter devoted to everything related to keeping a site or service available as consistently as possible. SRE (Site/Service Reliability Engineering) isn’t just about automated failover or fault-tolerant architectures — although of course those are important. It’s about a holistic view of reliability that takes into account everything from servers to human factors to processes to automation and more.
Did “human error” cause that outage? What caused the human to make the error? Can we make it impossible for them to make that kind of error through automation? Can we expose likely errors so that they can be caught early or avoided?
The Internet is growing at a fascinatingly fast pace, and we’re continually having to invent new techniques to keep ever larger services operational with increasingly stringent SLAs. SRE has evolved from its roots in Systems Engineering, DevOps, and Program Management in an attempt to fill this need, but it’s still a very new field. The more information we share, the more progress we’ll be able to make as a profession.
I started SRE Weekly out of frustration at the seeming lack of one or two main sources for articles relating to service reliability. I hope you’ll enjoy it, and I’d love to hear from you if you’ve seen any articles that I might have missed. Feel free to reply to the email newsletter directly, or email me directly at my first name at this domain.