SRE Weekly Issue #3

Articles

sysadvent: Day 22 – Simplicity in Complex Systems

I love this article! Simplicity is especially important to us in SRE, as we try to gain a holistic understanding of the service in order to understand its failure modes. Complexity is the enemy. Every bit of complexity can hide a design weakness that threatens to take down your service.

I once heard that it takes Google SRE hires 6 months to come up to speed. I get that Google is huge, but is it possible to reduce this kind of spin-up time by aggressively simplifying?

On Call for the Holidays

This Reddit thread highlights the importance of management’s duty to adequately staff on-call teams and make on-call not suck. My favorite quote:

If you’re the sole sysadmin ANYWHERE then your company is dropping the ball big-time with staffing. One of anything means no backup. Point to the redundant RAID you probably installed in the server for them and say “See, if I didn’t put in multiples, you’d be SOL every time one of those drives fails, same situation except if I fail, I’m not coming back and the new one you have to replace me with won’t be anywhere near as good.”

40 Percent of IT Pros Expect to Work on Christmas Eve and Christmas Day

Whether or not you celebrate, I hope this holiday is a quiet and incident-free one for all of you. Being on call during the holidays can be an especially quick path to burnout and pager fatigue. As an industry, it’s important that we come up with ways to keep our services operational during the holidays with minimal interruption to folks’ family/vacation time.

Secret Code Found in Juniper’s Firewalls Shows Risk of Government Backdoors

Even if you think you’ve designed your infrastructure to be bulletproof, there may weaknesses lurking.

Oh, Molly!

Molly-guard is a tool to help you avoid rebooting the wrong host. Back at ${JOB[0]}, I mistakenly rebooted the host running the first production trial of a deploy tool that I’d spent 6 months writing, when it was 95% complete. Oops.

7 Steps to a Minimum Viable Runbook

The final installment in a series on writing runbooks. The biggest takeaway for me is the importance of including a runbook link in every automated alert. Especially useful for those 3am incidents.

Every Minute Counts: Coordinating Heroku’s Incident Response

In this talk from last year (video & transcript), Blake Gentry talks about how Heroku’s incident response had evolved. Full disclosure: I work for Heroku. We still largely do things the same way, although now there’s a entire team dedicated to only the IC role.

Outages

Minecraft’s Wii U release caused Nintendo’s eShop to experience an intermittent outage
Apple’s iCloud Hit by Global Outage
Moonfruit outage
- Moonfruit has been the target of a series of nasty DDoS attacks and blackmail threats. They stood up to the attackers and were very open about the situation, and I really respect them for that. DDoS attacks can be incredibly exhausting, frustrating, and difficult to fight.
‘Phantom Squad’ threatens Xbox Live and PSN outages over Christmas
- More blackmail. ‘Tis the season.
SendGrid Outage
Xbox Live intermittend outage
Microsoft’s December 3 Office 365 outage: What went wrong
WWE Network Crashes During TLC Kickoff

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues