SRE Weekly Issue #29

Articles

DevOps vs SRE: delayed coverage of the dumbest war – charity.wtf

I can’t summarize this awesome article well enough, so I’m just going to quote Charity a bunch:

the outcomes associated with operations (reliability, scalability, operability) are the responsibility of *everyone* from support to CEO.

if you have a candidate come in and they’re a jerk to your office manager or your cleaning person, don’t fucking hire that person because having jerks on your team is an operational risk

If you try and just apply Google SRE principles to your own org according to their prescriptive model, you’re gonna be in for a really, really bad time.

March Outages post-mortem

Traffic spikes can be incredibly difficult to handle, foreseen or not. Packagecloud.io details its efforts to survive a daily spike of 600% of normal traffic in March.

New high availability checklist now available | Azure

This checklist is aimed toward deployment on Azure, but a lot of the items could be generalized and applied to infrastructures deployed elsewhere.

Emails reveal months of missteps leading up to Tennessee’s disastrous online testing debut

In-depth detail surrounding the multiple failures of TNReady mentioned earlier this year (issues #10 and #20).

The Two Sides to Google Infrastructure for Everyone Else

A two-sided debate, both sides of which are Gareth Rushgrove (maintainer of the excellent Devops Weekly). Should we try to adopt Google’s way of doing things in our own infrastructures? For example, error budgets:

What if you’re operating an air traffic control system or a nuclear power station? Your goal is probably closer to zero outages

Outages

Telstra
- Another announcement that they’re dedicating more money to outages, and another subsequent outage. Telstra’s CEO says that the number of outages has not actually increased.
Google Compute Engine
- Click through for the full postmortem.
  
  On Wednesday 29 June 2016, newly created Google Compute Engine instances and newly created network load balancers in all zones were partially unreachable for a duration of 106 minutes.
Virgin Mobile
Google Calendar
Snapchat
Idea (mobile telecom)
Microsoft Office 365
Comcast (Boston, MA, US)

SRE Weekly Issue #29

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues