the outcomes associated with operations (reliability, scalability, operability) are the responsibility of *everyone* from support to CEO.
if you have a candidate come in and they’re a jerk to your office manager or your cleaning person, don’t fucking hire that person because having jerks on your team is an operational risk
If you try and just apply Google SRE principles to your own org according to their prescriptive model, you’re gonna be in for a really, really bad time.
What if you’re operating an air traffic control system or a nuclear power station? Your goal is probably closer to zero outages
Another announcement that they’re dedicating more money to outages, and another subsequent outage. Telstra’s CEO says that the number of outages has not actually increased.
- Google Compute Engine
Click through for the full postmortem.
On Wednesday 29 June 2016, newly created Google Compute Engine instances and newly created network load balancers in all zones were partially unreachable for a duration of 106 minutes.
- Virgin Mobile
- Google Calendar
- Idea (mobile telecom)
- Microsoft Office 365
- Comcast (Boston, MA, US)