General

SRE Weekly Issue #72

SPONSOR MESSAGE

Concerned about downtime? VictorOps helps you prepare, respond, and recover from IT and DevOps Incidents. Swing by our product center to learn how and start your trial. http://try.victorops.com/SREWeekly/ProductCenter

Articles

Idempotence is a critically important tool in building a reliable system. Stripe explains the concept and shows how they wrap theoretically non-idempotent actions like charging a credit card into safely idempotent API calls.

Here’s an account of an effort to move from server-based paging (this server is down) to functional-based alerting (this user action isn’t working), with a resulting impressive reduction in out-of-hours paging.

It pays to study up and deeply understand what a simple metric like “cpu utilization” really means.

Why am I linking to AWS’s status site? Look closely, and you’ll see that the “green checkmark i” symbol has been replaced with a far more noticeable blue circle with a white diamond. Check out the old icon here for comparison. End of an era, or just another way of presenting the same information?

The author introduces a new Ruby gem, grpc-commons that makes it easy to add circuit breaker and statsd support to a grpc client.

Along with being a tutorial on setting up Zipkin with Python, this article also explains some basic Zipkin concepts.

PagerDuty is apparently trying to position itself as more than just a paging service, with a few new features around the entire incident lifecycle. I’m especially interested in checking out the new postmortem tooling.

I included this article last week, but my link was outdated and returned a 404. Here’s the corrected link — sorry about that!

I put a call out for a review of Elastic’s new beta anomaly detection feature last week, and here one is! Thanks to an Elastic employee for forwarding this link to me.

This article cautions one to be careful to look past an obvious root cause, because a deeper systemic or policy problem may be lurking behind it.

Serverless / FaaS abstract away traditional provisioning, and they make it really easy to ignore planning for resource usage.

Wow, what a concept:

you can think of […] reliable systems […] as successfully imagining all of the potential things that could go wrong

This 2.5-minute podcast from Todd Conklin has a really great question: to achieve reliability, do we have to try to imagine in advance all of the possible ways our systems could fail?

A patient was given an incorrect syringe resulting in a 5x insulin overdose. Brigham and Women’s Hospital reports on the accident and what they’re doing to prevent mistakes of this sort in the future.

Consumers today have increasingly high expectations for digital applications and service performance, but do IT personnel feel equipped to rise to the occasion? In this survey, we uncover the extent of the digital services expectation gap between consumers and IT teams as well as top strategies teams are using to solve digital disruption challenges.

Outages

  • Our First Kubernetes Outage – Saltside Engineering
    • Kudos to the Saltside folks for sharing a public postmortem for an internal, non-customer-impacting outage!

      This is public postmortem for an a complete shutdown of our internal Kubernetes cluster. It’s shared with you all so everyone may learn.

  • “Re-experience the fun of customizing your Place Page!” A Tale of Oops from Ops
    • Ouch. Linden Lab’s ops team discovered the hard way that they didn’t have a working backup copy of some customer data. The best part of this article is the discussion of the “Shrek Ears” tradition at Linden. It’s one of the things I remember most fondly from my time there, and having worn the ears a few times in my day, I can attest to the fact that it’s a great way to handle the psychological impact of making a mistake.
  • Chase (bank)
  • Facebook

SRE Weekly Issue #71

SPONSOR MESSAGE

Resolving DevOps and IT incidents is not enough. Download the eBook: “Blameless Post Mortems (and how to do them)”, and start learning from them. http://try.victorops.com/BlamelessPostMortems/SREWeekly

Articles

The interesting bit in this story is that upgrading to 5.7 requires a full table rewrite (<tt>ALTER TABLE</tt>) for any table that has time-related columns. Their initial test-run took months and still hadn’t finished.

AdStage made the move from Heroku to running their service directly on EC2, and in this article they explain why and how.

We were officially only getting about 2 ECUs per dyno, but the reality was that we were getting something closer to 6 since our neighbors on Heroku were not using their full share. This meant that our fleet of AWS instances was 3 times too small, […]

Language Warning: contains the word “sexy” used to describe new or interesting technology.

Full disclosure: Heroku, my employer, is mentioned.

I’ve featured many articles from Mathias Lafeldt as part of his series, Production Ready. Now that he’s moved to Gremlin Inc (a SaaS helping customers run chaos experiments), Mathias reintroduces the history and theory of Chaos Engineering.

The folks behind Mail.ru implemented their own master-master replication system on top of Tarantool, a DBMS I’d never heard of. Their implementation is based on some details of their use-case that may not apply more broadly, but the design discussion is interesting nonetheless.

Facebook rewrote their tool, OnlineSchemaChange in Python (from the original PHP). OSC is a tool for doing DDL in MySQL without downtime.

The original open sourced OSC was more like an engine than a tool. Users needed to write PHP code wrapping to run the schema change, and, with PHP becoming less popular in the operations world, OSC.php wasn’t widely adopted by the community.

From PagerDuty, an article on the incident management data to gather, how to gather it, and how to analyze it.

A basic introduction to structured logging, including rationale on why you’d want to use it. With infrastructures growing more and more complicated, I find structured logging indispensable in keeping everything up and running and debugging difficult problems.

For the network nerds, Facebook details their new inter-datacenter network topology.

New in the latest version of Elastic Stack (think ElasticSearch, Logstash, Kibana, etc) is built-in anomaly detection using machine learning, based on technology from Prelert (acquired by Elastic in 2016). “Machine Learning” — they might as well say it’s powered by “Lasers™”. If you try this out and have any success, please write up your results and send me a link!

Outages

SRE Weekly Issue #70

SPONSOR MESSAGE

Resolving DevOps and IT incidents is not enough. Download the eBook: “Blameless Post Mortems (and how to do them)”, and start learning from them. http://try.victorops.com/BlamelessPostMortems/SREWeekly

Articles

GitHub has released OctoDNS, their tool for synchronizing DNS across multiple providers. Shortly after the Dyn outage last fall (covered here), they still only had one DNS provider (source: direct observation). I suspected that this may have had to do with complication in keeping records synched across two providers – perhaps that’s why they created OctoDNS.

Bolt is Netflix’s “event driven diagnostic and remediation platform”, although it actually seems like a full-blown remote execution system for large fleets of servers.

A Google SRE takes us through their first on-call shift including running incident command for a production incident. I like the emphasis on a blameless postmortem.

Pete Shima received some questions about onboarding SREs, and lucky us, he decided to answer them publicly. My favorite section is the one about connecting a new SRE to people across the company. I find that solid connections to folks in various positions are vital to getting my job done well. Thanks to Pete for the SRE Weekly mention!

Salesforce has a humongous infrastructure, and they needed a tool to help visualize data from lots of monitoring systems. They created Refocus to serve that need, and they open sourced it. They had three goals: gather data from all of the monitoring systems, on-board new services quickly, and visualize data in a way that makes sense for each service.

Full disclosure: Salesforce (parent company of my employer, Heroku), is mentioned.

Tcpdump is a critical tool for debugging thorny network issues. Julia Evans created a new zine to help you learn the basics, although if her other zines are any indication, even a pro may learn a new trick or two. The zine is $10 now and will be available for free at some point in the future.

Turns out that sharks are a reliability risk. And not just those WFLB.

From their Global Developer Survey, GitLab learned that it’s common for developers to release code before it’s production-ready in response to organizational pressures.

Code released before it’s ready might be good for meeting deadlines, but that’s about all it’s good for.

Here’s a pretty excellent analysis of why adopting the cloud can be difficult for banks. Just skip past the bit with the incorrect uptime calculation, since four nines of uptime actually equates to about 53 minutes’ downtime per year, not 9 hours.

Outages

  • London Marathon Donations
    • Ebay and Virgin Money Giving both went down under the load as many flocked to place donations before the London Marathon.
  • CARLI
    • CARLI is the Consortium of Academic Research Libraries in Illinois. I included this outage because of the short but sweetly personal postmortem from their network engineer.
  • Instagram
  • Reddit
    • Sorry for the extended outage there. We failed back the maintenance performed earlier tonight. We’ll provide a post-mortem at a later date.

SRE Weekly Issue #69

SPONSOR MESSAGE

Incident management is essential to modern DevOps environments. Learn why in the eBook, “Making the Case for Real-time Incident Management” from your friends at VictorOps. http://try.victorops.com/realtime_incident_mgmt/SREweekly

Articles

In February of 2016, a metal hospital gurney was inadvertently wheeled* into an MRI room, resulting in a costly near-miss accident. Brigham and Women’s Hospital posted about the mishap on their Safety Matters blog and also released a Q&A with their chief quality officer about their dedication to an open and just culture.

If an employee at Brigham makes a mistake that anyone else could make, we will work on improving the system, rather than punishing the employee. We believe that in every circumstance involving “human error” there are systemic opportunities for mitigating reoccurrence.

* Yes, I used the passive voice on purpose. See what I did there?

Sometimes logs help us prevent outages or discover a cause. But raise your hand if you’ve seen logging cause an outage. Yeah, me too.

Traditionally, auditd, together with Linux’s system call auditing support, has been used as part of security monitoring. Slack developed go-audit so that they could use system call auditing as a general monitoring tool. I can think of plenty of outages during which I’d have loved to be able to query system call patterns!

Dropbox has some pretty complex needs around feature gating. Apparently existing tools couldn’t satisfy their use case so they wrote and released a tool with sophisticated user segmentation support.

Why depend on fallible QA testing to spot regressions in a web UI? Computers are so much better at that kind of thing. Niffy spots the pixel changes between old and new code so you can see exactly what regressed before putting it in front of your users.

In this beautifully-illustrated article, Stripe engineer Jacqueline Xu explains how Stripe safely rolled out a major database schema upgrade. Many code paths had to be refactored, and they took a methodical, incremental approach to avoid downtime. Thanks to this article, I now know about Scientist and can’t wait to use it.

Speaking of Stripe, they have another polished post on how and why to add load shedding to your API.

Scientist is such an awesome idea. The idea is to try out a new code path and see if it produces the same result as the old code path. It only returns the new code path, so you know you can safely prove to yourself whether the new code path is safe before exposing users to it.

I’m including this article at least in part due to its mention of the February S3 outage. AWS had difficulty reporting the outage on its status site because of a dependency on S3.

Conway’s Law is extremely important to us as SREs. As we can see in the example of Sprouter, a poor organizational structure can produce unreliable software. My fellow SRE, Courtney Eckhardt, loves to say, “My job is applying Conway’s Law in reverse.”

Outages

  • AT&T VoIP
    • I received an anonymous anecdote from an SRE Weekly reader (thanks!) that this affected at least one hospital, with the result that critical phone communication was significantly hampered. What happened to the good old mostly-reliable traditional phone system? Irony: in the reader’s case, an announcement about the failure was sent out via email.
  • Three
    • This is the second case this year of a telecom outage resulting in SMSes being delivered to the wrong people. Am I the only one that finds this an extremely surprising and concerning failure mode?
  • eBay
  • Red Hat

SRE Weekly Issue #68

SPONSOR MESSAGE

Incident management is essential to modern DevOps environments. Learn why in the eBook, “Making the Case for Real-time Incident Management” from your friends at VictorOps. http://try.victorops.com/realtime_incident_mgmt/SREweekly

Articles

The big story this week is the release of the inaugural issue of Increment, a newsletter by Stripe, edited by Susan Fowler. They bill it as “A digital magazine about how teams build and operate software systems at scale” and the first issue, dedicated to on-call, certainly delivers. Below, I’ll include my short take on each article in the issue.

Increment interviewed over thirty companies to build a picture of the common practices in incident response. I’m actually pretty surprised to hear that “it turns out that they all follow similar (if not completely identical) incident response processes”, but apparently the commonalities don’t stop at just process:

Slack and PagerDuty appear to be two points of failure across the entire tech industry

Bonus content: Julia Evans shared her notes on Twitter.

Next up, Increment addresses the dichotomy of ops teams versus developers on call for their code. It turns out that the latter practice is more prevalent than I’d realized.

After laying a solid groundwork of suggestions for avoiding burn-out in on-call, this next Increment article raises a really important point: on-call affects people differently based on privilege. Example: single parents are going to have a much harder time of it.

[…] if you set up an on-call rotation with a schedule or intensity that assumes the participants have no real responsibilities outside of the office, you are limiting the people who will be able to participate on your team.

Remember a couple of months back when GitLab live-streamed their incident response? Increment caught up with their CEO to give us this in-depth interview about their radical transparency.

Increment shares tips and key practices for setting up on-call, targeted to companies of size ranges varying from 0-10 employees all the way up to 10000+.

Increment rounds out their issue with advice in the form of quotes from six of the companies they interviewed.

The other big news of the week is the official launch of Honeycomb.io. If you haven’t had a chance to check it out yet, here’s an introduction, and you can also sign up for a free one-month trial.

Outages

  • Melbourne IT
    • A DDoS took out their DNS service, taking out customer domains and also sites they they host for customers. While this is a news article and not a formal post-analysis, it does include some pretty interesting technical detail from an interview with their CTO. I’m not sure that he did himself any favors by quoting the definition of their SLA:

      “People look at 99.9 per cent and think that’s seconds of downtime, but you work it out and it’s 45 minutes.”

  • Google Cloud HTTP(S) Load Balancer
    • Google Cloud LB threw 502s for 25% of requests in a 22-minute period. They released this post-analysis 7 days later, and I have to say, the root cause is pretty interesting – and sadly familiar.

      A bug in the HTTP(S) Load Balancer configuration update process caused it to revert to a configuration that was substantially out of date.

SRE WEEKLY © 2015 Frontier Theme