SRE Weekly Issue #242

A message from our sponsor, StackHawk:

StackHawk just raised a $10M series A. Read the blog by CEO Joni Klippert about what we’ve built and where are going in our mission to bring application security to developers.
http://sthwk.com/series-a

Articles

The work of SREs and the material we produce can be an excellent source of information to onboard new employees (not just SREs!).

Author Emily Arnot — Blameless

Having safeguards in your tools to prevent errors, is wise. Allowing the user to disable those safeguards when the need arises is even wiser.

Rachel by the bay

Lots of factors contributed to the crash and destruction of this $175 million USD aircraft. The pilot escaped with minor injuries.

Colonel Bryan T. Callahan et al. — USAF

Serverless isn’t going to make ops go away. NoOps is a myth.

Charity Majors — Honeycomb

In this blog post, we’ll present reliability-centric metrics and key performance indicators (KPIs) that show the positive impact that reliability has on businesses.

Andre Newman — Gremlin

“Outage of a CRL server” isn’t the first thing that would come to mind when diagnosing a database connection failure.

Oren Eini — RavenDB

Telltale combines anomaly detection, alerting, dashboarding, and incident management.

Andrei Ushakov, Seth Katz, Janak Ramachandran, Jeff Butsch, Peter Lau, Ram Vaithilingam, and Greg Burrell — Netflix

What?! I had no idea this was possible! You can transfer file descriptors (and the open files they point to) to another process, even outside of the normal parent/child process relationship.

Cindy Sridharan

Outages

  • GeoComply
    • GeoComply, a geo-location service used by most online gaming sites in the US to monitor the physical location of their customers, experienced a major outage.

  • Coinbase
  • Twitter
Updated: November 1, 2020 — 8:21 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme