Skip to content

[Initiative]: Reference framework for the levels of Service Reliability Automation #1984

@svrnm

Description

@svrnm

Name

Levels Of Service Reliability Automation

Short description

Create a reference document that provides a framework around "levels of service reliability automation". It allows end-users to identify where they are standing right now, and how they can improve. Likewise it allows open source projects and commercial products to position where they can help to go from one level to another.

Responsible group

TAG Operational Resilience

Does the initiative belong to a subproject?

Yes

Subproject name

No response

Primary contact

Severin Neumann, Causely, (@svrnm)

Additional contacts

Steffen Geissinger, Causely, (@ib-steffen)
Will Hegedus, Linode (@wbh1)
Diana Todea, VictoriaMetrics (@didiViking)
Vitor Vasconcellos, MercadoLibre (@vitorvasc)
Graziano Casto, Mia-Platform, (@graz-dev)

Initiative description

Motivation

Reliability today is often framed narrowly as "observability", "incident response" and "troubleshooting". We especially see this when AI SREs are pitched as support humans in the loop or take them out of the loop, but basically offer a llm-powered on call automation, that's focused on those three domains. These tools mostly support reactive troubleshooting.

However, true service reliability (or operational resilience) spans the entire lifecycle, from building reliable software, to resource management, to release management, to maintaining and improving on going operations, observability, incident response, troubleshooting and more.

With this proposal, we want to show a bigger picture: how reliability engineering tasks (done by SREs, developers, ...) can climb a ladder of autonomy from manual work to autonomy.

Goal

The goal of this initiatve is to create a reference document that provides a framework around "levels of service reliability automation". It allows end-users to identify where they are standing right now, and how they can improve. Likewise it allows open source projects and commercial products to position where they can help to go from one level to another.

Examples

  • In software development teams might do manual analysis of their code today towards reliability issues, or they have automation to identify potential issues and the required changes.
  • For ongoing operations the team might do manual resource management, or they have automation that scales the resources up and down autonomously alongside predefined guard railes
  • For observability teams might add instrumentation to their code manually, or they leverage different kinds of automation that either provides out of the box instrumentation, or LLM-based guidance where and how improvements can be made
  • ...

Inspiration

We borrow the framing from the SAE “Levels of Driving Automation”. Just as cars progress from manual to fully autonomous driving, reliability systems can progress from manual to autonomous reliability. This analogy creates a shared language to describe where the industry is today and where it’s heading, e.g. there might be Levels 0 ("manual"), 1 ("rule book automation"), 2 ("reactive assistants"), 3 ("proactive guidance"), 4 ("human in the loop autonomy"), 5 ("full autonomy"). These levels might then be defined for the different domains called out above.

Note 1: that his not necessarily how this whitepaper needs to be structured, or how those levels need to look like. Also for some domains it will be not necessary to go through all the levels, or not all levels make sense. The goal of the initiative is to charter that "map" in collaboration.

Note 2: While AI (and especially LLMs) play a big role in accomplishing the higher levels in this framework, they are not a necessity and a valuable outcome of this framework might be that it can help to identify where AI is the right tool and where it might be a wrong fit or maybe even harmful.

Scope

The scope of this project is around providing that framework, create a common language and examples per category. It may already position certain projects and/or products to verify its applicability, but it does not have the goal to provide a complete "landscape", this may be a follow up activity.

The scope also excludes to build or design any projects to "fill out" the levels in some of the domains, although it might be taken as inspiration for existing projects to take on that task or new projects to emerge.

Deliverable(s) or exit criteria

A shared reference document that contains the framework as outlined above

Tracking document for meeting and progress

tbd

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/initiativeAn initiative or an item related to imitative processesneeds-triageIndicates an issue or PR that has not been triaged yet (has a 'triage/foo' label applied).

    Type

    No type

    Projects

    Status

    New

    Status

    status/new

    Status

    No status

    Status

    No status

    Status

    No status

    Status

    No status

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions