[Initiative]: Reference framework for the levels of Service Reliability Automation

### Name

Levels Of Service Reliability Automation

### Short description

Create a reference document that provides a framework around "levels of service reliability automation". It allows end-users to identify where they are standing right now, and how they can improve. Likewise it allows open source projects and commercial products to position where they can help to go from one level to another.

### Responsible group

TAG Operational Resilience

### Does the initiative belong to a subproject?

Yes

### Subproject name

_No response_

### Primary contact

Severin Neumann, Causely, (@svrnm)

### Additional contacts

Steffen Geissinger, Causely, (@ib-steffen)
Will Hegedus, Linode (@wbh1)
Diana Todea, VictoriaMetrics (@didiViking)
Vitor Vasconcellos, MercadoLibre (@vitorvasc)
Graziano Casto, Mia-Platform, (@graz-dev)

### Initiative description

## Motivation

Reliability today is often framed narrowly as "observability", "incident response" and "troubleshooting". We especially see this when AI SREs are pitched as support humans in the loop or take them out of the loop, but basically offer a llm-powered on call automation, that's focused on those three domains. These tools mostly support reactive troubleshooting.

However, true service reliability (or operational resilience) spans the entire lifecycle, from building reliable software, to resource management, to release management, to maintaining and improving on going operations, observability, incident response, troubleshooting and more.

With this proposal, we want to show a bigger picture: how reliability engineering tasks (done by SREs, developers, ...) can climb a ladder of autonomy from manual work to autonomy. 

## Goal

The goal of this initiatve is to create a reference document that provides a framework around "levels of service reliability automation". It allows end-users to identify where they are standing right now, and how they can improve. Likewise it allows open source projects and commercial products to position where they can help to go from one level to another. 

## Examples
 
* In software development teams might do manual analysis of their code today towards reliability issues, or they have automation to identify potential issues and the required changes.
* For ongoing operations the team might do manual resource management, or they have automation that scales the resources up and down autonomously alongside predefined guard railes
* For observability teams might add instrumentation to their code manually, or they leverage different kinds of automation that either provides out of the box instrumentation, or LLM-based guidance where and how improvements can be made
* ...

## Inspiration

We borrow the framing from the[ SAE “Levels of Driving Automation”.](https://www.sae.org/binaries/content/assets/cm/content/blog/sae-j3016-visual-chart_5.3.21.pdf) Just as cars progress from manual to fully autonomous driving, reliability systems can progress from manual to autonomous reliability. This analogy creates a shared language to describe where the industry is today and where it’s heading, e.g. there might be Levels 0 ("manual"), 1 ("rule book automation"), 2 ("reactive assistants"), 3 ("proactive guidance"), 4 ("human in the loop autonomy"), 5 ("full autonomy"). These levels might then be defined for the different domains called out above.

Note 1: that his **not** necessarily how this whitepaper needs to be structured, or how those levels need to look like. Also for some domains it will be not necessary to go through all the levels, or not all levels make sense. The goal of the initiative is to charter that "map" in collaboration.

Note 2: While AI (and especially LLMs) play a big role in accomplishing the higher levels in this framework, they are not a necessity and a valuable outcome of this framework might be that it can help to identify where AI is the right tool and where it might be a wrong fit or maybe even harmful.

## Scope

The scope of this project is around providing that framework, create a common language and examples per category. It may already position certain projects and/or products to verify its applicability, but it does not have the goal to provide a complete "landscape", this may be a follow up activity.

The scope also excludes to build or design any projects to "fill out" the levels in some of the domains, although it might be taken as inspiration for existing projects to take on that task or new projects to emerge.

### Deliverable(s) or exit criteria

A shared reference document that contains the framework as outlined above

### Tracking document for meeting and progress

tbd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Initiative]: Reference framework for the levels of Service Reliability Automation #1984

Name

Short description

Responsible group

Does the initiative belong to a subproject?

Subproject name

Primary contact

Additional contacts

Initiative description

Motivation

Goal

Examples

Inspiration

Scope

Deliverable(s) or exit criteria

Tracking document for meeting and progress

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Initiative]: Reference framework for the levels of Service Reliability Automation #1984

Description

Name

Short description

Responsible group

Does the initiative belong to a subproject?

Subproject name

Primary contact

Additional contacts

Initiative description

Motivation

Goal

Examples

Inspiration

Scope

Deliverable(s) or exit criteria

Tracking document for meeting and progress

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions