1 unstable release
Uses new Rust 2024
| new 0.1.0 | Mar 10, 2026 |
|---|
#1365 in Rust patterns
41KB
731 lines
Error classification for services and control planes.
This module defines Error, the error type used at service boundaries.
It separates failures into three categories:
- domain errors: expected business failures that callers branch on
- transient errors: operational failures that are usually retried
- invariant violations: broken assumptions where the safest action is to abort the current operation
The goal is to make function signatures reflect what a caller can actually do: handle domain failures explicitly, decide retry policy for transient failures, and treat invariant violations as "stop and unwind".
Scope
This type is intended for:
- long-running services and control planes
- orchestration code that calls out to multiple dependencies
- code that needs consistent retry and failure handling
It is not required everywhere. Leaf libraries can expose whatever error
types are natural for them (thiserror enums, anyhow::Error, etc.).
Conversion into Error usually happens at service boundaries.
Problem
In practice we saw two unsatisfying extremes:
- fully typed error enums threaded through every layer
- a single untyped
anyhow::Erroreverywhere
Fully typed errors are nice for pattern matching, but make it hard to share generic infrastructure (retries, middleware) and tend to accumulate variants that nobody matches on. A single untyped error is easy to work with, but it does not clearly communicate which failures a caller is expected to handle versus those that are purely operational or unrecoverable.
Design
The Error enum keeps domain failures typed and uses anyhow::Error for
categories where callers rarely pattern match:
Error<D, T = anyhow::Error, I = Never>
D: domain error type (required)T: transient error type (defaults toanyhow::Error, useNeverto disallow transients)I: invariant violation type (defaults toNever, useanyhow::Errorwhere violations are possible)
Never represents an impossible category. It lets us state at the type
level which failure modes a function can produce.
Domain errors (D)
Expected failures in business logic: missing resources, validation failures, conflicts, permission checks, and similar cases. Callers are expected to match on these and take different code paths depending on the variant.
D is usually an enum defined in the calling crate. It should be concrete
and exhaustively matchable.
Transient errors (T)
Operational failures where the usual response is some form of retry or backoff: timeouts, connection failures, rate limiting, dependency overload, and similar cases. Callers typically do not care about the detailed type, only that the failure is transient.
The default choice is anyhow::Error. This gives:
- cheap boxing and downcasting
- rich context via
.context(...) - good interoperability with the rest of the Rust ecosystem
For observability, prefer structured logging and tracing rather than matching on concrete transient error types.
Invariant violations (I)
Situations where an assumed invariant is broken and continuing the current operation is unsafe: corrupted data, impossible state, violated contracts, or code paths that should be unreachable.
The caller cannot recover from these. The correct response is to unwind the
current operation, perform cleanup (rollback transactions, release locks,
close connections), and surface a failure up the stack. This is an
alternative to panic! when the process as a whole is still healthy, but
the current request cannot proceed safely.
Most functions should use Never here. Use anyhow::Error in
low-level or system code that may need to propagate rich context about a
violation.
Serialization
Serialization is intentionally asymmetric:
- domain errors: serialized structurally (requires
D: Serialize) - transient/invariant errors: serialized only via their
Displaystring
The intent is to discourage shipping internal error details across process boundaries and to encourage explicit API error types at the edges. At network boundaries, transient failures are usually network problems anyway; clients reconstruct their own transient errors based on local failures.
Usage
Some patterns that have worked well:
// Function that can only fail with domain errors
fn validate(input: &str) -> Result<Data, Error<ValidationError, Never, Never>> {
// ...
}
// Function that can experience transient failures
fn fetch_user(id: UserId) -> Result<User, Error<UserError, anyhow::Error, Never>> {
// ...
}
// System-level function that may encounter invariant violations
fn process_request(req: Request) -> Result<Response, Error<ApiError, anyhow::Error, anyhow::Error>> {
// ...
}
A rough rule of thumb:
- if callers should branch on it, put it in
D - if callers only need to know "retry or not", put it in
T - if the safest response is "stop this operation", put it in
I
Alternatives considered
Trait-based error classification
One option was a trait implemented by error types that exposes methods like
is_transient() or is_invariant(). This keeps a single error type but
relies on implementations to be correct. It also makes it harder to express
at a function boundary that a function never returns transient errors; that
becomes a convention instead of something the compiler can check.
With Error<D, T, I> and Never, the type system enforces which
categories are possible.
Domain traits for retry
Another option was a trait implemented on domain error enums, used by retry helpers to decide whether to back off or fail fast. This couples domain types to infrastructure concerns and makes those traits part of the public surface area.
By keeping retry decisions on Error<_, T, _> (where "transient" is a type
parameter), domain types remain free of infrastructure logic and retry code
can be reused across services.
Dependencies
~5.5–9MB
~87K SLoC