0% found this document useful (0 votes)

128 views36 pages

Safe Learning in Robotics - From Learning-Based Control To Safe Reinforcement Learning

Uploaded by

Kitty Marvin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

128 views36 pages

Safe Learning in Robotics - From Learning-Based Control To Safe Reinforcement Learning

Uploaded by

Kitty Marvin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Annual Review of Control, Robotics, and

Autonomous Systems
Safe Learning in Robotics:
From Learning-Based Control
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

to Safe Reinforcement Learning

Lukas Brunke,1,2,3,∗ Melissa Greeff,1,2,3,∗
Access provided by [Link] on 09/01/23. For personal use only.

Adam W. Hall,1,2,3,∗ Zhaocong Yuan,1,2,3,∗

Siqi Zhou,1,2,3,∗ Jacopo Panerati,1,2,3
and Angela P. Schoellig1,2,3
1
Institute for Aerospace Studies, University of Toronto, Toronto, Ontario, Canada;
email: [Link]@[Link], [Link]@[Link],
[Link]@[Link], [Link]@[Link],
[Link]@[Link], [Link]@[Link],
[Link]@[Link]
2
University of Toronto Robotics Institute, Toronto, Ontario, Canada
3
Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada

Annu. Rev. Control Robot. Auton. Syst. 2022. Keywords

5:411–44
safe learning, robotics, robot learning, learning-based control, safe
First published as a Review in Advance on
January 26, 2022 reinforcement learning, adaptive control, robust control, model predictive
control, machine learning, benchmarks
The Annual Review of Control, Robotics, and
Autonomous Systems is online at
[Link] Abstract
[Link] The last half decade has seen a steep rise in the number of contributions on
020211 safe learning methods for real-world robotic deployments from both the
Copyright © 2022 by Annual Reviews. control and reinforcement learning communities. This article provides a
All rights reserved concise but holistic review of the recent advances made in using machine
∗
These authors contributed equally to this article learning to achieve safe decision-making under uncertainties, with a focus on
unifying the language and frameworks used in control theory and reinforce-
ment learning research. It includes learning-based control approaches that
safely improve performance by learning the uncertain dynamics, reinforce-
ment learning approaches that encourage safety or robustness, and methods
that can formally certify the safety of a learned control policy. As data- and
learning-based robot control methods continue to gain traction, researchers
must understand when and how to best leverage them in real-world scenar-
ios where safety is imperative, such as when operating in close proximity

411
to humans. We highlight some of the open challenges that will drive the field of robot learning in
the coming years, and emphasize the need for realistic physics-based benchmarks to facilitate fair
comparisons between control and reinforcement learning approaches.

1. INTRODUCTION
Robotics researchers strive to design systems that can operate autonomously in increasingly com-
plex scenarios, often in close proximity to humans. Examples include self-driving vehicles (1), aerial
delivery (2), and the use of mobile manipulators for service tasks (3). However, the dynamics of
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

these complex applications are often uncertain or only partially known—for example, the mass
distribution of a carried payload might not be given a priori. Uncertainties arise from various
sources. For example, the robot dynamics may not be perfectly modeled, sensor measurements
may be noisy, and/or the operating environment may not be well characterized or may include
Access provided by [Link] on 09/01/23. For personal use only.

other agents whose dynamics and plans are not known.

In these real-world applications, robots must make decisions despite having only partial knowl-
edge of the world. In recent years, the research community has multiplied its efforts to leverage
data-based approaches to address this problem. This was motivated in part by the success of ma-
chine learning in other areas, such as computer vision and natural language processing.
A crucial, domain-specific challenge of learning for robot control is the need to implement
and formally guarantee the safety of the robot’s behavior, not only for the optimized policy (or
controller, which is essential for the certification of systems that interact with humans) but also
during learning, to avoid costly hardware failures and improve convergence. Ultimately, these
safety guarantees can only be derived from the assumptions and structure captured by the problem
formalization.
Researchers in both control theory and machine learning—reinforcement learning (RL) in
particular—have proposed approaches to tackle this problem. Control theory has traditionally
taken a model-driven approach (see Figure 1): It leverages a given dynamics model and provides
guarantees with respect to known operating conditions. RL has traditionally taken a data-driven
approach, which makes it highly adaptable to new contexts at the expense of providing formal
guarantees. Combining model-driven and data-driven approaches, and leveraging the advantages
of each, is a promising direction for safe learning in robotics. The methods we review encourage
robustness (by accounting for the worst-case scenarios and taking conservative actions), enable
adaptation (by learning from online observations and adapting to unknown situations), and build
and leverage prediction models (based on a combination of domain knowledge, real-world data,
and high-fidelity simulators).
While control is still the bedrock of all current robot applications, the body of safe RL literature
has ballooned from tens of publications to more than a thousand in just the few years since its most
recent review (4). Physics-based simulation (5), which we leverage in our open-source benchmark
implementation (6, 7), has played an important role in the recent progress of RL; however, the
transfer to real systems remains a research area in itself (8).
Previous review works have focused on specific techniques—for example, learning-based model
predictive control (MPC) (9), iterative learning control (10, 11), model-based RL (12), data-
efficient policy search (13), imitation learning (14), or the use of RL in robotics (15, 16) and
in optimal control (17)—without emphasizing the safety aspect. Recent surveys on safe learning
control have focused on either control-theoretic (18) or RL (19) approaches and do not provide a
unifying perspective.

412 Brunke et al.

Model-driven approaches Data-driven approaches Combined approaches

Safe

World
Safe Safe
Range of
region region acceptable risk
Learning must be defined
A model of the world can for exploration.
Model be learned by collecting
Only a small part of the world data over time. There is no
can be accurately modeled. clear boundary between
There is a clear boundary what can be accurately
Model + learning
between what can be modeled (and is safe) and An accurate model of the
accurately modeled (and is what cannot be accurately world can be learned over
safe) and what cannot be modeled (and is unsafe). time by collecting data. The
accurately modeled (and is safe boundary is well
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

unsafe). defined and extensible.

Unsafe

Generalizable and safe within defined

Benefits Strong guarantees within specific contexts Highly generalizable to new contexts boundaries
Access provided by [Link] on 09/01/23. For personal use only.

Safely and efficiently exploring the

Challenges Generalization to new contexts Providing hard guarantees
unknowns

Figure 1
A comparison of model-driven, data-driven, and combined approaches.

In this article, we provide a bird’s-eye view of the most recent work in learning-based control
and RL that implement safety and provide safety guarantees for robot control. We focus on safe
learning control approaches where the data generated by the robot system are used to learn or
modify the feedback controller (or policy). We hope to help shrink the gap between the control
and RL communities by creating a common vocabulary and introducing benchmarks for algorithm
evaluation that can be leveraged by both (20, 21). Our target audience is researchers, with either a
control or RL background, who are interested in a concise but holistic perspective on the problem
of safe learning control. While we do not cover perception, estimation, planning, or multiagent
systems, we do connect our discussion to these additional challenges and opportunities.

2. PRELIMINARIES AND BACKGROUND ON SAFE

LEARNING CONTROL
In this review, we are interested in the problem of safe decision-making under uncertainties using
machine learning (i.e., safe learning control). Intuitively, in safe learning control, our goal is to
allow a robot to fulfill a task while respecting a set of safety constraints despite the uncertainties
present in the problem. In this section, we define the safe learning control problem (Section 2.1)
and provide an overview of how the problem of decision-making under uncertainties has tradition-
ally been tackled by the control (Section 2.2) and RL (Section 2.3) communities. We highlight the
main limitations of these approaches and articulate how novel data-based, safety-focused methods
can address these gaps (Section 2.4).

2.1. Problem Statement

We formulate the safe learning control problem as an optimization problem to capture the efforts
of both the control and RL communities. The optimization problem has three main components
(see Figure 2): (a) a system model that describes the dynamic behavior of the robot, (b) a cost
function that defines the control objective or task goal, and (c) a set of constraints that specify
the safety requirements. The goal is to find a controller or policy that computes commands (also

[Link] • Safe Learning in Robotics 413

Safe learning controller
Computes inputs that Modifies inputs if they are
Control policy Safety certificate and filter
enable the robot to fulfill perceived to be unsafe
(Sections 3.1 and 3.2) (Section 3.3)
the task while respecting
safety constraints Input
Prior system model
Prior system model
Prior cost function
Predefined safety constraints
Prior safety constraints

Data buffer and learning algorithm

Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

Updates the control policy and/or the safety filter

based on data generated by the environment

Robot operating environment

Observations
Cost/reward Dynamics
3. Safety
Access provided by [Link] on 09/01/23. For personal use only.

Constraint values uncertainty Certified input

constraints c
2. Task J Key components
(e.g., following
a desired path) Optional
1. System model f components
Signals

Figure 2
Block diagram representing the safe learning control approaches reviewed in this article. The three main components of the safe
learning control problem are the cost function J, the system model f , and the constraints c, all of which may be initially unknown. Data
are used to update the control policy (see Sections 3.1 and 3.2) or the safety filter (see Section 3.3).

called inputs) that enable the system to fulfill the task while respecting given safety constraints.
In general, any of the components could be initially unknown or only partially known. Below, we
first introduce each of the three components and then conclude by stating the overall safe learning
control problem.

2.1.1. System model. We consider a robot whose dynamics can be represented by the following
discrete-time model:

xk+1 = fk (xk , uk , wk ), 1.

where k ∈ Z≥0 is the discrete-time index; xk ∈ X is the state, with X denoting the state space;
uk ∈ U is the input, with U denoting the input space; fk is the dynamics model of the robot; and
wk ∈ W is the process noise distributed according to a distribution W. Equation 1 is analogous to
the transition function in RL. Throughout this review, we assume direct access to (possibly noisy)
measurements of the robot state xk and neglect the problem of state estimation. Equation 1 repre-
sents many common robot platforms (e.g., quadrotors, manipulators, and ground vehicles). More
Transition function: complex models (e.g., partial differential equations) may be necessary for other robot designs.
a transition probability
model Tk (xk+1 | xk , uk ) 2.1.2. Cost function. The robot’s task is defined by a cost function. We consider a finite-
that is commonly used horizon optimal control problem with time horizon N. Given an initial state x0 , the cost is
in RL as an alternative computed based on the sequence of states x0:N = {x0 , x1 , . . . , xN } and the sequence of inputs
representation of the
u0:N −1 = {u0 , u1 , . . . , uN −1 }:
system model

N −1
J(x0:N , u0:N −1 ) = lN (xN ) + lk (xk , uk ), 2.
k=0

414 Brunke et al.

Soft constraints Probabilistic constraints Hard constraints
Safety level I Safety level II Safety level III
Possible No violations No
minimal Goal with high Goal violations Goal
violations probability

Distribution
of possible Path
paths the traversed by
robot could the robot
traverse
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

Figure 3
The three safety levels.
Access provided by [Link] on 09/01/23. For personal use only.

where lk : X × U → R is the stage cost incurred at each time step k (analogous to discounted
rewards in RL), and lN : X → R is the terminal cost incurred at the end of the N-step horizon.
The stage and terminal cost functions map the state and input sequences, which may be random
variables, to a real number, and may, for example, include the expected value or variance operators.

2.1.3. Safety constraints. Safety constraints ensure, or encourage, the safe operation of the
robot and include (a) state constraints Xc ⊆ X, which define the set of safe operating states (e.g.,
the lane in self-driving); (b) input constraints Uc ⊆ U (e.g., actuation limits); and (c) stability guar-
antees (e.g., the robot’s motion converging to a desired path) (22). To encode the safety constraints,
j
we define nc constraint functions: ck (xk , uk , wk ) ∈ Rnc , with each constraint ck being a real-valued,
time-varying function. Starting with the strongest guarantee, we introduce three levels of safety:
hard, probabilistic, and soft constraints (illustrated in Figure 3). In practice, safety levels are often
mixed. For example, input constraints are typically hard constraints, but state constraints may be
soft constraints (e.g., encouraging small acceleration values). Probabilistic constraints guarantee a
constraint with high probability and are often the best we can achieve with a learning approach.

[Link]. Safety level III: constraint satisfaction guaranteed. The system satisfies hard
constraints:
j
ck (xk , uk , wk ) ≤ 0 3.
Discounted reward:
for all times k ࢠ {0, . . . , N} and constraint indexes j ࢠ {1, . . . , nc }.
the typical RL stage
cost expressed as a
[Link]. Safety level II: constraint satisfaction with high probability. The system satisfies prob- reward rk discounted
abilistic constraints: by γ ࢠ [0, 1]:
lk = −γ k rk (xk , uk )
j
Pr c k (xk , uk , wk ) ≤ 0 ≥ p j, 4.
Stability: the
where Pr(·) denotes the probability and p j ࢠ (0, 1) defines the likelihood of the jth constraint boundedness of the
system output or state
being satisfied, for all times k ࢠ {0, . . . , N} and j ࢠ {1, . . . , nc }. The chance constraint in Equation 4
(e.g., asymptotic
is identical to the hard constraint in Equation 3 for p j = 1. stability in the sense of
Lyapunov requires the
[Link]. Safety level I: constraint satisfaction encouraged. The system encourages constraint system to converge to
satisfaction. This can be achieved in different ways. One is to add a penalty term to the objective a desired state)
function that discourages the violation of constraints with a high cost. A nonnegative ϵj is added

[Link] • Safe Learning in Robotics 415

to the right-hand side of the inequality shown in Equation 3, for all times k ࢠ {0, . . . , N} and
j ࢠ {1, . . . , nc },
j
ck (xk , uk , wk ) ≤ j , 5.
and an appropriate penalty term l () ≥ 0, with l () = 0 ⇐⇒ = 0, is added to the objective
function. The vector includes all elements ϵj and is an additional variable of the optimization
j
problem. Alternatively, although ck (xk , uk , wk ) is a stepwise quantity, some approaches aim to pro-
vide guarantees on its expected value E[·] only on a trajectory level:
N −1
j
Jc j = E ck (xk , uk , wk ) ≤ d j , 6.
k=0
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

where Jc j represents the expected total constraint cost, and dj defines the constraint threshold. The
j
constraint function can also be discounted as γ k ck (xk , uk , wk ), similar to the stage cost. Note that
j
Equation 6 can represent a probabilistic constraint, for example, if ck takes the form of an indicator
function over some state–input set. Then, bounding the expectation translates to bounding the
Access provided by [Link] on 09/01/23. For personal use only.

probability of entering this set. Equation 6 can also represent a stepwise hard constraint. For
j j
example, nonnegative ck and dj = 0 effectively reduce the bounded expectation to ck (xk , uk , wk ) = 0,
or zero violation across all steps.

2.1.4. Formulation of the safe learning control problem. The functions introduced above—
the system model f , the constraints c, and the cost function J—represent the true functions of
the robot control problem. In practice, f , c, and J may be unknown or partially known. Without
loss of generality, we assume that each of the true functions f , c, and J can be decomposed into
a nominal component (·), ¯ reflecting our prior knowledge, and an unknown component (·), ˆ to be
learned from data. For instance, the dynamics model f can be decomposed as
fk (xk , uk , wk ) = f̄k (xk , uk ) + f̂k (xk , uk , wk ), 7.
where f̄ is the prior dynamics model and f̂ represents the uncertain dynamics.
Safe learning control leverages our prior knowledge P = {f̄ , c̄, J̄} and the data collected from
the system D = {x(i) , u(i) , c(i) , l (i) }i=D
i=0 to find a policy (or controller) πk (xk ) that achieves the given
task while respecting all safety constraints:
safe learning control : (P, D) → πk , 8.
(i)
where (·) denotes a sample of a quantity (·)k and D is the data set size. More specifically, we aim
to find a policy πk that best approximates the true optimal policy πk , which is the solution to the
following optimization problem:

Jπ (x̄0 ) = min J(x0:N , u0:N −1 ) + l () 9a.
π0:N −1 ,

subject to xk+1 = fk (xk , uk , wk ), wk ∼ W, ∀k ∈ {0, . . . , N − 1}, 9b.

uk = πk (xk ), 9c.
x0 = x̄0 , 9d.

safety constraints according to either Equations 3–5 or 9e.

Equation 6, and ≥ 0,

where x̄0 ∼ X0 is the initial state, with X0 being the initial state distribution, and and lϵ are
introduced to account for the soft safety constraint case (safety level I, Equation 5) and are set to

416 Brunke et al.

zero, for example, if only hard and probabilistic safety constraints are considered (safety levels II
and III).
Parametric model:
2.2. A Control Theory Perspective a model that depends
on a finite number of
Safe decision-making under uncertainty has a long history in the field of control theory. Typical
parameters that
assumptions are that a model of the system is available and that it either is parameterized by an typically reflect our
unknown parameter or has bounded unknown dynamics and noise. While the control approaches prior knowledge about
are commonly formulated using continuous-time dynamic models, they are usually implemented the system structure
in discrete time with sampled inputs and measurements. Lyapunov function:
Adaptive control typically considers systems modeled as parametric models with uncertain pa- a positive definite
rameters and adapts the controller or model online to optimize performance. Adaptive control function L : X → R≥0
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

requires knowledge of the parametric form of the uncertainty (23) and typically considers a dy- used to prove the
stability of a dynamical
namics model that is affine in u and the uncertain parameters θ ∈ :
system (with respect to
an equilibrium)
xk+1 = f̄x (xk ) + f̄u (xk )uk + f̄θ (xk )θ, 10.
Access provided by [Link] on 09/01/23. For personal use only.

where f̄x , f̄u , and f̄θ are known functions often derived from first principles and is a possibly
bounded parameter set. The control input is uk = π(xk , θ̂k ), which is parameterized by θ̂k . The
parameter θ̂k is adapted by using either a Lyapunov function to guarantee that the closed-loop
system is stable or model reference adaptive control (MRAC) to make the system behave as a pre-
defined stable reference model (23). Adaptive control is typically limited to parametric uncertain-
ties and relies on a specific model structure. Moreover, adaptive control approaches tend to overfit
to the latest observations, and convergence to the true parameters is generally not guaranteed (23,
24). These limitations motivate the learning-based adaptive control approaches in Section 3.1.1.
Robust control is a control design technique that guarantees stability for prespecified bounded
disturbances, which can include unknown dynamics and noise. In contrast to adaptive control,
which adapts to the parameters currently present, robust control finds a suitable controller for all
possible disturbances and keeps the controller unchanged after the initial design. Robust control is
limited largely to linear time-invariant systems with a linear nominal model f̄ (xk , uk ) = Āxk + B̄uk
and unknown dynamics f̂k (xk , uk , wk ) = Âxk + B̂uk + wk ∈ D, with D being known and bounded
and Ā, B̄, Â, and B̂ being static matrices of appropriate size—that is,

xk+1 = (Ā + Â)xk + (B̄ + B̂)uk + wk . 11.

Robust control design techniques, such as robust H∞ and H2 control design (25), yield controllers
that are robustly stable for all f̂k ∈ D. Robust control can be extended to nonlinear systems whose
dynamics can be decomposed into a linear nominal model f̄ and a nonlinear function f̂ with
known bound f̂ ∈ D (26).
Robust MPC extends classical adaptive and robust control by additionally guaranteeing state
and input constraints, xk ∈ Xc and uk ∈ Uc , for all possible bounded disturbances f̂ ∈ D. At every
time step k, MPC solves a constrained optimization problem over a control input sequence
uk:k+H−1 for a finite horizon H, and applies the first optimal control input to the system and
then resolves the optimization problem in the next time step based on the current state (27). A
common approach in robust MPC is tube-based MPC (28), which uses a nominal prediction
model f̄ (xk , uk ) in the MPC optimization and tightens the constraints to account for unmodeled
dynamics. A stabilizing controller keeps the true state inside a bounded set of states around
the nominal state, called a tube, for all possible disturbances. Since the nominal states satisfy
the tightened constraints and the true states stay inside the tube around the nominal states,
constraint satisfaction for the true states is guaranteed. Tube-based MPC typically considers a

[Link] • Safe Learning in Robotics 417

linear nominal model f̄ (x̄k , ūk ) = Āx̄k + B̄ūk with nominal state x̄k and input ūk . In its simplest
implementation, prior knowledge of set D is combined with a known stabilizing linear controller,
uk,stab = K(xk − x̄k ) with gain K, to determine the bounded tube tube from the matrices Ā,
B̄, and K and the set D. The stabilizing controller uk,stab keeps all potential errors within the
tube, xk − x̄k ∈ tube , for all k. For the nominal model, tube-based MPC solves the following
constrained optimization problem at every time step k to obtain the optimal sequence ū∗0:H−1 :

H−1
J∗RMPC (x̄k ) = min lH (zH ) + li (zi , ūi ) 12a.
ū0:H−1
i=0

subject to zi+1 = Āzi + B̄ūi , ∀i ∈ {0, . . . , H − 1}, 12b.

Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

zi ∈ Xc tube , ūi ∈ Uc Ktube , 12c.

z0 = x̄k , zH ∈ Xterm , 12d.
Access provided by [Link] on 09/01/23. For personal use only.

where zi is the open-loop nominal state at time step k + i, and Xc tube = {x ∈ X : x + ω ∈

Xc , ∀ω ∈ tube } and Uc Ktube = {u ∈ U : u + Kω ∈ Uc , ∀ω ∈ tube } are, respectively, the
state and input constraints tightened using the bounded tube tube , with ࣸ denoting the
Pontryagin difference. Combined with the stabilizing control input uk,stab , the control in-
put uk = uk,stab + ū∗0 is applied to the system at every time step. Stability and the satisfaction of
the tightened constraints are guaranteed by selecting the terminal cost lH in Equation 12a such
that after the prediction horizon H the nominal state zH is within a terminal constraint set Xterm
(see Equation 12d), at which point a known linear controller can be safely applied (27).
Both robust control and robust MPC are conservative, as they guarantee stability and—in the
case of MPC—state and input constraints for the worst-case scenario. This usually yields poor
performance (9). For example, a conservative uncertainty set D generates a large tube tube , re-
sulting in tight hard constraints that are prioritized over cost optimization. Learning-based robust
control and robust MPC improve performance by using data to (a) learn a less conservative state-
and input-dependent uncertainty set D and/or (b) learn the unknown dynamics f̂ and, as a result,
reduce the remaining model uncertainty (see Sections 3.1.2 and 3.1.3, respectively).

2.3. A Reinforcement Learning Perspective

RL is the standard machine learning framework to address the problem of sequential decision-
making under uncertainty. Unlike traditional control, RL generally does not rely on an a pri-
ori dynamics model f̄ and can be directly applied to uncertain dynamics f . However, the lack of
explicit assumptions and constraints in many of the works limits their applicability to safe con-
trol. RL algorithms attempt to find π∗ while gathering data and knowledge of f from interaction
with the system—initially taking random actions and then improving afterward. A long-standing
challenge of RL, which hampers safety during the learning stages, is the exploration–exploitation
dilemma—that is, whether to (a) act greedily with the available data or (b) explore, which means
taking suboptimal (and possibly unsafe) actions u to learn a more accurate f̂ .
RL typically assumes that the underlying control problem is a Markov decision process (MDP).
An MDP comprises a state space X, an input (action) space U, stochastic dynamics (also called a
transition model), and a per-step reward function. When all the components of an MDP are known
(in particular, f and J from Section 2.1), then it solves the problem in Equations 9a–9c without
the constraints. Dynamic programming algorithms such as value and policy iteration can be used
to find an optimal policy π∗ . Many RL approaches, however, make no assumptions on any part of
f being known a priori.

418 Brunke et al.

We can distinguish model-based RL approaches, which learn an explicit model f̂ of the system
dynamics f and use it to optimize a policy, from model-free RL algorithms. The latter algorithms
(29) can be broadly categorized as (a) value function–based methods, learning an action-value Value function: under
function; (b) policy-search and policy-gradient methods, directly trying to find an optimal policy policy πk , a function of
π∗ ; and (c) actor–critic methods, learning both a value function (critic) and the policy (actor). We state xk , V πk (xk ), equal
also note that the convergence of these methods has been shown for simple scenarios but is still a to the expected return
J of applying πk from
challenge for more complex scenarios or when function approximators are used (30).
xk
There are multiple practical hurdles to the deployment of RL algorithms in real-world robotics
problems (8). These challenges include (a) the continuous, possibly high-dimensional X and U Action-value
function: under πk , a
in robotics (often assumed to be finite, discrete sets in RL); (b) the stability and convergence of
function of action uk
the learning algorithm (31) (necessary, albeit not sufficient, to produce a stable policy); (c) learn- and state xk ,
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

ing robust policies from limited samples; (d) the interpretability of the learned policy, especially Qπk (xk , uk ), equal to
in deep RL when leveraging neural networks for function approximation (29); and, importantly, the expected return J
(e) providing provable safety guarantees. when taking the action
uk at xk and then
The exploration–exploitation dilemma can be mitigated using Bayesian inference. This is
Access provided by [Link] on 09/01/23. For personal use only.

following πk
achieved by computing posterior distributions over the states X, possible dynamics f , or total cost J
from past observed data (32). These posteriors provide explicit knowledge of the problem’s uncer- Neural network:
a computational model
tainty and can be used to guide exploration. In practice, however, full posterior inference is almost
with interconnected
always prohibitively expensive, and concrete implementations must rely on approximations. layers of neurons
To achieve constraint satisfaction (over states or sequences of states) and robustness to differ- (parameterized by
ent, possibly noisy dynamics f , constrained MDPs (CMDPs) and robust MDPs are extensions of weights) that can be
traditional MDPs that more closely resemble the problem statement in Section 2.1. used to approximate
highly nonlinear
CMDPs (33) extend simple MDPs with constraints and optimize the problem in Equation 9
functions
when Equation 9e takes the discounted form of safety level I’s Equation 6. We refer to the dis-
counted constraint cost Jc j under policy π as Jπ cj
. Traditional approaches to solve CMDPs, such Off-policy
evaluation: in RL,
as linear programming and Lagrangian methods, often assume discrete state–action spaces and
improving the value
cannot properly scale to complex tasks such as robot control. Deep RL promises to mitigate this function estimates
problem, yet applying it to the constrained problem still suffers from the computational com- with data collected
plexity of the off-policy evaluation (not to be confused with offline RL) of trajectory-level con- operating under
straints Jc j (34). In Section 3.2.3, we present some recent advances in CMDP-based work that different policies;
on-policy updates, by
feature (a) integration of deep learning techniques in CMDPs for more complex control tasks,
contrast, use only data
(b) provable constraint satisfaction throughout the exploration or learning process, and (c) con- generated by the
straint transformation for the efficient evaluation of Jc j from data collected off-policy. current policy
Robust MDPs (35), inspired by robust control (see Section 2.2), extend MDPs such that the
Offline RL: RL
dynamics can include parametric uncertainties or disturbances, and the cost of the worst-case approaches that use
scenario is optimized. This is captured by the min–max optimization problem: data previously
collected by a (possibly
Jπ (x̄0 ) = min max J(x0:N , u0:N −1 ) 13a. unknown) policy but
π0:N −1 f̂ ∈D do not interact with
the environment until
subject to Equations 9b–9d, 13b. deployed

where D is a given uncertainty set of f̂ . To keep solutions tractable, practical implementations

typically restrict D to certain classes of models. This can limit the applicability of robust MDPs
beyond toy problems. Recent work (36–38) applied deep RL to robust decision-making, targeting
key theoretical and practical hurdles such as how to effectively model uncertainty with deep neu-
ral networks (DNNs) and how to efficiently solve the min–max optimization (e.g., via sampling
or two-player, game-theoretic formulations). These ideas, including adversarial RL and domain
randomization, are presented in Section 3.2.4.

[Link] • Safe Learning in Robotics 419

2.4. Bridging Control Theory and Reinforcement Learning
for Safe Learning Control
Gaussian process When designing a learning-based controller, we typically have two sources of information: our
(GP): a probabilistic prior knowledge and data generated by the robot system. Control approaches rely on prior knowl-
model specifying a edge and on assumptions such as parametric dynamics models to provide safety guarantees. RL
distribution over
approaches typically make use of expressive learning models to extract patterns from data that
functions
facilitate learning complex tasks, but these can impede the provision of any formal guarantees. In
recent literature, we see an effort from both the control and RL communities to develop safe learn-
ing control algorithms, with the goal of systematically leveraging expressive models for closed-
loop control (see Figure 4). Questions that arise from these efforts include how control-theoretic
tools can be applied to expressive machine learning models and how expressive models can be
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

incorporated into control frameworks.

Expressive learning models can be categorized as deterministic (e.g., standard DNNs) or
probabilistic [e.g., Gaussian processes (GPs) and Bayesian linear regression]. Deep learning tech-
niques such as feedforward neural networks, convolutional neural networks, and long short-term
Access provided by [Link] on 09/01/23. For personal use only.

memory networks have the advantage of being able to abstract large volumes of data, enabling
real-time execution in a control loop. On the other hand, their probabilistic counterparts, such
as GPs and Bayesian linear regression, provide model output uncertainty estimates that can
be naturally blended into traditional adaptive and robust control frameworks. We note that
there are approaches aiming to combine the advantages of the two types of learning (e.g.,
Bayesian neural networks), and quantifying uncertainty in deep learning is still an active research
direction.

Increasing safety
guarantees

Safety Goal
Hard constraint certification
satisfaction Safely learning Stability
(safety level III) uncertain dynamics (Section 3.3.1)
Expressive
s model with
Learning adaptive strong
t safety
af t guarantees
t
Constraint set (Section
control (Section 3.1.1) 3.3.2)
Learning robust
Probabilistic control (Section 3.1.2)
constraint Learning robust MPC (Section 3.1.3)
satisfaction
(safety level II) Safe model-based RL (Section 3.1.4)
Standard
control
approaches RL that encourages safety and robustness
Soft constraint Safe exploration and optimization (Section 3.2.1)
satisfaction Risk-averse and uncertainty-aware RL (Section 3.2.2)
(safety level I) Constrained MDPs and RL (Section 3.2.3)
Robust MDPs and RL (Section 3.2.4)

No Standard RL
guarantees
Increasing
reliance
Known Prior linear Prior control- Prior structured Prior generic Unknown on data
dynamics dynamics affine nonlinear nonlinear dynamics
dynamics dynamics dynamics

Imperfect prior knowledge/model

(i.e., dynamics uncertainty)
Figure 4
Summary of the safe learning control approaches reviewed in Section 3. Abbreviations: MDP, Markov
decision process; MPC, model predictive control; RL, reinforcement learning.

420 Brunke et al.

In this review, we focus on approaches that address the problem of safe learning control at two
stages: (a) online adaptation or learning, where online data are used to adjust the parameters of the
controller, the robot dynamics model, the cost function, or the constraint functions during closed-
Lipschitz continuity:
loop operation, and (b) offline learning, where data collected from each trial are recorded and used a function h : A → B
to update a model in a batch manner in between trials of closed-loop operation. In safe learning for which a bounded
control, data are generally used to address the issue of uncertainties in the problem formulation change in input yields
and reduce the conservatism in the system design, while the safety aspect boils down to knowing a bounded change in
output
what is unknown and cautiously accounting for the incomplete knowledge via algorithm design.

3. SAFE LEARNING CONTROL APPROACHES

Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

The ability to guarantee safe robot control is inevitably dependent on the amount of prior knowl-
edge available and the types of uncertainties present in the problem of interest. In this section, we
discuss approaches for safe learning control in robotics based on the following three categories
(see Figures 2 and 4):
Access provided by [Link] on 09/01/23. For personal use only.

Learning uncertain dynamics to safely improve performance: These works rely on an a priori
model of the robot dynamics. The robot’s performance is improved by learning the uncer-
tain dynamics from data. Safety is typically guaranteed based on standard control-theoretic
frameworks, achieving safety level II or III.
Encouraging safety and robustness in RL: These works encompass approaches that usually
do not have knowledge of an a priori robot model or the safety constraints. Rather than
providing hard safety guarantees, these approaches encourage safe robot operation (safety
level I), for example, by penalizing dangerous actions.
Certifying learning-based control under dynamics uncertainty: These works aim to pro-
vide safety certificates for learning-based controllers that do not inherently consider safety
constraints. These approaches modify the learning controller output by constraining the
control policy, leveraging a known safe backup controller, or modifying the controller out-
put directly to achieve stability and/or constraint satisfaction. They typically achieve safety
level II or III.

Figure 4 categorizes the approaches reviewed in this section based on their safety level and
reliance on data. A more detailed summary of the approaches can be found in Supplemental
Table 1 in the Supplemental Material.

3.1. Learning Uncertain Dynamics to Safely Improve Performance

In this section, we consider approaches that improve the robot’s performance by learning the un-
certain dynamics and provide safety guarantees (safety level II or III) via control frameworks such
as adaptive control, robust control, and robust MPC (outlined in Section 2.2). These approaches
make assumptions about the unknown parts of the problem (e.g., Lipschitz continuity) and often
rely on a known model structure (e.g., a control-affine or linear system with bounded uncertainty)
to prove stability and/or constraint satisfaction (see Figure 4).

3.1.1. Integrating machine learning and adaptive control. There are three main ideas to
incorporating online machine learning into traditional adaptive control (see Section 2.2), each
with its own distinct benefits: (a) using black-box machine learning models to accommodate non-
parametric unknown dynamics; (b) using probabilistic learning and explicitly accounting for the
learned model uncertainties, in order to achieve cautious adaptation; and (c) augmenting adaptive

[Link] • Safe Learning in Robotics 421

control with deep learning approaches for experience memorization in order to minimize the need
for readaptation.

[Link]. Learning nonparametric unknown dynamics with machine learning models. One
goal of integrating adaptive control and machine learning is to improve the performance of a
robot subject to nonparametric dynamics uncertainties. Instead of using Equation 10, we con-
sider a system with nominal dynamics f̄ (xk , uk ) = Āxk + B̄uk and unknown dynamics f̂ (xk , uk ) =
B̄ ψk (xk ), where ψk (xk ) is an unknown nonlinear function without an obvious parametric struc-
ture. Learning-based MRAC approaches (39, 40) make the uncertain system behave as the linear
nominal model f̄ by using a combination of L1 adaptation (41) and online learning to approxi-
mate ψk (xk ) by ψ̂k (xk ) = ψ̂L1 ,k + ψ̂learn,k (xk ), where ψ̂L1 ,k is the input computed by the L1 adaptive
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

controller and ψ̂learn,k is the input computed by the learning module, which can be a neural net-
work (39) or a GP (40). The estimated ψ̂k (xk ) is then used in the controller πk (xk ) to account
for the unknown nonlinear dynamics ψk (xk ), improving a linear nominal control policy designed
based on f̄ . In these approaches, fast adaptation and stability guarantees are provided by the L1
Access provided by [Link] on 09/01/23. For personal use only.

adaptation framework (safety level III), while the learning module provides additional flexibility
to capture the unknown dynamics. The addition of learning improves the performance of the
standard L1 adaptive controller and allows fast adaptation to be achieved at a lower sampling rate
(39).

[Link]. Cautious adaptation with probabilistic model learning. Another set of adaptive con-
trol approaches leverage probabilistic models to achieve cautious adaptation by weighting the
contribution of the learned model based on the model output uncertainty. We consider a system
with nominal dynamics f̄ (xk , uk ) = f̄x (xk ) + f̄u (xk )uk and unknown dynamics of the same form,
f̂ (xk , uk ) = f̂x (xk ) + f̂u (xk )uk , where f̂x (xk ) and f̂u (xk ) are unknown nonparametric nonlinear func-
tions. In a model inversion–based MRAC framework, an approximate feedback linearization is
achieved via the nominal model to facilitate the design of MRAC, and a GP-based adaptation
approach is used to compensate for feedback linearization errors due to the unknown dynamics
(42). To account for the uncertainty in the GP model learning, the controller relies on the GP
model only if the confidence in the latter is high: πk (xk ) = πnom (xk ) − γ (xk , uk ) πlearn,k (xk ), where
πnom (xk ) is the control policy designed based the nominal model, πlearn,k (xk ) is the adaptive com-
ponent designed based on the GP model, and γ (xk , uk ) ∈ [0, 1] is a scaling factor, with 0 indicating
low confidence in the GP. The stability of the overall system (safety level III) is guaranteed via a
stochastic stability analysis, and the efficacy of the approach has been demonstrated in quadrotor
experiments (42, 43).

[Link]. Memorizing experience with deep architectures. Apart from compensating for nonlin-
ear and nonparametric dynamics uncertainties, deep learning approaches have also been applied
to adaptive control for memorizing generalizable feature functions as the system adapts. In par-
ticular, References 44 and 45 proposed an asynchronous DNN adaptation approach. Similarly
to References 39 and 40, these works consider a linear nominal model f̄ (xk , uk ) = Āxk + B̄uk and
unknown nonlinear dynamics f̂ (xk , uk ) = B̄ ψk (xk ), with ψk (xk ) being an unknown nonlinear func-
tion. In the proposed approach, the last layer of the DNN is updated at a higher frequency for
fast adaptation, while the inner layers are updated at a lower frequency to memorize pertinent
features for the particular operation regimes. To provide safety guarantees, References 44 and 45
derived an upper bound on the sampling complexity of the DNN to achieve a prescribed level of
modeling error and leveraged this result to show that Lyapunov stability of the adapted system can
be guaranteed (safety level III) by ensuring that the modeling error of the DNN is lower than a

422 Brunke et al.

given bound. In contrast to other MRAC approaches, which usually do not retain a memory of the
past experience, the inner layers of the asynchronous DNN store relevant features that facilitate
adaptation when similar scenarios arise in the future.

3.1.2. Learning-based robust control. Learning-based robust control improves the perfor-
mance of classical robust control (described in Section 2.2) by using data to improve the linear
dynamics model and reduce the uncertainty in Equation 11.

[Link]. Using a Gaussian process dynamics model for linear robust control. The conservative
performance of robust control (described in Section 2.2) is improved by updating the linear dy-
namics model and uncertainty in Equation 11 with a GP (46). The unknown nonlinear dynamics
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

f̂ (xk , uk ) are learned as a GP, which is then linearized about an operating point. Linearizing the GP
(as opposed to directly fitting a linear model) allows data close to the operating point to be priori-
tized. The uncertain linear dynamics in Equation 11 are assumed to be modeled as Â = A0 + Ã
and B̂ = B0 + B̃, where A0 and B0 are obtained from the linearized GP mean, Ã and B̃ are ob-
Access provided by [Link] on 09/01/23. For personal use only.

tained from the linearized GP variance (often two standard deviations), and represents a matrix
with elements taking any value in the range of [−1, +1]. Further performance improvement is
achieved by modeling Ã and B̃ as state dependent (47). Additionally, Reference 48 achieved better
performance than Reference 46 by leveraging the GP’s distribution, while maintaining the same
level of safety. The main advantage of these approaches is that they can robustly guarantee safety
level II stability while improving performance, which is achieved by shrinking the GP uncertainty
as more data are added, thus improving the linear model A0 and B0 and reducing the uncertain
component Ã and B̃. This approach has been shown on a quadrotor (46). However, these methods
are limited to stabilization tasks and do not account for state and input constraints.

[Link]. Exploiting feedback linearization for robust learning-based tracking. Trajectory

tracking convergence, as opposed to the simpler stabilization task performed in Reference 46, is
guaranteed by exploiting the special structure of exactly feedback linearizable systems (49). This
structure assumes that the nonlinear system dynamics in Equation 1 can be described by a lin-
ear nominal model, where Ā and B̄ have an integrator chain structure. It also assumes that the
unknown dynamics are f̂ (xk , uk ) = B̄ψ (xk , uk ), where ψ (xk , uk ) is an unknown invertible function.
A probabilistic upper bound is obtained for ψ (xk , uk ) by learning this function as a GP. A robust
linear controller is designed for the uncertain system based on this learned probabilistic bound.
The performance is further improved by also updating the feedback linearization (50) through im-
provements in the estimate of the inverse ψ −1 (·). These approaches have been applied to trajectory
tracking, with safety level II, on Lagrangian mobile manipulators (49) and quadrotor models (50).
However, they hinge on this special structure and cannot account for state and input constraints.

3.1.3. Reducing conservatism in robust model predictive control with learning and
adaptation. The conservative nature of robust MPC (Section 2.2) is improved—while still satis-
fying input and state constraints—through (a) robust adaptive MPC, which adapts to parametric
uncertainty, and (b) learning-based robust MPC, which learns the unknown dynamics f̂ or, in one
case, cost Ĵ.

[Link]. Robust adaptive model predictive control. Robust adaptive MPC assumes parametric
uncertainties, and either uses data to reduce the set of possible parameters over time or uses an
inner-loop adaptive controller and an outer-loop robust model predictive controller. This leads

[Link] • Safe Learning in Robotics 423

to improved performance compared with standard robust MPC (see Section 2.2) while satisfying
hard constraints (safety level III).
The first set of approaches consider stabilization tasks, where the full system dynamics
(Equation 10) are assumed to be linear and stable (51) or linear and unstable (52) with uncer-
tain parameter θ ∈ Θ0 :

xk+1 = (Ā + Â(θ))xk + (B̄ + B̂(θ))uk + wk . 14.

The process noise and the parameters are assumed to be bounded by known sets W and the ini-
tially conservative Θ0 , respectively. Given W and Θ0 , we can derive a conservative upper bound
on the uncertain dynamics f̂ (xk , uk , wk , θ) = Â(θ)xk + B̂(θ)uk + wk ∈ D0 , where D0 is a com-
pact set determined from W and Θ0 . To guarantee constraint satisfaction, tube-based MPC (see
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

Section 2.2) is applied, where the initial tube tube,0 is based on D0 . To reduce the conservatism of
the approach, an adaptive control method is introduced to improve the estimate of the parameter
set Θk and reduce the size of the tube tube,k at each time step k. This idea has been extended to
stochastic process noise for probabilistic constraint satisfaction (safety level II) (53), time-varying
Access provided by [Link] on 09/01/23. For personal use only.

parameters (54), and linearly parameterized uncertain nonlinear systems (55, 56). Further per-
formance improvement is achieved by combining robust adaptive MPC with iterative learning
(57), which updates the terminal constraint set Xterm in Equation 12d and the terminal cost lH
in Equation 12a after every full iteration using the closed-loop state trajectory and cost (58). In
the second approach, an underlying MRAC (see Section 2.2) is used to make the closed-loop sys-
tem dynamics resemble a linear reference model with bounded disturbance set D (59). This linear
model and its bounds are then used in an outer-loop robust model predictive controller to achieve
fast stabilization in the presence of model errors.

[Link]. Learning-based robust model predictive control. Learning-based robust MPC uses
data to improve the unknown dynamics estimate, reduce the uncertainty set, or update the cost
to avoid states with high uncertainty. Unlike robust adaptive control, learning-based robust MPC
considers nonparameterized systems.
Under the assumption of a linear nominal model f̄ (x, u) = Āx + B̄u and bounded unknown
dynamics f̂ (xk , uk ) ∈ D, the unknown dynamics can be safely learned from data, for example, us-
ing a neural network (60). Robust constraint satisfaction (Equation 12c) is guaranteed by us-
ing tube-based MPC for the linear nominal model, and performance improvement is achieved
by optimizing over the control inputs for the combined nominal and learned dynamics. If the
bounded unknown dynamics are assumed to be state dependent, f̂ (xk ) ∈ D(xk ), instead of using
a constant tube tube , then the state constraints in Equation 12c can be tightened based on the
state-dependent uncertainty set D(x) (61). In a numerical stabilization task, a GP is used to model
f̂ (xk ), and its covariance determines D(x). This yields less conservative, probabilistic state con-
straints (safety level II).
Learning-based robust MPC can be extended to nonlinear nominal models. Typically, a
GP is used to learn the unknown dynamics f̂ (x, u): The mean updates the dynamics model in
Equation 12b, and the state- and input-dependent uncertainty set D(x, u) is derived from the
GP’s covariance and contains the true uncertainty with high probability. Similarly, the state- and
input-dependent tube in Equation 12c is determined from the uncertainty set D(x, u). The main
challenge, compared with using a linear nominal model, is the uncertainty propagation over the
prediction horizon in the MPC, because Gaussian uncertainty (obtained from the GP) is no longer
Gaussian when propagated through the nominal nonlinear dynamics. Approximation schemes are
required, such as using a sigma-point transform (62), linearization (63), exact moment matching
(64), or ellipsoidal uncertainty set propagation (65). Additionally, further approximations [e.g.,

424 Brunke et al.

fixing the GP’s covariances over the prediction horizon (63)] are usually required to achieve
real-time implementation. These approximations can lead to violations of the probabilistic con-
straints (safety level II) in Equation 9e. An alternative approach to address the challenges of GPs
Lipschitz constant:
uses a neural network regression model that predicts the quantile bounds of the tail of a state a positive scalar ρ that
trajectory distribution for tube-based MPC (66). However, this approach is hindered by typically bounds the change in
nonexhaustive training data sets, which can lead to the underestimation of the tube size. the output of a
For repetitive tasks, another approach is to adjust the cost function based on data instead of Lipschitz continuous
function to a
the system model. The predicted cost error is learned from data using the difference between
maximum of ρ times
the predicted cost at each time step and the actual closed-loop cost at execution. By adding this the change in the input
additional term to the cost function, the MPC penalizes states that had previously resulted in
Region of attraction
a higher closed-loop cost than expected (Equation 12a) (67), resulting in reliable performance
(ROA): the set of
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

despite model errors. states from which a

closed-loop system
3.1.4. Safe model-based reinforcement learning with a priori dynamics. Safe model-based converges to an
RL augments model-based RL (see Section 2.3) with safety guarantees (see Figure 4). Stability equilibrium x∗
Access provided by [Link] on 09/01/23. For personal use only.

can be probabilistically guaranteed (safety level II) under the assumption that the known nomi- Markov property:
nal model f̄ (xk , uk ) and the unknown part f̂ (xk , uk ) are Lipschitz continuous with known Lipschitz the property that the
constants (68). Given a Lyapunov function, an initial safe policy π0 , and a GP to learn the unknown probability of being in
state xk at time k
dynamics f̂ , a control policy πk is chosen so that it maximizes a conservative estimate of the region
depends only on the
of attraction (ROA). The most uncertain states (based on the GP’s covariance) inside the ROA are state at time k − 1,
explored, which reduces the uncertainty over time and allows the ROA to be extended. The prac- xk−1 , and, in MDPs,
tical implementation resorts to discrete states for tractability and retains the stability guarantees on uk−1
while being suboptimal in (exploration) performance. Ergodicity: for an
MDP, the property
that, by following a
3.2. Encouraging Safety and Robustness in Reinforcement Learning given policy, any state
is reachable from any
The approaches in this section are safety-augmented variations of the traditional MDP and RL
other state
frameworks. In general, these methods do not assume knowledge of an a priori nominal model
f̄ , and some also learn the reward or step cost l (69) or the safety constraints c (70). Rather than
providing strict safety guarantees, these approaches encourage constraint satisfaction during and
after learning, or the robustness of the learned control policy π to uncertain dynamics (safety
level I; see Figure 4). In plain MDP formulations, (a) states and inputs (or actions) are assumed
to have known, often discrete and finite, domains but are not further constrained while searching
for an optimal policy π∗ , and (b) only loose assumptions are made on the dynamics f , such as the
system satisfying the Markov property (71).
A previous taxonomy of safe RL (4)—covering research published up until 2015—distinguished
methods that either modify the exploration process with external knowledge or modify the opti-
mality criterion J with a safety factor. However, the number and breadth of publications in RL,
including safe RL, have since greatly increased (72). Because recent works in safe RL are numerous
and diverse, we provide a high-level review of the significant trends with an emphasis on robotics,
including (a) safe exploration of MDPs, (b) risk-aware RL, (c) RL with CMDPs, and (d) robust RL.

3.2.1. Safe exploration and optimization. Exploration in RL poses a challenge to its safety, as
it must select inputs with unpredictable consequences in order to learn about them.

[Link]. Safe exploration. Reference 73 used the notion of ergodicity to tackle the problem
of safely exploring an MDP. Policy updates that preserve the ergodicity of the MDP enable
the system to return to an arbitrary state from any state. Thus, the core idea is to restrict the

[Link] • Safe Learning in Robotics 425

space of eligible policies to those that make the MDP ergodic (with at least a given probability).
Exactly solving this problem, however, is NP-hard. Reference 73 solved a simplified problem
using a heuristic exploration algorithm (74), which leads to suboptimal but safe exploration that
only considers a subset of the ergodic policies. In practice, this method was demonstrated in
two simulated scenarios with discrete, finite X and U. In a different approach from recoverable
exploration via ergodicity, References 70 and 75 developed safe exploration strategies with
constraint satisfaction. They used a safety layer to convert an optimal but potentially unsafe
action ulearn,k from a neural network policy into the closest safe action usafe,k with respect to
some safety state constraints. Both works involved solving a constrained least-squares problem,
usafe,k = arg minuk uk − ulearn,k 22 , which is akin to the safety filter approaches covered in Sec-
tion 3.3.2. Concretely, Reference 75 assumed full knowledge of the (linear) constraint functions
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

and solved usafe,k using a differentiable quadratic programming solver. By contrast, Reference 70
assumed that constraints are unknown a priori but can be evaluated. Thus, the approach learns
the linear approximations of these constraints and then uses them in the solver. Notably,
References 70 and 75 considered single-time-step state constraints. In Section 3.2.3, we discuss
Access provided by [Link] on 09/01/23. For personal use only.

methods that deal with more general trajectory-level constraints (Equation 6).

[Link]. Safe optimization. Several works have addressed the problem of safely optimizing an
unknown function (typically the cost function), often exploiting GP models (76). Safety refers to
sampling inputs that do not violate a given safety threshold (Equations 3 and 4). These approaches
fall under the category of Bayesian optimization (77) and include SafeOpt (78); SafeOpt-MC (79),
an extension to multiple constraints; StageOpt (80), a more efficient two-stage implementation;
and GoSafe (81), used for exploration beyond the initial safe region. In particular, SafeOpt infers
two subsets of the safe set from the GP model—one with candidate inputs to extend the safe
set, and one with candidate inputs to optimize the unknown function—from which it greedily
picks the most uncertain. In SafeMDP, Reference 69 applied the ideas pioneered by SafeOpt to
MDPs, resulting in the safe exploration of MDPs with an unknown cost function l (x, u), which
the paper modeled as a GP. In SafeMDP, the single-step reward represents the safety feature that
should not violate a given threshold. Another extension of SafeOpt (78), SafeExpOpt-MDP (82),
treats the safety feature c j and the MDP’s cost l as two separate, unknown functions, allowing for
the constraint of the former and the optimization of the latter. Reference 76 provided a recent
survey of these techniques that highlighted the distinction between safe learning in regression
(i.e., minimizing the selection and evaluation of nonsafe training inputs) and safe exploration in
MDPs and stochastic dynamical systems such as Equation 9b (i.e., selecting action inputs that also
preserve ergodicity).

[Link]. Learning a safety critic. A safety critic is a learnable action-value function Qπ safe that
can detect whether a proposed action can lead to unsafe conditions. References 83–85 used this
critic with various fallback schemes to determine a safer alternative input. These works differ
from those in Section 3.3.2 in that the filtering criterion depends on a model-free, learned value
function, which can only grant the satisfaction of safety level I.
Reference 83 used safety Q-functions for reinforcement learning (SQRL) to (a) learn a safety
critic from only abstract, sparse safety labels (e.g., a binary indicator) and (b) transfer knowledge
of safe action inputs to new but similar tasks. SQRL trains Qπ safe to predict the future probability
of failure in a trajectory and uses it to filter out unsafe actions from the policy π. Knowledge
transfer is achieved by pretraining Qπ safe and π in simulations and then fine-tuning π on the new
task (with similar dynamics f and safety constraints), still in simulation, while reusing Qπ safe to
discriminate unsafe inputs. However, the success of the final safe policy still depends on the

426 Brunke et al.

task- and environment-specific hyperparameters (which must be found via parameter search prior
to the actual experiment). Building on this work, recovery RL (84) additionally learns a recovery
policy πrec to produce fallback actions for Qπ safe as an alternative to filtering out unsafe inputs and Conditional value at
resorting to potentially suboptimal ones in π. Reference 85 extended conservative Q-learning risk: a risk measure
(86), an approach that mitigates the value function overestimation of Q-learning, and proposed that is equal to the
the conservative safety critic algorithm. Similarly to SQRL, this algorithm assumes sparse safety average of the samples
labels and uses Qπ π below the α-percentile
safe for action filtering, but it trains Qsafe to upper bound the probability of failure
of the total cost J
and ensures provably safe policy improvement at each iteration.
Distributional RL:
3.2.2. Risk-averse reinforcement learning and uncertainty-aware reinforcement learning. RL that aims at
modeling and learning
Safety in RL can also be encouraged by deriving and using risk (or uncertainty) estimates dur-
the distribution of the
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

ing learning. These estimates are typically computed for the system dynamics or the overall cost cost J rather than its
function and leveraged to produce more conservative (and safer) policies. expected value
Risk can be defined as the probability of collision for a robot performing a navigation task (87,
88). A collision model, captured by a neural network ensemble trained with Monte Carlo dropout,
Access provided by [Link] on 09/01/23. For personal use only.

predicts the probability distribution of a collision, given the current state and a sequence of future
actions. The collision-averse behavior is then achieved by incorporating the collision model in an
MPC planner.
Reference 89 extended model-based probabilistic ensembles with trajectory sampling (PETS)
(90) to propose cautious adaptation for safe RL (CARL). Ensembles are collections of learned
models used to mitigate noise or capture the stochastic dynamics. CARL has two training phases:
(a) a pretraining phase that is not risk aware, where a PETS agent is trained on different system
dynamics, and (b) an adaptation phase, where the agent is fine-tuned on the target system by taking
risk-averse actions. CARL also defines two notions of risk, one to avoid low-reward trajectories
and another to avoid catastrophes (e.g., irrecoverable states or constraint violations). In safety-
augmented value estimation from demonstrations (SAVED) (91), a PETS agent is used to predict
the probability of a robot’s collision and to evaluate a chance constraint for safe exploration. Similar
to the safety critic methods described in Section 3.2.1, SAVED learns a value function from sparse
costs, which it uses as a terminal cost estimate.
To learn risk-averse policies using only offline data, Reference 92 optimized for a risk measure
of the cost, such as conditional value at risk. Instead of using model ensembles, as in References 87–
89, Reference 92 used distributional RL to explicitly model the distribution of the total cost of
the task (control of a simulated one-dimensional car) and offline learning to improve scalability.

3.2.3. Constrained Markov decision processes and reinforcement learning. The CMDP
framework is frequently used in safe RL, as it introduces constraints that can express arbitrary
safety notions in the form of Equation 6. RL for CMDPs, however, faces two important challenges
(see Section 2.3): how to incorporate and enforce constraints in the RL algorithm and how to
efficiently solve the constrained RL problem—especially when using deep learning models, which
are the de facto standard in nonsafe RL. In this section, we cover three approaches that aim to
address these challenges: (a) Lagrangian methods for RL, (b) generalized Lyapunov functions for
constraints, and (c) backward value functions. However, most of the work in this area remains
confined to naive simulated tasks, motivating further research on their applicability in real-world
control.

[Link]. Lagrangian methods in reinforcement learning optimization. In References 93 and

94, the CMDP constrained optimization problem (see Section 2.3) is first transformed into an
equivalent unconstrained optimization problem over the primal variable π and dual variable λ

[Link] • Safe Learning in Robotics 427

using the Lagrangian function L(π, λ), and RL is used as a subroutine in the primal–dual updates
for Equation 15b:

L(π, λ) = Jπ + λ j Jπ
cj
− dj , 15a.
j

∗ ∗
(π , λ ) = arg max min L(π, λ) subject to Equations 9b–9d. 15b.
λ≥0 π

In particular, Reference 93 defined a constraint on the conditional value at risk of cost Jπ and used
policy-gradient or actor–critic methods to update the policy in Equation 15b. Reference 94 subse-
quently improved on this work by incorporating off-policy updates of the dual variable (with the
on-policy primal–dual updates), and showed empirically that this method achieves better sample
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

efficiency and faster convergence. Reference 34 extended a standard trust-region RL algorithm

(95) to CMDPs using a novel bound that relates the expected cost of two policies to their state-
averaged divergence (where a divergence is a measure of the similarity between probability distri-
butions). The key idea is performing primal–dual updates with surrogates or approximations of the
Access provided by [Link] on 09/01/23. For personal use only.

cost Jπ and constraint cost Jπ cj

derived from the bound. The benefits are twofold: The surrogates
can be estimated with only state–action data, bypassing the challenge of trajectory evaluation from
off-policy data, and the updates guarantee monotonic policy improvement and near constraint sat-
isfaction at each iteration (safety level I). However, unlike in References 93 and 94, each update
involves solving the dual variables from scratch, which can be computationally expensive.

[Link]. A Lyapunov approach to safe reinforcement learning. Lyapunov functions are used
extensively in control to analyze system stability and are a powerful tool to translate a system’s
global properties into local conditions. References 96 and 97 used Lyapunov functions to trans-
form the trajectory-level constraints Jπ
cj
in Equation 6 into stepwise, state-based constraints. This
approach allows a more efficient computation of Jπ cj
and mitigates the cost of off-policy evalua-
tion; however, it also requires the system to start from a baseline policy π0 that already satisfies
the constraints. In Reference 96, the authors proposed four different algorithms to solve CMDPs
by combining traditional RL methods and the Lyapunov constraints, but they are applicable only
to discrete input spaces (with continuous state spaces). In subsequent work (97), the authors ex-
tended the approach to continuous input spaces and standard policy-gradient methods, addressing
its computational tractability.

[Link]. Learning backward value functions. Reference 98 proposed backward value functions
to overcome the excessive computational cost in the previous approaches (34, 97). Similar to
a (forward) value function V π that estimates the total future cost from each state, a backward
value function V b,π estimates the accumulated cost up to the current state. We can decompose a
trajectory-level constraint at any time step k as the sum of Vc πj (xk ) and Vc b,π
j (xk ) for the constraint
cost Jπc j . This decomposition also alleviates the problem of off-policy evaluation, as these value
functions can be learned concurrently and efficiently via temporal difference methods (71). In
practice, Vc πj , Vc b,π
j , and V
π
are jointly learned (98) and used for policy improvement at each time
step, allowing the implementation of safety level I. The approach is intended for discrete action
spaces but can be adapted to continuous ones (70).

3.2.4. Robust Markov decision processes and reinforcement learning. Works in this sec-
tion aim to implement robustness in RL—specifically, learning policies that can operate under
disturbances and generalize across similar tasks or robotic systems. This is typically done by fram-
ing the learning problem as a robust MDP (Equation 13). Reference 99 developed robust RL,

428 Brunke et al.

which implements an actor–disturber–critic architecture. The authors observed that the learned
policy and value function coincide with those derived analytically from H∞ control theory for
linear systems (see Section 2.2). However, more recent robust RL literature often abstains from
assumptions about dynamics or disturbances and applies model-free RL (100) to seek empirically
robust performance at the expense of theoretical guarantees. Below, we introduce two lines of
work: (a) robust adversarial RL, which explicitly models the min–max problem in Equation 13 in
a game-theoretic fashion, and (b) domain randomization, which approximates the same problem
in Equation 13 by learning on a set of randomized perturbed dynamics.

[Link]. Robustness through adversarial training. Combining RL with adversarial learn-

ing (101) results in robust adversarial RL (36), where the robust optimization problem
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

(Equation 13) is set up as a two-player, discounted zero-sum Markov game in which an agent
(protagonist) learns policy π to control the system and another agent (adversary) learns a separate
policy to destabilize the system. The two agents learn in an alternating fashion (each is updated
while fixing the other), attempting to progressively improve both the robustness of the protag-
Access provided by [Link] on 09/01/23. For personal use only.

onist’s policy and the strength of its adversary. Reference 37 extended this work with risk-aware
agents (see Section 3.2.2), with the protagonist being risk averse and the adversary being risk seek-
ing. This method learns an ensemble of deep Q-networks (102) and defines the risk of an action
based on the variance of its value predictions. In another extension of the same work, Reference 38
trained a population of adversaries (rather than a single one), making the resulting protagonist less
exploitable by new adversaries. Finally, Reference 103 proposed certified lower bounds for the
value predictions from a deep Q-network (102), given bounded observation perturbations. The
action selection is based on these value lower bounds, assuming adversarial perturbation.

[Link]. Robustness through domain randomization. Domain randomization methods aim to

learn policies that generalize to a wide range of tasks or systems. Instead of concerning worst-case
scenarios, learning happens on systems with randomly perturbed parameters (e.g., inertial proper-
ties and friction coefficients), which often have prespecified ranges. This effectively induces a ro-
bust set that approximates the uncertainty set D in Equation 13 and allows one to use any standard
RL method. In Reference 104, a quadrotor learns vision-based flight in simulation with random-
ized scenes. Using this model-free policy in the real world results in improved collision avoidance
performance. Instead of learning a policy directly, a system described in Reference 105 uses learned
visual predictions with an MPC controller to enable efficient and scalable real-world performance.
Besides the uniform randomization in References 104 and 105, adaptive randomization strategies
such as Bayesian search (106) are also a promising direction. Reference 107 adversarially trained
and used a discriminator to guide the randomization process that generates systems that are less
explored or exploited by the current policy.

3.3. Certifying Learning-Based Control Under Dynamics Uncertainty

In this section, we review methods providing certification to learning-based control approaches
that do not inherently account for safety constraints (see Figure 2). We divide the discussion
into two parts: (a) stability certification and (b) constraint set certification. The works in this sec-
tion leverage an a priori dynamics model of the system and provide hard or probabilistic safety
guarantees (safety level II or III) under dynamics uncertainties (see Figure 4).

3.3.1. Stability certification. This section introduces certification approaches that guarantee
closed-loop stability under a learning-based control policy.

[Link] • Safe Learning in Robotics 429

[Link]. Lipschitz-based safety certification for deep neural network–based learning controllers.
These approaches exploit the expressive power of DNNs for policy parameterization and guar-
antee closed-loop stability through a Lipschitz constraint on the DNN. Let ρ be a Lipschitz con-
Robust positive
control invariant safe stant of a DNN policy; an upper bound on ρ that guarantees closed-loop stability (safety level III)
set: a set safe ⊆ Xc can be established by using a small-gain stability analysis (108), solving a semidefinite program
for which there exists a (109), or applying a sliding mode control framework (110). This bound on ρ can be used in either
feedback policy π (x) (a) a passive, iterative enforcement approach, where the Lipschitz constant ρ is first estimated [e.g.,
that, when starting
via semidefinite programming–based estimation (111)] and then used to guide retraining until the
within the set, the state
never leaves the set, Lipschitz constraint is satisfied, or (b) an active enforcement approach, where the Lipschitz con-
for all possible model straint is directly enforced by the training algorithm of the DNN [e.g., via spectral normalization
errors D (110)]. While guaranteeing stability, the Lipschitz-based certification approaches often rely on the
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

particular structure of the system dynamics (e.g., a control-affine structure or linear structure with
additive nonlinear uncertainty) to find the certifying Lipschitz constant. It remains to be explored
how this idea can be extended to more generic robot systems.
Access provided by [Link] on 09/01/23. For personal use only.

[Link]. Learning regions of attraction for safety certification. The ROA of a closed-loop sys-
tem is used in the learning-based control literature as a means to guarantee safety. For a nonlinear
system with a given state-feedback controller, the ROA is the set of states that is guaranteed to
converge to the equilibrium, which is treated as a safe state. This notion of safety provides a way
to certify a learning-based controller. It guarantees that there is a region in state space from which
the controller can drive the system back to the safe state (safety level III) (112). ROAs can, for
example, be used to guide data acquisition for model or controller learning (68, 113).
We consider deterministic closed-loop systems xk+1 = fπ (xk ) = f (xk , π(xk )), with fπ (xk ) being
Lipschitz continuous. A Lyapunov neural network can be used to iteratively learn the ROA of
a controlled nonlinear system from the system’s input–output data (112). As compared with the
typical Lyapunov functions in control (e.g., quadratic Lyapunov functions), the proposed method
uses the Lyapunov neural network as a more flexible Lyapunov function representation to provide
a less conservative estimate of the system’s ROA. The necessary properties of a Lyapunov function
are preserved via the network’s architectural design. Reference 113 presented an ROA estimation
approach for high-dimensional systems that combines a sum-of-squares programming method
for the ROA computation (114) and a dynamics model order reduction technique to curtail com-
putational complexity (115).

3.3.2. Constraint set certification. This section summarizes approaches that provide con-
straint set certification to a learning-based controller based on the notion of robust positive control
invariant safe sets safe ⊆ Xc . Certified learning, which can be achieved through a safety filter (9)
or shielding (116), finds the minimal modification of a learning-based control input ulearn (see
Figure 2) such that the system’s state stays inside the set safe :

usafe,k = arg min uk − ulearn,k 22 16a.

uk ∈Uc

subject to xk+1 = f̄k (xk , uk ) + f̂k (xk , uk , wk ) ∈ safe , 16b.

∀ f̂k (xk , uk , wk ) ∈ D(xk , uk ),

where the range of possible disturbances D(xk , uk ) is given. Since the safety filter and controller
are usually decoupled, suboptimal behavior can emerge, as the learning-based controller may try
to violate the constraints (65).

430 Brunke et al.

[Link]. Control barrier functions. Control barrier functions (CBFs) are used to define safe sets.
More specifically, the safe set safe is defined as the superlevel set of a continuously differentiable
CBF Bc , Bc : Rnx → R, as safe = {x ∈ Rnx : Bc (x) ≥ 0}. The function Bc is generally considered
Control Lyapunov
for continuous-time, control-affine systems of the form function (CLF):
ẋ = fx (x) + fu (x)u, 17. a function whose
existence guarantees
where fx and fu are locally Lipschitz, and x and u are functions of time (117). In the simplest that there also exists a
form, the function Bc is a CBF if there exists a state-feedback control input u such that the time state-feedback
derivative Ḃc (x) = ∂B c
ẋ satisfies (117) controller π(x) that
∂x
asymptotically
∂Bc stabilizes the system
sup (fx (x) + fu (x)u) ≥ −Bc (x). 18.
u∈U ∂x
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

The CBF condition in Equation 18 is a continuous-time version of the robust positive control
invariance constraint in Equation 16b. In addition to the robust positive control invariance con-
straint, a similar constraint for asymptotic stability can be added to Equation 16 in the form of a
constraint on the time derivative of a control Lyapunov function (CLF) Lc . However, uncertain
Access provided by [Link] on 09/01/23. For personal use only.

dynamics also yield uncertain time derivatives of Bc and Lc .

Learning-based approaches extend CBF and CLF analyses to control-affine systems in
Equation 17 with known nominal f̄x and f̄u and unknown f̂x and f̂u . The time derivative of the
unknown dynamics for CBFs and/or CLFs [e.g., ∂B ∂x
c
(f̂x (x) + f̂u (x)u)] can be learned from itera-
tive trials (118, 119) or data collected by an RL agent (120, 121). Given a CBF and/or CLF for
the true dynamics f , improving the estimate of the CBF’s or CLF’s time derivative for the un-
known dynamics f̂ using data, collected either offline or online, yields a more precise estimate of
the constraint in Equation 18. However, any learning error in the CBF’s or CLF’s time deriva-
tive of the unknown dynamics can still lead to applying falsely certified control inputs. To this
end, ideas from robust control can be used to guarantee set invariance (safety level III) during the
learning process by mapping a bounded uncertainty in the dynamics to a bounded uncertainty in
the time derivatives of the CBF or CLF (122, 123), or by accounting for all model errors consis-
tent with the collected data (124). In addition, adaptive control approaches have been proposed
to allow safe adaptation of parametric uncertainties in the time derivatives of the CBF or CLF
(125, 126). Probabilistic learning techniques for CBFs and CLFs have been used to achieve set
invariance probabilistically (safety level II) with varying assumptions about the system dynamics:
that the function fu (x) is fully known (127, 128), that a nominal model is known (129), or that no
nominal model is available (130). A recent extension (131) introduced measurement-robust CBFs
that account for errors in the state estimation and allow for safe learning-based updates of the
measurement model.

[Link]. Hamilton–Jacobi reachability analysis. Another approach for state constraint set cer-
tification of a learning-based controller is via Hamilton–Jacobi reachability analysis. This analysis
provides a means to estimate a robust positive control invariant safe set safe under dynamics
uncertainties and Equation 16. Consider a nonlinear system subject to unknown but bounded dis-
turbances f̂ (x) ∈ D(x), where D(x) is assumed to be known but possibly conservative. To compute
safe , a two-player, zero-sum differential game is formulated:

V (x) = max min inf lc φ(x, k; usig , f̂sig ) , 19.

usig ∈Usig f̂ ∈D k≥0
sig sig

where V is the value function associated with a point x ∈ X, lc : X → R is a cost function that is
nonnegative for x ∈ Xc and negative otherwise, φ(x, k; usig , f̂sig ) denotes the state at k along a tra-
jectory initialized at x following input signal usig and disturbance signal f̂sig , and Usig and Dsig are

[Link] • Safe Learning in Robotics 431

collections of input and disturbance signals such that each time instance is in U and D, respectively.
The value function V can be found as the unique viscosity solution of the Hamilton–Jacobi–Isaacs
variational inequality (132). The safe set is then safe = {x ∈ X | V (x) ≥ 0}. Based on this formula-
tion, we can also obtain an optimally safe policy π∗safe that maximally steers the system toward the
safe set safe (i.e., in the greatest ascent direction of V ). The Hamilton–Jacobi reachability analysis
allows us to define a safety filter for learning-based control approaches to guarantee constraint set
satisfaction (safety level II or III). In particular, given safe and π∗safe , one can safely learn in the in-
terior of safe and apply the optimally safe policy π∗safe if the system reaches the boundary of safe .
To reduce the conservativeness of the approach, Reference 133 proposed a GP-based learning
scheme to adapt (and shrink) the unknown dynamics set D(x) based on observed data.
The general Hamilton–Jacobi reachability analysis framework (132) has also been combined
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

with online dynamics model learning for a target-tracking task (134), with online planning for
safe exploration (135) and with temporal difference algorithms for safe RL (136). Reference 137
integrated Hamilton–Jacobi reachability analysis and CBFs to compute smoother control policies
while circumventing the need to hand-design appropriate CBFs. In another recent extension,
Access provided by [Link] on 09/01/23. For personal use only.

Reference 138 proposed modifications that improve the scalability of the Hamilton–Jacobi safety
analysis approach for higher-dimensional systems and demonstrated its use on a 10-dimensional
quadrotor trajectory-tracking problem.

[Link]. Predictive safety filters. Predictive safety filters can augment any learning-based con-
troller to enforce state constraints x ∈ Xc and input constraints u ∈ Uc . They do this by defining
the safe invariant set safe from Equation 16b as the set of states (at the next time step) where a
sequence of safe control inputs (e.g., from a backup controller) exists that allows the return to a
terminal safe set Xterm or to previously visited safe states.
Model predictive safety certification (MPSC) uses the theory of robust MPC in Section 2.2
and learning-based robust MPC in Section 3.1.3 to filter the output of any learning-based con-
troller, such as the controller of an RL method, to ensure robust constraint satisfaction. The
simplest implementation of MPSC (139) uses tube-based MPC and considers the constraints in
Equations 12b–12d but replaces the cost in Equation 12a with the cost in Equation 16a to find the
closest input uk to the learned input ulearn,k at the current time step that guarantees that we will
continue to satisfy state and input constraints in the future. The main difference between MPSC
and learning-based robust MPC described in Section 3.1.3 is that the terminal safe set Xterm in
Equation 12d is not coupled with the selection of the cost function in Equation 12a. Instead, the
terminal safe set is conservatively initialized with Xterm = tube and can grow to include state tra-
jectories from previous iterations. This approach has been extended to probabilistic constraints by
considering a probabilistic tube tube (140) and to nonlinear nominal models (141) (safety level II).
Backup control for safe exploration ensures hard state constraint satisfaction (safety level III)
by finding a safe backup controller for any given RL policy π (142). Under the assumptions of
a known bound D(x, u) on the dynamics f and a distance measure to the state constraints Xc ,
the backup controller is used to obtain a future state in the neighborhood of a previously visited
safe state in some prediction horizon. Before a control input uk from π is applied to the system, all
possible predicted states xk+1 must satisfy (a) xk+1 ∈ Xc and (b) the existence of a safe backup action
ucertified (xk+1 ). Otherwise, the previous backup control input ucertified (xk ) is applied. This procedure
guarantees that the system state stays inside a robust positive control invariant set safe ⊆ Xc .

4. BENCHMARKS
The approaches discussed in Section 3 have been evaluated in vastly different ways (Figure 5).
The trends we observe are that (a) works that learn uncertain dynamics (Section 3.1) include a

432 Brunke et al.

1.0 2
Abstract numerical examples
46% 0.5 1
and grid worlds 38%

x2
0 0
(39, 47, 51–56, 58, 60, 61, 69, 70,
73, 78, 80, 82, 84, 88, 93, 96, 98, −0.5 −1

109, 137, 139) 7% −1.0 −2

−1.0 −0.5 0 0.5 1.0 −4 −2 0 2 4 gym-minigrid gym-minigrid
x1 x1
61%
Robot simulations and physics-
based RL environments
(34, 36–38, 40, 43, 44, 64–66, 68, 34%
29%
83, 85, 89, 92, 103, 106, 112, 118,
121, 122, 124–127, 129–131, 136,
138, 140–142) dm-control gym gym pybullet
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

Real-world robot experiments

(42, 45, 46, 49, 62, 63, 75, 79, 87, 91, 32%
97, 104, 105, 107, 110, 113, 119, 25% 28%
Access provided by [Link] on 09/01/23. For personal use only.

120, 123, 128, 133–135)

(135) (75) (46) (62)

Learning uncertain Encouraging safety Certifying learning-

dynamics (Section 3.1) in RL (Section 3.2) based control (Section 3.3) Open source

Figure 5
Summary of the environments used for evaluation. With increasing complexity, they can be classified as abstract numerical examples
and grid worlds, robot simulations and physics-based RL environments, and real-world robot experiments. The histograms show the
prevalence of each category in Sections 3.1–3.3, as well as the fraction whose code is open source. Abbreviation: RL, reinforcement
learning.

preponderance of abstract examples; (b) works that encourage safety in RL (Section 3.2) still
mostly use numerical examples and grid worlds, but robot simulations [often based on physics
engines (143) such as MuJoCo] are equally common; and (c) works that certify learning-based con-
trol (Section 3.3) are still mostly simulated but also account for the largest fraction of real-world
experiments. Although numerical examples make it difficult to gauge the practical applicability
of a method, we note that even many physics-based RL environments are not representative of
existing robotic platforms. In an ideal world, all research would be demonstrated in simulations
that closely resemble the target system—and brought to real robots whenever possible.
Furthermore, only a minority of software implementations from published research have been
open-sourced. Even in RL, where this is more common (see the red bars in Figure 5)—and stan-
dardized tools such as gym (144) exist—the reproducibility of results (which often rely on careful
hyperparameter tuning) remains limited (72). With regard to safety, simple RL environments aug-
mented with constraint evaluation (20) and disturbances (5) have been proposed but lack a unified
simulation interface for both safe RL and learning-based control approaches—that is, one that
also exposes the available a priori knowledge of a system.
We believe that a necessary stepping-stone for the advancement of safe learning control is to
create physics-based environments that (a) are simple enough to promote adoption, (b) are realistic
enough to represent meaningful robotic platforms, (c) are equipped with intuitive interfaces for
both control and RL researchers, and (d) provide useful comparison metrics (e.g., the amount of
data required by different approaches).

4.1. Cart–Pole and Quadrotor Benchmark Environments

For these reasons, we created an open-source benchmark suite (6, 7) that simulates two plat-
forms highly popular with both control and RL research: (a) a cart–pole system (64, 103, 136) and

[Link] • Safe Learning in Robotics 433

(b) a quadrotor (44, 66, 129, 142, 145). What sets our implementation apart from previous safe
RL environments (5, 20) is the extension of the traditional API (144) with features to facilitate
(a) the integration of approaches developed by the control theory community (i.e., symbolic nom-
inal models) and (b) the evaluation of safety and robustness (i.e., state and input constraints, ran-
domized inertial properties, initial positions and velocities, and external disturbances).

4.2. Safe Learning Control Results

We focus on a constrained stabilization task in both the cart–pole and quadrotor environments.
Our results are meant not to establish the superiority of one approach over another but to show
how the methods in References 63, 70, and 139, taken from Sections 3.1 (learning uncertain dy-
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

namics), 3.2 (encouraging safety in RL), and 3.3 (certifying learning-based control), respectively,
can improve control performance while pursuing constraint satisfaction. In doing so, we also show
that our benchmark supports algorithms developed by both the control and RL research commu-
nities. This also allows us to better compare the data hungriness of the different safe learning
Access provided by [Link] on 09/01/23. For personal use only.

control approaches.
For Section 3.1 (learning uncertain dynamics), we implemented a learning-based robust MPC
with a GP estimate of f̂ (GP-MPC) to stabilize a quadrotor subject to a state constraint and input
constraints, as in Reference 63. A linearization about hover, with a mass and moment of inertia
at 150% of the true values, was used as the prior model. Hyperparameter optimization was per-
formed offline, using 800 randomly selected state–action pairs (equivalent to 80 s of training data).
Figure 6 compares the performance of linear MPC, using the (incorrect, heavier) prior model,
with the GP-MPC approach. We see that linear MPC predicts the trajectory of the quadrotor to
be relatively shallow when maximum thrust is applied, which results in the quadrotor quickly vi-
olating the position constraint. By contrast, GP-MPC is able to account for the inaccurate model
and satisfies the constraint by a margin proportional to the 95% confidence interval on its predic-
tions, stabilizing the quadrotor.
For Section 3.2 (encouraging safety in RL), we combined the safe exploration approach in
Reference 70 with the popular deep RL algorithm proximal policy optimization (PPO) (146) and
applied it to cart–pole stabilization with constraints on the cart position. Notably, the task ter-
minates upon any constraint violation. We compared this approach with two baselines: standard
PPO and PPO with naive cost shaping (i.e., a penalty when close to constraint violation). Each

1.0
Linear MPC Prediction horizon
Goal
GP-MPC Prediction horizon 9 10
8
Constraint 2σ of prediction 7
z (m)

0.5 6
9 10
5 8
4 7
6
5
3 4
2 23
1
0 0 01
−1.0 −0.8 −0.6 −0.4 −0.2 0
x (m)
Figure 6
Position comparison of a two-dimensional quadrotor stabilization using linear MPC (red) and GP-MPC
(blue), along with the prediction horizons at the second time step, subject to a diagonal state constraint (gray)
and input constraints. Abbreviations: GP-MPC, learning-based robust model predictive control with a
Gaussian process estimate of f̂ ; MPC, model predictive control.

434 Brunke et al.

a b
200 PPO

constraint violations
Average number of
l 0.15 300
Average cost J

shaping
150 l 0.2
shaping
Safe0.15 200
100
Safe0.2
100
50

0 0
0 200 400 600 800 1,000 0 200 400 600 800 1,000
Training step (× 103) Training step (× 103)
Figure 7
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

Total (a) cost and (b) constraint violations during learning for PPO, PPO with cost shaping (for two parameterizations), and PPO with
safe exploration (for two slack variable values). Plotted are medians with upper and lower quantiles over 10 seeds. Abbreviation: PPO,
proximal policy optimization.
Access provided by [Link] on 09/01/23. For personal use only.

approach used more than 9 h of simulation time to collect training data. Figure 7 shows that,
with sufficient training, the two constraint-aware approaches (cost shaping and safe exploration)
achieve their best performance and have substantially fewer constraint violations than standard
PPO. In terms of constraint satisfaction, safe exploration outperforms cost shaping without com-
promising convergence speed. Safe exploration, however, requires careful parameter tuning of the
slack variable dictating the responsiveness to near constraint violation.
Finally, for Section 3.3 (certifying learning-based control), we implemented an MPSC algo-
rithm based on Reference 139. This particular formulation uses an MPC framework to modify an
unsafe learning controller’s actions. Here, a suboptimal PPO controller provides the uncertified
inputs trying to stabilize the cart–pole system. The advantages of using MPSC are highlighted
in Figure 8. In Figure 8a, the inputs are modified by the MPSC early in the stabilization to keep
the cart–pole system within the constraint boundaries. Figure 8b shows that without MPSC, PPO
would violate the constraints, but with MPSC, it manages to stay within the boundaries. The plot
also shows that MPSC is most active when the system is close to the constraint boundaries. This
provides a proof of concept of how safety filters can be combined with RL control to improve
safety.

a 5 b MPSC + PPO
1.0 Modiﬁed
0 PPO
Input (uk)

θ˙ (rad/s)

0.5 Constraint

−5
0.0

−10
−0.5
0 10 20 −0.2 −0.1 0.0 0.1 0.2
Step (k) θ (rad)

Figure 8
(a) Plot of uncertified PPO input (red) against certified MPSC + PPO input (blue). (b) Cart–pole state diagram (θ and θ˙ ) comparing the
MPSC + PPO certified trajectory (blue) and the uncertified PPO trajectory (red). Green dots show when the MPSC modified the
learning controller’s input; the MPSC is most active when the system is about to leave the constraint boundary (gray) or the set of states
from which the MPSC can correct the system. Abbreviations: MPSC, model predictive safety certification; PPO, proximal policy
optimization.

[Link] • Safe Learning in Robotics 435

5. DISCUSSION AND PERSPECTIVES ON FUTURE DIRECTIONS
The problem of safe learning control is emerging as a crucial topic for next-generation robotics.
In this review, we have summarized approaches from the control and the machine learning com-
munities that allow data to be safely used to improve the closed-loop performance of robot control
systems. We showed that machine learning techniques, particularly RL, can help generalization
toward larger classes of systems (i.e., fewer prior model assumptions), while control theory offers
the insights and frameworks necessary to provide constraint satisfaction guarantees and closed-
loop stability guarantees during the learning. Despite the many advances to date, there remain
many opportunities for future research.
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

FUTURE ISSUES
1. Capturing a broader class of systems: Work to date has focused on nonlinear systems in
the form of Equation 1. While they can model many robotic platforms, robots can also
Access provided by [Link] on 09/01/23. For personal use only.

exhibit hybrid dynamics [e.g., legged robots or other contact dynamics with the envi-
ronment (147)], time-varying dynamics [e.g., operation in changing environments (148,
149)], time delays (e.g., in actuation, sensing, or observing the reward), or partial differ-
ential dependencies in the dynamics [e.g., in continuum robotics (150)]. Expanding safe
learning control approaches to these scenarios is essential for their broader applicability
in robotics.
2. Accounting for imperfect state measurements: The majority of safe learning control ap-
proaches assume direct access to (possibly noisy) state measurements and neglect the
problem of state estimation. In practice, obtaining accurate state information is challeng-
ing due to sensors that do not provide state measurements directly (e.g., images as mea-
surements), inaccurate process and observation models used for state estimation, and/
or improper state feature representations. Expanding existing approaches to work with
(possibly high-dimensional) sensor data is essential for a broad applicability of these
methods in robotics.
3. Considering scalability as well as sampling and computational efficiency: Many of the
approaches presented here have been demonstrated only on small toy problems, and
applying them to high-dimensional robotics problems is not trivial. Moreover, in prac-
tice, we often face issues such as data sparsity, distribution shifts, and the optimality–
complexity trade-off for real-time implementations. Efficient robot learning relies on
multiple factors, including control architecture design (151), systematic training data col-
lection (152), and appropriate function class selection (153). While current approaches
focus on providing theoretical safety guarantees, formal analysis of sampling complexity
and computational complexity is indispensable to facilitate the implementation of safe
learning control algorithms in real-world robot applications.
4. Verifying system and modeling assumptions: The safety guarantees provided often rely
on a set of assumptions (e.g., Lipschitz continuous true dynamics with a known Lipschitz
constant or bounded disturbance sets). It is difficult to verify these assumptions prior to a
robot’s operation. To facilitate algorithm implementation, we also see other approxima-
tions being made (e.g., linearization, or data assumed to be independent and identically
distributed Gaussian samples). Systematic approaches to verify or quantify the impact of
the assumptions and the approximations with minimal (online) data are crucial to allow

436 Brunke et al.

the safe learning approaches to be used in real-world applications. This can also include
investigations into the interpretability of trained models, especially black-box models
such as deep neural networks, for safe closed-loop operation.

DISCLOSURE STATEMENT
The authors are not aware of any affiliations, memberships, funding, or financial holdings that
might be perceived as affecting the objectivity of this review.
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

ACKNOWLEDGMENTS
The authors would like to acknowledge the early contributions to this work by Karime Pereida
and Sepehr Samavi, the invaluable suggestions and feedback by Hallie Siegel, and the support
from the Natural Sciences and Engineering Research Council of Canada, the Canada Research
Access provided by [Link] on 09/01/23. For personal use only.

Chairs program, and the CIFAR AI Chairs program.

LITERATURE CITED
1. Burnett K, Qian J, Du X, Liu L, Yoon DJ, et al. 2021. Zeus: a system description of the two-time winner
of the collegiate SAE autodrive competition. J. Field Robot. 38:139–66
2. Boutilier JJ, Brooks SC, Janmohamed A, Byers A, Buick JE, et al. 2017. Optimizing a drone network to
deliver automated external defibrillators. Circulation 135:2454–65
3. Dong K, Pereida K, Shkurti F, Schoellig AP. 2020. Catch the ball: accurate high-speed motions for
mobile manipulators via inverse dynamics learning. arXiv:2003.07489 [[Link]]
4. García J, Fernández F. 2015. A comprehensive survey on safe reinforcement learning. J. Mach. Learn.
Res. 16:1437–80
5. Dulac-Arnold G, Levine N, Mankowitz DJ, Li J, Paduraru C, et al. 2021. An empirical investigation of
the challenges of real-world reinforcement learning. arXiv:2003.11881 [[Link]]
6. Dyn. Syst. Lab. 2021. safe-control-gym. GitHub. [Link]
7. Yuan Z, Hall AW, Zhou S, Brunke L, Greeff M, et al. 2021. safe-control-gym: a unified benchmark suite
for safe learning-based control and reinforcement learning. arXiv:2109.06325 [[Link]]
8. Dulac-Arnold G, Mankowitz D, Hester T. 2019. Challenges of real-world reinforcement learning.
arXiv:1904.12901 [[Link]]
9. Hewing L, Wabersich KP, Menner M, Zeilinger MN. 2020. Learning-based model predictive control:
toward safe learning in control. Annu. Rev. Control Robot. Auton. Syst. 3:269–96
10. Bristow D, Tharayil M, Alleyne A. 2006. A survey of iterative learning control. IEEE Control Syst. Mag.
26(3):96–114
11. Ahn HS, Chen Y, Moore KL. 2007. Iterative learning control: brief survey and categorization. IEEE
Trans. Syst. Man Cybernet. C 37:1099–121
12. Polydoros AS, Nalpantidis L. 2017. Survey of model-based reinforcement learning: applications on
robotics. J. Intell. Robot. Syst. 86:153–73
13. Chatzilygeroudis K, Vassiliades V, Stulp F, Calinon S, Mouret JB. 2020. A survey on policy search algo-
rithms for learning robot controllers in a handful of trials. IEEE Trans. Robot. 36:328–47
14. Ravichandar H, Polydoros AS, Chernova S, Billard A. 2020. Recent advances in robot learning from
demonstration. Annu. Rev. Control Robot. Auton. Syst. 3:297–330
15. Kober J, Bagnell JA, Peters J. 2013. Reinforcement learning in robotics: a survey. Int. J. Robot. Res.
32:1238–74
16. Recht B. 2019. A tour of reinforcement learning: the view from continuous control. Annu. Rev. Control
Robot. Auton. Syst. 2:253–79

[Link] • Safe Learning in Robotics 437

17. Kiumarsi B, Vamvoudakis KG, Modares H, Lewis FL. 2018. Optimal and autonomous control using
reinforcement learning: a survey. IEEE Trans. Neural Netw. Learn. Syst. 29:2042–62
18. Osborne M, Shin HS, Tsourdos A. 2021. A review of safe online learning for nonlinear control systems.
In 2021 International Conference on Unmanned Aircraft Systems (ICUAS), pp. 794–803. Piscataway, NJ:
IEEE
19. Tambon F, Laberge G, An L, Nikanjam A, Mindom PSN, et al. 2021. How to certify machine learning
based safety-critical systems? A systematic literature review. arXiv:2107.12045 [[Link]]
20. Ray A, Achiam J, Amodei D. 2019. Benchmarking safe exploration in deep reinforcement learning. Preprint,
OpenAI, San Francisco, CA. [Link]
21. Leike J, Martic M, Krakovna V, Ortega PA, Everitt T, et al. 2017. AI safety gridworlds. arXiv:1711.09883
[[Link]]
22. Khalil H. 2002. Nonlinear Systems. Upper Saddle River, NJ: Prentice Hall. 3rd ed.
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

23. Sastry S, Bodson M. 2011. Adaptive Control: Stability, Convergence and Robustness. Mineola, NY: Dover
24. Nguyen-Tuong D, Peters J. 2011. Model learning for robot control: a survey. Cogn. Process. 12:319–40
25. Zhou K, Doyle J, Glover K. 1996. Robust and Optimal Control. Upper Saddle River, NJ: Prentice Hall
26. Dullerud G, Paganini F. 2005. A Course in Robust Control Theory: A Convex Approach. New York: Springer
27. Rawlings J, Mayne D, Diehl M. 2017. Model Predictive Control: Theory, Computation, and Design. Santa
Access provided by [Link] on 09/01/23. For personal use only.

Barbara, CA: Nob Hill

28. Mayne D, Seron M, Raković S. 2005. Robust model predictive control of constrained linear systems with
bounded disturbances. Automatica 41:219–24
29. Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA. 2017. Deep reinforcement learning: a brief
survey. IEEE Signal Process. Mag. 34(6):26–38
30. Dai B, Shaw A, Li L, Xiao L, He N, et al. 2018. SBEED: convergent reinforcement learning with non-
linear function approximation. In Proceedings of the 35th International Conference on Machine Learning, ed.
J Dy, A Krause, pp. 1125–34. Proc. Mach. Learn. Res. 80. N.p.: PMLR
31. Cheng R, Verma A, Orosz G, Chaudhuri S, Yue Y, Burdick J. 2019. Control regularization for reduced
variance reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning,
ed. K Chaudhuri, R Salakhutdinov, pp. 1141–50. Proc. Mach. Learn. Res. 97. N.p.: PMLR
32. Ghavamzadeh M, Mannor S, Pineau J, Tamar A. 2015. Bayesian reinforcement learning: a survey. Found.
Trends Mach. Learn. 8:359–483
33. Altman E. 1999. Constrained Markov Decision Processes. Boca Raton, FL: Chapman & Hall/CRC
34. Achiam J, Held D, Tamar A, Abbeel P. 2017. Constrained policy optimization. In Proceedings of the 34th
International Conference on Machine Learning, ed. D Precup, YW Teh, pp. 22–31. Proc. Mach. Learn. Res.
70. N.p.: PMLR
35. Nilim A, El Ghaoui L. 2005. Robust control of Markov decision processes with uncertain transition
matrices. Oper. Res. 53:780–98
36. Pinto L, Davidson J, Sukthankar R, Gupta A. 2017. Robust adversarial reinforcement learning. In Pro-
ceedings of the 34th International Conference on Machine Learning, ed. D Precup, YW Teh, pp. 2817–26.
Proc. Mach. Learn. Res. 70. N.p.: PMLR
37. Pan X, Seita D, Gao Y, Canny J. 2019. Risk averse robust adversarial reinforcement learning. In 2019
International Conference on Robotics and Automation (ICRA), pp. 8522–28. Piscataway, NJ: IEEE
38. Vinitsky E, Du Y, Parvate K, Jang K, Abbeel P, Bayen A. 2020. Robust reinforcement learning using
adversarial populations. arXiv:2008.01825 [[Link]]
39. Cooper J, Che J, Cao C. 2014. The use of learning in fast adaptation algorithms. Int. J. Adapt. Control
Signal Process. 28:325–40
40. Gahlawat A, Zhao P, Patterson A, Hovakimyan N, Theodorou E. 2020. L1-GP: L1 adaptive control
with Bayesian learning. In Proceedings of the 2nd Conference on Learning for Dynamics and Control, ed.
AM Bayen, A Jadbabaie, G Pappas, PA Parrilo, B Recht, et al., pp. 826–37. Proc. Mach. Learn. Res. 120.
N.p.: PMLR
41. Hovakimyan N, Cao C. 2010. L1 Adaptive Control Theory: Guaranteed Robustness with Fast Adaptation.
Philadelphia: Soc. Ind. Appl. Math.
42. Grande RC, Chowdhary G, How JP. 2014. Experimental validation of Bayesian nonparametric adaptive
control using Gaussian processes. J. Aerosp. Inf. Syst. 11:565–78

438 Brunke et al.

43. Chowdhary G, Kingravi HA, How JP, Vela PA. 2015. Bayesian nonparametric adaptive control using
Gaussian processes. IEEE Trans. Neural Netw. Learn. Syst. 26:537–50
44. Joshi G, Chowdhary G. 2019. Deep model reference adaptive control. In 2019 IEEE 58th Conference on
Decision and Control (CDC), pp. 4601–8. Piscataway, NJ: IEEE
45. Joshi G, Virdi J, Chowdhary G. 2020. Asynchronous deep model reference adaptive control.
arXiv:2011.02920 [[Link]]
46. Berkenkamp F, Schoellig AP. 2015. Safe and robust learning control with Gaussian processes. In 2015
European Control Conference (ECC), pp. 2496–501. Piscataway, NJ: IEEE
47. Holicki T, Scherer CW, Trimpe S. 2021. Controller design via experimental exploration with robustness
guarantees. IEEE Control Syst. Lett. 5:641–46
48. von Rohr A, Neumann-Brosig M, Trimpe S. 2021. Probabilistic robust linear quadratic regulators
with Gaussian processes. In Proceedings of the 3rd Conference on Learning for Dynamics and Control, ed.
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

A Jadbabaie, J Lygeros, GJ Pappas, PA Parrilo, B Recht, et al., pp. 324–35. Proc. Mach. Learn. Res. 144.
N.p.: PMLR
49. Helwa MK, Heins A, Schoellig AP. 2019. Provably robust learning-based approach for high-accuracy
tracking control of Lagrangian systems. IEEE Robot. Autom. Lett. 4:1587–94
Access provided by [Link] on 09/01/23. For personal use only.

50. Greeff M, Schoellig AP. 2021. Exploiting differential flatness for robust learning-based tracking control
using Gaussian processes. IEEE Control Syst. Lett. 5:1121–26
51. Tanaskovic M, Fagiano L, Smith R, Morari M. 2014. Adaptive receding horizon control for constrained
MIMO systems. Automatica 50:3019–29
52. Lorenzen M, Cannon M, Allgöwer F. 2019. Robust MPC with recursive model update. Automatica
103:461–71
53. Bujarbaruah M, Zhang X, Borrelli F. 2018. Adaptive MPC with chance constraints for FIR systems. In
2018 Annual American Control Conference (ACC), pp. 2312–17. Piscataway, NJ: IEEE
54. Bujarbaruah M, Zhang X, Tanaskovic M, Borrelli F. 2019. Adaptive MPC under time varying uncertainty:
robust and stochastic. arXiv:1909.13473 [[Link]]
55. Gonçalves GA, Guay M. 2016. Robust discrete-time set-based adaptive predictive control for nonlinear
systems. J. Process Control 39:111–22
56. Köhler J, Kötting P, Soloperto R, Allgöwer F, Müller MA. 2021. A robust adaptive model predictive
control framework for nonlinear uncertain systems. Int. J. Robust Nonlinear Control 31:8725–49
57. Rosolia U, Borrelli F. 2018. Learning model predictive control for iterative tasks. A data-driven control
framework. IEEE Trans. Autom. Control 63:1883–96
58. Bujarbaruah M, Zhang X, Rosolia U, Borrelli F. 2018. Adaptive MPC for iterative tasks. In 2018 IEEE
Conference on Decision and Control (CDC), pp. 6322–27. Piscataway, NJ: IEEE
59. Pereida K, Brunke L, Schoellig AP. 2021. Robust adaptive model predictive control for guaranteed fast
and accurate stabilization in the presence of model errors. Int. J. Robust Nonlinear Control 31:8750–84
60. Aswani A, Gonzalez H, Sastry SS, Tomlin C. 2013. Provably safe and robust learning-based model pre-
dictive control. Automatica 49:1216–26
61. Soloperto R, Müller MA, Trimpe S, Allgöwer F. 2018. Learning-based robust model predictive control
with state-dependent uncertainty. IFAC-PapersOnLine 51(20):442–47
62. Ostafew CJ, Schoellig AP, Barfoot TD. 2016. Robust constrained learning-based NMPC enabling reli-
able mobile robot path tracking. Int. J. Robot. Res. 35:1547–63
63. Hewing L, Kabzan J, Zeilinger MN. 2020. Cautious model predictive control using Gaussian process
regression. IEEE Trans. Control Syst. Technol. 28:2736–43
64. Kamthe S, Deisenroth M. 2018. Data-efficient reinforcement learning with probabilistic model predic-
tive control. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics,
ed. A Storkey, F Perez-Cruz, pp. 1701–10. Proc. Mach. Learn. Res. 84. N.p.: PMLR
65. Koller T, Berkenkamp F, Turchetta M, Boedecker J, Krause A. 2019. Learning-based model predictive
control for safe exploration and reinforcement learning. arXiv:1906.12189 [[Link]]
66. Fan D, Agha A, Theodorou E. 2020. Deep learning tubes for tube MPC. In Robotics: Science and Systems
XVI, ed. M Toussaint, A Bicchi, T Hermans, pap. 87. N.p.: Robot. Sci. Syst. Found.

[Link] • Safe Learning in Robotics 439

67. McKinnon CD, Schoellig AP. 2020. Context-aware cost shaping to reduce the impact of model error
in receding horizon control. In 2020 IEEE International Conference on Robotics and Automation (ICRA),
pp. 2386–92. Piscataway, NJ: IEEE
68. Berkenkamp F, Turchetta M, Schoellig A, Krause A. 2017. Safe model-based reinforcement learning with
stability guarantees. In Advances in Neural Information Processing Systems 30, ed. I Guyon, UV Luxburg,
S Bengio, H Wallach, R Fergus, et al., pp. 908–19. Red Hook, NY: Curran
69. Turchetta M, Berkenkamp F, Krause A. 2016. Safe exploration in finite Markov decision processes with
Gaussian processes. In Advances in Neural Information Processing Systems 29, ed. DD Lee, M Sugiyama,
UV Luxburg, I Guyon, R Garnett, pp. 4312–20. Red Hook, NY: Curran
70. Dalal G, Dvijotham K, Vecerik M, Hester T, Paduraru C, Tassa Y. 2018. Safe exploration in continuous
action spaces. arXiv:1801.08757 [[Link]]
71. Sutton RS, Barto AG. 2018. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press.
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

2nd ed.
72. Henderson P, Islam R, Bachman P, Pineau J, Precup D, Meger D. 2018. Deep reinforcement learning
that matters. In The Thirty-Second AAAI Conference on Artificial Intelligence, pp. 3207–14. Palo Alto, CA:
AAAI Press
73. Moldovan TM, Abbeel P. 2012. Safe exploration in Markov decision processes. In Proceedings of the 29th
Access provided by [Link] on 09/01/23. For personal use only.

International Conference on Machine Learning (ICML), pp. 1451–58. Madison, WI: Omnipress
74. Brafman RI, Tennenholtz M. 2002. R-max – a general polynomial time algorithm for near-optimal re-
inforcement learning. J. Mach. Learn. Res. 3:213–31
75. Pham TH, De Magistris G, Tachibana R. 2018. OptLayer - practical constrained optimization for deep
reinforcement learning in the real world. In 2018 IEEE International Conference on Robotics and Automation
(ICRA), pp. 6236–43. Piscataway, NJ: IEEE
76. Kim Y, Allmendinger R, López-Ibáñez M. 2021. Safe learning and optimization techniques: towards a
survey of the state of the art. arXiv:2101.09505 [[Link]]
77. Duivenvoorden RR, Berkenkamp F, Carion N, Krause A, Schoellig AP. 2017. Constrained Bayesian
optimization with particle swarms for safe adaptive controller tuning. IFAC-PapersOnLine 50(1):11800–
7
78. Sui Y, Gotovos A, Burdick J, Krause A. 2015. Safe exploration for optimization with Gaussian processes.
In Proceedings of the 32nd International Conference on Machine Learning, ed. F Bach, D Blei, pp. 997–1005.
Proc. Mach. Learn. Res. 37. N.p.: PMLR
79. Berkenkamp F, Krause A, Schoellig AP. 2020. Bayesian optimization with safety constraints: safe and
automatic parameter tuning in robotics. arXiv:1602.04450 [[Link]]
80. Sui Y, Zhuang V, Burdick J, Yue Y. 2018. Stagewise safe Bayesian optimization with Gaussian processes.
In Proceedings of the 35th International Conference on Machine Learning, ed. J Dy, A Krause, pp. 4781–89.
Proc. Mach. Learn. Res. 80. N.p.: PMLR
81. Baumann D, Marco A, Turchetta M, Trimpe S. 2021. GoSafe: globally optimal safe robot learning.
arXiv:2105.13281 [[Link]]
82. Wachi A, Sui Y, Yue Y, Ono M. 2018. Safe exploration and optimization of constrained MDPs using
Gaussian processes. In The Thirty-Second AAAI Conference on Artificial Intelligence, pp. 6548–55. Palo
Alto, CA: AAAI Press
83. Srinivasan K, Eysenbach B, Ha S, Tan J, Finn C. 2020. Learning to be safe: deep RL with a safety critic.
arXiv:2010.14603 [[Link]]
84. Thananjeyan B, Balakrishna A, Nair S, Luo M, Srinivasan K, et al. 2021. Recovery RL: safe reinforcement
learning with learned recovery zones. IEEE Robot. Autom. Lett. 6:4915–22
85. Bharadhwaj H, Kumar A, Rhinehart N, Levine S, Shkurti F, Garg A. 2021. Conservative safety critics
for exploration. arXiv:2010.14497 [[Link]]
86. Kumar A, Zhou A, Tucker G, Levine S. 2020. Conservative Q-learning for offline reinforcement learn-
ing. arXiv:2006.04779 [[Link]]
87. Kahn G, Villaflor A, Pong V, Abbeel P, Levine S. 2017. Uncertainty-aware reinforcement learning for
collision avoidance. arXiv:1702.01182 [[Link]]
88. Lütjens B, Everett M, How JP. 2019. Safe reinforcement learning with model uncertainty estimates. In
2019 International Conference on Robotics and Automation (ICRA), pp. 8662–68. Piscataway, NJ: IEEE

440 Brunke et al.

89. Zhang J, Cheung B, Finn C, Levine S, Jayaraman D. 2020. Cautious adaptation for reinforcement learn-
ing in safety-critical settings. In Proceedings of the 37th International Conference on Machine Learning, ed.
HD Daumé III, A Singh, pp. 11055–65. Proc. Mach. Learn. Res. 119. N.p.: PMLR
90. Chua K, Calandra R, McAllister R, Levine S. 2018. Deep reinforcement learning in a handful of trials
using probabilistic dynamics models. In Advances in Neural Information Processing Systems 31, ed. S Bengio,
H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, R Garnett, pp. 4759–70. Red Hook, NY: Curran
91. Thananjeyan B, Balakrishna A, Rosolia U, Li F, McAllister R, et al. 2020. Safety augmented value esti-
mation from demonstrations (SAVED): safe deep model-based RL for sparse cost robotic tasks. IEEE
Robot. Autom. Lett. 5:3612–19
92. Urpí NA, Curi S, Krause A. 2021. Risk-averse offline reinforcement learning. arXiv:2102.05371 [[Link]]
93. Chow Y, Ghavamzadeh M, Janson L, Pavone M. 2017. Risk-constrained reinforcement learning with
percentile risk criteria. J. Mach. Learn. Res. 18:6070–120
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

94. Liang Q, Que F, Modiano E. 2018. Accelerated primal-dual policy optimization for safe reinforcement
learning. arXiv:1802.06480 [[Link]]
95. Schulman J, Levine S, Abbeel P, Jordan M, Moritz P. 2015. Trust region policy optimization. In Pro-
ceedings of the 32nd International Conference on Machine Learning, ed. F Bach, D Blei, pp. 1889–97. Proc.
Mach. Learn. Res. 37. N.p.: PMLR
Access provided by [Link] on 09/01/23. For personal use only.

96. Chow Y, Nachum O, Duenez-Guzman E, Ghavamzadeh M. 2018. A Lyapunov-based approach to safe

reinforcement learning. In Advances in Neural Information Processing Systems 31, ed. S Bengio, H Wallach,
H Larochelle, K Grauman, N Cesa-Bianchi, R Garnett, pp. 8103–12. Red Hook, NY: Curran
97. Chow Y, Nachum O, Faust A, Duenez-Guzman E, Ghavamzadeh M. 2019. Lyapunov-based safe policy
optimization for continuous control. arXiv:1901.10031 [[Link]]
98. Satija H, Amortila P, Pineau J. 2020. Constrained Markov decision processes via backward value func-
tions. In Proceedings of the 37th International Conference on Machine Learning, ed. HD Daumé III, A Singh,
pp. 8502–11. Proc. Mach. Learn. Res. 119. N.p.: PMLR
99. Morimoto J, Doya K. 2005. Robust reinforcement learning. Neural Comput. 17:335–59
100. Turchetta M, Krause A, Trimpe S. 2020. Robust model-free reinforcement learning with multi-
objective Bayesian optimization. In 2020 IEEE International Conference on Robotics and Automation (ICRA),
pp. 10702–708. Piscataway, NJ: IEEE
101. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, et al. 2014. Generative adversarial
networks. arXiv:1406.2661 [[Link]]
102. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, et al. 2015. Human-level control through deep
reinforcement learning. Nature 518:529–33
103. Lütjens B, Everett M, How JP. 2020. Certified adversarial robustness for deep reinforcement learning.
In Proceedings of the Conference on Robot Learning, ed. LP Kaelbling, D Kragic, K Sugiura, pp. 1328–37.
Proc. Mach. Learn. Res. 100. N.p.: PMLR
104. Sadeghi F, Levine S. 2017. CAD2RL: real single-image flight without a single real image. In Robotics:
Science and Systems XIII, ed. N Amato, S Srinivasa, N Ayanian, S Kuindersma, pap. 34. N.p.: Robot. Sci.
Syst. Found.
105. Loquercio A, Kaufmann E, Ranftl R, Dosovitskiy A, Koltun V, Scaramuzza D. 2020. Deep drone racing:
from simulation to reality with domain randomization. IEEE Trans. Robot. 36:1–14
106. Rajeswaran A, Ghotra S, Ravindran B, Levine S. 2017. EPOpt: learning robust neural network policies
using model ensembles. arXiv:1610.01283 [[Link]]
107. Mehta B, Diaz M, Golemo F, Pal CJ, Paull L. 2020. Active domain randomization. In Proceedings of the
Conference on Robot Learning, ed. LP Kaelbling, D Kragic, K Sugiura, pp. 1162–76. Proc. Mach. Learn.
Res. 100. N.p.: PMLR
108. Zhou S, Helwa MK, Schoellig AP. 2020. Deep neural networks as add-on modules for enhancing robot
performance in impromptu trajectory tracking. Int. J. Robot. Res. 39:1397–418
109. Jin M, Lavaei J. 2020. Stability-certified reinforcement learning: a control-theoretic perspective. IEEE
Access 8:229086–100
110. Shi G, Shi X, O’Connell M, Yu R, Azizzadenesheli K, et al. 2019. Neural lander: stable drone land-
ing control using learned dynamics. In 2019 International Conference on Robotics and Automation (ICRA),
pp. 9784–90. Piscataway, NJ: IEEE

[Link] • Safe Learning in Robotics 441

111. Fazlyab M, Robey A, Hassani H, Morari M, Pappas GJ. 2019. Efficient and accurate estimation of Lip-
schitz constants for deep neural networks. arXiv:1906.04893 [[Link]]
112. Richards SM, Berkenkamp F, Krause A. 2018. The Lyapunov neural network: adaptive stability certifi-
cation for safe learning of dynamical systems. In Proceedings of the 2nd Conference on Robot Learning, ed.
A Billard, A Dragan, J Peters, J Morimoto, pp. 466–76. Proc. Mach. Learn. Res. 87. N.p.: PMLR
113. Zhou Z, Oguz OS, Leibold M, Buss M. 2020. A general framework to increase safety of learning al-
gorithms for dynamical systems based on region of attraction estimation. IEEE Trans. Robot. 36:1472–
90
114. Jarvis-Wloszek Z, Feeley R, Tan W, Sun K, Packard A. 2003. Some controls applications of sum of
squares programming. In 42nd IEEE International Conference on Decision and Control (CDC), Vol. 5,
pp. 4676–81. Piscataway, NJ: IEEE
115. Schilders WH, Van der Vorst HA, Rommes J. 2008. Model Order Reduction: Theory, Research Aspects and
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

Applications. Berlin: Springer

116. Alshiekh M, Bloem R, Udiger Ehlers R, Könighofer B, Niekum S, Topcu U. 2018. Safe reinforcement
learning via shielding. In The Thirty-Second AAAI Conference on Artificial Intelligence, pp. 2669–78. Palo
Alto, CA: AAAI Press
Access provided by [Link] on 09/01/23. For personal use only.

117. Ames AD, Coogan S, Egerstedt M, Notomista G, Sreenath K, Tabuada P. 2019. Control barrier func-
tions: theory and applications. In 2019 18th European Control Conference (ECC), pp. 3420–31. Piscataway,
NJ: IEEE
118. Taylor AJ, Dorobantu VD, Le HM, Yue Y, Ames AD. 2019. Episodic learning with control Lyapunov
functions for uncertain robotic systems. In 2019 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS), pp. 6878–84. Piscataway, NJ: IEEE
119. Taylor A, Singletary A, Yue Y, Ames A. 2020. Learning for safety-critical control with control bar-
rier functions. In Proceedings of the 2nd Conference on Learning for Dynamics and Control, ed. AM Bayen,
A Jadbabaie, G Pappas, PA Parrilo, B Recht, et al., pp. 708–17. Proc. Mach. Learn. Res. 120. N.p.:
PMLR
120. Ohnishi M, Wang L, Notomista G, Egerstedt M. 2019. Barrier-certified adaptive reinforcement learning
with applications to brushbot navigation. IEEE Trans. Robot. 35:1186–205
121. Choi J, Castañeda F, Tomlin C, Sreenath K. 2020. Reinforcement learning for safety-critical con-
trol under model uncertainty, using control Lyapunov functions and control barrier functions. In
Robotics: Science and Systems XVI, ed. M Toussaint, A Bicchi, T Hermans, pap. 88. N.p.: Robot. Sci. Syst.
Found.
122. Taylor AJ, Dorobantu VD, Krishnamoorthy M, Le HM, Yue Y, Ames AD. 2019. A control Lyapunov
perspective on episodic learning via projection to state stability. In 2019 IEEE 58th Conference on Decision
and Control (CDC), pp. 1448–55. Piscataway, NJ: IEEE
123. Taylor AJ, Singletary A, Yue Y, Ames AD. 2020. A control barrier perspective on episodic learning via
projection-to-state safety. arXiv:2003.08028 [[Link]]
124. Taylor AJ, Dorobantu VD, Dean S, Recht B, Yue Y, Ames AD. 2020. Towards robust data-driven control
synthesis for nonlinear systems with actuation uncertainty. arXiv:2011.10730 [[Link]]
125. Taylor AJ, Ames AD. 2020. Adaptive safety with control barrier functions. In 2020 American Control
Conference (ACC), pp. 1399–405. Piscataway, NJ: IEEE
126. Lopez BT, Slotine JJE, How JP. 2021. Robust adaptive control barrier functions: an adaptive and data-
driven approach to safety. IEEE Control Syst. Lett. 5:1031–36
127. Cheng R, Orosz G, Murray RM, Burdick JW. 2019. End-to-end safe reinforcement learning through
barrier functions for safety-critical continuous control tasks. The Thirty-Third AAAI Conference on Arti-
ficial Intelligence, pp. 3387–95. Palo Alto, CA: AAAI Press
128. Fan DD, Nguyen J, Thakker R, Alatur N, Agha-mohammadi A, Theodorou EA. 2019. Bayesian
learning-based adaptive control for safety critical systems. arXiv:1910.02325 [[Link]]
129. Wang L, Theodorou EA, Egerstedt M. 2018. Safe learning of quadrotor dynamics using barrier certifi-
cates. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 2460–65. Piscataway,
NJ: IEEE

442 Brunke et al.

130. Khojasteh MJ, Dhiman V, Franceschetti M, Atanasov N. 2020. Probabilistic safety constraints for
learned high relative degree system dynamics. In Proceedings of the 2nd Conference on Learning for Dy-
namics and Control, ed. AM Bayen, A Jadbabaie, G Pappas, PA Parrilo, B Recht, et al., pp. 781–92. Proc.
Mach. Learn. Res. 120. N.p.: PMLR
131. Dean S, Taylor AJ, Cosner RK, Recht B, Ames AD. 2020. Guaranteeing safety of learned perception
modules via measurement-robust control barrier functions. arXiv:2010.16001 [[Link]]
132. Mitchell I, Bayen A, Tomlin C. 2005. A time-dependent Hamilton-Jacobi formulation of reachable sets
for continuous dynamic games. IEEE Trans. Autom. Control 50:947–57
133. Fisac JF, Akametalu AK, Zeilinger MN, Kaynama S, Gillula J, Tomlin CJ. 2019. A general safety frame-
work for learning-based control in uncertain robotic systems. IEEE Trans. Autom. Control 64:2737–
52
134. Gillula JH, Tomlin CJ. 2012. Guaranteed safe online learning via reachability: tracking a ground target
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

using a quadrotor. In 2012 IEEE International Conference on Robotics and Automation (ICRA), pp. 2723–30.
Piscataway, NJ: IEEE
135. Bajcsy A, Bansal S, Bronstein E, Tolani V, Tomlin CJ. 2019. An efficient reachability-based framework
for provably safe autonomous navigation in unknown environments. In 2019 IEEE 58th Conference on
Access provided by [Link] on 09/01/23. For personal use only.

Decision and Control (CDC), pp. 1758–65. Piscataway, NJ: IEEE

136. Fisac JF, Lugovoy NF, Rubies-Royo V, Ghosh S, Tomlin CJ. 2019. Bridging Hamilton-Jacobi safety
analysis and reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA),
pp. 8550–56. Piscataway, NJ: IEEE
137. Choi JJ, Lee D, Sreenath K, Tomlin CJ, Herbert SL. 2021. Robust control barrier-value functions for
safety-critical control. arXiv:2104.02808 [[Link]]
138. Herbert S, Choi JJ, Sanjeev S, Gibson M, Sreenath K, Tomlin CJ. 2021. Scalable learning of safety
guarantees for autonomous systems using Hamilton-Jacobi reachability. arXiv:2101.05916 [[Link]]
139. Wabersich KP, Zeilinger MN. 2018. Linear model predictive safety certification for learning-based con-
trol. In 2018 IEEE Conference on Decision and Control (CDC), pp. 7130–35. Piscataway, NJ: IEEE
140. Wabersich KP, Hewing L, Carron A, Zeilinger MN. 2019. Probabilistic model predictive safety certifi-
cation for learning-based control. arXiv:1906.10417 [[Link]]
141. Wabersich KP, Zeilinger MN. 2021. A predictive safety filter for learning-based control of constrained
nonlinear dynamical systems. Automatica 129:109597
142. Mannucci T, van Kampen E, de Visser C, Chu Q. 2018. Safe exploration algorithms for reinforcement
learning controllers. IEEE Trans. Neural Netw. Learn. Syst. 29:1069–81
143. Liu CK, Negrut D. 2021. The role of physics-based simulators in robotics. Annu. Rev. Control Robot.
Auton. Syst. 4:35–58
144. Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, et al. 2016. OpenAI Gym.
arXiv:1606.01540 [[Link]]
145. Panerati J, Zheng H, Zhou S, Xu J, Prorok A, Schoellig AP. 2021. Learning to fly—a Gym environment
with PyBullet physics for reinforcement learning of multi-agent quadcopter control. In 2021 IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS), pp. 7512–19. Piscataway, NJ: IEEE
146. Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. 2017. Proximal policy optimization algorithms.
arXiv:1707.06347 [[Link]]
147. Wieber PB, Tedrake R, Kuindersma S. 2016. Modeling and control of legged robots. In Springer Hand-
book of Robotics, ed. B Siciliano, O Khatib, pp. 1203–34. Cham, Switz.: Springer
148. McKinnon CD, Schoellig AP. 2018. Experience-based model selection to enable long-term, safe control
for repetitive tasks under changing conditions. In 2018 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), pp. 2977–84. Piscataway, NJ: IEEE
149. Chandak Y, Jordan S, Theocharous G, White M, Thomas PS. 2020. Towards safe policy improvement
for non-stationary MDPs. In Advances in Neural Information Processing Systems 33, ed. H Larochelle,
M Ranzato, R Hadsell, MF Balcan, H Lin, pp. 9156–68. Red Hook, NY: Curran
150. Burgner-Kahrs J, Rucker DC, Choset H. 2015. Continuum robots for medical applications: a survey.
IEEE Trans. Robot. 31:1261–80

[Link] • Safe Learning in Robotics 443

151. Mueller FL, Schoellig AP, D’Andrea R. 2012. Iterative learning of feed-forward corrections for high-
performance tracking. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),
pp. 3276–81. Piscataway, NJ: IEEE
152. Dean S, Tu S, Matni N, Recht B. 2018. Safely learning to control the constrained linear quadratic reg-
ulator. arXiv:5582–88
153. McKinnon CD, Schoellig AP. 2019. Learning probabilistic models for safe predictive control in un-
known environments. In 2019 18th European Control Conference (ECC), pp. 2472–79. Piscataway, NJ:
IEEE
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]
Access provided by [Link] on 09/01/23. For personal use only.

444 Brunke et al.

AS05_TOC [Link] January 27, 2022 15:37

Annual Review of
Control, Robotics,
and Autonomous
Contents Systems

Volume 5, 2022

An Historical Perspective on the Control of Robotic Manipulators

Mark W. Spong p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 1
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

Cognitive Science as a Source of Forward and Inverse Models of

Human Decisions for Robotics and Control
Mark K. Ho and Thomas L. Griffiths p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p33
Access provided by [Link] on 09/01/23. For personal use only.

Internal Models in Control, Bioengineering, and Neuroscience

Michelangelo Bin, Jie Huang, Alberto Isidori, Lorenzo Marconi, Matteo Mischiati,
and Eduardo Sontag p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p55
Behavior Trees in Robot Control Systems
Petter Ögren and Christopher I. Sprague p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p81
Methods for Robot Behavior Adaptation for Cognitive
Neurorehabilitation
Alyssa Kubota and Laurel D. Riek p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 109
Grappling Spacecraft
Carl Glen Henshaw, Samantha Glassner, Bo Naasz, and Brian Roberts p p p p p p p p p p p p p p p p p 137
Design and Control of Drones
Mark W. Mueller, Seung Jae Lee, and Raffaello D’Andrea p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 161
Contact and Physical Interaction
Neville Hogan p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 179
Multirobot Control Strategies for Collective Transport
Hamed Farivarnejad and Spring Berman p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 205
Observer Design for Nonlinear Systems with Equivariance
Robert Mahony, Pieter van Goor, and Tarek Hamel p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 221
Partially Observable Markov Decision Processes and Robotics
Hanna Kurniawati p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 253
Increasingly Intelligent Micromachines
Tian-Yun Huang, Hongri Gu, and Bradley J. Nelson p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 279
Magnetic Micro- and Nanoagents for Monitoring Enzymatic Activity
In Vivo
Michael G. Christiansen, Matej Vizovišek, and Simone Schuerle p p p p p p p p p p p p p p p p p p p p p p p p p 311
AS05_TOC [Link] January 27, 2022 15:37

From Theoretical Work to Clinical Translation: Progress in

Concentric Tube Robots
Zisos Mitros, S.M. Hadi Sadati, Ross Henry, Lyndon Da Cruz,
and Christos Bergeles p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 335
Medical Robotics: Opportunities in China
Yao Guo, Weidong Chen, Jie Zhao, and Guang-Zhong Yang p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 361
Probabilistic Model Checking and Autonomy
Marta Kwiatkowska, Gethin Norman, and David Parker p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 385
Safe Learning in Robotics: From Learning-Based Control to Safe
Annu. Rev. Control Robot. Auton. Syst. 2022.5:411-444. Downloaded from [Link]

Reinforcement Learning
Lukas Brunke, Melissa Greeff, Adam W. Hall, Zhaocong Yuan, Siqi Zhou,
Jacopo Panerati, and Angela P. Schoellig p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 411
Access provided by [Link] on 09/01/23. For personal use only.

Secure Networked Control Systems

Henrik Sandberg, Vijay Gupta, and Karl H. Johansson p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 445
Energy-Aware Controllability of Complex Networks
Giacomo Baggio, Fabio Pasqualetti, and Sandro Zampieri p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 465
Control of Microparticle Assembly
Xun Tang and Martha A. Grover p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 491
Stimuli-Responsive Polymers for Soft Robotics
Yusen Zhao, Mutian Hua, Yichen Yan, Shuwang Wu, Yousif Alsaid,
and Ximin He p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 515
Control of the Stefan System and Applications: A Tutorial
Shumon Koga and Miroslav Krstic p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 547
Turbulence and Control of Wind Farms
Carl R. Shapiro, Genevieve M. Starke, and Dennice F. Gayme p p p p p p p p p p p p p p p p p p p p p p p p p p p 579
Autonomous Airborne Wind Energy Systems: Accomplishments
and Challenges
Lorenzo Fagiano, Manfred Quack, Florian Bauer, Lode Carnel, and Espen Oland p p p p p p 603
Analysis and Control of Autonomous Mobility-on-Demand Systems
Gioele Zardini, Nicolas Lanzetti, Marco Pavone, and Emilio Frazzoli p p p p p p p p p p p p p p p p p p p 633
Control as an Enabler for Electrified Mobility
Andrew G. Alleyne and Christopher T. Aksland p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 659
Stability and Control of Power Grids
Tao Liu, Yue Song, Lipeng Zhu, and David J. Hill p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 689

Errata
An online log of corrections to Annual Review of Control, Robotics, and Autonomous
Systems articles may be found at [Link]

Stagewise SQP for Safe Control
No ratings yet
Stagewise SQP for Safe Control
36 pages
Reinforcement Learning For Robotics Advance
No ratings yet
Reinforcement Learning For Robotics Advance
2 pages
Survey of Model-Based Reinforcement Learning: Applications On Robotics
No ratings yet
Survey of Model-Based Reinforcement Learning: Applications On Robotics
21 pages
ARTICLEONnlp
No ratings yet
ARTICLEONnlp
18 pages
Reinforcement Learning in Robotics A Survey
No ratings yet
Reinforcement Learning in Robotics A Survey
37 pages
Towards Reinforcement Learning Controllers For Soft Robots Using Learned Environments
No ratings yet
Towards Reinforcement Learning Controllers For Soft Robots Using Learned Environments
6 pages
Safe RL: Policy Bifurcation Insights
No ratings yet
Safe RL: Policy Bifurcation Insights
12 pages
AAAI 2019 Richard Richard Joel-Annotated
No ratings yet
AAAI 2019 Richard Richard Joel-Annotated
9 pages
Reinforcement Learning in Robotics
No ratings yet
Reinforcement Learning in Robotics
38 pages
2019 RL Control Review
No ratings yet
2019 RL Control Review
27 pages
Wachi 等 - 2024 - A Survey of Constraint Formulations in Safe Reinforcement Learning
No ratings yet
Wachi 等 - 2024 - A Survey of Constraint Formulations in Safe Reinforcement Learning
10 pages
Continuous Control in RL
No ratings yet
Continuous Control in RL
28 pages
Impact of RL in Robot Control
No ratings yet
Impact of RL in Robot Control
20 pages
Adaptive and Learning Based Control of Safety Critical Systems
No ratings yet
Adaptive and Learning Based Control of Safety Critical Systems
209 pages
Robot Skill Learning via Reinforcement
No ratings yet
Robot Skill Learning via Reinforcement
9 pages
Variable Impedance Control A Reinforcement Learnin
No ratings yet
Variable Impedance Control A Reinforcement Learnin
9 pages
Robotics 12 00012 v2
No ratings yet
Robotics 12 00012 v2
19 pages
Reinforcement Learning in Robotics
No ratings yet
Reinforcement Learning in Robotics
3 pages
Safe RL: Methods, Theories, Apps
No ratings yet
Safe RL: Methods, Theories, Apps
97 pages
Cutler16 ICRA Final Submission
No ratings yet
Cutler16 ICRA Final Submission
7 pages
Safe AI for Robotics and Systems
No ratings yet
Safe AI for Robotics and Systems
6 pages
Paper Ask1 Arxiv
No ratings yet
Paper Ask1 Arxiv
7 pages
Learning Robot Control - 2012
No ratings yet
Learning Robot Control - 2012
12 pages
The Actor-Dueling-Critic Method
No ratings yet
The Actor-Dueling-Critic Method
20 pages
RL Robotics Mini Review
No ratings yet
RL Robotics Mini Review
1 page
3.reinforcement Learning DDPG-PPO Agent-Based Control S Ystem
No ratings yet
3.reinforcement Learning DDPG-PPO Agent-Based Control S Ystem
14 pages
Robot Safe Planning in Dynamic Environments Based On Model Predictive Control Using Control Barrier Function
No ratings yet
Robot Safe Planning in Dynamic Environments Based On Model Predictive Control Using Control Barrier Function
8 pages
Safe Exploration in Model Based
No ratings yet
Safe Exploration in Model Based
9 pages
Dhruv Anirudh DrSandeep
No ratings yet
Dhruv Anirudh DrSandeep
21 pages
Environment Interaction of A Bipedal Robot Using Model-Free Control Framework Hybrid Off-Policy and On-Policy Reinforcement Learning Algorithm
No ratings yet
Environment Interaction of A Bipedal Robot Using Model-Free Control Framework Hybrid Off-Policy and On-Policy Reinforcement Learning Algorithm
12 pages
Model Based Learning Approaches To Control
No ratings yet
Model Based Learning Approaches To Control
33 pages
A Comprehensive Survey On Safe Reinforcement Learning
No ratings yet
A Comprehensive Survey On Safe Reinforcement Learning
44 pages
Reinforcement Learning For UAV Control With Policy and Reward Shaping
No ratings yet
Reinforcement Learning For UAV Control With Policy and Reward Shaping
9 pages
Long-Term Safe Reinforcement Learning With Binary Feedback
No ratings yet
Long-Term Safe Reinforcement Learning With Binary Feedback
8 pages
Is Robotics Going Statistics? The Field of Probabilistic Robotics
No ratings yet
Is Robotics Going Statistics? The Field of Probabilistic Robotics
8 pages
IEEE Conference Template
No ratings yet
IEEE Conference Template
4 pages
Report ML Aat g1 Final
No ratings yet
Report ML Aat g1 Final
8 pages
1 s2.0 S0925231224014486 Main 1 8
No ratings yet
1 s2.0 S0925231224014486 Main 1 8
13 pages
Applying Quantitative Model Checking To Analyze Safety in Reinforcement Learning
No ratings yet
Applying Quantitative Model Checking To Analyze Safety in Reinforcement Learning
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
3 pages
Safety-Critical NMPC for Mobile Robots
No ratings yet
Safety-Critical NMPC for Mobile Robots
12 pages
Algorithms 17 00014 v2
No ratings yet
Algorithms 17 00014 v2
15 pages
Machine Learning in Robotic Manipulation
No ratings yet
Machine Learning in Robotic Manipulation
15 pages
Algorithms 17 00436
No ratings yet
Algorithms 17 00436
16 pages
Stabilized and Robust Online Learning From Human
No ratings yet
Stabilized and Robust Online Learning From Human
8 pages
Project Report
No ratings yet
Project Report
11 pages
Robot Grasping via Reinforcement Learning
No ratings yet
Robot Grasping via Reinforcement Learning
9 pages
Adaptive Robotics Papers
No ratings yet
Adaptive Robotics Papers
56 pages
Zhou Et Al 2020 Deep Neural Networks As Add On Modules For Enhancing Robot Performance in Impromptu Trajectory Tracking
No ratings yet
Zhou Et Al 2020 Deep Neural Networks As Add On Modules For Enhancing Robot Performance in Impromptu Trajectory Tracking
22 pages
2021 3e Improvement of PMSM Control Using Reinforcement Learning Deep Deterministic Policy Gradient Agent
No ratings yet
2021 3e Improvement of PMSM Control Using Reinforcement Learning Deep Deterministic Policy Gradient Agent
6 pages
Kormushev ROB2013
No ratings yet
Kormushev ROB2013
28 pages
Data Driven Control IEEE Paper
No ratings yet
Data Driven Control IEEE Paper
4 pages
DownloadFull TextResearchReportPDFPp1 18 Signed
No ratings yet
DownloadFull TextResearchReportPDFPp1 18 Signed
18 pages
Avoidance of An Unexpected Obstacle Without Reinforcement Learning: Why Not Using Advanced Control-Theoretic Tools?
No ratings yet
Avoidance of An Unexpected Obstacle Without Reinforcement Learning: Why Not Using Advanced Control-Theoretic Tools?
9 pages
1 s2.0 S0957415821000659 Main
No ratings yet
1 s2.0 S0957415821000659 Main
19 pages
Safe Reinforcement Learning Via Shielding
No ratings yet
Safe Reinforcement Learning Via Shielding
10 pages
Alemi Etal 2017
No ratings yet
Alemi Etal 2017
8 pages
Machine Learning Algorithms in Bipedal Robot Control
No ratings yet
Machine Learning Algorithms in Bipedal Robot Control
16 pages
Complete Notes - f1 - Business Studies
No ratings yet
Complete Notes - f1 - Business Studies
22 pages
SLG 9.3 Properties of Water (Part I)
No ratings yet
SLG 9.3 Properties of Water (Part I)
6 pages
Kelompok 5 Chapter 8&9
No ratings yet
Kelompok 5 Chapter 8&9
43 pages
Business Plan: of GROUP 1 From A2-11ABM-07
No ratings yet
Business Plan: of GROUP 1 From A2-11ABM-07
15 pages
Science Milestones Timeline
No ratings yet
Science Milestones Timeline
1 page
Jose Rizal's Retraction Controversy
No ratings yet
Jose Rizal's Retraction Controversy
1 page
CS401 Quiz 1 Solved by VU Answer
No ratings yet
CS401 Quiz 1 Solved by VU Answer
24 pages
Concrete Admixture for Builders
No ratings yet
Concrete Admixture for Builders
2 pages
Tashnuba Anwar's Resume & Skills
No ratings yet
Tashnuba Anwar's Resume & Skills
3 pages
Success Story of Tcs
No ratings yet
Success Story of Tcs
8 pages
Differential Pressure Level Instruments
100% (3)
Differential Pressure Level Instruments
36 pages
Intelligence
No ratings yet
Intelligence
12 pages
Polymer Science Symposium 2023
No ratings yet
Polymer Science Symposium 2023
1 page
God of Vengeance
No ratings yet
God of Vengeance
19 pages
Reading Passage 1: Succeed in IELTS Volume 12
No ratings yet
Reading Passage 1: Succeed in IELTS Volume 12
15 pages
Muhammad Bilal
No ratings yet
Muhammad Bilal
2 pages
Streamlight's TLR-1 HL and TLR-2 HL
No ratings yet
Streamlight's TLR-1 HL and TLR-2 HL
3 pages
Chapter - 4 Performance Management System
No ratings yet
Chapter - 4 Performance Management System
59 pages
Lecture 1
No ratings yet
Lecture 1
10 pages
Baf Preamble
No ratings yet
Baf Preamble
7 pages
Compressed Air Management System
No ratings yet
Compressed Air Management System
24 pages
Siemens ShareNet: A KM Case Study
100% (1)
Siemens ShareNet: A KM Case Study
2 pages
Technical Specifications of GIV-20A2616-GSOLA
No ratings yet
Technical Specifications of GIV-20A2616-GSOLA
5 pages
Implementing Performance Assessment in The Classroom
No ratings yet
Implementing Performance Assessment in The Classroom
4 pages
Full Essentials of Food Science 5th Edition Vickie Vaclavik PDF All Chapters
No ratings yet
Full Essentials of Food Science 5th Edition Vickie Vaclavik PDF All Chapters
55 pages
Ba 3094
No ratings yet
Ba 3094
29 pages
Quiz&Worksheet Altitude&AirPressureStudy - Com 1692282542810
No ratings yet
Quiz&Worksheet Altitude&AirPressureStudy - Com 1692282542810
4 pages
Taylor Experiencing
No ratings yet
Taylor Experiencing
12 pages
Engineering Convolution Guide
No ratings yet
Engineering Convolution Guide
10 pages
Samsung P801 Washing Machine Service Guide
No ratings yet
Samsung P801 Washing Machine Service Guide
20 pages

Safe Learning in Robotics - From Learning-Based Control To Safe Reinforcement Learning

Uploaded by

Safe Learning in Robotics - From Learning-Based Control To Safe Reinforcement Learning

Uploaded by

Annual Review of Control, Robotics, and

to Safe Reinforcement Learning

Adam W. Hall,1,2,3,∗ Zhaocong Yuan,1,2,3,∗

Annu. Rev. Control Robot. Auton. Syst. 2022. Keywords

other agents whose dynamics and plans are not known.

412 Brunke et al.

unsafe). defined and extensible.

Generalizable and safe within defined

Safely and efficiently exploring the

2. PRELIMINARIES AND BACKGROUND ON SAFE

2.1. Problem Statement

[Link] • Safe Learning in Robotics 413

Data buffer and learning algorithm

Updates the control policy and/or the safety filter

Robot operating environment

Constraint values uncertainty Certified input

414 Brunke et al.

[Link] • Safe Learning in Robotics 415

subject to xk+1 = fk (xk , uk , wk ), wk ∼ W, ∀k ∈ {0, . . . , N − 1}, 9b.

safety constraints according to either Equations 3–5 or 9e.

416 Brunke et al.

xk+1 = (Ā + Â)xk + (B̄ + B̂)uk + wk . 11.

[Link] • Safe Learning in Robotics 417

subject to zi+1 = Āzi + B̄ūi , ∀i ∈ {0, . . . , H − 1}, 12b.

zi ∈ Xc  tube , ūi ∈ Uc  Ktube , 12c.

where zi is the open-loop nominal state at time step k + i, and Xc  tube = {x ∈ X : x + ω ∈

2.3. A Reinforcement Learning Perspective

418 Brunke et al.

where D is a given uncertainty set of f̂ . To keep solutions tractable, practical implementations

[Link] • Safe Learning in Robotics 419

incorporated into control frameworks.

Imperfect prior knowledge/model

420 Brunke et al.

3. SAFE LEARNING CONTROL APPROACHES

3.1. Learning Uncertain Dynamics to Safely Improve Performance

[Link] • Safe Learning in Robotics 421

422 Brunke et al.

[Link]. Exploiting feedback linearization for robust learning-based tracking. Trajectory

[Link] • Safe Learning in Robotics 423

xk+1 = (Ā + Â(θ))xk + (B̄ + B̂(θ))uk + wk . 14.

424 Brunke et al.

despite model errors. states from which a

[Link] • Safe Learning in Robotics 425

426 Brunke et al.

[Link]. Lagrangian methods in reinforcement learning optimization. In References 93 and

[Link] • Safe Learning in Robotics 427

efficiency and faster convergence. Reference 34 extended a standard trust-region RL algorithm

cost Jπ and constraint cost Jπ cj

428 Brunke et al.

[Link]. Robustness through adversarial training. Combining RL with adversarial learn-

[Link]. Robustness through domain randomization. Domain randomization methods aim to

3.3. Certifying Learning-Based Control Under Dynamics Uncertainty

[Link] • Safe Learning in Robotics 429

usafe,k = arg min uk − ulearn,k 22 16a.

subject to xk+1 = f̄k (xk , uk ) + f̂k (xk , uk , wk ) ∈ safe , 16b.

430 Brunke et al.

dynamics also yield uncertain time derivatives of Bc and Lc .

V (x) = max min inf lc φ(x, k; usig , f̂sig ) , 19.

[Link] • Safe Learning in Robotics 431

432 Brunke et al.

109, 137, 139) 7% −1.0 −2

Real-world robot experiments

120, 123, 128, 133–135)

Learning uncertain Encouraging safety Certifying learning-

4.1. Cart–Pole and Quadrotor Benchmark Environments

[Link] • Safe Learning in Robotics 433

4.2. Safe Learning Control Results

434 Brunke et al.

[Link] • Safe Learning in Robotics 435

436 Brunke et al.

Chairs program, and the CIFAR AI Chairs program.

[Link] • Safe Learning in Robotics 437

Barbara, CA: Nob Hill

438 Brunke et al.

[Link] • Safe Learning in Robotics 439

440 Brunke et al.

zi ∈ Xc tube , ūi ∈ Uc Ktube , 12c.

where zi is the open-loop nominal state at time step k + i, and Xc tube = {x ∈ X : x + ω ∈

usafe,k = arg min uk − ulearn,k 22 16a.

subject to xk+1 = f̄k (xk , uk ) + f̂k (xk , uk , wk ) ∈ safe , 16b.