Real-World RL Challenges Survey
Real-World RL Challenges Survey
Abstract
Reinforcement learning (RL) has proven its worth in a series of artificial domains,
and is beginning to show some successes in real-world scenarios. However, much
of the research advances in RL are hard to leverage in real-world systems due to a
series of assumptions that are rarely satisfied in practice. In this work, we identify
and formalize a series of independent challenges that embody the difficulties that
must be addressed for RL to be commonly deployed in real-world systems. For
each challenge, we define it formally in the context of a Markov Decision Process,
analyze the effects of the challenge on state-of-the-art learning algorithms, and
present some existing attempts at tackling it. We believe that an approach that
addresses our set of proposed challenges would be readily deployable in a large
number of real world problems. Our proposed challenges are implemented in a
suite of continuous control environments called realworldrl-suite which we
propose an as an open-source benchmark.
1 Introduction
Reinforcement learning (RL) [Sutton and Barto 2018] is a powerful algorithmic paradigm encom-
passing a wide array of contemporary algorithmic approaches [Mnih et al. 2015; Silver et al. 2016;
Hafner et al. 2018]. RL methods have been shown to be effective on a large set of simulated en-
vironments [Mnih et al. 2015; Silver et al. 2016; Lillicrap et al. 2015; OpenAI 2018], but uptake
in real-world problems has been much slower. We posit that this is primarily due to a large gap
between the casting of current experimental RL setups and the generally poorly defined realities of
real-world systems.
We are inspired by a large range of real-world tasks, from control systems grounded in the physical
world [Vecerik et al. 2019; Kalashnikov et al. 2018] to global-scale software systems interacting
with billions of users [Gauci et al. 2018; Covington et al. 2016; Ie et al. 2019].
Physical systems can range in size from a small drone [Abbeel et al. 2010] to a data center [Evans
and Gao 2016], in complexity from a one-dimensional thermostat [Hester et al. 2018b] to a self-
driving car, and in cost from a calculator to a spaceship. Software systems range from billion-user
recommender systems [Covington et al. 2016] to on-device controllers for individual smart-phones,
they can be scheduling millions of software jobs across the globe or optimizing the battery profile
of a single device, and the codebase might be millions of lines of code to a simple kernel module.
In all these scenarios, there are recurring themes: the systems have inherent latencies, noise, and
non-stationarities that make them hard to predict. They may have large and complicated state &
action spaces, safety constraints with significant consequences, and large operational costs both in
∗
equally contributed, 1 Google Research, Paris, 2 Deepmind, London, 3 Deepmind, Mountainview, 4 Work
done during time at Deepmind
terms of money and time. This is in contrast to training on a perfect simulated environment where
an agent has full visibility of the system, zero latency, no consequences for bad action choices and
often deterministic system dynamics.
We posit that these difficulties can be well summarized by a set of nine challenges that are holding
back RL from real-world use. At a high level these challenges are:
These challenges can present themselves in a wide array of task scenarios. We choose three exam-
ples, from robotics, healthcare and software systems to illustrate how these challenges can manifest
themselves in various ways.
A common robotic challenge is autonomous manipulation, and has potential applications ranging
from manufacturing to healthcare. Such a robotic system is affected by nearly all of the proposed
challenges.
• Robot time is costly and therefore learning should be data-efficient (Challenge 1).
• Actuators and sensor introduce varying amounts of delay, and the task reward can be de-
layed relative to the system state (Challenge 2).
• Robotic systems almost always have some form of constraints either in their movement
space, or directly on their joints in terms of velocity and acceleration constraints (Challenge
4).
• As the system manipulates the space around it, things will react in unexpected, stochastic
ways, and the robot’s environment will not be fully observable (Challenge 5).
• System operators may want to optimize for a certain performance on the task, but also want
to encourage fast operation, energy efficiency, and reduce wear & tear (Challenge 6).
• A performant controller requires low latency for both smooth and safe control (Challenge
7).
• There are generally logs of the system operating either through tele-operation, or simpler
black-box controllers, both of which can be leveraged to learn offline without costing sys-
tem time (Challenge 8).
In the case of a healthcare application, we can imagine a policy for assisted diagnostic that is trained
from electronic health records (EHRs). This policy could work hand-in-hand with doctors to help in
treatment approaches, and would be presented with many of our described challenges:
• EHR data is not necessarily plentiful, and therefore learning from limited samples is essen-
tial to finding good policies from the available data (Challenge 1).
• The effects of a particular treatment may be observable hours to months after it takes place.
These strong delays will likely pose a challenge to any current RL algorithms (Challenge
2).
2
• Certain constraints, such as dosage strength or patient-specific allergies, must be respected
to provide pertinent treatment strategies (Challenge 4).
• Biological systems are inherently complex, and both observations as well as patient reac-
tions are inherently stochastic (Challenge 5).
• Many treatment approaches balance aggressivity towards a pathology with sensitivity to
the patients’ reaction. Along with other constraints such as time and drug availability,
these problems are often multi-objective (Challenge 6).
• EHR data is naturally off-line, and therefore being able to leverage as much information
from the data before interacting with patients is essential (Challenge 7).
• For successful collaboration between an algorithm and medical professionals, explainabil-
ity is essential. Understanding the policy’s long-term intended goals is essential in deciding
which strategy to take (Challenge 9).
Recommender systems are amongst the most solicited large-scale software systems, and RL pro-
poses an enticing framework for optimizing them [Covington et al. 2016; Chen et al. 2019a]. How-
ever, there are many difficulties to be dealt with in large user-facing software systems such as these:
• Interactions with the user can be strongly delayed, either from users reacting to recommen-
dations with high latency, or recommendations being sent to users at different points in
time (Challenge 2).
• The set of possible actions is generally very large (millions to even potentially billions),
which becomes particularly difficult when reasoning about action selection (Challenge 3).
• Many aspects of the user’s interactions with the system are unobserved: Does the user
see the recommendation? What is a user currently thinking? Does the user choose not to
engage due to poor recommendations? (Challenge 5)
• Optimization goals are often multi-objective, with recommender systems trying to increase
engagement, all while driving revenue, reducing costs, maintaining diversity and ensuring
fairness (Challenge 6).
• Many of these systems interact in real-time with a user, and need to provide recommenda-
tions within milliseconds (Challenge 7).
• Although some degree of experimentation is possible on-line, large amounts of information
are available in the form of interaction logs with the system, and need to be exploited in an
off-line manner (Challenge 8).
• Finally, as a recommender system has a potential to significantly affect the user’s experi-
ence on the platform, its choices need to be easily understandable and interpretable (Chal-
lenge 9).
This set of examples shows that the proposed challenges appear in varied types of applications, and
we believe that by identifying, replicating and solving these challenges, reinforcement learning can
be more readily used to solve many of these important real-world problems.
1.2 Contributions
• Identification and definition of real-world challenges: Our main goal is to more clearly
define the issues reinforcement learning is having when dealing with real systems. By
making these problems identifiable and well-defined, we hope they can be dealt with more
explicitly, and thus solved more rapidly. We structure the difficulties of real-world systems
in the aforementioned 9 challenges. For each of the above challenges, we provide some
intuition on where it arises and discuss potential solutions present in the literature.
• Experiment design and analysis for each challenge: For all challenges except explain-
ability, we provide a formal definition of the challenge and implement a set of environ-
ments exhibiting this challenge’s characteristics. This allows researchers to easily observe
the effects of this challenge on various algorithms, and evaluate if certain approaches seem
3
promissing in dealing with the given challeng. To both illustrate the extent of each chal-
lenge’s difficulty, and provide some reference results, we train two state-of-the-art RL
agents on each defined environment, with varying degrees of difficulty, and analyze the
challenge’s effects on learning. With these analyses we provide insights as to which chal-
lenges are more difficult and propose calibrated parameters for each challenge implemen-
tation.
• Define and baseline RWRL Combined Challenge Benchmark tasks: After careful cal-
ibration, we combine a subset of our proposed challenges into a single environment and
baseline the performance of two state-of-the-art learning agents on this setup in Section
2.10. We show that state-of-the-art agents fail quickly, even for mild perturbations applied
along each challenge dimension. We encourage the community to work on improving upon
the combined challenges’ baseline performance. We believe that in doing so, we will take
large steps towards developing agents that are implementable on real world systems.
• Open-source realworldrl-suite codebase: We present the set of perturbed environ-
ments in a parametrizable suite, called realworldrl-suite which extends the Deep-
Mind Control Suite [Tassa et al. 2018] with various perturbations representing the afore-
mentioned challenges. The goal of the suite is to accelerate research in these areas by
enabling RL practitioners and researchers to quickly, in a principled and reproducible
fashion, test their learning algorithms on challenges that are encountered in many real-
world systems and settings. The realworldrl-suite is available for download here:
https://github.com/google-research/realworldrl suite. A user manual, found
in Appendix C, explains how to instantiate each challenge and also provides code examples
for training an agent.
In this section, for each of the challenges presented in the introduction we discuss its importance
and present current research directions that attempt to tackle the challenge, providing starting points
for practitioners and newcomers to the domain. We then define it more formally, and analyse its
effects on state-of-the-art learning algorithms using the realworldrl-suite, to provide insights
on how these challenges manifest themselves in isolation. While not all of these challenges are
present together in every real system, for many systems they are all present together to some degree.
For this reason, in Section 2.10 we also present a set of combined reference challenges, varying in
difficulty, that emulate a complete system with all of the introduced challenges. We believe that a
learner able to tackle these combined challenges would be a good candidate for many real-world
systems.
Notation Environments are formalised as Markov Decision Processes (MDPs). A MDP can be
defined as a tuple hS, A, p, r, γi, where an agent is in a state st ∈ S and takes an action at ∈ A
at timestep t. When in state st and taking an action at , an agent will arrive in a new state st+1
with probability p(st+1 |st , at ), and receive a reward r(st , at , st+1 ). Our environments are episodic,
which is to say that they last a finite number of timesteps, 1 ≤ t ≤ T . The value of γ, the discount
factor, reflects the agent’s planning horizon. The full state of the process, st , respects the Markov
property: p(st+1 |st , at , · · · , s0 , a0 ) = p(st+1 |st , at ), i.e. all necessary information to predict st+1
is contained in st and at . In many of the environments in this paper the observed state does not
include the full internal state of the MuJoCo physics simulator. It has nevertheless been shown
empirically that the observed state is sufficient to control an agent, so we interchange the notion of
state and observation unless otherwise specified.
Ultimately, the goal of a RL agent is to find an optimal policy π ∗ : S → A which maximizes its
expected return over a given MDP:
"∞ #
X
∗ π t
π = arg max E γ r(st , π(st ), st+1 ∼ p(st , π(st )))
π
t=0
There are many ways to find this policy [Sutton and Barto 2018], and we will use two model-free
methods described in the following section.
4
Learning algorithms: For each challenge, we present the results of two state-of-the-art (SOTA)
RL learning algorithms: Distributional Maximum a Posteriori Policy Optimization (DMPO) [Ab-
dolmaleki et al. 2018a] and Distributed Distributional Deterministic Policy Gradient (D4PG) [Barth-
Maron et al. 2018]. We chose these two algorithms for benchmarking performance as they (1) yield
SOTA performance on the dm-control suite (see e.g., Hoffman et al. [2020]; Barth-Maron et al.
[2018]); (2) they are both fundamentally different algorithms (DMPO is an EM-style policy itera-
tion algorithm with a stochastic policy and D4PG is a deterministic policy gradient algorithm). Note
that we also tested the original non-distributional algorithm MPO and found performance to be sim-
ilar to DMPO. As such we did not include the results. It was important that our algorithms were
both strong in terms of performance and diverse in terms of algorithmic implementation to show that
SOTA algorithms struggle on many of the challenges that we present in the paper. We could have
included more algorithms such as SAC and PPO. However, we felt that the environmental cost of
running thousands of additional experiments would not justify the additional insights gained. One of
our main motivations in this work is to show that SOTA algorithms do suffer from these challenges
to encourage more research on these topics.
D4PG is a modified version of Deep Deterministic Policy Gradients (DDPG) [Lillicrap et al. 2015],
an actor-critic algorithm where state-action values are estimated by a critic network, and the actor
network is updated with gradients sampled from the critic network. D4PG makes four changes to
improve the critic estimation (and thus the policy): evaluating n-step rather than 1-step returns,
performing a distributional critic update [Bellemare et al. 2017], using prioritized sampling of the
replay buffer, and performing distributed training. These improvements give D4PG state of the art
results across many DeepMind control suite [Tassa et al. 2018] tasks as well as manipulation and
parkour tasks [Heess et al. 2017]. The hyperparameters for D4PG can be found in Appendix A,
Table 9.
MPO [Abdolmaleki et al. 2018b] is an RL method that combines the sample efficiency of off-policy
methods with the scalability and hyperparameter robustness of on-policy methods. It is an EM
style method, which alternates an E-step that re-weights state-action samples with an M step that
updates a deep neural network with supervised training. MPO achieves state of the art results on
many continuous control tasks while using an order of magnitude fewer samples when compared
with PPO [Schulman et al. 2017]. Distributional MPO (DMPO) is an extension of MPO that uses
a distributional value function and achieves superior performance. The hyperparameters for DMPO
can be found in Appendix A, Table 10. The hyperparameters were found by doing a grid-search on
each algorithm, based on parameters used in the original papers. The algorithms achieved optimal
reported performance in each case using these parameters in the ‘no challenge’ setting (i.e., when
none of the challenges are present in the environment).
Each algorithm is run for 30K episodes on 5 different seeds on cartpole:swingup, walker:walk,
quadruped:walk and humanoid:walk tasks from the realworldrl-suite. Unless stated oth-
erwise, the mean value reported in each graph is the mean performance of the last 100 episodes
of training with the corresponding standard deviation. All hyperparameters for all experiments can
be found in Table 11. To make experiments more easily reproducible we did not use distributed
training for either D4PG or DMPO. Additionally, unless otherwise noted, evaluation is performed
on the same policy as used for training, to be consistent with the notion that there is no train/eval
dichotomy. We refer to average reward and average return interchangeably in this paper.
Motivation & Related Work Almost all real-world systems are either slow-moving, fragile, or
expensive enough to operate, that data they produce is costly and therefore learning algorithms must
be as data-efficient as possible. Unlike much of the research performed in RL [Mnih et al. 2015;
Espeholt et al. 2018a; Hester et al. 2018a; Tessler et al. 2016], real systems do not have separate
training and evaluation environments, therefore the agent must quickly learn to act reasonably and
safely. In the case where there are off-line logs of the system, these might not contain anywhere near
the amount of data or data coverage that current RL algorithms expect. In addition, as all training
data comes from the real system, learning agents cannot have an overly aggressive exploration policy
during training, as these exploratory actions are rarely without consequence. This results in training
data that is low-variance with very little of the state and action space being covered.
5
Learning iterations on a real system can take a long time, as slower systems’ control frequencies can
range from hours in industrial settings, to multiple months in cases with infrequent user interactions
such as healthcare or advertisement. Even in the case of higher-frequency control tasks, the learning
algorithm needs to learn quickly from potential mistakes without having to repeat them multiple
times. In addition, since there is often only one instance of the system, approaches that instantiate
hundreds or thousands of environments to accelerate training through distributed training [Horgan
et al. 2018; Espeholt et al. 2018b; Adamski et al. 2018] nevertheless require as much data and are
rarely compatible with real systems. For all these reasons, learning on a real system requires an
algorithm to be both sample-efficient and quickly performant.
There are a number of related works that deal with RL on real systems and, in particular, focus on
sample efficiency. One body of work is Model Agnostic Meta-Learning (MAML) [Finn et al. 2017],
which focuses on learning within a task distribution and, with few-shot learning, quickly adapt-
ing to solving a new in-distribution task that it has not seen previously. Bootstrap DQN [Osband
et al. 2016] learns an ensemble of Q-networks and uses Thompson Sampling to drive exploration
and improve sample efficiency. Another approach to improving sample efficiency is to use expert
demonstrations to bootstrap the agent, rather than learning from scratch. This approach has been
combined with DQN [Mnih et al. 2015] and demonstrated on Atari [Hester et al. 2018a], as well
as combined with DDPG [Lillicrap et al. 2015] for insertion tasks on robots [Vecerı́k et al. 2019].
Recent Model-based deep RL approaches [Hafner et al. 2018; Chua et al. 2018; Nagabandi et al.
2019], where the algorithm plans against a learned transition model of the environment, show a lot
of promise for improving sample efficiency. Haarnoja et al. [2018] introduce soft actor-critic al-
gorithms which achieve state-of-the-art performance in terms of sample efficiency and asymptotic
performance. Riedmiller et al. [2018] propose Schedule Auxiliary Control (SAC-X) that enables an
agent to learn complex behaviours from scratch using multiple sparse reward signals. This leads to
efficient exploration which is important for sparse reward RL. Levine and Koltun [2013] use trajec-
tory optimization to direct policy learning and avoid poor local optima. This leads to sample efficient
learning that significantly outperforms the state of the art. Yahya et al. [2017] build on this work to
perform distributed learning with multiple real-world robots to achieve better sample efficiency and
generalization performance on a door opening task using four robots. Another common approach
is to learn ensembles of transition models and use various sampling strategies from those models to
drive exploration and improve sample efficiency [Hester and Stone 2013; Chua et al. 2018; Buckman
et al. 2018].
Experimental Setup & Results To evaluate this challenge, we measure the global normalized
regret with respect to the performance of the best converged policy (across algorithms). Let
window size be the size of a sliding window wk across episodes where k is the index of the earliest
episode contained in the window. We calculate the highest average return across all algorithms using
∗
the final window size steps of training and denote this value Rmean . We also calculate the 95% con-
∗ ∗
fidence interval for this window: [Rlower , Rupper ]. We denote wK as the sliding window for which
∗
more than 50% of episodes have a return higher than Rlower , and consider an agent to have con-
verged at episode K. If this condition is not satisfied during training, then K = M − window size,
where M is the total number of episodes. We can then define the global normalized regret as
" K
#
1 ∗
X
Lpre−converge (π) = ∗
K∗ Rmean − Ri ,
Rmean i=0
which can be read as sum of regrets for each episode i, i.e., the return that would have been achieved
by the best final policy minus the actual return that was achieved. The normalized regret for each of
the evaluation domains is shown in Figure 1a. The normalized regret can effectively be interpreted
as the amount of actual return lost, prior to convergence, due to poor policy performance. We can
observe that DMPO has higher normalized regret than D4PG on all tasks.
Another interesting aspect to measure upon convergence is the instability of the converged policy
during training. To do so, we define the post-convergence instability, which measures the per-
∗
centage of post-convergence episodes for which the return is below Rlower . This can be written as:
PM ∗
1(Ri ≥Rlower )
Lpost−converge (π) = 100 ∗ i=K
M −K , where 1(.) is an indicator function.
6
105
D4PG 100 D4PG
DMPO DMPO
40
102
20
101 0
cartpole:swingup walker:walk quadruped:walk humanoid:walk cartpole:swingup walker:walk quadruped:walk humanoid:walk
Figure 1: Sample efficiency metrics. (a) Pre-convergence global normalized regret measures how
much the total reward is lost before convergence to the level of final performance reached by the
best policy for that task. This is normalized by the average episodic return for the best policy.
(b) Post-convergence stability measures what percentage of episodes are suboptimal after conver-
gence. If an algorithm never converges this is measured using the last window size episodes, where
window size is the size of the sliding window used for determining convergence.
The average post-convergence instability for each of the domains2 is shown in Figure 1b. As can be
seen in the figure, DMPO also has higher instability than D4PG, except for cartpole:swingup.
The regret and instability metrics together can be used to summarize the sample efficiency of differ-
ent algorithms. Note that they are both computed with respect to the best known performance for
each task. This means that, if a new algorithm is developed that has better performance, the values
of these metrics will change as a result. This is by design: when a better method comes along, it
should heighten the regret of the previous ones. Note that we could have used the best possible
performance for each task instead of the performance of the best known policy, but if we did that
we would have run the risk that no algorithm converged to that value, making the regret potentially
unbounded. We could also have normalized each algorithm by its own final performance, but that
would have made it hard to compare across algorithms.
The results not only show D4PG to be generally more sample efficient, but can also be used to
compare the difficulty of achieving sample efficient learning across domains. For instance, it is in-
teresting that while D4PG takes longer to get to a policy on humanoid:walk, the policy it eventually
converges to is more stable than the one for walker:walk. We hope that analysing algorithms in
this way will enable a practitioner to (1) develop algorithms that are sample efficient and reduce the
regret until convergence; and (2) ensure that, once converged, the algorithm is stable. These two
properties are highly desirable in many industrial systems.
Motivation & Related Work Most real systems have delays in either sensing, actuation, or reward
feedback. These might occur because of low-frequency sensing and actuation, because of safety
checks or other transformations performed on the selected action before it is actually implemented,
or because it takes time for an action’s effect to be fully manifested.
Hester and Stone [2013] focus on controlling a robot vehicle with significant delays in the control
of the braking system. They incorporate recent history into the state of the agent so that the learn-
ing algorithm can learn the delay effects itself. Mann et al. [2018] look at delays in recommender
systems, where the true reward is based on the user’s interaction with the recommended item, which
may take weeks to determine. They both present a factored learning approach that is able to take ad-
vantage of intermediate reward signals to improve learning in these delayed tasks. Hung et al. [2018]
introduce a method to better assign rewards that arrive significantly after a causative event. They
2
Note that there are no error bars for humanoid because none of the runs converge to the best performance
across algorithms.
7
challenge=Action delay, agent=D4PG challenge=Observation delay, agent=D4PG challenge=Reward delay, agent=D4PG
1000 1000 cartpole:swingup 1000
walker:walk
800 800 quadruped:walk 800
Average Reward
Average Reward
Average Reward
600 600
humanoid:walk 600
400 400 400
0 0 0
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 20 40 60 80 100
Delay Delay Delay
(a) D4PG
challenge=Action delay, agent=DMPO challenge=Observation delay, agent=DMPO challenge=Reward delay, agent=DMPO
1000 1000
1000 cartpole:swingup
walker:walk 800
800 800 quadruped:walk
Average Reward
Average Reward
Average Reward
600
humanoid:walk 600
600
400 400 400
0 0 0
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 20 40 60 80 100
Delay Delay Delay
(b) DMPO
Figure 2: Average performance on the four tasks under varying action (left) and observation (middle)
delays from a delay of 0 to a delay to 20 timesteps. Reward delays (right) include delays from 0 to
100 timesteps
use a memory-based agent, and leverage the memory retrieval system to properly allocate credit to
distant past events that are useful in predicting the value function in the current timestep. They show
that this mechanism is able to solve previously unsolveable delayed reward tasks. Arjona-Medina
et al. [2018] introduce the RUDDER algorithm, which uses a backwards-view of a task to generate
a return-equivalent MDP where the delayed rewards are re-distributed more evenly throughout time.
This return-equivalent MDP is easier to learn, is guaranteed to have the same optimal policy as the
original MDP, and the approach shows improvements in Atari tasks with long delays.
Motivation & Related Work Many practical real-world problems have large and continuous state
& action spaces. For example, consider the huge action spaces of recommender systems [Covington
et al. 2016], or the number of sensors and actuators to control cooling in a data center [Evans and Gao
2016]. These large state and action spaces can present serious issues for traditional RL algorithms,
(e.g., see [Dulac-Arnold et al. 2015; Tessler et al. 2019]).
8
There are a number of recent works focused on addressing this challenge. Dulac-Arnold et al. [2015]
look at situations involving a large number of discrete actions, and present an approach based on
generating a vector for a candidate action and then doing nearest neighbor search to find the closest
applicable action. For systems with action cardinality that is particularly high (|A| > 1e5), it can
be practical to decompose the action selection process into two steps: action candidate generation
and action ranking, as detailed by Covington et al. [2016]. Zahavy et al. [2018] propose an Action
Elimination Deep Q Network (AE-DQN) that uses a contextual bandit to eliminate irrelevant ac-
tions. He et al. [2015] present the Deep Reinforcement Relevance Network (DRRN) for evaluating
continuous action spaces in text-based games. Tessler et al. [2019] introduce compressed sensing as
an approach to reconstruct actions in text-based games with combinatorial action spaces.
Experimental Setup & Results Given the continuous nature of the realworldrl-suite we
chose to simulate a high-dimensional state space, although increasing the action space with dummy
dimensions could be interesting for further work. For readers interested in experiments dealing
with large discrete action spaces, please refer to [Dulac-Arnold et al. 2015] for various experimental
setups evaluating large discrete actions spaces. For this challenge, we first compared results across
all the tasks in an unperturbed manner. The state and action dimensions for each task can be found
in Table 1. Both stability of the overall system and the dimensionality affect learning progress. For
example, as seen in Figures 3a and 3b for D4PG and DMPO respectively, quadruped is higher
dimensional than walker, yet converges faster since it is a fundamentally more stable system. On
the other hand, dimensionality is also a factor as cartpole, which is significantly lower-dimensional
than humanoid, converges significantly faster.
We subsequently increased the number of state dimensions of each task with dummy state variables
sampled from a zero mean, unit variance normal distribution. We then compare the average return
for each task as we increase the state dimensionality. Figures 4a and 4b (right) show the converged
average performance of the learning algorithm on each task for D4PG and DMPO respectively.
Since the added states were effectively injecting noise into the system, the algorithm learns to deal
with the noise and converges to the optimal performance for the cases of cartpole:swingup,
quadruped:walk and walker:walk. In some cases, e.g. Figures 5a and 5b for walker:walk, the
additional dummy dimensions slightly affect convergence speed indicating that the learning algo-
rithm learns to deal with noise efficiently, but it does slow down learning progress.
Average return
600 600
400 400
cartpole:swingup cartpole:swingup
200
walker:walk 200 walker:walk
quadruped:walk quadruped:walk
0 humanoid:walk 0 humanoid:walk
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
Episodes Episodes
(a) D4PG (b) DMPO
Figure 3: Learning performance on all domains as a function of number of episodes, truncated to
10K episodes for better visualization.
9
challenge=Gaussian action noise, agent=D4PG challenge=Gaussian observation noise, agent=D4PG challenge=Increase observation dimension, agent=D4PG
1000 cartpole:swingup 1000
1000
walker:walk
Average Reward
Average Reward
Average Reward
800
750 quadruped:walk 950
600
500 humanoid:walk
400 900
200 250
0 0 850
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0 20 40 60 80 100
Noise Noise Additional observation dimensions
(a) D4PG
challenge=Gaussian action noise, agent=DMPO challenge=Gaussian observation noise, agent=DMPO challenge=Increase observation dimension, agent=DMPO
1000 1000 cartpole:swingup 1000
walker:walk
Average Reward
Average Reward
Average Reward
800 800
600 600 quadruped:walk 900
humanoid:walk
400 400 800
200 200
0 0 700
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0 20 40 60 80 100
Noise Noise Additional observation dimensions
(b) DMPO
Figure 4: Average performance and standard deviation on the four tasks when adding Gaussian
action noise (left), Gaussian observation noise (middle) and increasing the dimensionality of the
state space with dummy variables (right).
800 800
Average return
Average return
600 600
0 additional dimensions 0 additional dimensions
400 10 additional dimensions 400 10 additional dimensions
200 20 additional dimensions 20 additional dimensions
50 additional dimensions 200 50 additional dimensions
0 100 additional dimensions 100 additional dimensions
0
0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000
Number of episodes Number of episodes
(a) D4PG (b) DMPO
Figure 5: Learning performance of D4PG (left) and DMPO (right) on walker walk as the state ob-
servation dimension increases. The graph has been cropped to 4000 episodes for better visualization
to highlight the effect that increasing the observation dimensionality has on the learning algorithm.
Motivation & Related Work Almost all physical systems can destroy or degrade themselves and
the environment around them if improperly controlled. Software systems can also significantly de-
grade their performance or crash, as well as provide improper or incorrect interactions with users.
As such, considering constraints on their operation is fundamentally necessary to controlling them.
Constraints are not only important during system operation, but also during exploratory learning
phases as well. Examples of physical constraints include limits on system temperatures or contact
forces for safe operation, maintaining minimum battery levels, avoiding dynamic obstacles, or lim-
iting end effector velocities. Software systems might have constraints around types of content to
propose to users or system load and throughput limits to respect.
Although system designers may often wrap the learnt controller in a safety watchdog controller, the
learnt controller needs to be aware of the constraints to avoid degenerate solutions which lazily rely
on the watchdog. We want to emphasize that constraints can be put in place for varying reasons,
ranging from monetary costs, to system up-time and longevity, to immediate physical safety of users
and operators. Due to the physically grounded nature of our suite, our proposed set of constraints
are physically bound and are intended to avoid self-harm, but the suite’s framework provides options
for users to define any constraints they wish.
Recent work in RL safety [Dalal et al. 2018; Achiam et al. 2017; Tessler et al. 2018; Satija et al.
2020] has cast safety in the context of Constrained MDPs (CMDPs) [Altman 1999], and we will
concentrate on pre-defined constraints on the environment in this context. Constrained MDPs define
10
a constrained optimization problem and can be expressed as:
max R(π) subject to C k (π) ≤ Vk , k = 1, . . . , K.
π∈Π
Here, R is the cumulative reward of a policy π for a given MDP, and C k (π) describes the incurred
cumulative cost of a certain policy π relative to constraint k. The CMDP framework describes
multiple ways to consider cumulative cost of a policy π: the total cost until task completion, the
discounted cost, or the average cost. Specific constraints are defined as ck (s, a).
The CMDP setup allows for arbitrary constraints on state and action to be expressed. In the context
of a physical system these can be as simple as box constraints on a specific state variable, or more
complex such as dynamic collision-avoidance constraints. One major challenge with addressing
these safety concerns in real systems is that safety violations will likely be very rare in logs of the
system. In many cases, safety constraints are assumed and are not even specified by the system
operator or product manager.
An extension to CMDPs is budgeted MDPs [Boutilier and Lu 2016; Carrara et al. 2018]. While for
a CMDP, the constraint level Vk is given, for budgeted MDPs, it is unknown. Instead, the policy
is learned as a function of constraint level. The user can examine the trade-offs between expected
return and constraint level and choose the constraint level that best works for the data. This is a
good match for common real-world scenario where the constraints may not be absolute, but small
violations may be allowed for a large improvement in expected returns.
Recently, there has a been a lot of work focused on the problem of safety in reinforcement learning.
One focus has been the addition of a safety layer to the network [Dalal et al. 2018; Pham et al. 2017].
These approaches focus on safety during training, and have enabled an agent to learn a task with zero
safety violations during training. There are other approaches [Achiam et al. 2017; Tessler et al. 2018;
Bohez et al. 2019] that learn a policy that violates constraints during training but produce a trained
policy that respects the safety constraints. Stooke et al. [2020] introduce the concept of lagrangian
damping which leads to improved stability by performing PID control on the lagrangian parame-
ter. Additional RL approaches include using Lyapunov functions to learn safe policies [Chow et al.
2018] and exploration strategies that predict the safety of neighboring states [Turchetta et al. 2016;
Wachi et al. 2018]. Satija et al. [2020] introduce the concept of a backward value function for a more
conservative optimization algorithm. A Probabilistic Goal MDP [Mankowitz et al. 2016c; Xu and
Mannor 2011] is another type of objective that encourages an agent to achieve a pre-defined reward
level irrespective of the time it takes to complete the task. This objective encourages risk-averse be-
haviour leading to safer and more robust policies. Thomas [2015] proposes a safe RL algorithm that
searches for new and improved policies while ensuring that the probability of selecting bad policies
is low. Calian et al. [2020] provide a meta-gradient solution to balancing the trade-off between max-
imizing rewards and minimizing constraint violations. This D4PG variant learns the learning rate of
the lagrange multiplier in a soft-constrained optimization procedure. Thomas et al. [2017] propose a
new framework for designing machine learning algorithms that simplifies the problem of specifying
and regulating undesired behaviours. There have also been approaches to learn a policy that satisfies
constraints in the presence of perturbations to the dynamics of an environment [Mankowitz et al.
2020].
Experimental Setup & Results To demonstrate the complexity of system constraints, we lever-
age the CMDP formalism to include a series of binary safety-inspired constraints to our challenge
domains. These constraints can be either considered passively, as a measure of an agent’s behavior,
or they can be included in the agent’s observation so that the agent may learn to avoid them.
As an example, our cartpole environment with variables x, θ (cart position and pole angle) in-
cludes three boolean constraints:
1. slider pos, which restricts the cart’s position on the track: xl < x < xr .
2. slider accel, which limits cart acceleration: ẍ < Amax .
3. balance velocity, a slightly more complex constraint, which limits the pole’s angular
velocity when it is close to being balanced: |θ| > θL ∨ θ̇ < θ̇V .
The full set of available constraints across all tasks is described in Table 2. Each constraint can be
tuned by modifying a parameter safety coeff ∈ [0, 1] where 0 is harder and 1 is easier to satisfy.
11
Cart-Pole Variables: x, θ
Type Constraint
slider pos xl < x < x r
slider accel ẍ < Amax
balance velocity |θ| > θL ∨ θ̇ < θ̇V
Walker Variables: θ, u, F
Type Constraint
joint angle θL < θ <
θU
joint velocity maxi θ˙i < Lθ̇
Quadruped Variables: θ, u, F
Type Constraint
joint angle θL,i < θi < θU,i
joint velocity maxi θ˙i < L
θ̇
upright 0 < uz
foot force FEE < Fmax
Humanoid Variables: θ, u, F
Type Constraint
joint angle constraint θL,i < θi < θU,i
joint velocity constraint maxi θ˙i < Lθ̇
To evaluate this challenge, we track the number of constraint violations by the agent, for each con-
straint, throughout training. We present the effects of safety coeff on all four environments in
Figure 6. For each task, we illustrate both the effects of safety coeff as a function of the average
number of constraint violations upon convergence (left) as well as the average number of violations
throughout an episode of cartpole swingup (right). We can see that safety coeff makes the
task more difficult as it tends towards 0, and that constraint violations are non-uniform throughout
time e.g. as the cart swings back and forth, the pole, position and acceleration constraints are more
frequently violated.
Although the learner presented here ignores the constraints, we also include a multi-objective task
which combines the task’s reward function with a constraint violation penalty in Section 2.6.
12
Domain: cartpole:swingup Mean Violations / Timestep Domain: walker:walk Mean Violations / Timestep
Mean # Violations
Mean # Violations
500
600
0.6 0.3
400
200
0.2 200 0.1
100
0.0 0.0
0 0
0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000
Safety Coefficient Timestep Safety Coefficient Timestep
Domain: quadruped:walk Mean Violations / Timestep Domain: humanoid:walk Mean Violations / Timestep
Mean # Violations (Convergence)
Mean # Violations
Mean # Violations
foot_force_constraint foot_force_constraint
0.0020 0.6
600 600
0.0015
0.4
400 400
0.0010
0.2
200 0.0005 200
0.0000 0.0
0 0
0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000
Safety Coefficient Timestep Safety Coefficient Timestep
Mean # Violations
600 0.3
150 0.6
slider_pos_constraint
slider_accel_constraint
balance_velocity_constraint 400 0.2
100 0.4
0.0 0.0
0 0
0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000
Safety Coefficient Timestep Safety Coefficient Timestep
Domain: quadruped:walk Mean Violations / Timestep Domain: humanoid:walk Mean Violations / Timestep
Mean # Violations (Convergence)
0.0 0.0
0 0
0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000
Safety Coefficient Timestep Safety Coefficient Timestep
Motivation & Related Work Almost all real systems where we would want to deploy RL are
partially observable. For example, on a physical system, we likely do not have observations of the
wear and tear on motors or joints, or the amount of buildup in pipes or vents. We have no obser-
vations on the quality of the sensors and whether they are malfunctioning. On systems that interact
with users such as recommender systems, we have no observations of the mental state of the users.
Often, these partial observations appear as noise (e.g., sensor wear and tear or uncalibrated/broken
sensors), non-stationarity (e.g. as a pump’s efficiency degrades) or as stochasticity (e.g. as each
robot being operated behaves differently).
Partial observability. Partially observable problems are typically formulated as a partially ob-
servable Markov Decision Process (POMDP) [Cassandra 1998]. The key difference from the MDP
formulation is that the agent’s observation x ∈ X is now separate from the state, with an observation
function O(x | s) giving the probability of observing x given the environment state s. There are a
13
couple common approaches to handling partial observability in the literature. One is to incorporate
history into the observation of the agent: DQN [Mnih et al. 2015] stacks four Atari frames together
as the agent’s observation to account for partial observability. Alternatively, an approach is to use
recurrent networks within the agent, enabling them to track and recover hidden state. Hausknecht
and Stone [2015] apply such an approach to DQN, and show that the recurrent version can perform
equally well in Atari games when only given a single frame as input. Nagabandi et al. [2018] pro-
pose an approach modeling the system as non-stationary with a time-varying reward function, and
use meta-learning to find policies that will adapt to this non-stationarity. Much of the recent work
on transferring learned policies from simulation to the real system also focuses on this area, as the
underlying differences between the systems are not observable [Andrychowicz et al. 2018; Peng
et al. 2018].
Experimental Setup & Results Many real-world sensor issues can be viewed as a partial observability
challenge (unobserved properties describing the functioning of the sensor) that could be helped by
recurrent models or other approaches for partial observability. A common issue we see in real-world
settings is malfunctioning sensors. On any real task, we can assume that the sensors are noisy, which
we reproduce by adding increasing levels of Gaussian noise to the actions and observations. Results
of these perturbations can be observed in Figures 4a and 4b (left and middle figures respectively)
for D4PG and DMPO. We frequently also see sensors that either get stuck at a certain value for a
period of time or drop out entirely, with some default value being sent to the agent. We simulate
both of these scenarios by setting both a probability of a sensor being stuck or dropped and varying
the length of the malfunction being. Results for these perturbations are presented in Figures 7a, 7b
and Figures 8a, 8b for stuck and dropped sensors. We see from the figures that both dropped and
stuck sensors have a significant effect on degrading the final performance.
Non-stationarity. Real world systems are often stochastic and noisy compared to most simulated
environments. In addition, sensor and action noise as well as action delays add to the perturbations
an agent may experience in the real-world setting. There are a number of RL approaches that have
been utilized to ensure that an agent is robust to different subsets of these factors. We will focus on
Robust MDPs, domain randomization and system identification as frameworks for reasoning about
noisy, non-stationary systems.
A Robust MDP is defined by a tuple hS, A, P, r, γi where S, A, r and γ are as previously defined;
P is a set of transition matrices referred to as the uncertainty set [Iyengar 2005]. The objective that
we optimize is the worst-case value function defined as:
X∞
J(π) = inf Ep γ t rt |P, π .
p∈P
t=0
At each step, nature chooses a transition function that the agent transitions with so as to minimize the
long term value. The agent learns a policy that maximizes this worst case value function. Recently,
a number of works have surfaced that have shown this formulation to yield robust policies that are
agnostic to a range of perturbations in the environment [Tamar et al. 2014; Mankowitz et al. 2018a;
Shashua and Mannor 2017; Derman et al. 2018; 2019; Mankowitz et al. 2019]. The solutions do
tend to be overly conservative but some work has been done to yield less conservative, ‘soft-robust’
solutions [Derman et al. 2018].
In addition to the robust MDP formalism, the practitioner may be interested in both robustness due to
domain randomization and system identification. Domain randomization [Peng et al. 2018] involves
explicitly training an agent on various perturbations of the environment and averaging these learning
errors together during training. System identification involves training a policy that, once on a new
system, can determine the characteristics of the environment it is operating in and modify its policy
accordingly [Finn et al. 2017; Nagabandi et al. 2018].
Experimental Setup & Results We perform a number of different experiments to determine the effects
of non-stationarity. We first want to determine whether perturbations to the environment can have
an effect on a converged policy that is trained without any challenges added to the environment. For
each of the domains, we perturb each of the supported parameters shown in Table 3. The effect of the
perturbations on the converged D4PG policy for each domain and supported parameter can be seen
in Figure 9. It is clear that varying the perturbations does indeed have an effect on the performance
of the converged policy; in many instances this causes the converged policy to completely fail. This
is consistent with the results in Mankowitz et al. [2019]. This hyperparameter sweep also helps
14
determine which parameter settings are more likely to have an effect on the learning capabilities of
the agent during training.
The second set of experiments therefore aim to determine the consequences of incorporating
non-stationarity effects during training. Every episode, new environment parameters are sam-
pled between a [perturbmin , perturbmax ] where perturbmin and perturbmax indicate the min-
imum and maximum perturbation values of a particular parameter that we vary. For exam-
ple, in cartpole:swingup, the perturbation parameter is pole length and perturbmin = 0.5,
perturbmax = 3.0 and the variance used for sampling is perturbstd = 0.05.
Based on the previous set of experiments, for each task, we select domain parameters that we expect
may change the optimal policy. We perform four hyperparameter training sweeps on the domain
parameters for each domain & each algorithm (D4PG and DMPO). These sweeps are in increas-
ing orders of difficulty and have thus been named diff1 , diff2 , diff3 , diff4 and are shown in
Table 4. We perturb the environment in two different ways: uniform and cyclic perturbations. For
uniform perturbations, we sample each episode from a uniform distribution and for the cyclic per-
turbations, a random positive change was sampled from a normal distribution, and the values were
reset to the lower limit once the upper limit had been reached. Additional sampling methods and
perturbation parameters are supported in the realworldrl-suite and can also be seen in Table 3.
Cycle sampling simulates scenarios of equipment degrading over time until being replaced or fixed
and returning to peak performance. The slow consistent changes over episodes also enables for the
possibility of an algorithm adapting to the changes over time.
Figures 10 and 11 show the training performance for D4PG and DMPO when applying uniform
and cyclic perturbations per episode respectively. As seen in the figures, increasing the range of the
perturbation parameter has the effect of slowing down learning. This seems to be consistent across
all of the domains we evaluated.
15
Env. P erturbmin P erturbmax P erturbstd Default Value
Cart-Pole
Parameter pole length
diff1 0.9 1.1 0.02 1.0
diff2 0.7 1.7 0.1 1.0
diff3 0.5 2.3 0.15 1.0
diff4 0.3 3.0 0.2 1.0
Walker
Parameter thigh length
diff1 0.225 0.25 0.002 0.225
diff2 0.225 0.4 0.015 0.225
diff3 0.15 0.55 0.04 0.225
diff4 0.1 0.7 0.06 0.225
Quadruped
Parameter shin length
diff1 0.25 0.3 0.005 0.25
diff2 0.25 0.8 0.05 0.25
diff3 0.25 1.4 0.1 0.25
diff4 0.25 2.0 0.15 0.25
Humanoid
Parameter join damping
diff1 0.6 0.8 0.02 0.1
diff2 0.5 0.9 0.04 0.1
diff3 0.4 1.0 0.06 0.1
diff4 0.1 1.2 0.1 0.1
Table 4: Perturbed parameters chosen for each control task, with varying levels of difficulty
50
50
50
800
20
20
20
20
10
10
10
450 400
400 600
5
300 200
1
200 450
150
0
0
0.0
1
5
0.1
0.3
0.5
0.7
0.0
1
5
0.1
0.3
0.5
0.7
0.0
1
5
0.1
0.3
0.5
0.7
0.0
1
5
0.1
0.3
0.5
0.7
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
Probability of stuck observation Probability of stuck observation Probability of stuck observation Probability of stuck observation
(a) D4PG
cartpole:swingup walker:walk quadruped:walk humanoid:walk
Duration of stuck observation
750
900
50
50
50
50
750 800
600
20
20
20
20
10
10
10
300 150
1
200 450
150
0
0
0.0
1
5
0.1
0.3
0.5
0.7
0.0
1
5
0.1
0.3
0.5
0.7
0.0
1
5
0.1
0.3
0.5
0.7
0.0
1
5
0.1
0.3
0.5
0.7
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
Probability of stuck observation Probability of stuck observation Probability of stuck observation Probability of stuck observation
(b) DMPO
Figure 7: Average performance and standard deviation on the four tasks under the stuck sensors
condition. Both the probability of a sensor becoming stuck and the number of steps it is stuck at the
last value for are varied.
16
cartpole:swingup walker:walk quadruped:walk humanoid:walk
Duration of dropped observation
50
50
50
800
600 750
20
20
20
20
600
600 600
10
10
10
10
450 400
400 450
5
5
300
300 200
1
1
200
150
150
0
0
0.0
1
5
0.1
0.3
0.5
0.7
0.0
1
5
0.1
0.3
0.5
0.7
0.0
1
5
0.1
0.3
0.5
0.7
0.0
1
5
0.1
0.3
0.5
0.7
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
Probability of dropped observation Probability of dropped observation Probability of dropped observation Probability of dropped observation
(a) D4PG
cartpole:swingup walker:walk quadruped:walk humanoid:walk
Duration of dropped observation
50
50
50
750 800
750 600
20
20
20
20
600 600 600 450
10
10
10
10
200
150 150
0
0
0.0
1
5
0.1
0.3
0.5
0.7
0.0
1
5
0.1
0.3
0.5
0.7
0.0
1
5
0.1
0.3
0.5
0.7
0.0
1
5
0.1
0.3
0.5
0.7
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
Probability of dropped observation Probability of dropped observation Probability of dropped observation Probability of dropped observation
(b) DMPO
Figure 8: Average performance on the four tasks under the dropped sensors condition. Both the
probability of a sensor being dropped and the number of steps it is dropped for are varied.
17
challenge=perturbation cartpole swingup pole length, agent=D4PG challenge=perturbation cartpole swingup pole mass, agent=D4PG
1000 1000
800 800
Average Reward
Average Reward
600
600
400
400
200
200
0
0.10.20.30.40.50.60.70.80.91.01.11.21.31.41.51.61.71.81.92.02.12.22.32.42.52.62.72.82.93.0 0.1 0.2 0.3 0.5 0.7 1.0 1.2 1.5 1.8 2.2 2.5 2.8 3.0 3.5 4.0 4.5 5.0 6.0 7.0 8.0 9.0 10.0
param value param value
challenge=perturbation cartpole swingup joint damping, agent=D4PG challenge=perturbation cartpole swingup slider damping, agent=D4PG
1000
800 800
Average Reward
Average Reward
600 600
400 400
200 200
0
1e-105e-101e-095e-091e-085e-081e-073e-075e-071e-063e-065e-061e-055e-050.0001
0.00050.0010.0050.01 0.05 0.1 0.0005 0.001 0.01 0.1 1.0 3.0 10.0 20.0 50.0
param value param value
challenge=perturbation walker walk thigh length, agent=D4PG challenge=perturbation walker walk torso length, agent=D4PG
1000 1000
800 800
Average Reward
Average Reward
600
600
400
400
200
200
0
0.1 0.125 0.15 0.175 0.2 0.225 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.7 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7
param value param value
challenge=perturbation walker walk joint damping, agent=D4PG challenge=perturbation walker walk contact friction, agent=D4PG
1000 1000
900
800
Average Reward
800
600
Average Reward
700
600
400
500
200 400
300
0
1e-05 5e-05 0.0001 0.0005 0.001 0.005 0.01 0.05 0.1 0.5 1.0 5.0 10.0 0.01 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9
param value param value
challenge=perturbation quadruped walk shin length, agent=D4PG challenge=perturbation quadruped walk torso density, agent=D4PG
1000 1000
900
Average Reward
Average Reward
800 800
700 600
600
500 400
400 200
0.25 0.5 1.0 2.0 3.0 5.0 7.5 10.0 15.0 20.0 0.1 0.5 1.0 5.0 10.0 50.0 100.0 500.0 1000.0 5000.0 10000.050000.0100000.0
param value param value
challenge=perturbation quadruped walk joint damping, agent=D4PG challenge=perturbation quadruped walk contact friction, agent=D4PG
1100
1000 1000
900
Average Reward
800 800
Average Reward
700
600
600
400 500
400
200 300
3.0 10.0 20.0 30.0 40.0 50.0 70.0 85.0 100.0 125.0 150.0 200.0 300.0 0.1 0.5 0.9 1.3 1.5 1.9 2.3 2.7 3.1 3.5 3.9 4.2 4.5
param value param value
challenge=perturbation humanoid walk joint damping, agent=D4PG challenge=humanoid walk walk contact friction, agent=D4PG
800 800
700
600
Average Reward
600
Average Reward
500
400 400
300
200 200
100
0 0
0.01 0.05 0.1 0.2 0.5 1.0 2.0 5.0 10.0 20.0 0.05 0.1 0.3 0.5 0.7 1.0
param value param value
500
400
300
200
100
0
0.05 0.1 0.3 0.5 0.7 1.0
param value
Figure 9: Perturbation effects on a converged D4PG policy due to varying specific environment
parameters.
18
(a) D4PG
(b) DMPO
Figure 10: Uniform perturbations applied per episode for each of the four domains for D4PG and
DMPO.
19
(a) D4PG
(b) DMPO
Figure 11: Cyclic perturbations applied per episode for each of the four domains for D4PG and
DMPO.
20
2.6 Challenge 6: Multi-Objective Reward Functions
Motivation & Related Work RL frames policy learning through the lens of optimizing a global
reward function, yet most systems have multi-dimensional costs to be minimized. In many cases,
system or product owners do not have a clear picture of what they want to optimize. When an agent
is trained to optimize one metric, other metrics are often discovered that also need to be maintained
or improved. Thus, a lot of the work on deploying RL to real systems is spent figuring out how to
trade off between different objectives.
There are many ways of dealing with multi-objective rewards: Roijers et al. [2013] provide an
overview of various approaches. Various methods exist that deal explicitly with the multi-objective
nature of the learning problems, either by predicting a value function for each objective [Van Seijen
et al. 2017], or by finding a policy that optimizes each sub-problem [Li et al. 2019], or that fits
each Pareto-dominating mixture of objectives [Moffaert and Nowé 2014]. Yang et al. [2019] learn
a general policy that can behave optimally for any desired mixture of objectives. Multiple trivial
objectives have been also used for enriching the reward signal to simply improve learning of the base
task [Jaderberg et al. 2016]. Abdolmaleki et al. [2020] uses an expectation maximization approach
to learn multiple Q-functions per objective.
In the specific case of dealing with balancing a task reward with negative outcomes, a possible
approach is to use a Conditional Value at Risk (CVaR) objective [Tamar et al. 2015b], which looks
at a given percentile of the reward distribution, rather than expected reward. Tamar et al. show
that by optimizing reward percentiles, the agent is able to improve upon its worst-case performance.
Distributional DQN [Dabney et al. 2018; Bellemare et al. 2017] explicitly models the distribution
over returns, and it would be straight-forward to extend it to use a CVaR objective.
When rewards can’t be functionally specified, there are a number of works devoted to recovering an
underlying reward function from demonstrations, such as inverse reinforcement learning [Russell
1998; Ng et al. 2000; Abbeel and Ng 2004; Ross et al. 2011]. Hadfield-Menell et al. examine how
to infer the truly intended reward function from the given reward function and training MDPs, to
ensure that the agent performs as intended in new scenarios.
Because the global reward function is generally a balance of multiple sub-goals (e.g., reducing both
time-to-target and energy use), a proper evaluation should separate the individual components of
the reward function to better understand the policy’s trade-offs. Looking at the Pareto boundaries
provides some insights to the relative trade-offs between objectives, but doesn’t scale well beyond
2-3 objectives. We propose a simple multi-objective analysis of return. If we consider that the global
PK
reward function is defined as a linear combination of sub-rewards, r(s, a) = j=1 αj rj (s, a), then
we can consider the vector of per-component rewards for evaluation:
Tn
!
X
multi
J (π) = rj (si , ai ) ∈ RK . (1)
i=1 1≤j≤K
When dealing with multi-objective reward functions, it is important to track the different objectives
individually when evaluating a policy. This allows for a more clear understanding of the different
trade-offs the policy is making and choose which compromises they consider best.
To evaluate the performance of the algorithm across the full distribution of scenarios (e.g. users,
tasks, robots, objects,etc.), we suggest independently analyzing the performance of the algorithm
on each cohort. This is also important for ensuring fairness of an algorithm when interacting with
populations of users. Another approach is to analyze the CVaR return rather than expected returns,
or to directly determine whether rare catastrophic rewards are minimized [Tamar et al. 2015b;a].
Another evaluation procedure is to observe behavioural changes when an agent needs to be risk-
averse or risk-seeking such as in football [Mankowitz et al. 2016c].
Experimental Setup & Results We illustrate the multi-objective challenge by looking at the ef-
fects of a multi-objective reward function that encourages both task success and the satisfaction of
safety constraints specified in Section 2.4. We use a naive mixture reward:
rm = (1 − α)rb + αrc , (2)
where rb is the task’s base reward, rc is the number of satisfied constraints during that timestep and
α ∈ [0, 1] is the multi-objective coefficient that balances between the objectives.
21
The realworldrl-suite allows multi-objective rewards to be defined, providing the multiple
objectives either as observations to the agents, as modifications to the original task’s reward, or
both. We use the suite to model the multi-objective problem by letting α correspond to the
multiobj coeff in the realworldrl-suite, and changing the task’s reward to correspond to
Equation (2). For each task, we visualize both the per-element reward, as defined in Equation (1),
and the average number of each constraint’s violations upon convergence. Figure 12 shows the
varying effects of this multi-objective reward on each reward component, rb and rc , as a function of
multiobj coeff, where we adjust safety coeff to 0.5 and vary multiobj coeff. We can see
the evolution in performance relative to rb and rc (left), as well as the resulting effects on constraint
satisfaction (right) as multiobj coeff is varied. As rc becomes more important in the global re-
ward, constraints are quickly taken into account. However, over-emphasis on rc quickly degrades
rb and therefore base task performance. Although this is a naive way to deal with safety constraints,
it illustrates the often contradictory goals that a real-world task might have, and the difficulty in
satisfying all of them. We also believe it provides an interesting framework to analyze how different
algorithmic approaches better balance the need to satisfy constraints with the ability to maintain
adequate system performance.
22
Domain: cartpole:swingup Domain: cartpole:swingup Domain: walker:walk Domain: walker:walk
400 30
400 400
300
20
200
200 200
10
Base 100 Base
Multi-Obj Multi-Obj
0 0 0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Multi-Objective Coefficient Multi-Objective Coefficient Multi-Objective Coefficient Multi-Objective Coefficient
Domain: quadruped:walk Domain: quadruped:walk Domain: humanoid:walk Domain: humanoid:walk
Average # Violations (Convergence)
200
200 200 200
Base Base 100
Multi-Obj Multi-Obj
0 0 0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Multi-Objective Coefficient Multi-Objective Coefficient Multi-Objective Coefficient Multi-Objective Coefficient
(a) Performance vs. constraint satisfaction trade-offs as α, the multiobjective coefficient, is varied for D4PG.
Domain: cartpole:swingup Domain: cartpole:swingup Domain: walker:walk Domain: walker:walk
Average # Violations (Convergence)
600 80
600 60
60
400
400 40
40
200 200 20
Base 20 Base
Multi-Obj Multi-Obj
0 0 0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Multi-Objective Coefficient Multi-Objective Coefficient Multi-Objective Coefficient Multi-Objective Coefficient
Domain: quadruped:walk Domain: quadruped:walk Domain: humanoid:walk Domain: humanoid:walk
Average # Violations (Convergence)
1000
joint_velocity_constraint 1000 joint_velocity_constraint
upright_constraint upright_constraint
800 800 foot_force_constraint 800 dangerous_fall_constraint
800 foot_force_constraint
(b) Performance vs. constraint satisfaction trade-offs as α, the multiobjective coefficient, is varied for DMPO.
Figure 12: Performance vs. constraint satisfaction trade-offs as α, the multiobjective coefficient,
is varied. The multi-objective coefficient is the reward-mixture coefficient that makes the agent’s
perceived reward lean more towards the original task reward or more towards the constraint satis-
faction reward. For each task, the left plot shows the evolution of the tasks’ original reward as the
reward-mixture mixture coefficient is altered. The right plot shows the average number of constraint
violations upon convergence per episode for each individual constraint.
23
2.7 Challenge 7: Real-time Inference Challenge
Motivation & Related Work To deploy RL to a production system, policy inference must be
done in real-time at the control frequency of the system. This may be on the order of milliseconds
for a recommender system responding to a user request [Covington et al. 2016] or the control of a
physical robot, and up to the order of minutes for building control systems [Evans and Gao 2016].
This constraint both limits us from running the task faster than real-time to generate massive amounts
of data quickly [Silver et al. 2016; Espeholt et al. 2018b] and limits us from running slower than
real-time to perform more computationally expensive approaches (e.g. some forms of model-based
planning [Doya et al. 2002; Levine et al. 2019; Schrittwieser et al. 2019]).
One approach is to take existing algorithms and validate their feasibility to run in real-time [Adam
et al. 2011]. Another approach is to design algorithms with the explicit goal of running in real-time
[Cai et al. 2017; Wang and Yuan 2015]. Recently Ramstedt and Pal [2019] presented a different
view on real-time inference and proposed the Real-Time Markov Reward Process, in which the state
evolves during an action selection. Anytime inference [Vlasselaer et al. 2015; Spirtes 2001] is a
family of algorithms that can return a valid solution at any time they are being interrupted, and are
expected to produce better performing solutions the longer they run. Travnik et al. [2018] propose
a class of reactive SARSA RL algorithms that address the problem of asynchronous environments
which occur in many real-world tasks. That is, the state is continuously changing while the agent is
computing an action to take, or executing an action.
Experimental Setup & Results The realworldrl-suite offers two ways in which one can
measure the effect of real-time inference: latency and throughput. Latency corresponds to the
amount of time it takes an agent to output an action based on an observation. Even if the agent
is replicated over multiple machines, allowing it to handle the frequency of the observations arriving
from the system, it still may have latency issues due to the time it needs in order to output an action
for a single observation. To be able to see how a system will react in the face of latency, we use
the action delay mechanism, where at time step t the agent outputs an action at based on st , but
the system actually responds to at−n , where n is the delay in time steps. Throughput correspond
to the frequency of input observations the agent is able to process which depends on the amount of
hardware or compute that is available for it as well as the complexity of the agent itself. We mod-
eled the effects of throughput bottlenecks as action repetition: we denote the length of the action
repetition by k, then at time step k · t the agent outputs an action ak·t based on the observation sk·t ,
however, for the next k − 1 time steps (i.e., time steps k · t + 1, k · t + 2, ..(k + 1) · t − 1), the agent
repeats the same output ak·t . These two perturbations allow us to see how agents that have latency
and throughput issues will affect their environment, and additionally can show us how well an agent
can learn to plan accordingly to compensate for its computational shortcomings.
Figures 2a and 2b show the performance of D4PG and DMPO, respectively, on the action delay
challenge. For discussion on these results we refer the reader to Section 2.2. Figures 13a and 13b
shows the performance on the action repetition challenge for D4PG and DMPO, respectively. We
note that generally, as expected, the performance of the agents deteriorates as the number of repeated
actions increases. More interestingly though, we observe that albeit quadruped has larger state and
action spaces than cartpole and walker, it still more robust to action repetition. We believe the
reason for that lies in the inherit stability of the different tasks, where humanoid is the least stable,
and quadruped is the most stable.
Motivation & Related Work For many systems, learning from scratch through online interaction
with the environment is too expensive or time-consuming. Therefore, it is important to design
algorithms for learning good policies from offline logs of the system’s behavior. In many cases
these comes from an existing rule-based, heuristic or myopic policy that we are trying to replace
with an RL approach. This setting is typically referred to as Offline Reinforcement Learning 3 .
Offline and off-policy learning are closely related:
Off-policy learning consists of a behaviour policy that generates the data and a target policy that
learns from the data generated by the behaviour policy [Sutton and Barto 2018]. The behaviour
3
Offline RL is also referred to as ’batch RL’ in the literature.
24
challenge=Action repetition, agent=D4PG challenge=Action repetition, agent=DMPO
1000 cartpole:swingup cartpole:swingup
walker:walk 1000
walker:walk
quadruped:walk quadruped:walk
humanoid:walk humanoid:walk
800
800
Average Return
Average Return
600
600
400
400
200
200
2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
# of Time Steps Repeated # of Time Steps Repeated
policy continuously collects data for the agent in the environment (typically a simulator). An
example of this is in deep RL where data is collected using past policies up to time k during
training π0 , π1 · · · πk and stored in a replay buffer. This data is then used to train the policy πk+1
[Levine et al. 2020]. There are numerous examples of off-policy RL such as Q-learning [Sutton and
Barto 2018], Deep Q-Networks [Mnih et al. 2015] as well as actor critic variants such as IMPALA
[Espeholt et al. 2018c]. Offline RL, however, does not have the luxury of a behaviour policy
that continuously interacts with the environment. In this setting, a dataset of trajectories is made
available to the agent from a potentially unknown behaviour policy πB . The dataset is collected
once and is not altered during training [Levine et al. 2020].
Some of the early examples of offline RL include least squares temporal difference methods [Bradtke
and Barto 1996; Lagoudakis and Parr 2003] and fitted Q iteration [Ernst et al. 2005; Riedmiller
2005]. More works such as Agarwal et al. [2019], Fujimoto et al. [2019], or Kumar et al. [2019]
have shown that naively applying well-known deep RL methods such as DQN [Mnih et al. 2015] in
the offline setting can lead to poor performance. This has been attributed to a combination of poor
generalization outside the training data’s distribution as well as overly confident Q-function esti-
mates when performing backups with a max operator. However, distributional deep RL approaches
[Dabney et al. 2018; Bellemare et al. 2017; Barth-Maron et al. 2018] have been shown to produce
better performance in the offline setting in both Atari [Agarwal et al. 2019] and robot manipulation
[Cabi et al. 2019]. There have also been a number of recent methods explicitly addressing the issues
stemming from combining generalization outside the training data along with issues related to the
max operator, which come in two main flavors. The first family of approaches constrain the action
choice to the support of the training data [Fujimoto et al. 2019; Kumar et al. 2019; Siegel et al. 2020;
Jaques et al. 2019; Wu et al. 2019; Wang et al. 2020]. The second type of approaches start with be-
havior cloning [BC; Pomerleau 1989], which trains a policy using the objective of predicting the
action seen in the offline logs. Works such as Wang et al. [2018], Chen et al. [2019b], or Peng et al.
[2019] then use the advantage function to select the best actions in the dataset for training behav-
ior cloning. Finally, model-based approaches also offer a solution to the offline setup, by training
a model of the system dynamics offline and then exploiting it to solve the problem. Works such
as MOPO [Yu et al. 2020] and MoREL [Kidambi et al. 2020] leverage the learnt model to learn a
model-free policy, and approaches such as MBOP [Argenson and Dulac-Arnold 2020] leverage the
model directly using an MPC-based planner.
Experimental Setup & Results The realworldrl-suite version of the offline / batch RL chal-
lenge is to learn from data logs generated from sub-optimal policies running on the no-challenge
setting, where all challenge effects are turned off, and the combined challenge setting (see Section
2.10) where data logs are generated from an environment that includes effects from combining all
the challenges (except for safety and multi-objective rewards). The policies were obtained by train-
25
ing three DMPO agents until convergence with different random weight initializations, and then
taking snapshots corresponding to roughly 75% of the converged performance. For the no challenge
setting, we generated three datasets of different sizes for each environment by combining the three
snapshots, with the total dataset sizes (in numbers of episodes) provided in Table 5. Further, we
repeated the procedure with the easy combination of the other challenges (see section 2.10). We
chose to use the “large data” setting for the combined challenge to ensure the task is still solvable.
The algorithms used for offline learning were an offline version of D4PG [Barth-Maron et al. 2018]
that uses the data logs as a fixed experience replay buffer, as well as Critic Regularized Regression
(CRR) Wang et al. [2020], which restricts the learned model to mimic the behavior policy when it
has a positive advantage.
The performance of the ABM algorithm trained on the small, medium and large batch datasets can
be found in Figure 14 (learning curves) for each of the domains. D4PG was also trained on each of
the tasks, but failed to learn in each case and therefore the results have been omitted. As seen in the
figures, the agent fails to learn properly in the humanoid:walk and cartpole:swingup domain,
but manages to reach a decent level of performance in walker:walk and quadruped:walk. In
addition, the size of the dataset does not seem to have a significant effect on performance. This may
indicate that the dataset sizes are still too large to handicap an agent’s learning capabilities for a
state-of-the-art offline RL agent, while being too difficult to solve for D4PG.
For the ‘Easy’ combined challenge offline task, we used DMPO behaviour policies trained on
each task. The humanoid:walk DMPO behaviour policy was too poor to generate reason-
able data (see Figure 17b) and we therefore focused on cartpole:swingup, walker:walk and
quadruped:walk for this task. This also motivates why we need to make progress on the combined
challenges online task (see Section 2.10) so that we can generate reasonable behaviour policies to
generate the datasets for batch RL algorithms to train on.
We subsequently trained CRR and D4PG (offline version) on the data generated from the behaviour
policies. The agents failed to achieve any reasonable level of performance on cartpole and walker,
and have thus been omitted. The learning curves of CRR trained on quadruped on the combined easy
challenge can be found in Figure 15. Although the performance is still sub-optimal, it is encouraging
to see that the batch agents can learn something reasonable. The D4PG offline agent failed to learn
in each case and the results have therefore been omitted.
If a research or practitioner is testing out the capabilities of an agent, it would be useful to be able
to summarize the performance of an agent across each challenge dimension. One such approach is
to do a radar plot with respect to the various challenges. We provide an example radar plot of D4PG
agent’s performance on a subset of the challenges (for visualization purposes) in Figure 16 for the
domains of cartpole:swingup, walker:walk, quadruped:walk and humanoid:walk respec-
tively. The performance is measured individually for each challenge. There are four overlapping
plots on each radar plot, namely optimal performance (blue) as well as Diff1 (red), Diff2 (Orange)
and Diff3 (black) which corresponds to the second, third and fourth parameters (in ascending order
of difficulty) for each challenge from Table 11.
As you can see in the figure, D4PG struggles with the hard setting along each of the challenge
dimensions, other than increased observation dimension. In addition it appears to be less sensitive
to reward delay and adding Gaussian action noise on all domains except for humanoid. This kind
26
domain=cartpole:realworld_swingup, agent=Advantage-Weighted Modelling domain=walker:realworld_walk, agent=Advantage-Weighted Modelling
700 700
600 600
Average Reward
Average Reward
500
500
400
400 easy easy
300
medium medium
300 hard 200 hard
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0 20000 40000 60000 80000 100000
Learner Steps 1e7 Learner Steps
domain=quadruped:realworld_walk, agent=Advantage-Weighted Modelling domain=humanoid:realworld_walk, agent=Advantage-Weighted Modelling
900 160
140
800
Average Reward
Average Reward
120
700 100
600 80
easy 60 easy
500
medium 40 medium
400 hard 20 hard
0 10000 20000 30000 40000 50000 60000 0 20000 40000 60000 80000
Learner Steps Learner Steps
Figure 14: Learning from offline data on small, medium and large datasets in the no challenge setting
using CRR. For the cartpole domain, the X-axis is extended to show a clearer learning curve.
700
Average Reward
600
500
400
large
300
0 1000 2000 3000 4000 5000 6000 7000 8000
Learner Steps
Figure 15: Learning from offline data on large datasets in the easy combined challenge setting using
CRR on quadruped.
of summary will immediately identify the weak points of an algorithm. We will make this plotting
code available in the real-world RL suite open-source codebase.
27
Experiment Challenge 1 Challenge 2 Challenge 3
(easy) (Medium) (Hard)
System Delays Time Steps Time Steps Time Steps
Action 3 6 9
Observation 3 6 9
Rewards 10 20 40
Action Repetition 1 2 3
Gaussian Noise Std. Deviation Std. Deviation Std. Deviation
Action 0.1 0.3 1.0
Observation 0.1 0.3 1.0
Stuck/ Prob. Time Prob. Time Prob. Time
Dropped Noise steps steps steps
Stuck Sensor 0.01 1 0.05 5 0.1 10
Dropped Sensor 0.01 1 0.05 5 0.1 10
Perturbation [Min,Max] Std. [Min,Max] Std. [Min,Max] Std.
Cartpole
[0.9,1.1] 0.02 [0.7,1.7] 0.1 [0.5,2.3] 0.15
Perturbation [Min,Max] Std. [Min,Max] Std. [Min,Max] Std.
Quadruped
[0.25,0.3] 0.005 [0.25,0.8] 0.05 [0.25,1.4] 0.1
Perturbation [Min,Max] Std. [Min,Max] Std. [Min,Max] Std.
Walker
[0.225,0.25] 0.002 [0.225,0.4] 0.015 [0.15,0.55]] 0.04
Perturbation [Min,Max] Std. [Min,Max] Std. [Min,Max] Std.
Humanoid
[0.6,0.8] 0.02 [0.5,0.9] 0.04 [0.4, 1.0] 0.06
High State Dimension State Dimension State Dimension
Dimensionality Increase Increase Increase
10 20 50
Table 6: The hyperparameter setting for each combined challenge in increasing levels of difficulty
While each of these challenges present difficulties independently, many real world domains possess
all of the challenges together. To demonstrate the difficulty of learning to control a system with
multiple dimensions of real-world difficulty, we combine multiple challenges described above into a
set of benchmark tasks to evaluate real-world learning algorithms. Our combined challenges include
parameter perturbations, additional state dimensions, observation delays, action delays, reward de-
lays, action repetition, observation & action noise, and stuck & dropped sensors. Even taking the
relatively easy versions of each challenge (where the algorithm still reached close to the optimal
performance individually) and combining them together creates a surprisingly difficult task. Perfor-
mance on these challenges can be seen in Table 7 for D4PG and Table 8 for DMPO, and Figures 17a
and 17b respectively. We can see that both learners’ performance drops drastically, even when ap-
plying the smallest perturbations of each challenge.
Due to both the application interest in these combined challenges, as well as their clear difficulty, we
believe them to be good benchmark tasks for researchers looking to create RL algorithms for real-
world systems. We provide the parameters for each challenge in Table 6 (taken from the individual
hyperparameters sweeps, see Table 11 in the Appendix). The realworldrl-suite can load the
challenges directly, making it easy to replicate these benchmark environments in any experimental
setup. Although the baseline performance we provide is with a naive learner that is not designed
to answer these challenges, we believe it provides a good starting point for comparison and look
forward to followup work that provides more performant algorithms on these reference challenges.
In this paper, we have addressed 8 of the 9 challenges originally presented in [Dulac-Arnold et al.
2019]. The remaining challenge is explainability. Objectively evaluating explainability of a policy
is not trivial, but we we hope this can be addressed in future iterations of this suite. We provide an
overview of this challenge and possible approaches to creating explainable RL agents.
28
cartpole:swingup walker:walk quadruped:walk humanoid:walk
859.63 (5.68) 983.24 (9.7) 998.71 (0.32) 934.0 (27.34)
Easy 482.32 (84.56) 514.44 (70.21) 787.73 (86.95) 102.92 (22.47)
Medium 175.47 (51.57) 75.49 (16.94) 268.01 (135.84) 1.28 (0.99)
Hard 108.2 (57.97) 59.85 (17.7) 280.75 (123.21) 1.27 (0.79)
Table 7: Mean D4PG performance (± standard deviation) when incorporating all challenges into
the system.
Explainability Another essential aspect of real systems is that they are owned and operated by
humans, who need to be reassured about the controllers’ intentions and require insights regarding
failure cases. For this reason, policy explainability is important for real-world policies. Especially
in cases where the policy might find an alternative and unexpected approach to controlling a system,
understanding the longer-term intent of the policy is important for obtaining stakeholder buy-in.
In the event of policy errors, being able to understand the error’s origins a posteriori is essential.
Previous work that is potentially well-suited to this challenge include options [Sutton et al. 1999] that
are well-defined hierarchical actions that can be composed together to solve a given task. Previous
research in this area includes learning the options from scratch [Mankowitz et al. 2016a;b; Bacon
et al. 2017] as well as planning, given a pre-trained set of options [Schaul et al. 2015; Mankowitz
et al. 2018b]. In addition, research has been done to develop a symbolic planning language that
could be useful for explainability [Konidaris et al. 2018; James et al. 2018].
29
Domain: cartpole:swingup
Increase observation dimension Gaussian action noise
800
600
400
200
Reward delay Gaussian observation noise
Domain: walker:walk
Increase observation dimension Gaussian action noise
1000
800
600
400
200
Reward delay Gaussian observation noise
Domain: quadruped:walk
Increase observation dimension Gaussian action noise
1000
800
600
400
200
Reward delay Gaussian observation noise
Domain: humanoid:walk
Increase observation dimension Gaussian action noise
800
600
400
200
Reward delay 0
Gaussian observation noise
Optimal
Easy
Medium
Observation delay Action delay Hard
Figure 16: D4PG radar plot for the domains of cartpole:swingup, walker:walk,
quadruped:walk and humanoid:walk respectively. The performance is measured individually for
each challenge. There are four overlapping plots on each radar plot, namely optimal performance
(blue) as well as Diff1 (red), Diff2 (Orange) and Diff3 (black) which corresponds to the second,
third and fourth parameters (in ascending order of difficulty) for each challenge from Table 11.
30
challenge=Combined challenges, agent=D4PG
1000 cartpole:swingup
800
walker:walk
quadruped:walk
Average Reward
600 humanoid:walk
400
200
0
No Challenge Easy Medium Hard
Difficulty
(a) D4PG
challenge=Combined challenges, agent=DMPO
1000 cartpole:swingup
800
walker:walk
quadruped:walk
Average Reward
600 humanoid:walk
400
200
0
No Challenge Easy Medium Hard
Difficulty
(b) DMPO
Figure 17: D4PG and DMPO performance when incorporating all challenges into the system.
31
3 Additional Related Work
While we covered related work specific to each challenge in the sections above, there are a few
other works that relate to ours, either through the goal of practical reinforcement learning or more
generally by providing interesting benchmark suites.
In general, the fact that machine learning methods have a tendency to overfit to their evaluation
environments is well-recognized. Wagstaff [2012] discusses the strong lack of real-world applica-
tions in ML conferences and the subsequent impact on research directions this can have. Henderson
et al. [2018] investigate ways in which RL results can be made to be more reproducible and sug-
gest guidelines for doing so. Their paper ends by asking the question “In what setting would [a
given algorithm] be useful?”, to which we try to contribute by proposing a specific setting in which
well-adapted work should hopefully stand out.
Hester and Stone [2013] similarly present a list of challenges for real world RL, but specifically
for RL on robots. They present four challenges (sample efficiency, high-dimensional state and ac-
tion spaces, sensor/actuator delays, and real-time inference), all of which we include in our set of
challenges. They do not include our other challenges such as satisfying constraints, multi-objective,
non-stationarity and partial observability (e.g., noisy/stuck sensors). Their approach is to setup a
real-time architecture for model-based learning where ensembles of models are learned to improve
robustness and sample efficiency. In a spirit similar to ours, the bsuite framework [Osband et al.
2019] proposes a set of challenges grounded in fundamental problems in RL such as memory, ex-
ploration, credit assignment etc. These problems are equally important and complementary to the
more empirically founded challenges proposed in our suite. Recently, other teams have released
real-world inspired environments, such as Safety Gym [Ray et al. 2019], which extends a planar
world with location-based safety constraints. Our suite proposes a richer and more varied set of
constraints, as well as an easy ability to add custom constraints, which we believe provides a more
general and difficult challenge for RL algorithms.
The Horizon platform [Gauci et al. 2018] and Decision Service [Agarwal et al. 2016] provide soft-
ware platforms for training, evaluation and deployment of RL agents in real-world systems. In
the case of Decision Service, transition probabilities are logged to help make off-policy evaluation
easier down the line, and both systems consider different approaches to off-policy evaluation. We
believe well-structured frameworks such as these are crucial to productionizing RL systems. Ahn
et al. [2019] propose a set of simple robot designs with corresponding simulators that have been
tuned to be physically realistic, implementing safety constraints and various perturbations.
Riedmiller [2012] proposes a set of best practices for successfully solving typical real-world control
tasks using RL. This is intended as a subjective report on how they tackle problems in practice.
We emphasize in this work that the goal is enable RL on real-world products and systems, which may
include recommender systems, physical control systems such as autonomous driving/navigation,
warehouse automation etc). There are, of course, some real-world systems that have had success
using RL as the algorithmic solution - mainly in robotics. For example, Gu et al. (2017) perform
off-policy training of deep Q functions to learn 3D manipulation skills as well as a door opening skill.
Mahmood et al. (2018) provide benchmarks using four off-the-shelf RL algoirthms and evaluate the
performance on multiple commercially available robots. Kalashnikov et al. [2018] introduce QT-
Opt which is a self-supervised vision-based RL algoirthm that can learn a grasping skill that can
generalize to unseen objects and handle perturbations. Levine et al. [2016] proposed an end-to-end
learning algorithm that can map raw image observation’s to torques at the robot’s motors. This
algorithm is able to complete a range of manipulation tasks requiring close coordination between
vision and control, such as screwing a cap on a bottle.
• Seven real-world challenge wrappers (mentioned above) across 8 DeepMind Control Suite
tasks [Tassa et al. 2018]:
cartpole:(swingup and balance), walker:(walk and run),
quadruped:(walk and run), humanoid:(stand and walk)
32
• The flexibility to instantiate different variants of each challenge, as well as the ability to
easily combine challenges together using a simple configuration language. See Appendix
C for more details.
• Examples of how to run RL agents on each challenge environment.
• The ability to instantiate the “Easy”, “Medium” and “Hard” combined challenges.
• A Jupyter notebook enabling an agent to be run on any of the challenges in a browser, as
well as accompanying functions to plot the agent’s performance.
Evaluation environments. In this paper, we evaluate RL algorithms on a subset of four tasks from
our suite, namely: cartpole:swingup, walker:walk, quadruped:walk and humanoid:walk.
We chose these tasks to cover varying levels of task difficulty and dimensionality. It should be
noted that MuJoCo possesses an internal dynamics state and that only preprocessed observations are
available to the agent [Tassa et al. 2018]. We refer to state in this paper as in the typical MDP
setting: the information available to the agent at time t. Since we provide all available observations
as input to the agent, we use the term observation and state interchangeably in this paper. For
each challenge, we have implemented environment wrappers that instantiate the challenge. These
wrappers are parameterized such that the challenge can be ramped up from having no effect to being
very difficult. For example, the amount of delay added onto the actuators can be set arbitrarily,
varying the difficulty from slight to impossible. By implementing the challenges in this way, we can
easily adapt them to other tasks and ramp them up and down to measure their effects. Our goal with
this task suite is to replicate difficulties seen in complex real systems in a more simplified setup,
allowing for methodical and principled research.
Identification and definition of real-world challenges We believe that we provide a set of the
most important challenges that RL algorithms need to succeed at before being ready for real-world
application. In our own personal experience as well as that of our collaborators, we have been
confronted ourselves numerous times with the often difficult task of applying RL to various real-
world systems. This set of challenges stems from these experiences, and we are convinced that
finding solutions to them will likely provide promissing algorithms that are readily useable in real-
world systems. We are particularly interested in results in the off-line domain, as most large systems
have a large amount of logs, but little to no tolerance for exploratory actions (datacenter cooling
& robotics being good examples of this). We also believe that algorithms able to reason about
environmental constraints will allow RL to move onto systems that were previously considered too
fragile or expensive for learning-based approaches. Overall, we are excited about the directions that
a lot of the cited research is taking and looking forward to interesting results in the near future.
Experiment design and analysis for each challenge Additionally, the design of an experiment
for each challenge demonstrates the independent effects of each challenge on an RL agent. This
allowed us to show which aspects of real-world tasks present the biggest difficulties for RL agents
in a precise and reproducible manner. In the case of learning on live systems from limited samples,
our proposed efficiency metrics (performance regret and stability) produced interesting findings,
showing DMPO to be almost an order of magnitude worse in terms of regret, but significantly more
stable once converged. When dealing with system delays, we saw that observation and action de-
lays quickly degrade algorithm performance, but reward delays seem to be globally less impactful
except on the humanoid:walk task. For high-dimensional continuous state & action spaces, we
see that additional observation dimensions don’t affect either DMPO or D4PG significantly, and
that environments with more action dimensions are not necessarily harder to learn. When reasoning
about system constraints, we argue that explicit reasoning about constraints is preferable to sim-
ply integrating them in the reward, and show that there is no natural way to express constraints in
the standard MDP framework. We provide a mechanism in realworldrl-suite that can express
33
constraints in the CMDP setting, and show that constraints can be violated in interesting ways, es-
pecially in tasks that have different regimes (e.g. cartpole:swingup’s ‘swing-up’ and ‘balance’
phases). Partial observability & non-stationarity are often present in real systems, and can present
clear problems for learning algorithms. In small doses stuck sensors pose less of a problem than
outright dropped signals however, even though the underlying information is the same. When it
comes to non-stationary system dynamics, we can see that the effects depend greatly on the type of
element that is varying. Additionally, naive policies clearly degrade more quickly in the face of un-
stable system dynamics. Multi-objective rewards can be difficult to optimize for when they are not
well-aligned. By using safety-related constraints that weren’t always compatible with the base task,
we showed how naively reasoning about this trade-off can quickly degrade system performance, yet
that compromising solutions are also possible. We believe that expressing tasks beyond a single
reward function is essential in tackling more complex problems and look forward to new methods
able to do so. Real-time policies are essential for high-frequency control loops present in robotics
or low-latency responses necessary in software systems. We showed the effects of both action and
state delays on DMPO and D4PG, and showed that these approaches quickly degrade if the sys-
tem’s control frequency is higher than their response time and actions decorrelate too strongly from
observations. Many real-world systems are hard to train on directly, and therefore RL agents need
to be able to train off-line from fixed logs. It has long been known that this is not a trivial task, as
situations that aren’t represented in the data become difficult to respond to. Especially in the case of
off-policy td-learning methods, the arg max over-estimation issue quickly creates divergent value
functions. We showed that simply applying D4PG to data from a logged task is not sufficient to find
a functional policy, but that offline-specific learning algorithms can deal with even small amounts of
data. Finally, explainable policies are often desirable (as are explainable machine learning models
in general), but not easy to provide or even evaluate. We provide a couple directions of current work
in this area, and hope that future work finds clearer approach to this problem.
Define and baseline RWRL Combined Challenge Benchmark tasks By combining a well-
tuned set of challenges into a single environment, we were able to generate 12 benchmark tasks
(3 levels of difficulty and 4 tasks) which can serve as reference tasks for further research in real-
world RL. The choice of challenge parameterizations for each level of difficulty was performed
after careful analysis of the combined effects on the learning algorithms we experimented with. We
also provided a first round of baselines on our benchmark tasks by running D4PG and DMPO on
them: we find that D4PG seems to be slightly more robust for easy perturbations but, aside from the
quadruped:walk task, quickly matches DMPO in poor performance. By providing these baseline
performance numbers for D4PG and DPMO on these task, we hope that followup work will have a
good starting point to understand the quality of their proposed solutions. We encourage the research
community to better our current set of RWRL combined challenge baseline results.
References
P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings
of the twenty-first international conference on Machine learning, page 1. ACM, 2004.
34
A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. A. Riedmiller. Maximum
a posteriori policy optimisation. CoRR, abs/1806.06920, 2018a.
A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. A. Riedmiller. Maxi-
mum a posteriori policy optimisation. In International Conference on Learning Representations
(ICLR), 2018b.
A. Abdolmaleki, S. H. Huang, L. Hasenclever, M. Neunert, H. F. Song, M. Zambelli, M. F. Mar-
tins, N. Heess, R. Hadsell, and M. Riedmiller. A distributional view on multi-objective policy
optimization. arXiv preprint arXiv:2005.07513, 2020.
J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained policy optimization. CoRR,
abs/1705.10528, 2017.
S. Adam, L. Busoniu, and R. Babuska. Experience replay for real-time reinforcement learning con-
trol. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews),
42(2):201–212, 2011.
I. Adamski, R. Adamski, T. Grel, A. Jedrych, K. Kaczmarek, and H. Michalewski. Distributed deep
reinforcement learning: Learn how to play atari games in 21 minutes. In International Conference
on High Performance Computing, pages 370–388. Springer, 2018.
A. Agarwal, S. Bird, M. Cozowicz, L. Hoang, J. Langford, S. Lee, J. Li, D. Melamed, G. Oshri,
O. Ribas, et al. Making contextual decisions with low technical debt. arXiv preprint
arXiv:1606.03966, 2016.
R. Agarwal, D. Schuurmans, and M. Norouzi. Striving for simplicity in off-policy deep reinforce-
ment learning. Preprint arXiv:1907.04543, 2019.
M. Ahn, H. Zhu, K. Hartikainen, H. Ponte, A. Gupta, S. Levine, and V. Kumar. ROBEL: RObotics
BEnchmarks for Learning with low-cost robots. In Conference on Robot Learning (CoRL), 2019.
E. Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.
M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron,
M. Plappert, G. Powell, A. Ray, et al. Learning dexterous in-hand manipulation. arXiv preprint
arXiv:1808.00177, 2018.
A. Argenson and G. Dulac-Arnold. Model-based offline planning. arXiv preprint arXiv:2008.05556,
2020.
J. A. Arjona-Medina, M. Gillhofer, M. Widrich, T. Unterthiner, and S. Hochreiter. Rudder: Return
decomposition for delayed rewards. arXiv preprint arXiv:1806.07857, 2018.
P.-L. Bacon, J. Harb, and D. Precup. The option-critic architecture. In Thirty-First AAAI Conference
on Artificial Intelligence, 2017.
G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, D. TB, A. Muldal, N. Heess,
and T. P. Lillicrap. Distributed distributional deterministic policy gradients. In International
Conference on Learning Representations (ICLR), 2018.
M. G. Bellemare, W. Dabney, and R. Munos. A distributional perspective on reinforcement learning.
CoRR, abs/1707.06887, 2017.
S. Bohez, A. Abdolmaleki, M. Neunert, J. Buchli, N. Heess, and R. Hadsell. Value constrained
model-free continuous control. arXiv preprint arXiv:1902.04623, 2019.
C. Boutilier and T. Lu. Budget allocation using weakly coupled, constrained markov decision pro-
cesses. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence (UAI-16),
pages 52–61, New York, NY, 2016.
S. Bradtke and A. Barto. Linear least-squares algorithms for temporal difference learning. Machine
Learning, 22:33–57, 03 1996.
35
J. Buckman, D. Hafner, G. Tucker, E. Brevdo, and H. Lee. Sample-efficient reinforcement learning
with stochastic ensemble value expansion. CoRR, abs/1807.01675, 2018.
S. Cabi, S. G. Colmenarejo, A. Novikov, K. Konyushkova, S. Reed, R. Jeong, K. Zolna, Y. Aytar,
D. Budden, M. Vecerik, O. Sushkov, D. Barker, J. Scholz, M. Denil, N. de Freitas, and Z. Wang.
Scaling data-driven robotics with reward sketching and batch reinforcement learning. Preprint
arXiv:1909.12200, 2019.
H. Cai, K. Ren, W. Zhang, K. Malialis, J. Wang, Y. Yu, and D. Guo. Real-time bidding by reinforce-
ment learning in display advertising. In Proceedings of the Tenth ACM International Conference
on Web Search and Data Mining, pages 661–670, 2017.
D. A. Calian, D. J. Mankowitz, T. Zahavy, Z. Xu, J. Oh, N. Levine, and T. Mann. Balancing
constraints and rewards with meta-gradient d4pg, 2020.
N. Carrara, R. Laroche, J. Bouraoui, T. Urvoy, T. D. S. Olivier, and O. Pietquin. A fitted-q algorithm
for budgeted mdps. In EWRL 2018, 2018.
A. R. Cassandra. A survey of pomdp applications. In Working notes of AAAI 1998 fall symposium
on planning with partially observable Markov decision processes, volume 1724, 1998.
M. Chen, A. Beutel, P. Covington, S. Jain, F. Belletti, and E. H. Chi. Top-k off-policy correction for
a reinforce recommender system. In Proceedings of the Twelfth ACM International Conference
on Web Search and Data Mining, pages 456–464, 2019a.
X. Chen, Z. Zhou, Z. Wang, C. Wang, Y. Wu, Q. Deng, and K. Ross. BAIL: Best-action imitation
learning for batch deep reinforcement learning. Preprint arXiv:1910.12179, 2019b.
Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh. A lyapunov-based approach to
safe reinforcement learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-
Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages
8092–8101. 2018.
K. Chua, R. Calandra, R. McAllister, and S. Levine. Deep reinforcement learning in a handful
of trials using probabilistic dynamics models. In Advances in Neural Information Processing
Systems, pages 4754–4765, 2018.
P. Covington, J. Adams, and E. Sargin. Deep neural networks for youtube recommendations. In
Proceedings of the 10th ACM conference on recommender systems, pages 191–198. ACM, 2016.
W. Dabney, G. Ostrovski, D. Silver, and R. Munos. Implicit quantile networks for distributional
reinforcement learning. In J. Dy and A. Krause, editors, Proceedings of the 35th International
Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research,
pages 1096–1105, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y. Tassa. Safe exploration in
continuous action spaces. CoRR, abs/1801.08757, 2018.
E. Derman, D. J. Mankowitz, T. A. Mann, and S. Mannor. Soft-robust actor-critic policy-gradient.
arXiv preprint arXiv:1803.04848, 2018.
E. Derman, D. J. Mankowitz, T. A. Mann, and S. Mannor. A bayesian approach to robust reinforce-
ment learning. CoRR, abs/1905.08188, 2019. URL http://arxiv.org/abs/1905.08188.
K. Doya, K. Samejima, K.-i. Katagiri, and M. Kawato. Multiple model-based reinforcement learn-
ing. Neural computation, 14(6):1347–1369, 2002.
G. Dulac-Arnold, R. Evans, H. van Hasselt, P. Sunehag, T. Lillicrap, J. Hunt, T. Mann, T. Weber,
T. Degris, and B. Coppin. Deep reinforcement learning in large discrete action spaces. arXiv
preprint arXiv:1512.07679, 2015.
G. Dulac-Arnold, D. J. Mankowitz, and T. Hester. Challenges of real-world reinforcement learning.
ICML Workshop on Reinforcement Learning for Real Life, abs/1904.12901, 2019.
36
D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of
Machine Learning Research, 6:503–556, 2005.
R. Evans and J. Gao. Deepmind ai reduces google data centre cooling bill by
40%, 2016. URL https://deepmind.com/blog/deepmind-ai-reduces-google-data-
centre-cooling-bill-40/.
C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep net-
works. In Proceedings of the 34th International Conference on Machine Learning-Volume 70,
pages 1126–1135. JMLR. org, 2017.
S. Fujimoto, D. Meger, and D. Precup. Off-policy deep reinforcement learning without exploration.
In International Conference on Machine Learning, pages 2052–2062, 2019.
S. Gu, E. Holly, T. Lillicrap, and S. Levine. Deep reinforcement learning for robotic manipulation
with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and
automation (ICRA), pages 3389–3396. IEEE, 2017.
D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent
dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018.
M. J. Hausknecht and P. Stone. Deep recurrent q-learning for partially observable mdps. CoRR,
abs/1507.06527, 2015.
J. He, J. Chen, X. He, J. Gao, L. Li, L. Deng, and M. Ostendorf. Deep reinforcement learning with
a natural language action space. arXiv preprint arXiv:1511.04636, 2015.
T. Hester and P. Stone. TEXPLORE: Real-time sample-efficient reinforcement learning for robots.
Machine Learning, 90(3), 2013. doi: 10.1007/s10994-012-5322-7.
37
T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan,
A. Sendonaris, I. Osband, G. Dulac-Arnold, J. Agapiou, J. Z. Leibo, and A. Gruslys. Deep q-
learning from demonstrations. In Proceedings of the Thirty-Second AAAI Conference on Artificial
Intelligence, (AAAI-18), pages 3223–3230, 2018a.
T. A. Hester, E. J. Fisher, and P. Khandelwal. Predictively controlling an environmental control
system, Jan. 16 2018b. US Patent 9,869,484.
M. Hoffman, B. Shahriari, J. Aslanides, G. Barth-Maron, F. Behbahani, T. Norman, A. Abdolmaleki,
A. Cassirer, F. Yang, K. Baumli, et al. Acme: A research framework for distributed reinforcement
learning. arXiv preprint arXiv:2006.00979, 2020.
D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. van Hasselt, and D. Silver. Dis-
tributed prioritized experience replay. CoRR, abs/1803.00933, 2018.
C.-C. Hung, T. Lillicrap, J. Abramson, Y. Wu, M. Mirza, F. Carnevale, A. Ahuja, and G. Wayne.
Optimizing agent behavior over long time scales by transporting value. arXiv preprint
arXiv:1810.06721, 2018.
E. Ie, C.-w. Hsu, M. Mladenov, V. Jain, S. Narvekar, J. Wang, R. Wu, and C. Boutilier. Recsim:
A configurable simulation platform for recommender systems. arXiv preprint arXiv:1909.04847,
2019.
G. N. Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280,
2005.
M. Jaderberg, V. Mnih, W. Czarnecki, T. Schaul, J. Z. L. Leibo, D. Silver, and K. Kavukcuoglu.
Reinforcement Learning with Unsupervised Auxiliary Tasks. arXiv, pages 1–11, 2016. ISSN
0004-6361. doi: 10.1051/0004-6361/201527329.
S. James, B. Rosman, and G. Konidaris. Learning to plan with portable symbols. In Workshop on
Planning and Learning (PAL@ ICML/IJCAI/AAMAS), 2018.
38
S. Levine, A. Kumar, G. Tucker, and J. Fu. Offline reinforcement learning: Tutorial, review, and
perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
K. Li, T. Zhang, and R. Wang. Deep Reinforcement Learning for Multi-objective Optimization. 14
(8):1–10, 2019.
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Contin-
uous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
A. R. Mahmood, D. Korenkevych, G. Vasan, W. Ma, and J. Bergstra. Benchmarking reinforcement
learning algorithms on real-world robots. arXiv preprint arXiv:1809.07731, 2018.
D. J. Mankowitz, T. A. Mann, and S. Mannor. Adaptive skills adaptive partitions (asap). In Advances
in Neural Information Processing Systems, pages 1588–1596, 2016a.
D. J. Mankowitz, T. A. Mann, and S. Mannor. Iterative hierarchical optimization for misspecified
problems (ihomp). arXiv preprint arXiv:1602.03348, 2016b.
D. J. Mankowitz, A. Tamar, and S. Mannor. Situational awareness by risk-conscious skills. arXiv
preprint arXiv:1610.02847, 2016c.
D. J. Mankowitz, T. A. Mann, P.-L. Bacon, D. Precup, and S. Mannor. Learning robust options. In
Thirty-Second AAAI Conference on Artificial Intelligence, 2018a.
D. J. Mankowitz, A. Žı́dek, A. Barreto, D. Horgan, M. Hessel, J. Quan, J. Oh, H. van Hasselt,
D. Silver, and T. Schaul. Unicorn: Continual learning with a universal, off-policy agent. arXiv
preprint arXiv:1802.08294, 2018b.
D. J. Mankowitz, N. Levine, R. Jeong, A. Abdolmaleki, J. T. Springenberg, T. A. Mann, T. Hester,
and M. A. Riedmiller. Robust reinforcement learning for continuous control with model misspec-
ification. CoRR, abs/1906.07516, 2019.
D. J. Mankowitz, D. A. Calian, R. Jeong, C. Paduraru, N. Heess, S. Dathathri, M. Riedmiller, and
T. Mann. Robust constrained reinforcement learning for continuous control with model misspec-
ification, 2020.
T. A. Mann, S. Gowal, R. Jiang, H. Hu, B. Lakshminarayanan, and A. György. Learning from
delayed outcomes with intermediate observations. CoRR, abs/1807.09387, 2018.
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Ried-
miller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement
learning. Nature, 518(7540):529, 2015.
K. V. Moffaert and A. Nowé. Multi-objective reinforcement learning using sets of pareto dominating
policies. JMLR, 1:3663–3692, 2014.
A. Nagabandi, C. Finn, and S. Levine. Deep online learning via meta-learning: Continual adaptation
for model-based RL. CoRR, abs/1812.07671, 2018.
A. Nagabandi, K. Konoglie, S. Levine, and V. Kumar. Deep dynamics models for learning dexterous
manipulation. arXiv preprint arXiv:1909.11652, 2019.
A. Y. Ng, S. J. Russell, et al. Algorithms for inverse reinforcement learning. In Icml, volume 1,
page 2, 2000.
OpenAI. Openai five. https://blog.openai.com/openai-five/, 2018.
I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped dqn. In
D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural
Information Processing Systems 29, pages 4026–4034. Curran Associates, Inc., 2016.
I. Osband, Y. Doron, M. Hessel, J. Aslanides, E. Sezener, A. Saraiva, K. McKinney, T. Latti-
more, C. Szepezvari, S. Singh, et al. Behaviour suite for reinforcement learning. arXiv preprint
arXiv:1908.03568, 2019.
39
X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel. Sim-to-real transfer of robotic control
with dynamics randomization. In 2018 IEEE International Conference on Robotics and Automa-
tion (ICRA), pages 1–8. IEEE, 2018.
X. B. Peng, A. Kumar, G. Zhang, and S. Levine. Advantage-weighted regression: Simple and
scalable off-policy reinforcement learning. Preprint arXiv:1910.00177, 2019.
T. Pham, G. D. Magistris, and R. Tachibana. Optlayer - practical constrained optimization for deep
reinforcement learning in the real world. CoRR, abs/1709.07643, 2017.
D. A. Pomerleau. ALVINN: An autonomous land vehicle in a neural network. In Conference on
Neural Information Processing Systems, pages 305–313, 1989.
S. Ramstedt and C. Pal. Real-time reinforcement learning. In Advances in Neural Information
Processing Systems, pages 3067–3076, 2019.
A. Ray, J. Achiam, and D. Amodei. Benchmarking safe exploration in deep reinforcement learning.
2019.
M. Riedmiller. Neural fitted Q iteration – first experiences with a data efficient neural reinforcement
learning method. In J. Gama, R. Camacho, P. B. Brazdil, A. M. Jorge, and L. Torgo, editors,
European Conference on Machine Learning, pages 317–328, 2005.
M. Riedmiller. 10 steps and some tricks to set up neural reinforcement controllers. In Neural
networks: tricks of the trade, pages 735–757. Springer, 2012.
M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. Van de Wiele, V. Mnih, N. Heess,
and J. T. Springenberg. Learning by playing-solving sparse reward tasks from scratch. arXiv
preprint arXiv:1802.10567, 2018.
D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley. A survey of multi-objective sequential
decision-making. Journal of Artificial Intelligence Research, 48:67–113, 2013.
S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to
no-regret online learning. In Proceedings of the fourteenth international conference on artificial
intelligence and statistics, pages 627–635, 2011.
S. J. Russell. Learning agents for uncertain environments. In COLT, volume 98, pages 101–103,
1998.
H. Satija, P. Amortila, and J. Pineau. Constrained markov decision processes via backward value
functions. arXiv preprint arXiv:2008.11811, 2020.
T. Schaul, D. Horgan, K. Gregor, and D. Silver. Universal value function approximators. In Inter-
national conference on machine learning, pages 1312–1320, 2015.
J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart,
D. Hassabis, T. Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned
model. arXiv preprint arXiv:1911.08265, 2019.
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization
algorithms. CoRR, abs/1707.06347, 2017.
S. D.-C. Shashua and S. Mannor. Deep robust kalman filter. arXiv preprint arXiv:1703.02310, 2017.
N. Siegel, J. T. Springenberg, F. Berkenkamp, A. Abdolmaleki, M. Neunert, T. Lampe, R. Hafner,
N. Heess, and M. Riedmiller. Keep doing what worked: Behavior modelling priors for offline
reinforcement learning. In International Conference on Learning Representations, 2020.
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,
I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural
networks and tree search. nature, 529(7587):484, 2016.
P. Spirtes. An anytime algorithm for causal inference. In AISTATS, 2001.
40
A. Stooke, J. Achiam, and P. Abbeel. Responsive safety in reinforcement learning by pid lagrangian
methods. arXiv preprint arXiv:2007.03964, 2020.
R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
R. S. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for temporal
abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
A. Tamar, S. Mannor, and H. Xu. Scaling up robust mdps using function approximation. In Inter-
national Conference on Machine Learning, pages 181–189, 2014.
A. Tamar, Y. Chow, M. Ghavamzadeh, and S. Mannor. Policy gradient for coherent risk measures.
In Advances in Neural Information Processing Systems, pages 1468–1476, 2015a.
A. Tamar, Y. Glassner, and S. Mannor. Optimizing the cvar via sampling. In Twenty-Ninth AAAI
Conference on Artificial Intelligence, 2015b.
Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel,
A. Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
C. Tessler, S. Givony, T. Zahavy, D. J. Mankowitz, and S. Mannor. A deep hierarchical approach to
lifelong learning in minecraft. CoRR, abs/1604.07255, 2016.
C. Tessler, D. J. Mankowitz, and S. Mannor. Reward constrained policy optimization. arXiv preprint
arXiv:1805.11074, 2018.
C. Tessler, T. Zahavy, D. Cohen, D. J. Mankowitz, and S. Mannor. Action assembly: Sparse im-
itation learning for text based games with combinatorial action spaces. CoRR, abs/1905.09700,
2019.
P. S. Thomas. Safe reinforcement learning. PhD thesis, University of Massachusetts Libraries, 2015.
P. S. Thomas, B. C. da Silva, A. G. Barto, and E. Brunskill. On ensuring that intelligent machines
are well-behaved. arXiv preprint arXiv:1708.05448, 2017.
J. B. Travnik, K. W. Mathewson, R. S. Sutton, and P. M. Pilarski. Reactive reinforcement learning
in asynchronous environments. Frontiers in Robotics and AI, 5:79, 2018.
M. Turchetta, F. Berkenkamp, and A. Krause. Safe exploration in finite markov decision processes
with gaussian processes. CoRR, abs/1606.04753, 2016.
H. Van Seijen, M. Fatemi, J. Romoff, R. Laroche, T. Barnes, and J. Tsang. Hybrid reward architec-
ture for reinforcement learning. In Advances in Neural Information Processing Systems 30, pages
5392–5402. 2017.
M. Vecerı́k, O. Sushkov, D. Barker, T. Rothörl, T. Hester, and J. Scholz. A practical approach to
insertion with variable socket position using deep reinforcement learning. In 2019 International
Conference on Robotics and Automation (ICRA), pages 754–760, 2019.
M. Vecerik, O. Sushkov, D. Barker, T. Rothörl, T. Hester, and J. Scholz. A practical approach to
insertion with variable socket position using deep reinforcement learning. In 2019 International
Conference on Robotics and Automation (ICRA), pages 754–760. IEEE, 2019.
J. Vlasselaer, G. Van den Broeck, A. Kimmig, W. Meert, and L. De Raedt. Anytime inference in
probabilistic logic programs with tp-compilation. In Twenty-Fourth International Joint Confer-
ence on Artificial Intelligence, 2015.
A. Wachi, Y. Sui, Y. Yue, and M. Ono. Safe exploration and optimization of constrained mdps using
gaussian processes. In AAAI, pages 6548–6556. AAAI Press, 2018.
K. Wagstaff. Machine learning that matters. arXiv preprint arXiv:1206.4656, 2012.
J. Wang and S. Yuan. Real-time bidding: A new frontier of computational advertising research. In
Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pages
415–416, 2015.
41
Q. Wang, J. Xiong, L. Han, p. sun, H. Liu, and T. Zhang. Exponentially weighted imitation learning
for batched historical data. In Conference on Neural Information Processing Systems, pages
6288–6297. 2018.
Z. Wang, A. Novikov, K. Zolna, J. T. Springenberg, S. Reed, B. Shahriari, N. Siegel, J. Merel,
C. Gulcehre, N. Heess, et al. Critic regularized regression. arXiv preprint arXiv:2006.15134,
2020.
Y. Wu, G. Tucker, and O. Nachum. Behavior regularized offline reinforcement learning. Preprint
arXiv:1911.11361, 2019.
H. Xu and S. Mannor. Probabilistic goal markov decision processes. In Twenty-Second International
Joint Conference on Artificial Intelligence, 2011.
A. Yahya, A. Li, M. Kalakrishnan, Y. Chebotar, and S. Levine. Collective robot reinforcement
learning with distributed asynchronous guided policy search. In 2017 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), pages 79–86. IEEE, 2017.
R. Yang, X. Sun, and K. Narasimhan. A Generalized Algorithm for Multi-Objective Reinforcement
Learning and Policy Adaptation. (NeurIPS):1–27, 2019.
T. Yu, G. Thomas, L. Yu, S. Ermon, J. Zou, S. Levine, C. Finn, and T. Ma. Mopo: Model-based
offline policy optimization. arXiv preprint arXiv:2005.13239, 2020.
T. Zahavy, M. Haroush, N. Merlis, D. J. Mankowitz, and S. Mannor. Learn what not to learn: Action
elimination with deep reinforcement learning. In Advances in Neural Information Processing
Systems, pages 3562–3573, 2018.
42
Appendix
A Learning Algorithms
Parameters that were used for D4PG and DMPO can be found in Table 9 and Table 10, respectively.
Hyperparameters D4PG
Policy net 300-300-200
Number of actions sampled per state 15
Q function net 400-400-300-100
σ (exploration noise) 0.1
vmim -150
vmax 150
num atoms 51
n-step 51
Discount factor (γ) 0.99
Adam learning rate 0.0001
Replay buffer size 2000000
Target network update period 200
Batch size 512
Activation function elu
Layer norm on first layer Yes
Tanh on output of layer norm Yes
Table 9: Hyperparameters for D4PG.
Hyperparameters DMPO
Policy net 300-300-200
Number of actions sampled per state 20
Q function net 400-400-300-100
0.1
µ 0.005
Σ 0.000001
Discount factor (γ) 0.99
vmin -150
vmax 150
num atoms 51
Adam learning rate 0.0001
Replay buffer size 1000000
Batch size 256
Activation function elu
Layer norm on first layer Yes
Tanh on output of layer norm Yes
Tanh on Gaussian mean No
Min variance Zero
Max variance unbounded
Table 10: Hyperparameters for DMPO.
43
B Parameters
The hyperparameters that were used for the individual challenges sweeps can be found in Table 11.
C Codebase
C.1 Specifying Challenges
Specifying task challenges is done by passing arguments to the load method of the environment
(see examples in Appendix C.2). Comprehensive documentation is available in the codebase itself,
however, for completeness we list the different arguments here.
• Constraints
– Description: Adds a set of constraints on the task. Returns an additional entry in the
observations (’constraints’) in the length of the number of the contraints, where each
entry is True if the constraint is satisfied and False otherwise. In our implementation
we used safety constraints as the constraints. The safety constraints per domain can
be found in Table 2.
– Input argument: safety spec, a dictionary that specifies the safety constraints spec-
ifications of the task. It may contain the following fields:
* enable, a boolean that represents whether safety specifications are enabled.
* constraints, a list of class methods returning boolean constraint satisfactions
(default ones are provided).
* limits, a dictionary of constants used by the functions in ’constraints’ (default
ones are provided).
* safety coeff, a scalar between 1 and 0 that scales safety constraints, 1 produc-
ing the base constraints, and 0 likely producing an unsolveable task.
* observations, a default-True boolean that toggles the whether a vector of satis-
fied constraints is added to observations.
44
• Delays
– Description: Adds actions, observations and rewards delays. Actions delay is the
number of steps between passing the action to the environment when it is actually
performed, and observation (reward) delay is the offset of freshness of the returned
observation (reward) after performing a step.
– Input argument: delay spec, a dictionary that specifies the delay specifications of
the task. It may contain the following fields:
* enable, a boolean that represents whether delay specifications are enabled.
* actions, an integer indicating the number of steps actions are being delayed.
* observations, an integer indicating the number of steps observations are being
delayed.
* rewards, an integer indicating the number of steps rewards are being delayed.
• Noise
– Description: Adds action or observation noise. Different noise include: white Gaus-
sian actions/observations, dropped actions/observations values, stuck actions/observa-
tions values, or repetitive actions.
– Input argument: noise spec, a dictionary that specifies the noise specifications of
the task. It may contains the following fields:
* gaussian, a dictionary that specifies the white Gaussian additive noise. It may
contain the following fields:
· enable, a boolean that represents whether noise specifications are enabled.
· actions, a float indicating the standard deviation of a white Gaussian noise
added to each action.
· observations, similarly, additive white Gaussian noise to each returned ob-
servation.
* dropped, a dictionary that specifies the dropped values noise. It may contain the
following fields:
· enable, a boolean that represents whether dropped values specifications are
enabled.
· observations prob, a float in [0,1] indicating the probability of dropping
each observation component independently.
· observations steps, a positive integer indicating the number of time steps
of dropping a value (setting to zero) if dropped.
· actions prob, a float in [0,1] indicating the probability of dropping each ac-
tion component independently.
· actions steps, a positive integer indicating the number of time steps of drop-
ping a value (setting to zero) if dropped.
* stuck, a dictionary that specifies the stuck values noise. It may contain the fol-
lowing fields:
· enable, a boolean that represents whether stuck values specifications are en-
abled.
· observations prob, a float in [0,1] indicating the probability of each obser-
vation component becoming stuck.
· observations steps, a positive integer indicating the number of time steps
an observation (or components of) stays stuck.
· actions prob, a float in [0,1] indicating the probability of each action compo-
nent becoming stuck.
· actions steps, a positive integer indicating the number of time steps an action
(or components of) stays stuck.
* repetition, a dictionary that specifies the repetition statistics. It may contain the
following fields:
· enable, a boolean that represents whether repetition specifications are enabled.
· actions prob, a float in [0,1] indicating the probability of the actions to be
repeated in the following steps.
· actions steps, a positive integer indicating the number of time steps of re-
peating the same action if it to be repeated.
45
• Perturbations
– Description: Perturbs physical quantities of the environment. These perturbations are
non-stationary and are governed by a scheduler.
– Input argument: perturb spec, a dictionary that specifies the perturbation specifica-
tions of the task. It may contain the following fields:
* enable, a boolean that represents whether perturbation specifications are enabled.
* frequency, an integer, number of episodes between updates perturbation updates.
* param, a string indicating which parameter to perturb (supporting multiple param-
eters, environment-dependent, see Table 3).
* scheduler, a string indicating the scheduler to apply to the perturbed parameter.
Currently supporting:
· constant - constant value determined by the ‘start‘ argument.
· random walk - random walk governed by a white Gaussian process.
· drift pos - uni-directional (increasing) random walk which saturates.
· drift neg - uni-directional (decreasing) random walk which saturates.
· cyclic pos - uni-directional (increasing) random walk which resets once reach-
ing the maximal value.
· cyclic neg - uni-directional (decreasing) random walk which resets once reach-
ing the minimal value.
· uniform - uniform sampling process within a bounded support.
· saw wave - alternating uni-directional random walks between minimal and
maximal values.
* start, a float indicating the initial value of the perturbed parameter.
* min, a float indicating the minimal value the perturbed parameter may be.
* max, a float indicating the maximal value the perturbed parameter may be.
* std, a float indicating the standard deviation of the white noise for the scheduling
process.
• Dimensionality
– Description: Adds extra dummy features to observations to increase dimensionality
of the state space.
– Input argument: dimensionality spec, a dictionary that specifies the added dimen-
sions to the state space. It may contain the following fields:
* num random state observations, an integer indicating the number of random
observations to add (defaults to zero).
• Multi-Objective Reward
– Description: Provides a reward that gets added onto the base reward and re-normalized
to [0,1].
– Input argument: multiobj spec, a dictionary that sets up the multi-objective chal-
lenge. The challenge works by providing an ‘Objective‘ object which describes both
numerical objectives and a reward-merging method that allow to both observe the var-
ious objectives in the observation and affect the returned reward in a manner defined
by the Objective object.
* objective, either a string which will load an ‘Objective‘ class from
utils.multiobj objectives.Objective, or an Objective object which subclasses it.
* reward, a boolean indicating whether to add the multiobj objective’s reward to
the environment’s returned reward.
* coeff, a float in [0,1] that is passed into the Objective object to change the mix
between the original reward and the Objective’s rewards.
* observed, a boolean indicating whether the defined objectives should be added
to the observation.
46
C.2 Code Snippets
Below is an example of using the OpenAI PPO baseline with our suite.
1 from baselines import bench
2 from baselines . common . vec_env import dummy_vec_env
3 from baselines . ppo2 import ppo2
4 import example_helpers as helpers
5 import realworldrl_suite . environments as rwrl
6
7
8 def _load_env () :
9 """ Loads environment . """
10 raw_env = rwrl . load (
11 domain_name = ’ cartpole ’ ,
12 task_name = ’ realworld_swingup ’ ,
13 safety_spec = dict ( enable = True ) ,
14 delay_spec = dict ( enable = True , actions =20) ,
15 log_output = ’/ tmp / path / to / results . npz ’ ,
16 environment_kwargs = dict ( log_safety_vars = True , flat_observation =
True ) )
17 env = helpers . GymEnv ( raw_env )
18 env = bench . Monitor ( env , FLAGS . save_path )
19 return env
20
21 env = dummy_vec_env . DummyVecEnv ([ _load_env ])
22 ppo2 . learn ( env = env , network = ’ mlp ’ , lr =1 e -3 , total_timesteps =1000000 ,
23 nsteps =100 , gamma =.99)
47
34 print ( ’ Random policy total reward per episode : {:.2 f } + - {:.2 f } ’.
format (
35 np . mean ( rewards ) , np . std ( rewards ) ) )
48