Effective Probabilistic Neural Networks Model For Model-Based Reinforcement Learning USV
Effective Probabilistic Neural Networks Model For Model-Based Reinforcement Learning USV
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TASE.2025.3539317
Abstract—Gaussian process (GP) offers a robust solution for knowledge of specific hardwares [5], manual selection of
modeling the dynamics of unmanned surface vehicles (USV) controller parameters [6] and large amounts of data that are
in model-based reinforcement learning (MBRL). However, the usually costly to collect by human experts [7]. There is still
rapidly increasing computational complexity with a large sample
capacity of GP limits its application in complex scenarios that a significant gap in achieving fully automatic control and
require substantial samples to cover the state space. In this adaptive learning in USV domian [8].
article, a novel probabilistic MBRL approach, probabilistic neu- Reinforcement learning (RL) aims to iteratively learn op-
ral networks model predictive control (PNMPC) is proposed to timal or suboptimal control strategies by interacting with the
tackle this issue. With an iterative learning framework, PNMPC external environments without human intervention [9]–[11].
properly models the USV dynamics using neural networks from
a probabilistic perspective to avoid the computational complex- It offers an appealing prospect for the fully autonomous USV
ity associated with sample capacity. Employing this model to with the capability of adaptive learning. The traditional model-
effectively propagate system uncertainties, a model predictive free RL approaches have been extensively applied to USV in
control (MPC) policy is developed to robustly control the USV various tasks including collision avoidance, target tracking and
against external disturbances. Evaluated by position-keeping and motion control [12]–[17]. The most recent research trained a
multiple targets-tracking scenarios on a real USV data-driven
simulation, the proposed method consistently demonstrates its model-free RL agent in a simulation environment and success-
significant superiority in both model accuracy and control per- fully drove it in the real-world USV tracking task [18]. On the
formance compared to not only GP model-based approaches but other hand, most of the traditional model-free RL approaches
also the probabilistic neural networks-based MBRL baselines, typically rely on a deterministic policy based on neural net-
across various scales of external disturbances. works which inherently lacks the the robustness against the
Note to Practitioners— Modelling the system dynamics and ocean environment under various uncertain disturbances. They
maintaining computational efficiency with a large sample set has usually need substantial learning in simulated environments
been challenging for MBRL in the USV domain. We propose a before they can be deployed in real-world scenarios with
novel neural network modeling method to capture the dynamic different unmanned systems [18]–[20].
features of USV within an RL loop and develop a robust MPC This issue can be mitigated by model-based RL (MBRL)
policy based on its uncertainty propagation. Our method achieves
computational complexity independent of the sample capacity which parallelly learns a model of the environment. One of
and outperforms related baselines in model accuracy and control the most representative approaches, the probabilistic inference
performance. for learning control (PILCO) modeled the uncertain system by
Gaussian process (GP) [21], [22] and achieved robust control
I. I NTRODUCTION with superior sampe-efficiency [23]. Based on the superior
robustness of model predictive control (MPC) in complex sys-
S one critical technology in marine science, unmanned
A surface Vehicles (USV) have evolved rapidly over the
past few decades in academia and industry. These technolo-
tems like unmanned ground vehicles and space vehicles [24],
[25], GP-MPC was proposed by integrating MPC into PILCO
to deal with the dynamic and unpredictable disturbances in
gies have become a key solution in not only enhancing the
uncertain environments with evaluation on simulated cart-pole
efficiency and safety of marine transportation but also tackling
and double pendulum [26]. As the first attempt to implement
the shortage of skilled professionals in the shipping industry.
the MBRL framework with GP model and MPC controller
Despite their growing applications in various scenarios [1]–
to USV, sample-efficient probabilistic model predictive con-
[4], the current USV technologies highly depend on the prior
trol (SPMPC) successfully drove a regular size boat in a
This work is supported in part by the National Natural Science Foundation real ocean environment to learn target-reaching and position-
of China under Grants 62103403 and 92473112; in part by Shenzhen R&D keeping tasks without neither prior human knowledge nor
Foundation under Grant KCXST20221021111210023. (Corresponding author:
Yunduan Cui) additional training process in simulation [27], [28]. Drawing
W. Huang is with University of Chinese Academy of Sciences, China and inspiration from [29], filtered probabilistic model predictive
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, control (FPMPC) exploited the benefits of the probabilistic
Shenzhen, Guangdong, 518055, China.
Y. Cui, H. Li and X. Wu are with CAS Key Laboratory of Human-Machine model in USV control. It employed the Bayesian filter to
Intelligence-Synergy Systems, Shenzhen Institutes of Advanced Technology, account for the latent uncertain variables and enhanced the
Chinese Academy of Sciences, Shenzhen, Guangdong, 518055, China, and control stability based on SPMPC [30].
also with Guangdong-Hong Kong-Macao Joint Laboratory of Human-Machine
Intelligence-Synergy Systems, Shenzhen Institutes of Advanced Technology, Although SPMPC and FPMPC have shown potential in
Chinese Academy of Sciences, Shenzhen, Guangdong, 518055, China. USV control, their application has been strongly limited by the
Authorized licensed use limited to: ShanghaiTech University. Downloaded on February 07,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies. Personal use is permitted,
but republication/redistribution requires IEEE permission. See [Link] for more information.
This article has been accepted for publication in IEEE Transactions on Automation Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TASE.2025.3539317
𝑌
𝜓! 𝑣!
Reward Ocean Environments
𝛿! Predicted Actions
𝑣" Model Policy
𝑋 𝑡 𝑡+𝐻−1
𝑥! , 𝑎! $
𝑥!"# , 𝑎!"# Target
⋯ ⋯
non-parametric nature of GP: the computational complexity the unpredictable noises are further filtered in MPC for more
of the GP model exponentially increases at a rate of O(N 3 ) robust and prompt control behaviors. It naturally combines the
where N is the number of samples. Although this issue can characteristics of Deep Pilco and PETS to achieve superior
be partially addressed by sparse GP [31], [32], balancing representative capability of system uncertainty without losing
the control performance and the sample capacity of GP- stability. Evaluated by position-keeping and targets-tracking
based approaches remains challenging in engineering. With an scenarios under different levels of environmental disturbances
independent computational complexity of the sample capacity, in the real USV data-driven simulation developed by [28],
deep neural networks have been widely implemented to control PNMPC not only achieved superior control stability and
unmanned vehicles [33], [34] and become one feasible solution prediction accuracy than the existing neural networks-based
to fit the probabilistic USV models. Deep Pilco first explored approach PETS but also comprehensively surpassed traditional
the power of neural networks in PILCO [35]. Probabilistic MBRL methods based on GP (SPMPC, FPMPC) in control
ensembles with trajectory sampling (PETS) integrated neural performances and model quality with an independent compu-
networks and probabilistic MPC into one MBRL framework. tational complexity to the sample capacity. The contributions
It achieved comparable performance to model-free RL ap- of this work are summarized following Fig. 1:
proaches in several Mujoco control benchmarks [36]. How-
ever, despite the breakthrough in computational complexity, 1) Developing a practical probabilistic neural networks
these approaches also exhibit inherent limitations in stability model for USV dynamics with superior generalization
compared to their original versions based on GP. The applica- capability and model accuracy than both GP model and
tion of them in environments with disturbances such as USV existing neural networks-based approaches.
remains unexplored. 2) Designing an effective MPC-based policy by combining
the characteristics of Deep Pilco and PETS in the
The motivation of this paper is to tackle the computational uncertainty propagation of probabilistic neural networks.
efficiency issue of traditional probabilistic MBRL approaches 3) Proposing a novel MBRL approach specific for USV
that employ the GP model and MPC controller in the USV do- control by integrating the model and policy above. Its
main. Designing a probabilistic neural network with sufficient advantages in control capability and independent com-
stability and expressiveness, a novel MBRL method, proba- putational complexity to sample capacity compared with
bilistic neural networks model predictive control (PNMPC) other baselines were evaluated in several USV scenarios.
is proposed to properly model the uncertain USV dynamics
and robustly control it against changing disturbances using The remainder of this paper is organized as follows. Section
deep neural networks. Our method emphasizes the features of II provided the preliminary of the target problem. Section III
USV dynamics in continuous time steps during the training detailed PNMPC. The experimental results were analyzed in
of neural networks to improve the prediction accuracy while Section IV. The conclusion was given in Sections V.
Authorized licensed use limited to: ShanghaiTech University. Downloaded on February 07,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies. Personal use is permitted,
but republication/redistribution requires IEEE permission. See [Link] for more information.
This article has been accepted for publication in IEEE Transactions on Automation Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TASE.2025.3539317
II. P RELIMINARY
𝑥! , 𝑎! 𝑥!"# 𝑥!"# , 𝑎!"# 𝑥!"$ + Δ𝑥
A. Modeling USV $
$
P(𝑥!"# |𝑥! , 𝑎! ) 𝑥!"# , 𝑎!"# $
P(𝑥!"% |𝑥!"# , 𝑎!"# )
In this work, the USV dynamics was described as a Markov
decision process (MDP) following [28], [30]. As shown in left Observation bias Second step loss
$
top of Fig. 1, the observed state x ∈ S was defined with Δ𝑥 = 𝑥!"# − 𝑥!"# ℒ%
Authorized licensed use limited to: ShanghaiTech University. Downloaded on February 07,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies. Personal use is permitted,
but republication/redistribution requires IEEE permission. See [Link] for more information.
This article has been accepted for publication in IEEE Transactions on Automation Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TASE.2025.3539317
Authorized licensed use limited to: ShanghaiTech University. Downloaded on February 07,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies. Personal use is permitted,
but republication/redistribution requires IEEE permission. See [Link] for more information.
This article has been accepted for publication in IEEE Transactions on Automation Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TASE.2025.3539317
Algorithm 1: Learning process of PNMPC in USV. At the same time, the second core predicts the current state
Input sample set D, number of network ensembles B, x̂t based on the previous state xt−1 and action at−1 sent by
number of dropout units set Q, executing time ∆t, the first core following Eq. (5). Setting it as the initial state
horizon of MPC policy H, reward function R. of the MPC policy effectively alleviates the bias caused by
# Initialize weights ensembles and random dropout set the USV operation during the optimization process, as stated
w = {w1 , ..., wB }, z = {z1 , ..., zQ } in [28]. The control signal for the next step is then decided
# Train neural networks model by D with warmup following Eq. (9) and sent to the first core at the end of this
samples following the loss function in Eq. (8) step. After each rollout with Lrollout steps, the model will be
fˆ = Train_NNS(D, L, w, z) iteratively updated based on the current sample set D following
the loss function L defined in Eq. (8). The whole learning
for i = 1, 2, ..., Ntrial do
# Initialize USV and control signal process was summarized in Algorithm 1.
x0 =Reset_USV_State()
a∗0 = [0, 0] IV. E XPERIMENTS
for t = 1, 2, ..., Lrollout do A. Experimental Settings
# Core 1
Operate_Control(a∗t−1 , ∆t) 1) Simulation Environment: To evaluate our approach,
xt = Observe_USV_State() we utilized the simulation environment developed by Furuno
if t > 1 then Electric CO. LTD., Japan which is based on a real boat (a
# Expand D with two steps’ samples Nissan JoyFisher with a length of seven meters) driving data
D = {D, {xt−2 , a∗t−2 , xt−1 , a∗t−1 }} in a real ocean environment. Following the dynamic model
detailed in [28], the simulated boat’s steering rudder had a
# Core 2
range δ o ∈ [−30, 30]◦ , its engine throttle was mapped to
# Bias compensation
PB PQ following Eq. (5)
1 ˆs (xt−1 , at−1 , wi , zj ) τ o ∈ [−1, 1], corresponding to the actions of moving backward
x̂t = BQ i=1 j=1 f
and forward, with maximum 800 rpm in rotation speed. The
# search optimal actions following Eq. (9)
executing time was set as ∆t = 2.0 second.
a∗t = MPC_Policy(x̂t , R, H, w, z)
There are two types of disturbances in this simulation: one
# Train neural networks model by D following the is the wind with speed v w and direction ψ w that can be
loss function in Eq. (8) detected by sensors, and the other is the unobservable wave
fˆ = Train_NNS(D, L, w, z) with speed v c and direction ψ c . Following the settings in
Return fˆ with w, z related works [28], [30], three levels of rand wind and current
velocity were set to simulate the ocean environments under
disturbances:
1) v w = 2.0+U (−0.1, 0.1), vc = 0.25+U (−0.1, 0.1) m/s,
while the loss function L fully considered randomness from
2) v w = 4.0 + U (−0.1, 0.1), vc = 0.5 + U (−0.1, 0.1) m/s,
multiple ensembles of networks, different dropout sets and the
3) v w = 6.0 + U (−0.1, 0.1), vc = 0.5 + U (−0.1, 0.1) m/s.
aleatoric uncertainties in neural networks. The updated neural
networks affected the MPC policy via Eq. (9) introduced in U indicates a uniform distribution. The increasing wind and
Section III-B, the aleatoric uncertainties of neural networks current speeds led to a significantly enhanced uncertainty of
were excluded for more stable control behaviors. USV dynamics and made the learning of robust policy more
challenging. Both wind and current directions were initialed
The interaction process between PNMPC and the USV by a certain value and affected by random noises at each step:
system mainly followed the two CPU cores setting proposed
in [28], as demonstrated in the top part of Fig. 3. Two ψ w = 37◦ + U (−30, 30)◦ , ψ c = 100◦ + U (−30, 30)◦ .
cores work in parallel to reduce the delay caused by the
optimization of the MPC policy. At the beginning of each trial, 2) Control Scenarios: Two control scenarios: position-
the probabilistic neural networks model was initially trained keeping and multiple targets-tracking were conducted. The
by the sample set D with some warmup date (e.g., USV first task focused on staying the USV near its initial position.
trajectories with random control signals). The RL training Its reward function was defined as:
process can be divided into Ntrial rollouts, each rollout has R(xt+1 ) = −||pt+1 − ptarget ||2 . (10)
Lrollout steps. At the beginning of each rollout, we first reset
the state of USV and obtained the initial state x0 , the initial pt+1 indicated the USV position [X u , Y u ] at step t + 1,
action is set as a∗0 = [0, 0]. The first core was responsible for ptarget = [0, 0] was the initial position. The learning process in
communicating with the USV system. At time step t, it sends Algorithm 1 was conducted with Ntrial = 20 iterations/rollouts,
the control signal a∗t−1 , which is optimized by the MPC policy each rollout contains Lrollout = 50 steps. After training, the
in the previous time step, to the USV system and receives the agent will be tested by 30 rollouts with Lrollout = 100 steps.
current state xt from sensors. The observed state will be sent In addition to the offset from the target position, we defined
to another core at the beginning of the next step. For step two additional evaluation criteria. The final success rate is
t > 1, the samples of two continuous steps will be added to determined by the ratio of staying USV near the target within
the sample set D = {D, {xt−2 , a∗t−2 , xt−1 , a∗t−1 }}. a 7 m range at the end of the test, The overall success rate is
Authorized licensed use limited to: ShanghaiTech University. Downloaded on February 07,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies. Personal use is permitted,
but republication/redistribution requires IEEE permission. See [Link] for more information.
This article has been accepted for publication in IEEE Transactions on Automation Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TASE.2025.3539317
TABLE I
C OMPARISON OF PROPOSED METHOD AND RELATED BASELINES
Computational complexity
Approach Model type Uncertainty propagation Horizon of model fitting Policy type
related to sample size N
DroQ [39] N/A N/A N/A Actor-Critic Networks Independent
TD7 [40] N/A N/A N/A Actor-Critic Networks Independent
SPMPC [28] Gaussian process Moment-Matching Single step MPC controller O(N 3 )
FPMPC [30] Gaussian process Moment-Matching Single step MPC controller O(N 3 )
PETS [36] Neural networks Ensembles sampling Single step MPC controller Independent
Ensembles sampling
PNMPC (ours) Neural networks Continuous two steps MPC controller Independent
+ Random dropout
determined by whether the USV holds its position within the where the sparse GP [31] was used to enable an H = 3 steps
range above during the entire test rollout. prediction horizon in MPC. According to the discussion of
The second task focused on sequentially passing multiple computational complexity in [28], it is difficult to complete
targets and staying near the last one. Two trajectories were MPC’s optimization within 2 seconds with H > 3 during a
designed for this task. The “circle” one included six targets 20 iterations’ learning procedure. PETS and PNMPC benefited
at [15.55, 12.89], [22.42, 32.15], [9.83, 46.77], [−13.45, 45.36], from the independent computational complexity of the sample
[−22.31, 28.53], [−13.25, 10.35] as one circle (120 m total capability. We set H = 10 for them to fully demonstrate these
distance) to test USV’s control skill in local scenes. The advantages. The neural networks were built with two hidden
“right” one included ten linearly spacing targets from [0, 0] to layers with 200 units, other related parameters were as follows:
[40, 200] (200 m total distance) to evaluate the control behavior B = 5, Q = 4, λ = [0.0001, 0.00025, 0.00025, 0.0005] for
in a long-term distance. The reward function was defined as: input, output and hidden layers in Eq. (8), the supervised
learning of neural works were conducted in 40 epoches with
R(pt+1 ) = −||pt+1 − pitarget ||2 −0.5 · ||pt+1 − pi+1
target ||
2
(11) learning rate 0.001. To evaluate the impact of the uncertainties
where the agent minimized its distances to the current and the in our method, we tested PNMPC without uncertainty prop-
next targets (denoted by index i and i + 1) when there is more agation (indicated as PNMPC-N), i.e., the size of ensembles
than one target. It turned to Eq. 10 with the last target. The was set to B = 1 while the dropout units were also disabled.
tracking distance was defined as: Table I summarized the characteristics of PNMPC and other
baselines. Leading model-free RL approaches, DroQ and TD7,
i
X learned control policies by updating Actor-Critic networks
D(t) = ||pjtarget −pj−1
target ||
2
− ||pt −pitarget ||2 . (12) without constructing explicit models. Their computational
j=1
complexity depended on the network structure rather than the
The complete ratio d was calculated by dividing it at the sample size. In comparison, SPMPC and FPMPC modeled the
last step by the total distance. The rollout was classified as target dynamics using GP and derived control policies from
a success with d ≥ 95%, and as a failure with d < 60%. MPC controllers based on GP’s uncertainty propagation using
The learning process was conducted with Ntrial = 20 it- analytic moment matching [38], offering robust resistance
erations/rollouts, each rollout contains Lrollout = 100 steps. to environmental disturbances. However, GP’s computational
After training, the agent will be tested by 30 rollouts with complexity increased exponentially with sample size, hamper-
Lrollout = 100 steps. ing its applications on USV with large sample sets. PETS ad-
3) Baselines and Parameters: Five baselines were se- dressed this issue by substituting GP with probabilistic neural
lected in the experiment. The SPMPC proposed by [28] is networks and estimated propagated uncertainties through net-
the original MBRL approach based on GP. FPMPC [30] can work ensembles. Our method PNMPC further enhanced PETS
be treated as its SOTA extension with support of partially to meet the requirement of USV: it considered the system
observable MDP. For the neural networks-based approach, dynamics in two continuous steps during the model learning
PETS [36] was compared. It can be seen as the proposed process and integrated random dropout from Deep Pilco [35]
method without the features of model training and uncertainty for superior representation capability of uncertainties.
propagation introduced in Sections. III-A and III-B. Two state- All experimental results were conducted on a computational
of-the-art model-free RL approaches DroQ [39] and TD7 [40] server with an Intel Xeon(R) W-2245 CPU, NVIDIA GeForce
were selected for a fair comparison. All compared methods RTX 2080 Ti and 64 GB memory by five independent trials
were implemented by PyTorch [41]. All MPC policies were with different random seeds.
implemented based on the bound optimization by quadratic
approximation [42] in NLopt1 for a fair comparison. The
B. Evaluation of Learning and Control Performances
settings of SPMPC and FPMPC followed their original paper,
1) Learning Performances: The learning curves of all
1 [Link] compared methods within 20 iterations were compared in
Authorized licensed use limited to: ShanghaiTech University. Downloaded on February 07,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies. Personal use is permitted,
but republication/redistribution requires IEEE permission. See [Link] for more information.
This article has been accepted for publication in IEEE Transactions on Automation Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TASE.2025.3539317
TABLE II
T EST RESULT OF PROPOSED METHOD AND BASELINES IN THE POSITION - KEEPING TASK .
Disturbance Method Average [m] Median [m] Time [s] Success rate (final) Success rate (overall)
DroQ 47.16 ± 28.55 43.96 0.002 ± 0.0003 0.0% 0.0%
TD7 20.70 ± 15.88 29.02 0.002 ± 0.0002 10.67% 9.33%
SPMPC 3.94 ± 6.71 1.51 0.43 ± 0.27 78.67% 66.0%
Lv. 1 FPMPC 1.85 ± 3.15 1.30 0.69 ± 0.83 97.33% 48.0%
PETS 7.69 ± 10.04 4.0 0.24 ± 0.09 90.0% 3.33%
PNMPC 1.85 ± 2.72 1.14 0.25 ± 0.08 95.33% 80.0%
PNMPC-N 18.48 ± 18.03 6.41 0.19 ± 0.01 68.0% 0.0%
DroQ 82.32 ± 51.42 79.23 0.002 ± 0.0003 0.0% 0.0%
TD7 23.85 ± 27.98 16.90 0.002 ± 0.0002 34.0% 20.0%
SPMPC 4.90 ± 9.98 2.93 0.36 ± 0.24 85.33% 25.33%
Lv. 2 FPMPC 4.32 ± 7.17 2.67 0.47 ± 0.49 90.0% 26.0%
PETS 26.17 ± 28.24 15.65 0.25 ± 0.10 49.33% 0.67%
PNMPC 3.50 ± 4.97 1.85 0.25 ± 0.09 92.67% 51.33%
PNMPC-N 22.42 ± 35.07 11.41 0.19 ± 0.01 32.0% 0.0%
DroQ 79.26 ± 48.21 77.18 0.002 ± 0.0003 0.0% 0.0%
TD7 41.91 ± 47.49 48.98 0.002 ± 0.0003 27.33% 15.33%
SPMPC 14.0 ± 28.64 3.30 0.30 ± 0.18 72.0% 20.0%
Lv. 3 FPMPC 5.33 ± 9.85 3.47 0.56 ± 0.50 84.67% 0.0%
PETS 24.70 ± 20.52 18.61 0.24 ± 0.09 26.67% 0.0%
PNMPC 3.42 ± 4.03 2.04 0.25 ± 0.09 98.0% 35.33%
PNMPC-N 35.21 ± 30.24 26.84 0.19 ± 0.01 29.33% 0.0%
Fig. 4 where the Y axis indicated the average position offset performances could have certain advantages over the mature
in the position-keeping task and the average tracking distance MBRL approach based on the GP models.
in the targets-tracking task with circle and right trajectories. 2) Testing Performances: The control performances of
The original SPMPC gradually failed to converge to proper PNMPC and other baselines in the testing were compared in
policies as the level of disturbance increased. The neural Tables II, III and IV to investigate the generalization capability
networks-based PETS had large standard deviations in the of our method after the training procedure. The average and
average position offset during the training, making it difficult median terms in Table II were related to the position offset,
to converge under all levels of disturbances. As a comparison, the time term in all tables indicated the average optimization
PNMPC stably converged to the same or even superior con- time of the MPC policy per step. The bold font indicated the
trol policies as FPMPC under all levels of disturbances and best performance in the corresponding term.
enjoyed a leading convergence velocity. In the targets-tracking
In the position-keeping task, PNMPC had shown significant
scenario with both circle and right trajectories, we observed
advantages under different levels of disturbances over other
similar advantages in the learning capability with PNMPC. It
baselines in terms of average position offset, median position
usually converged to higher average tracking distances than
offset, and task success rates in all tests, apart from the 2% less
FPMPC while the performances of SPMPC were significantly
success rate (final) in the first level disturbance compared to
influenced by the degree of disturbances. The neural networks-
FPMPC. As a comparison, the neural networks-based PETS
based PETS struggled to learn comparable control policies
performed consistently worse than SPMPC and FPMPC in
in most cases without the proposed tricks in modeling USV
position offset and success rates. For computational efficiency,
dynamics and propagating system uncertainties. Once the
the proposed method had a lower average optimization time
uncertainty propagation was disabled, PNMPC-N exhibited
than other GP model-based approaches while enjoying a
a decline in both learning ability and stability, falling sig-
longer prediction horizon. In the targets-tracking task with
nificantly behind other probabilistic MBRL approaches. In
the circle trajectory, the superiority of PNMPC was observed.
terms of model-free methods, DroQ and TD7 were unable
The proposed method outperformed other baselines in all
to successfully learn the tasks within 20 iterations due to the
control-related metrics. Compared to the suboptimal approach
perturbed environment and the limited number of interactions.
FPMPC, PNMPC improved the average tracking distance by
The results above demonstrated the feasibility of the pro- 8% under the highest disturbance level, resulting in a 30% in-
posed method in MBRL USV. Compared to the traditional crease in the success rate. In contrast, the control performance
neural networks-based approach PETS, our method had ro- of PETS and SPMPC quickly deteriorated with increasing
bustly learned superior control strategies in the MBRL loop. Its disturbance levels. In terms of computational complexity, the
Authorized licensed use limited to: ShanghaiTech University. Downloaded on February 07,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies. Personal use is permitted,
but republication/redistribution requires IEEE permission. See [Link] for more information.
This article has been accepted for publication in IEEE Transactions on Automation Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TASE.2025.3539317
TABLE III
T EST RESULT OF PROPOSED METHOD AND BASELINES IN THE TARGETS - TRACKING TASK ( CIRCLE ).
Disturbance Method Ave. complete distance [m] Ave. complete ratio Time [s] Success rate Failure rate
DroQ −170.31 ± 59.80 −141.92% 0.002 ± 0.0004 0.0% 100.0%
TD7 −150.70 ± 94.01 −125.58% 0.002 ± 0.0002 0.0% 100.0%
SPMPC 102.0 ± 18.49 85.0% 0.14 ± 0.09 50.0% 8.0%
Lv. 1 FPMPC 102.80 ± 20.01 85.57% 0.13 ± 0.08 56.0% 9.33%
PETS 100.33 ± 26.95 83.61% 0.21 ± 0.02 53.33% 19.33%
PNMPC 114.61 ± 13.61 95.51% 0.22 ± 0.10 88.0% 4.67%
PNMPC-N 79.66 ± 47.55 66.39% 0.14 ± 0.004 40.67% 37.33%
DroQ −172.23 ± 67.71 −143.52% 0.002 ± 0.0004 0.0% 100.0%
TD7 −49.18 ± 41.45 −40.98% 0.002 ± 0.0003 0.0% 100.0%
SPMPC 94.66 ± 28.03 79.14% 0.13 ± 0.08 56.67% 20.0%
Lv. 2 FPMPC 93.91 ± 21.02 78.26% 0.12 ± 0.07 31.33% 19.33%
PETS 45.85 ± 51.78 38.21% 0.21 ± 0.10 10.0% 70.67%
PNMPC 104.81 ± 26.14 87.35% 0.22 ± 0.11 69.33% 14.0%
PNMPC-N 52.38 ± 39.04 43.65% 0.14 ± 0.002 10.0% 74.67%
DroQ −130.10 ± 37.68 −108.42% 0.002 ± 0.0004 0.0% 100.0%
TD7 −194.02 ± 221.73 −161.68% 0.002 ± 0.0002 0.0% 100.0%
SPMPC 45.23 ± 82.01 37.69% 0.12 ± 0.07 13.33% 40.0%
Lv. 3 FPMPC 82.03 ± 22.23 68.40% 0.11 ± 0.06 18.67% 40.0%
PETS 58.86 ± 48.87 49.05% 0.21 ± 0.10 16.67% 63.33%
PNMPC 91.49 ± 30.89 76.24% 0.22 ± 0.10 48.0% 29.33%
PNMPC-N 5.08 ± 82.87 4.23% 0.19 ± 0.001 5.33% 82.67%
TABLE IV
T EST RESULT OF PROPOSED METHOD AND BASELINES IN THE TARGETS - TRACKING TASK ( RIGHT ).
Disturbance Method Ave. complete distance [m] Ave. complete ratio Time [s] Success rate Failure rate
DroQ −116.35 ± 90.91 −58.17% 0.002 ± 0.0004 0.0% 100.0%
TD7 −51.65 ± 42.30 −25.82% 0.002 ± 0.0003 0.0% 100.0%
SPMPC 182.62 ± 30.86 91.31% 0.18 ± 0.12 77.33% 14.0%
Lv. 1 FPMPC 190.50 ± 17.80 95.25% 0.16 ± 0.11 86.0% 1.33%
PETS 133.89 ± 26.95 58.50% 0.23 ± 0.10 21.33% 40.67%
PNMPC 191.07 ± 19.42 95.53% 0.24 ± 0.11 84.67% 2.0%
PNMPC-N 96.55 ± 49.74 48.27% 0.14 ± 0.003 5.33% 72.67%
DroQ −293.13 ± 85.57 −146.57% 0.002 ± 0.0004 0.0% 100.0%
TD7 −85.42 ± 127.76 −42.71% 0.002 ± 0.0002 0.0% 100.0%
SPMPC 152.84 ± 59.40 76.42% 0.15 ± 0.08 49.33% 27.33%
Lv. 2 FPMPC 163.57 ± 44.14 81.78% 0.13 ± 0.07 39.33% 16.67%
PETS 130.79 ± 59.67 65.39% 0.22 ± 0.09 22.67% 34.0%
PNMPC 179.15 ± 35.28 89.58% 0.23 ± 0.10 66.67% 8.0%
PNMPC-N 92.07 ± 57.29 46.04% 0.14 ± 0.002 5.33% 68.67%
DroQ −253.19 ± 42.30 −126.60% 0.002 ± 0.0004 0.0% 100.0%
TD7 −286.07 ± 42.30 −143.04% 0.002 ± 0.0002 0.0% 100.0%
SPMPC 145.29 ± 66.99 72.64% 0.15 ± 0.07 45.33% 30.0%
Lv. 3 FPMPC 161.56 ± 50.97 80.78% 0.14 ± 0.08 43.33% 19.33%
PETS 129.03 ± 50.07 64.51% 0.22 ± 0.10 18.0% 43.33%
PNMPC 180.57 ± 36.19 90.28% 0.23 ± 0.10 74.0% 10.67%
PNMPC-N 83.08 ± 51.97 41.90% 0.14 ± 0.002 2.0% 74.0%
Authorized licensed use limited to: ShanghaiTech University. Downloaded on February 07,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies. Personal use is permitted,
but republication/redistribution requires IEEE permission. See [Link] for more information.
This article has been accepted for publication in IEEE Transactions on Automation Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TASE.2025.3539317
Lv. 1
Lv. 2
Lv. 3
Fig. 4. Learning curves of PNMPC and baselines in three control tasks under different levels of disturbances. Lines and translucent regions represent the
mean and standard deviation.
0.13 0.04 0.18 0.08 0.18 0.03 0.25 0.05 0.17 0.03 0.22 0.04
0.13 0.05 0.18 0.08 0.22 0.04 0.26 0.07 0.24 0.05 0.22 0.04
0.10 0.07 0.15 0.07 0.11 0.02 0.16 0.05 0.14 0.04 0.14 0.04
Lv. 1 0.72 0.13 0.50 0.23 1.18 0.21 1.27 0.19 1.24 0.31 1.95 0.38
0.07 0.03 0.13 0.06 0.10 0.03 0.15 0.05 0.13 0.02 0.13 0.04
0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 0 0.5 1 1.5 0 0.5 1 1.5 0 0.5 1 1.5 0 0.5 1 1.5 2 2.5
Average predicted error of position (X axis) [m] Average predicted error of position (Y axis) [m] Average predicted error of position (X axis) [m] Average predicted error of position (Y axis) [m] Average predicted error of position (X axis) [m] Average predicted error of position (Y axis) [m]
4.22 1.53 0.21 0.12 4.22 1.53 0.14 0.07 4.91 1.41 0.19 0.03
6.91 1.52 0.21 0.06 6.91 1.52 0.10 0.03 6.66 1.77 0.19 0.02
Lv. 3 3.13 0.66 0.11 0.02 3.13 0.66 0.10 0.07 4.59 2.09 0.09 0.03
5.65 2.98 0.79 0.17 5.65 2.98 0.89 0.19 8.17 2.48 0.96 0.17
2.58 0.58 0.07 0.02 2.58 0.58 0.08 0.03 3.75 1.75 0.10 0.03
4.48 1.00 0.17 0.06 4.48 1.00 0.12 0.03 4.49 0.80 0.18 0.03
6.88 1.88 0.17 0.05 6.88 1.88 0.12 0.03 5.83 1.26 0.18 0.05
2.54 1.43 0.18 0.08 2.54 1.43 0.12 0.07 4.62 2.64 0.11 0.04
Lv. 2 6.63 3.64 0.65 0.18 6.63 3.64 0.99 0.18 9.88 2.88 1.10 0.17
2.79 1.19 0.08 0.03 2.79 1.19 0.08 0.02 3.80 1.62 0.10 0.03
3.91 0.87 0.12 0.04 3.91 0.87 0.10 0.01 4.65 0.90 0.11 0.02
5.83 1.70 0.11 0.04 5.83 1.70 0.10 0.02 6.40 1.09 0.11 0.02
1.84 1.88 0.15 0.04 1.84 1.88 0.06 0.01 7.37 5.76 0.06 0.02
Lv. 1 6.56 4.74 0.67 0.24 6.56 4.74 1.08 0.21 11.07 3.37 1.13 0.20
2.18 1.82 0.05 0.02 2.18 1.82 0.05 0.01 5.29 3.34 0.06 0.01
Fig. 5. Average prediction errors of the model learned by PNMPC and other baselines in position-keeping and targets-tracking tasks during the testing.
GP-based method SPMPC and FPMPC had achieved certain PNMPC demonstrated significant advantages over other
advantages which were potentially caused by the efficiency baselines in terms of success and failure rates, showcasing
of sparse GP in a larger task area (compared to the position- its robustness in controlling USV in ocean environments with
keeping task) while PNMPC achieved a similar and stable dynamic disturbances. Its comprehensive control capabilities
optimization time as in the position-keeping task. In the task were evident through smaller average offsets and longer
with the right trajectory, PNMPC consistently maintained its driving distances. Furthermore, the consistent excellence over
superiority and outperformed all other baseline methods. Un- three tasks highlighted its superior generalization ability.
der the third level of disturbance, the proposed method tracked 3) Model Accuracy: In this subsection, we evaluated the
10% more distances compared to the suboptimal approach quality of the probabilistic model learned by PNMPC. The
FPMPC, while improving success rate by 30%. Matching with average errors and their standard deviation between the first
the deteriorated learning ability shown in Fig. 4, PNMPC-N step prediction by the learned model and the real observation
demonstrated the worst performance compared with all MBRL in USV position, orientation and velocity using all model-
baselines in the testing. Although the model-free methods based baselines were demonstrated in Fig. 5.
DroQ and TD7 had clear advantages in calculation speed, they In all testing scenarios, PNMPC consistently achieved better
all failed to guide the USV to complete the given tasks. prediction accuracy and smaller prediction standard devia-
Authorized licensed use limited to: ShanghaiTech University. Downloaded on February 07,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies. Personal use is permitted,
but republication/redistribution requires IEEE permission. See [Link] for more information.
This article has been accepted for publication in IEEE Transactions on Automation Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TASE.2025.3539317
10
Lv. 1
Lv. 2
Lv. 3
Fig. 6. Learning curves of PNMPC, PETS, TD7 and DroQ in three control tasks with larger sample capacity under different levels of disturbances. Lines
and translucent regions represent the mean and standard deviation.
tion in all dimensions. The model accuracy was maintained three levels of disturbance. At each step, the wind direction and
even with the increasing disturbances. In comparison, the velocity were also affected by the random noises U (−30, 30)◦
GP model-based approaches suffered from larger prediction and U (−0.1, 0.1) m/s respectively.
errors caused by the sparse GP model. Although the neural The configuration above turned into a substantial expansion
networks-based PETS usually had a similar one-step predic- of the state space. It significantly increased the task’s difficulty,
tion accuracy to PNMPC, it significantly lagged in control while also being highly valuable for learning more general
performances using multi-step prediction MPC policy. This and robust control strategies in the highly random ocean
was due to not only the original loss function that did not environment rather than the one with a limited range of
consider the USV dynamics in continuous two steps but disturbances in [28]. To fully explore and learn the tasks
also the excessive uncertainties in long-term prediction. After under this configuration, we set the number of iterations/rollout
disabling the uncertainty propagation, the model prediction during the training to 100, using a total of 5000 to 10000
error of PNMPC-N significantly increased, which led to the samples. It is infeasible to employ SPMPC and FPMPC to
deterioration of the testing performance. Compared to the promptly control USV by the MPC policy under such a large
GP model-based methods SPMPC and FPMPC, the proposed sample set due to the overlarge computational complexity of
PNMPC significantly enhanced model accuracy. Compared to the GP model discussed in [28]. We compared DroQ, TD7,
PETS which modeled the target dynamics by neural networks, PETS, PNMPC-N and PNMPC to evaluate the superiority
our approach substantially improved generalization ability of the proposed method. All algorithmic settings remained
and learning performance by fitting the model in continuous consistent with Section IV-A.
two steps without introducing additional errors in single-step
1) Learning Performances: The learning curves of the
prediction. These results indicated the superiority of PNMPC
proposed PNMPC and baseline PETS within 100 iterations
in the quality of the learned USV model.
were compared in Fig. 6. Both PNMPC and PETS demon-
strated good learning behaviors in three levels of disturbance.
C. Evaluation with Large Sample Capacity Without uncertainty propagation, PNMPC-N had significant
In this subsection, the proposed method was evaluated in deficiencies in learning stability and finally converged to the
more challenging ocean environments with more learning worse position offsets and tracking distances. The model-free
iterations and larger sample capability to fully demonstrate method DroQ could not learn the tasks within 100 iterations.
its potential as a neural networks-based approach. Based on Although TD7 successfully converged in the position-keeping
the three levels of disturbance introduced in Section IV-A, tasks, it achieved unsatisfactory performances in two tracking
we generated the initial wind direction in a wider range tasks. In the position-keeping scenario, PETS frequently ex-
ψ w = U (−180, 180)◦ . The initial wind velocity was set as perienced drastic degradations in control performance during
v w = U (0, 1) · vmax
w w
where vmax = [2.0, 4.0, 6.0] m/s for the training as the environmental disturbances increased. In
Authorized licensed use limited to: ShanghaiTech University. Downloaded on February 07,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies. Personal use is permitted,
but republication/redistribution requires IEEE permission. See [Link] for more information.
This article has been accepted for publication in IEEE Transactions on Automation Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TASE.2025.3539317
11
TABLE V
T EST RESULT OF PROPOSED METHOD AND PETS WITH MORE LEARNING ITERATIONS IN THE POSITION - KEEPING TASK .
Disturbance Method Average [m] Median [m] Time [s] Success rate (final) Success rate (overall)
DroQ 13.0 ± 8.11 12.43 0.002 ± 0.0004 13.33% 4.44%
TD7 4.97 ± 2.91 4.57 0.002 ± 0.0002 84.0% 2.0%
Lv. 1 PETS 5.71 ± 11.67 2.71 0.24 ± 0.08 73.33% 34.67%
PNMPC 4.09 ± 4.94 2.48 0.24 ± 0.08 85.33% 42.0%
PNMPC-N 5.38 ± 8.33 2.59 0.21 ± 0.01 74.67% 37.33%
DroQ 21.37 ± 17.23 25.61 0.002 ± 0.0004 20.0% 0.0%
TD7 3.79 ± 2.48 3.31 0.002 ± 0.0003 93.33% 20.0%
Lv. 2 PETS 7.54 ± 11.09 3.75 0.24 ± 0.08 72.0% 18.67%
PNMPC 6.31 ± 8.13 3.10 0.25 ± 0.09 76.0% 26.67%
PNMPC-N 9.39 ± 13.43 4.99 0.21 ± 0.01 59.33% 12.67%
DroQ 30.96 ± 22.90 32.73 0.002 ± 0.0003 0.0% 0.0%
TD7 73.96 ± 46.26 75.48 0.002 ± 0.0002 0.0% 0.0%
Lv. 3 PETS 8.69 ± 20.03 3.91 0.24 ± 0.09 70.0% 26.0%
PNMPC 5.73 ± 7.73 3.53 0.25 ± 0.09 74.67% 27.33%
PNMPC-N 10.32 ± 14.19 5.62 0.21 ± 0.01 54.0% 6.0%
comparison, our method enjoyed a more robust learning pro- and achieved very low success rates in all levels’ disturbances.
cedure, with a smaller standard deviation in learning curves While TD7 performed very well under the disturbances of
and finally converged to superior average position offsets. In the first two levels, its control ability declined rapidly against
the target-reaching scenario, PNMPC and PETS had similar stronger disturbances. It was entirely incapable of completing
convergence behaviors. Their learning curves exhibited larger the task under the disturbances of level three.
standard deviations with increasing levels of disturbance.
In the targets-tracking scenario with the circle trajectory,
Unlike the significant differences in learning curves between
the proposed PNMPC also achieved control performance that
PETS and PNMPC shown in Fig. 4, Section IV-B, they
surpassed the baseline PETS. On average, it completed 8%
had similar learning curves in the scenarios where some
more tracking distance under three levels of disturbance while
observed states came from highly random distributions. We
significantly improving the quality of task completion: the
believed this phenomenon was caused by the neural networks’
success rate increased by about 9% and the failure rate
characteristics: they are highly effective in learning data with
decreased by about 10%. In the targets-tracking scenario
large distributions without being negatively affected by the
with the right trajectory that addressed long-term and high-
excessive update within a relatively local sample space.
speed driving, PNMPC demonstrated significant superiority
2) Testing Performances: The control performances of
over PETS. As the level of disturbance increased, our method
TD7, DroQ, PETS, PNMPC and PNMPC-N in the testing
improved the tracking distance by 12%, 17% and 16% and
were compared in Tables V, VI and VII to investigate the
enhanced the success rate of the task by 20%, 18% and 14%.
generalization capability of our method after the 100 inter-
Disabling the uncertainty in the probabilistic neural networks,
actions’ training. The bold font indicated the best perfor-
PNMPC-N had the worst performances in both circle and right
mance in the corresponding term. Despite the similar learning
trajectories. Without the guidance of the model, model-free RL
curves of FPMPC and PETS in Fig. 6, the proposed method
failed to learn this task, DroQ and TD7 could not even track
consistently showed advantages in the testing while PETS
the first target under all levels of disturbances. The results
failed to demonstrate sufficient generalization capability and
above demonstrated the superior control performance and
stability. In the position-keeping task, PNMPC outperformed
generalization capability of the proposed method in learning
PETS in all metrics related to control performances. Under the
USV control tasks with a large sample capacity.
first level of disturbance, it enjoyed over 10% improvement
in success rate (final) while reducing the average position Regarding computational complexity, the average decision-
offset by 28%. Although the two methods suffered decreases making time for all methods remained consistent with their
in success rate under the strongest level of disturbance, the performances in Section IV-B, irrespective of increased sam-
proposed method still maintained over 4% advantage and ple capacity. Although PNMPC’s uncertainty propagation is
achieved a 34% lower position offset than PETS. Without more complex than PETS due to additional dropout units,
the support of uncertainty propagation, PNMPC-N achieved we consider the slightly longer optimization time of ap-
close control performance and success rate to PETS after a proximately 0.01 s acceptable in practice. In contrast, all
long training process. As the disturbance level increased, the GP-based approaches were infeasible for these tasks, as the
performance gap between PNMPC-N and PETS also widened. optimization time per step with over 10, 000 samples could
For the model-free approaches, DroQ could not learn this task exceed 10 seconds. While the decision-making time of model-
Authorized licensed use limited to: ShanghaiTech University. Downloaded on February 07,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies. Personal use is permitted,
but republication/redistribution requires IEEE permission. See [Link] for more information.
This article has been accepted for publication in IEEE Transactions on Automation Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TASE.2025.3539317
12
TABLE VI
T EST RESULT OF PROPOSED METHOD AND PETS WITH MORE LEARNING ITERATIONS IN THE TARGETS - TRACKING TASK ( CIRCLE ).
Disturbance Method Average complete distance [m] Average complete ratio Time [s] Success rate Failure rate
DroQ −38.41 ± 33.32 −32.0% 0.002 ± 0.0004 0.0% 100.0%
TD7 −103.43 ± 12.51 −86.19% 0.002 ± 0.0002 0.0% 100.0%
Lv. 1 PETS 98.69 ± 26.59 82.24% 0.20 ± 0.07 62.0% 19.33%
PNMPC 104.16 ± 23.44 85.13% 0.22 ± 0.09 73.33% 10.0%
PNMPC-N 68.45 ± 77.31 57.04% 0.15 ± 0.004 34.0% 40.0%
DroQ −94.23 ± 47.66 −78.52% 0.002 ± 0.0003 0.0% 100.0%
TD7 −146.56 ± 56.10 −122.13% 0.002 ± 0.0003 0.0% 100.0%
Lv. 2 PETS 80.91 ± 41.43 67.43% 0.21 ± 0.09 45.33% 39.33%
PNMPC 95.36 ± 32.64 79.47% 0.23 ± 0.11 60.67% 22.0%
PNMPC-N 58.96 ± 44.13 49.14% 0.15 ± 0.004 19.33% 63.33%
DroQ −37.35 ± 29.21 −31.12% 0.002 ± 0.0003 0.0% 100.0%
TD7 −40.14 ± 68.37 −33.45% 0.002 ± 0.0003 0.0% 100.0%
Lv. 3 PETS 85.19 ± 36.89 71.0% 0.22 ± 0.10 42.67% 34.0%
PNMPC 93.98 ± 30.43 78.32% 0.22 ± 0.10 51.33% 24.67%
PNMPC-N 57.63 ± 45.72 48.02% 0.15 ± 0.004 19.33% 64.0%
TABLE VII
T EST RESULT OF PROPOSED METHOD AND PETS WITH MORE LEARNING ITERATIONS IN THE TARGETS - TRACKING TASK ( RIGHT ).
Disturbance Method Average complete distance [m] Average complete ratio Time [s] Success rate Failure rate
DroQ −38.35 ± 103.05 −19.18% 0.002 ± 0.0004 0.0% 93.33%
TD7 −73.97 ± 161.26 −36.99% 0.002 ± 0.0002 2.0% 90.67%
Lv. 1 PETS 135.89 ± 57.46 67.87% 0.22 ± 0.10 34.0% 40.67%
PNMPC 159.33 ± 53.27 79.67% 0.23 ± 0.10 54.67% 23.33%
PNMPC-N 110.33 ± 61.61 55.16% 0.15 ± 0.006 16.67% 52.0%
DroQ −84.18 ± 96.05 −42.09% 0.002 ± 0.0003 0.0% 100.0%
TD7 17.91 ± 37.45 −9.0% 0.002 ± 0.0003 0.0% 100.0%
Lv. 2 PETS 120.46 ± 66.25 60.23% 0.22 ± 0.09 27.33% 48.67%
PNMPC 155.53 ± 50.28 77.77% 0.23 ± 0.10 45.33% 23.33%
PNMPC-N 91.07 ± 65.83 45.53% 0.15 ± 0.007 12.0% 67.33%
DroQ −55.66 ± 26.17 −27.83% 0.002 ± 0.0004 0.0% 100.0%
TD7 −39.23 ± 43.25 −19.61% 0.002 ± 0.0003 0.0% 100.0%
Lv. 3 PETS 120.46 ± 66.25 60.23% 0.22 ± 0.09 27.33% 48.67%
PNMPC 152.43 ± 50.52 76.21% 0.23 ± 0.10 41.33% 26.0%
PNMPC-N 86.89 ± 62.03 43.45% 0.15 ± 0.007 4.67% 70.67%
free approaches DroQ and TD7 was about 1% of our method, bances compared with the results in Section IV-B. Both PETS
they demonstrated significant shortcomings in both control and PNMPC demonstrated reasonable prediction accuracy in
performance (average offset and driving distance) and robust- these states. On the other hand, PNMPC demonstrated sig-
ness (success and failure rates). These disadvantages became nificant advantages over PETS in both accuracy and stability
more pronounced as environmental disturbances intensified, when predicting the orientation of USV which was strongly
rendering these methods nearly incapable of completing tasks affected by the ocean disturbances. It consistently enjoyed far
under the highest level of disturbances. Consequently, we smaller prediction errors and lower standard deviation under
believed that the trade-off between computation time and various levels of disturbances. We believe these advantages
performance in our method was reasonable. mainly come from the proper uncertainty representation and
propagation in the proposed method. Once the uncertainty was
3) Model Accuracy: Figure 7 compared the average error disabled, the model prediction error of PNMPC-N quickly
and standard deviation between the first step prediction in the increased and led to a control performance inferior to PETS.
MPC policy and the observed state using the models learned
by PNMPC, PNMPC-N and PETS during the testing. It can
be observed that as the sample capacity increased, the large D. Case Study
prediction errors in USV position and velocity were partially In this subsection, a case study of position-keeping and
alleviated even under more random and challenging distur- targets-tracking scenarios using all compared approaches was
Authorized licensed use limited to: ShanghaiTech University. Downloaded on February 07,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies. Personal use is permitted,
but republication/redistribution requires IEEE permission. See [Link] for more information.
This article has been accepted for publication in IEEE Transactions on Automation Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TASE.2025.3539317
13
0.15 0.05 0.23 0.10 0.14 0.04 0.22 0.08 0.16 0.04 0.19 0.07
0.64 0.28 0.64 0.25 1.27 0.35 1.43 0.39 1.22 0.39 1.93 0.49
Lv. 2
0.17 0.06 0.29 0.12 0.15 0.04 0.25 0.08 0.16 0.04 0.21 0.09
0.11 0.05 0.16 0.07 0.11 0.03 0.15 0.04 0.13 0.03 0.12 0.03
0.50 0.25 0.48 0.25 1.27 0.35 1.44 0.43 1.29 0.44 2.09 0.48
Lv. 1 0.13 0.05 0.18 0.07 0.13 0.03 0.16 0.05 0.14 0.03 0.13 0.04
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.5 1 1.5 2 0 0.5 1 1.5 2 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3
Average predicted error of position (X axis) [m] Average predicted error of position (Y axis) [m] Average predicted error of position (X axis) [m] Average predicted error of position (Y axis) [m] Average predicted error of position (X axis) [m] Average predicted error of position (Y axis) [m]
5.34 5.15 0.10 0.03 6.99 3.80 0.08 0.02 7.95 4.30 0.09 0.02
Lv. 3 7.27 2.70 0.75 0.26 10.53 2.00 1.15 0.22 10.72 2.27 1.22 0.26
3.23 1.20 0.10 0.03 4.24 1.00 0.09 0.02 4.48 1.54 0.10 0.03
4.93 3.68 0.10 0.03 7.47 4.35 0.08 0.02 7.95 4.30 0.09 0.02
7.37 2.86 0.74 0.26 10.50 1.90 1.16 0.21 10.85 2.31 1.24 0.27
Lv. 2 3.26 1.38 4.34 0.80
0.10 0.04 0.09 0.02 4.47 1.54 0.10 0.03
5.85 4.51 0.07 0.03 7.22 4.02 0.06 0.01 9.08 4.36 0.06 0.02
7.50 3.38 0.60 0.29 11.25 1.86 1.24 0.24 11.59 2.35 1.36 0.30
Lv. 1 3.05 1.11 0.08 0.03 4.10 0.75 0.06 0.02 4.23 1.25 0.07 0.02
Fig. 7. Average prediction errors of the model learned by PNMPC and PETS in position-keeping and targets-tracking tasks with large sample capacity during
the testing.
conducted to illustrate the control behavior learned by the too fast when tracking the second target. The subsequent
proposed method. After learning in 20 iterations with the same violent swaying in orientation further prevented the USV
random seed, all approaches were tested in one rollout with the from resetting, ultimately leading to the failure of the task.
same random distances. The testing rollout including the USV As a comparison, PNMPC quickly tracked all targets and
state and action trajectories were shown in Figs. 8, 9 and 10. stabilized at the final target within approximately 40 steps.
The boat shapes indicated the states of USV at time steps 1 to The effective uncertainty propagation in our method resulted
100 (blue to red). The yellow stars indicated tracking targets. in prompt feedback to the changing disturbances using the
The predicted values and the observed states were shown in MPC policy and therefore contributed to smooth and seamless
red and blue respectively. control trajectories of both USV position and orientation. The
In the position-keeping task shown in Fig. 8, all com- proper cooperation of the rudder and engine throttle drove the
pared approaches were able to maintain the USV position USV to quickly pass through all targets at a relatively high
within a range of 7 meters to the initial position at the last speed and finally stop steadily at the last target point.
step. However, compared to the GP model-based approaches According to Fig. 10, PNMPC had similar advantages in
SPMPC and FPMPC which resulted in a wider range of USV the long-distance tracking task (right trajectory). Based on
position and orientation, the proposed PNMPC consistently the traditional probabilistic neural networks model, the MPC
kept the position of USV as close to the initial position as policy of PETS failed to promptly respond to the changing
possible, except for necessary adjustments of orientation. The disturbances, resulting in the USV stalling near the fourth
proposed probabilistic neural networks accurately predicted target. In comparison to the GP model-based approaches that
the orientation of the USV, which contributed to a smooth excessively adjusted the USV velocity to sequentially track
trajectory of orientation with the smallest fluctuation range, multiple targets due to the large predicted error of direction
helping the USV to maintain its position under disturbances. and velocity, our method significantly reduced the frequency
Meanwhile, the neural networks-based PETS could not achieve of challenging throttle and contributed to superior efficiency
similar control performances. Compared to other GP model- and stability in the tracking task. Once the final target was
based baselines, the neural networks model in PETS sacrificed reached, PNMPC properly controlled the rudder and throttle,
model accuracy. It deviated far from the initial position with stably holding the USV’s position.
larger position offsets than our method. In terms of the states
and actions trajectories, thanks to its superior prediction accu- E. Discussion of PNMPC and its Implementation
racy of USV orientation than the GP model-based approaches, The experimental results above demonstrated the significant
PNMPC effectively controlled the USV heading while avoid- advantages of PNMPC over current state-of-the-art model-free
ing the fluctuation of orientations. Additionally, with a longer approaches in USV control scenarios. The agent of model-
prediction horizon in MPC policy, PNMPC always conducted free RL based on Actor-Critic networks attempts to sacrifice
smaller rudder and throttle commands, resulting in more the stability of the current system in pursuit of maximizing
precise control behaviors. the accumulated rewards in the long term, leading to loss
In the targets-tracking scenario with the circle trajectory of control in disturbed environments. In contrast, PNMPC
as shown in Fig. 9, PNMPC demonstrated superior and ef- achieved a superior balance between accumulated rewards and
fective control behavior compared with other baselines. With system stability in a short horizon via the MPC controller,
redundant USV trajectories, SPMPC and FPMPC remained making it more suitable for unmanned ship control. However,
in unstable position-keeping after tracking all targets. PETS it faces a heavier computational burden compared to DroQ and
was unable to complete the task at all. The large predicted TD7, while its performance is highly dependent on the MPC
error of velocity in the early stage caused the USV to move horizon and the reliability of the learned model.
Authorized licensed use limited to: ShanghaiTech University. Downloaded on February 07,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies. Personal use is permitted,
but republication/redistribution requires IEEE permission. See [Link] for more information.
This article has been accepted for publication in IEEE Transactions on Automation Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TASE.2025.3539317
14
Position Keeping
30 30 30 30
15 15 15 15
Rudder [°]
Rudder [°]
Rudder [°]
Rudder [°]
0 0 0 0
Throttle [%]
Throttle [%]
50 50
Throttle [%]
Throttle [%]
50 50
0 0 0 0
Fig. 8. One test rollout with state and action trajectories of all compared approaches in position-keeping under the third level of disturbance.
For applications, PNMPC can be directly implemented in the probabilistic model of PNMPC with the Actor-Critic
a real-world USV system following [28] while achieving a networks to balance the long-term reward and the current
superior control frequency independent of the size of samples. system stability is also a meaningful direction.
Although the stability of probabilistic neural networks has
been less discussed in related works [35], [36], the proposed V. C ONCLUSIONS
method provides a practical mechanism to alleviate unknown
machine faults and unexpected system behaviors in USV This article proposed PNMPC, a novel probabilistic MBRL
control scenarios: these anomalies may lead to an increase in approach specialized for USV control. It tackled the compu-
the standard deviation of the model predictions, signaling the tational efficiency issue present in current GP model-based
system to promptly halt autonomous decision-making, thereby MBRL USV approaches. Within one MBRL framework, PN-
preventing potentially dangerous situations. In non-ideal sce- MPC utilized deep neural networks to properly model the
narios, such as when observed states are missing or corrupted uncertain dynamics of the USV from a probabilistic per-
by noise, related MBRL approaches GP-MPC, FPMPC, and spective and incorporated an MPC policy to robustly control
PETS often predict states with inaccurate distributions, leading the USV against various external disturbances. A novel loss
to accumulated bias in subsequent steps. As a comparison, our function was designed to emphasize the characteristics of
method leverages a two-step loss function in model updating USV dynamics in continuous time steps to properly train the
and a mean-based propagation in decision-making to tackle probabilistic neural networks model. The effectiveness of the
this issue, enhancing the stability of control behavior. proposed in modeling USV dynamics and learning control
strategies was evaluated across various USV scenarios. Com-
In terms of scalability, employing other Bayesian neural net- pared to recent GP model-based MBRL approaches, PNMPC
works in the proposed method was straightforward. However, offers computational complexity that is independent of sample
traditional Bayesian networks mainly designed for supervised capacity and plans more optimal USV driving trajectories.
learning usually had numerous parameters and struggled to Compared with existing work PETS based on neural networks
efficiently express uncertainties [43], [44]. We believed the model, PNMPC achieves higher model prediction accuracy
probabilistic neural networks in PNMPC that focused on stably and more responsive control behavior against external dis-
and efficiently modeling system dynamics were more suitable turbances, breaking through limitations of control capability
for the real-time control of USV. Algorithmically, integrating traditionally encountered with probabilistic neural networks
Authorized licensed use limited to: ShanghaiTech University. Downloaded on February 07,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies. Personal use is permitted,
but republication/redistribution requires IEEE permission. See [Link] for more information.
This article has been accepted for publication in IEEE Transactions on Automation Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TASE.2025.3539317
15
Targets-tracking (circle)
40 40
Position in X [m]
Position in X [m]
20 20
0 0
-20 -20
-40 -40
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Time step Time step
60 60
Position in Y [m]
Position in Y [m]
40 40
20 20
0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Time step Time step
100 100
Direction [°]
Direction [°]
0 0
-100 -100
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Time step Time step
5 5
Velocity [m/s]
Velocity [m/s]
0 0
-5 -5
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Time step Time step
30 30 30 30
15 15 15 15
Rudder [°]
Rudder [°]
Rudder [°]
Rudder [°]
0 0 0 0
Throttle [%]
Throttle [%]
Throttle [%]
50 50 50 50
0 0 0 0
Fig. 9. One test rollout with state and action trajectories of all compared approaches in targets-tracking (circle) under the third level of disturbance.
in engineering scenarios. This work expands the potential [8] United Nations Conference on Trade and Development, Review of
of probabilistic neural network model-based reinforcement Maritime Transport 2018. 2018.
[9] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
learning towards a fully autonomous USV. MIT press Cambridge, 2018.
[10] Z. Liu, Q. Liu, L. Tang, K. Jin, H. Wang, M. Liu, and H. Wang, “Vi-
R EFERENCES suomotor reinforcement learning for multirobot cooperative navigation,”
IEEE Transactions on Automation Science and Engineering, vol. 19,
[1] M. J. Er, C. Ma, T. Liu, and H. Gong, “Intelligent motion control no. 4, pp. 3234–3245, 2022.
of unmanned surface vehicles: A critical review,” Ocean Engineering, [11] Z. Yan, A. R. Kreidieh, E. Vinitsky, A. M. Bayen, and C. Wu, “Unified
vol. 280, p. 114562, 2023. automatic control of vehicular systems with reinforcement learning,”
[2] Y. Qiao, J. Yin, W. Wang, F. Duarte, J. Yang, and C. Ratti, “Survey of IEEE Transactions on Automation Science and Engineering, vol. 20,
deep learning for autonomous surface vehicles in marine environments,” no. 2, pp. 789–804, 2023.
IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 4, [12] Y. Zhao, X. Qi, Y. Ma, Z. Li, R. Malekian, and M. A. Sotelo,
pp. 3678–3701, 2023. “Path following optimization for an underactuated usv using smoothly-
[3] J. Mcmahon and E. Plaku, “Autonomous data collection with dy- convergent deep reinforcement learning,” IEEE Transactions on Intelli-
namic goals and communication constraints for marine vehicles,” IEEE gent Transportation Systems, vol. 22, no. 10, pp. 6208–6220, 2021.
Transactions on Automation Science and Engineering, vol. 20, no. 3, [13] N. Wang, Y. Gao, H. Zhao, and C. K. Ahn, “Reinforcement learning-
pp. 1607–1620, 2023. based optimal tracking control of an unknown unmanned surface ve-
[4] Y. Yu, C. Guo, and H. Yu, “Finite-time plos-based integral sliding- hicle,” IEEE Transactions on Neural Networks and Learning Systems,
mode adaptive neural path following for unmanned surface vessels with vol. 32, no. 7, pp. 3034–3045, 2021.
unknown dynamics and disturbances,” IEEE Transactions on Automation
Science and Engineering, vol. 16, no. 4, pp. 1500–1511, 2019. [14] N. Wang, Y. Gao, and X. Zhang, “Data-driven performance-prescribed
reinforcement learning control of an unmanned surface vehicle,” IEEE
[5] N. Yang, D. Chang, M. Johnson-Roberson, and J. Sun, “Energy-optimal
Transactions on Neural Networks and Learning Systems, vol. 32, no. 12,
control for autonomous underwater vehicles using economic model
pp. 5456–5467, 2021.
predictive control,” IEEE Transactions on Control Systems Technology,
vol. 30, no. 6, pp. 2377–2390, 2022. [15] A. Heiberg, T. N. Larsen, E. Meyer, A. Rasheed, O. San, and D. Varag-
[6] B.-O. H. Eriksen, M. Breivik, E. F. Wilthil, A. L. Flåten, and E. F. nolo, “Risk-based implementation of colregs for autonomous surface
Brekke, “The branching-course model predictive control algorithm for vehicles using deep reinforcement learning,” Neural Networks, vol. 152,
maritime collision avoidance,” Journal of Field Robotics, vol. 36, no. 7, pp. 17–33, 2022.
pp. 1222–1249, 2019. [16] F. Huang, J. Xu, L. Yin, D. Wu, Y. Cui, Z. Yan, and T. Chen, “A general
[7] L. Liu, D. Wang, Z. Peng, and Q.-L. Han, “Distributed path following motion control architecture for an autonomous underwater vehicle with
of multiple under-actuated autonomous surface vehicles based on data- actuator faults and unknown disturbances through deep reinforcement
driven neural predictors via integral concurrent learning,” IEEE Trans- learning,” Ocean Engineering, vol. 263, p. 112424, 2022.
actions on Neural Networks and Learning Systems, vol. 32, no. 12, [17] W. Gan, X. Qu, D. Song, and P. Yao, “Multi-usv cooperative chasing
pp. 5334–5344, 2021. strategy based on obstacles assistance and deep reinforcement learning,”
Authorized licensed use limited to: ShanghaiTech University. Downloaded on February 07,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies. Personal use is permitted,
but republication/redistribution requires IEEE permission. See [Link] for more information.
This article has been accepted for publication in IEEE Transactions on Automation Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TASE.2025.3539317
16
Targets-tracking (right)
30 30 30 30
15 15 15 15
Rudder [°]
Rudder [°]
Rudder [°]
Rudder [°]
0 0 0 0
Throttle [%]
Throttle [%]
Throttle [%]
50 50 50 50
0 0 0 0
Fig. 10. One test rollout with state and action trajectories of all compared approaches in targets-tracking (right) under the third level of disturbance.
IEEE Transactions on Automation Science and Engineering, pp. 1–16, with probabilistic model predictive control,” in International Conference
2023. on Artificial Intelligence and Statistics, pp. 1701–1710, 2018.
[18] I. Masmitja, M. Martin, T. O’Reilly, B. Kieft, N. Palomeras, J. Navarro, [27] Y. Cui, S. Osaki, and T. Matsubara, “Reinforcement learning boat au-
and K. Katija, “Dynamic robotic tracking of underwater targets using topilot: A sample-efficient and model predictive control based approach,”
reinforcement learning,” Science Robotics, vol. 8, no. 80, p. eade7811, in 2019 IEEE/RSJ International Conference on Intelligent Robots and
2023. Systems (IROS), pp. 2868–2875, 2019.
[19] R. Chai, A. Tsourdos, S. Chai, Y. Xia, A. Savvaris, and C. L. P. Chen, [28] Y. Cui, S. Osaki, and T. Matsubara, “Autonomous boat driving system
“Multiphase overtaking maneuver planning for autonomous ground using sample-efficient model predictive control-based reinforcement
vehicles via a desensitized trajectory optimization approach,” IEEE learning approach,” Journal of Field Robotics, vol. 38, no. 3, pp. 331–
Transactions on Industrial Informatics, vol. 19, no. 1, pp. 74–87, 2023. 354, 2021.
[20] R. Chai, H. Niu, J. Carrasco, F. Arvin, H. Yin, and B. Lennox, [29] R. McAllister and C. E. Rasmussen, “Data-efficient reinforcement
“Design and experimental validation of deep reinforcement learning- learning in continuous state-action gaussian-pomdps,” in Advances in
based fast trajectory planning and control for mobile robot in unknown Neural Information Processing Systems, pp. 2040–2049, 2017.
environment,” IEEE Transactions on Neural Networks and Learning [30] Y. Cui, L. Peng, and H. Li, “Filtered probabilistic model predictive
Systems, vol. 35, no. 4, pp. 5778–5792, 2024. control-based reinforcement learning for unmanned surface vehicles,”
[21] C. E. Rasmussen and C. K. Williams, Gaussian processes for machine IEEE Transactions on Industrial Informatics, vol. 18, no. 10, pp. 6950–
learning, vol. 1. MIT press Cambridge, 2006. 6961, 2022.
[31] E. Snelson and Z. Ghahramani, “Sparse Gaussian processes using
[22] J. Meng, Y. Liu, R. Bucknall, W. Guo, and Z. Ji, “Anisotropic gpmp2: pseudo-inputs,” in Advances in neural information processing systems,
A fast continuous-time gaussian processes based motion planner for pp. 1257–1264, 2006.
unmanned surface vehicles in environments with ocean currents,” IEEE
[32] Y. Cui, W. Shi, H. Yang, C. Shao, L. Peng, and H. Li, “Probabilistic
Transactions on Automation Science and Engineering, vol. 19, no. 4,
model-based reinforcement learning unmanned surface vehicles using
pp. 3914–3931, 2022.
local update sparse spectrum approximation,” IEEE Transactions on
[23] M. P. Deisenroth, D. Fox, and C. E. Rasmussen, “Gaussian processes Industrial Informatics, vol. 20, no. 2, pp. 1283–1293, 2024.
for data-efficient learning in robotics and control,” IEEE Transactions [33] R. Chai, A. Tsourdos, A. Savvaris, S. Chai, Y. Xia, and C. L. P. Chen,
on Pattern Analysis & Machine Intelligence, vol. 37, no. 2, pp. 408–423, “Design and implementation of deep neural network-based control for
2013. automatic parking maneuver process,” IEEE Transactions on Neural
[24] R. Chai, A. Tsourdos, H. Gao, Y. Xia, and S. Chai, “Dual-loop tube- Networks and Learning Systems, vol. 33, no. 4, pp. 1400–1413, 2022.
based robust model predictive attitude tracking control for spacecraft [34] R. Chai, D. Liu, T. Liu, A. Tsourdos, Y. Xia, and S. Chai, “Deep
with system constraints and additive disturbances,” IEEE Transactions learning-based trajectory planning and control for autonomous ground
on Industrial Electronics, vol. 69, no. 4, pp. 4022–4033, 2021. vehicle parking maneuver,” IEEE Transactions on Automation Science
[25] R. Chai, A. Tsourdos, H. Gao, S. Chai, and Y. Xia, “Attitude tracking and Engineering, vol. 20, no. 3, pp. 1633–1647, 2023.
control for reentry vehicles using centralised robust model predictive [35] Y. Gal, R. McAllister, and C. E. Rasmussen, “Improving PILCO with
control,” Automatica, vol. 145, p. 110561, 2022. bayesian neural network dynamics models,” in Data-efficient machine
[26] S. Kamthe and M. Deisenroth, “Data-efficient reinforcement learning learning workshop, ICML, vol. 4, p. 25, 2016.
Authorized licensed use limited to: ShanghaiTech University. Downloaded on February 07,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies. Personal use is permitted,
but republication/redistribution requires IEEE permission. See [Link] for more information.
This article has been accepted for publication in IEEE Transactions on Automation Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TASE.2025.3539317
17
[36] K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep reinforce- Huiyun Li (Senior Member, IEEE) received the
ment learning in a handful of trials using probabilistic dynamics models,” [Link] degree in Electronic Engineering from
in Advances in neural information processing systems (NIPS), 2018. Nanyang Technological University in 2001, and the
[37] D. Sarkar, M. A. Osborne, and T. A. Adcock, “Prediction of tidal Ph.D degree from University of Cambridge, UK,
currents using bayesian machine learning,” Ocean Engineering, vol. 158, in 2006. She is now a professor at Shenzhen In-
pp. 221–231, 2018. stitutes of Advanced Technology, Chinese Academy
[38] A. Girard, C. E. Rasmussen, J. Q. Candela, and R. Murray-Smith, of Sciences and The Chinese University of Hong
“Gaussian process priors with uncertain inputs application to multiple- Kong. Her research interests automobile electronics
step ahead time series forecasting,” in Advances in neural information and autonomous vehicles.
processing systems (NIPS), pp. 545–552, 2003.
[39] T. Hiraoka, T. Imagawa, T. Hashimoto, T. Onishi, and Y. Tsuruoka,
“Dropout q-functions for doubly efficient reinforcement learning,” in
International Conference on Learning Representations (ICLR), 2022.
[40] S. Fujimoto, W.-D. Chang, E. Smith, S. Gu, D. Precup, and D. Meger,
“For sale: State-action representation learning for deep reinforcement
learning,” in Advances in Neural Information Processing Systems
(NeurIPS), vol. 36, pp. 61573–61624, 2023.
[41] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf,
E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,
L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-
performance deep learning library,” in Advances in neural information
processing systems (NIPS), pp. 8024–8035, 2019.
[42] M. J. Powell, “The bobyqa algorithm for bound constrained optimization
without derivatives,” Cambridge NA Report NA2009/06, University of
Cambridge, vol. 26, pp. 1–39, 2009.
[43] I. Osband, “Risk versus uncertainty in deep learning: Bayes, bootstrap
and the dangers of dropout,” in NIPS workshop on bayesian deep
learning, vol. 192, MIT Press, 2016.
[44] R. Egele, R. Maulik, K. Raghavan, B. Lusch, I. Guyon, and P. Bal-
aprakash, “Autodeuq: Automated deep ensemble with uncertainty quan-
tification,” in 2022 26th International Conference on Pattern Recognition
(ICPR), pp. 1908–1914, IEEE, 2022.
Authorized licensed use limited to: ShanghaiTech University. Downloaded on February 07,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies. Personal use is permitted,
but republication/redistribution requires IEEE permission. See [Link] for more information.