0% found this document useful (0 votes)

31 views17 pages

Effective Probabilistic Neural Networks Model For Model-Based Reinforcement Learning USV

This article presents a novel approach called probabilistic neural networks model predictive control (PNMPC) for model-based reinforcement learning (MBRL) in unmanned surface vehicles (USV). The proposed method addresses the computational complexity of traditional Gaussian process models by using deep neural networks to effectively model USV dynamics and control against disturbances. Evaluations demonstrate that PNMPC outperforms existing methods in both model accuracy and control performance in various USV scenarios.

Uploaded by

admin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views17 pages

Effective Probabilistic Neural Networks Model For Model-Based Reinforcement Learning USV

Uploaded by

admin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

This article has been accepted for publication in IEEE Transactions on Automation Science and Engineering.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TASE.2025.3539317

Effective Probabilistic Neural Networks Model for

Model-based Reinforcement Learning USV
Wenjun Huang, Yunduan Cui, Member, IEEE, Huiyun Li, Senior Member, IEEE,
Xinyu Wu, Senior Member, IEEE,

Abstract—Gaussian process (GP) offers a robust solution for knowledge of specific hardwares [5], manual selection of
modeling the dynamics of unmanned surface vehicles (USV) controller parameters [6] and large amounts of data that are
in model-based reinforcement learning (MBRL). However, the usually costly to collect by human experts [7]. There is still
rapidly increasing computational complexity with a large sample
capacity of GP limits its application in complex scenarios that a significant gap in achieving fully automatic control and
require substantial samples to cover the state space. In this adaptive learning in USV domian [8].
article, a novel probabilistic MBRL approach, probabilistic neu- Reinforcement learning (RL) aims to iteratively learn op-
ral networks model predictive control (PNMPC) is proposed to timal or suboptimal control strategies by interacting with the
tackle this issue. With an iterative learning framework, PNMPC external environments without human intervention [9]–[11].
properly models the USV dynamics using neural networks from
a probabilistic perspective to avoid the computational complex- It offers an appealing prospect for the fully autonomous USV
ity associated with sample capacity. Employing this model to with the capability of adaptive learning. The traditional model-
effectively propagate system uncertainties, a model predictive free RL approaches have been extensively applied to USV in
control (MPC) policy is developed to robustly control the USV various tasks including collision avoidance, target tracking and
against external disturbances. Evaluated by position-keeping and motion control [12]–[17]. The most recent research trained a
multiple targets-tracking scenarios on a real USV data-driven
simulation, the proposed method consistently demonstrates its model-free RL agent in a simulation environment and success-
significant superiority in both model accuracy and control per- fully drove it in the real-world USV tracking task [18]. On the
formance compared to not only GP model-based approaches but other hand, most of the traditional model-free RL approaches
also the probabilistic neural networks-based MBRL baselines, typically rely on a deterministic policy based on neural net-
across various scales of external disturbances. works which inherently lacks the the robustness against the
Note to Practitioners— Modelling the system dynamics and ocean environment under various uncertain disturbances. They
maintaining computational efficiency with a large sample set has usually need substantial learning in simulated environments
been challenging for MBRL in the USV domain. We propose a before they can be deployed in real-world scenarios with
novel neural network modeling method to capture the dynamic different unmanned systems [18]–[20].
features of USV within an RL loop and develop a robust MPC This issue can be mitigated by model-based RL (MBRL)
policy based on its uncertainty propagation. Our method achieves
computational complexity independent of the sample capacity which parallelly learns a model of the environment. One of
and outperforms related baselines in model accuracy and control the most representative approaches, the probabilistic inference
performance. for learning control (PILCO) modeled the uncertain system by
Gaussian process (GP) [21], [22] and achieved robust control
I. I NTRODUCTION with superior sampe-efficiency [23]. Based on the superior
robustness of model predictive control (MPC) in complex sys-
S one critical technology in marine science, unmanned
A surface Vehicles (USV) have evolved rapidly over the
past few decades in academia and industry. These technolo-
tems like unmanned ground vehicles and space vehicles [24],
[25], GP-MPC was proposed by integrating MPC into PILCO
to deal with the dynamic and unpredictable disturbances in
gies have become a key solution in not only enhancing the
uncertain environments with evaluation on simulated cart-pole
efficiency and safety of marine transportation but also tackling
and double pendulum [26]. As the first attempt to implement
the shortage of skilled professionals in the shipping industry.
the MBRL framework with GP model and MPC controller
Despite their growing applications in various scenarios [1]–
to USV, sample-efficient probabilistic model predictive con-
[4], the current USV technologies highly depend on the prior
trol (SPMPC) successfully drove a regular size boat in a
This work is supported in part by the National Natural Science Foundation real ocean environment to learn target-reaching and position-
of China under Grants 62103403 and 92473112; in part by Shenzhen R&D keeping tasks without neither prior human knowledge nor
Foundation under Grant KCXST20221021111210023. (Corresponding author:
Yunduan Cui) additional training process in simulation [27], [28]. Drawing
W. Huang is with University of Chinese Academy of Sciences, China and inspiration from [29], filtered probabilistic model predictive
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, control (FPMPC) exploited the benefits of the probabilistic
Shenzhen, Guangdong, 518055, China.
Y. Cui, H. Li and X. Wu are with CAS Key Laboratory of Human-Machine model in USV control. It employed the Bayesian filter to
Intelligence-Synergy Systems, Shenzhen Institutes of Advanced Technology, account for the latent uncertain variables and enhanced the
Chinese Academy of Sciences, Shenzhen, Guangdong, 518055, China, and control stability based on SPMPC [30].
also with Guangdong-Hong Kong-Macao Joint Laboratory of Human-Machine
Intelligence-Synergy Systems, Shenzhen Institutes of Advanced Technology, Although SPMPC and FPMPC have shown potential in
Chinese Academy of Sciences, Shenzhen, Guangdong, 518055, China. USV control, their application has been strongly limited by the

Authorized licensed use limited to: ShanghaiTech University. Downloaded on February 07,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies. Personal use is permitted,
but republication/redistribution requires IEEE permission. See [Link] for more information.
This article has been accepted for publication in IEEE Transactions on Automation Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TASE.2025.3539317

𝑌
𝜓! 𝑣!
Reward Ocean Environments

Predicted States State Action

with Uncertainty
𝒙# 𝒂∗#
𝜏! (𝑋 ! , 𝑌 ! )
𝜓" USV

𝛿! Predicted Actions
𝑣" Model Policy
𝑋 𝑡 𝑡+𝐻−1

𝑥! , 𝑎! $
𝑥!"# , 𝑎!"# Target

⋯ ⋯

• Alleviate changing disturbances

• Model uncertain USV system
⋯ • Iteratively learn the task
$ $ $
P(𝑥!"# |𝑥! , 𝑎! ) P(𝑥!"% |𝑥!"# , 𝑎!"# )
• Support large sample set

[Model] Probabilistic Neural Networks [Policy] MPC Controller based on

[Framework] MBRL Approach for
Model Specific for USV dynamics Dropout and Neural Networks
USV Control PNMPC
under Disturbances Ensembles Uncertainty Propagation

Fig. 1. Principle and contributions of the proposed method PNMPC.

non-parametric nature of GP: the computational complexity the unpredictable noises are further filtered in MPC for more
of the GP model exponentially increases at a rate of O(N 3 ) robust and prompt control behaviors. It naturally combines the
where N is the number of samples. Although this issue can characteristics of Deep Pilco and PETS to achieve superior
be partially addressed by sparse GP [31], [32], balancing representative capability of system uncertainty without losing
the control performance and the sample capacity of GP- stability. Evaluated by position-keeping and targets-tracking
based approaches remains challenging in engineering. With an scenarios under different levels of environmental disturbances
independent computational complexity of the sample capacity, in the real USV data-driven simulation developed by [28],
deep neural networks have been widely implemented to control PNMPC not only achieved superior control stability and
unmanned vehicles [33], [34] and become one feasible solution prediction accuracy than the existing neural networks-based
to fit the probabilistic USV models. Deep Pilco first explored approach PETS but also comprehensively surpassed traditional
the power of neural networks in PILCO [35]. Probabilistic MBRL methods based on GP (SPMPC, FPMPC) in control
ensembles with trajectory sampling (PETS) integrated neural performances and model quality with an independent compu-
networks and probabilistic MPC into one MBRL framework. tational complexity to the sample capacity. The contributions
It achieved comparable performance to model-free RL ap- of this work are summarized following Fig. 1:
proaches in several Mujoco control benchmarks [36]. How-
ever, despite the breakthrough in computational complexity, 1) Developing a practical probabilistic neural networks
these approaches also exhibit inherent limitations in stability model for USV dynamics with superior generalization
compared to their original versions based on GP. The applica- capability and model accuracy than both GP model and
tion of them in environments with disturbances such as USV existing neural networks-based approaches.
remains unexplored. 2) Designing an effective MPC-based policy by combining
the characteristics of Deep Pilco and PETS in the
The motivation of this paper is to tackle the computational uncertainty propagation of probabilistic neural networks.
efficiency issue of traditional probabilistic MBRL approaches 3) Proposing a novel MBRL approach specific for USV
that employ the GP model and MPC controller in the USV do- control by integrating the model and policy above. Its
main. Designing a probabilistic neural network with sufficient advantages in control capability and independent com-
stability and expressiveness, a novel MBRL method, proba- putational complexity to sample capacity compared with
bilistic neural networks model predictive control (PNMPC) other baselines were evaluated in several USV scenarios.
is proposed to properly model the uncertain USV dynamics
and robustly control it against changing disturbances using The remainder of this paper is organized as follows. Section
deep neural networks. Our method emphasizes the features of II provided the preliminary of the target problem. Section III
USV dynamics in continuous time steps during the training detailed PNMPC. The experimental results were analyzed in
of neural networks to improve the prediction accuracy while Section IV. The conclusion was given in Sections V.

II. P RELIMINARY
𝑥! , 𝑎! 𝑥!"# 𝑥!"# , 𝑎!"# 𝑥!"$ + Δ𝑥
A. Modeling USV $
$
P(𝑥!"# |𝑥! , 𝑎! ) 𝑥!"# , 𝑎!"# $
P(𝑥!"% |𝑥!"# , 𝑎!"# )
In this work, the USV dynamics was described as a Markov
decision process (MDP) following [28], [30]. As shown in left Observation bias Second step loss
$
top of Fig. 1, the observed state x ∈ S was defined with Δ𝑥 = 𝑥!"# − 𝑥!"# ℒ%

elements [X u , Y u , ψ u , v u , δ u , τ u , v w · cos(ψ w ), v w · sin(ψ w )] First step loss Total loss

where [X u , Y u ] indicate the current position of USV, ψ u ℒ# ℒ = ℒ# + ℒ %

and v u present the heading direction and velocity of USV,

δ u and τ u are the current values of USV rudder angle and First step prediction Second step prediction

engine throttle. Both wind direction ψ w and velocity v w are

observed by sensors while other disturbances will be treated Fig. 2. Training process of the neural networks model in PNMPC.
as unobservable noises. The control action a ∈ A included the
commands of rudder angle and engine throttle a = [δ o , τ o ].
following Deep Pilco [35]; 2) multiple ensembles of neural
The superscript u, w and o represent USV, wind and operation.
networks with independent weights [36]. One single model of
This work aims to model the following USV dynamics
PNMPC is defined as:
from a probabilistic perspective where the unobservable dis-
turbances can be well considered: [µ (xt , at , wi , zj ) , σ (xt , at , wi , zj )] = fˆ (xt , at , wi , zj ) .
(3)
xt+1 = fˆ(xt , at ). (1)
The networks predict both the mean and standard deviation of
The input state xt and the predicted state of the next step the state in the next step. wi indicates one independent weights
xt+1 in Eq. (1) are presented in the same form for concise matrix from B ensembles of neural networks {w1 , ..., wB }. zj
representation. In practice such as multiple-step prediction, is one set of Q sets of dropout units {z1 , ..., zQ }. According
we do not predict the unpredicted terms related to wind, but to [35], zj randomly select some dropout units in the L
instead keep them consistent with the initial input state. layers of neural networks following Bernoulli distribution
zj = {zj1 , ..., zjL }. The corresponding output at the l-th layer
B. MPC-based policy of the neural networks with weights wi becomes:
Based on the learned model fˆ(·), this work built an MPC ŷ l = ϕ ŷ l−1 ◦ zjl wil

(4)
controller as the policy to promptly respond to the frequently
challenging disturbances following existing works [35], [36]. where wil is the weights vector for the corresponding layer,
At time step t with the observation xt , one H-steps optimal ◦ is the element wise product, ϕ(·) is the activation function.
action sequence was obtained by the MPC-based policy: The one-step prediction of the probabilistic model follows:
H−1 B Q
X 1 X X ˆs
[a∗t , ..., a∗t+H−1 ]

= arg max E R(xt+s+1 , at+s ) , xt+1 = f (xt , at , wi , zj ) ,
at ,...,at+H−1 BQ i=1 j=1
s=0
xt+s+1 = fˆ(xt+s+1 , at+s ), fˆs (xt , at , wi , zj ) ∼ N µ (xt , at , wi , zj ) , σ (xt , at , wi , zj ) .

s.t.
xt+s+1 ∈ S, at+s ∈ A (5)
(2) Although it is difficult to describe the uncertainties of the
where R(·) is the reward function designed for a specific task. target system caused by external disturbances that follow un-
Only the first step action a∗t was executed. The USV moved to known distributions, we estimate fˆs (xt , at , wi , zj ) them by
step t + 1 and repeated the process above. It formed a policy Gaussian distribution according to its successful applications
a∗t = π(xt ) with the capability of handling the dynamical in ocean engineering [37].
disturbances from external environments. PNMPC proposed a novel training approach of the prob-
An MBRL framework can be formulated by integrating the abilistic neural networks model with B ensembles and Q
probabilistic model and MPC-based policy following [28]. dropout sets to address the features of USV dynamics in
The agent interacts with the ocean environment using the continuous time steps. Following the main principle shown
policy that maximizes the reward function based on its model in Fig. 2, the proposed method requires the samples of USV
knowledge. Its control performance is iteratively improved by dynamics in continuous two steps {xt , at , xt+1 , at+1 , xt+2 }.
updating the model through the newly obtained data. The loss function can be divided into two parts related to the
first and second-step predictions. The first step loss follows:
III. A PPROACH Q
B X
X
A. Accurate Neural Networks Model of USV Dynamics L1 = T
[Ew i ,zj
σ −1 (xt , at , wi , zj ) Ewi ,zj
i=1 j=1
(6)
In this section, we first detail the process of training an
+ log det σ −1 (xt , at , wi , zj )]
effective USV model by probabilistic neural networks. The
randomness of PNMPC’s model mainly comes from two where Ewi ,zj = (µ (xt , at , wi , zj ) − xt+1 ) is the prediction
aspects: 1) the random dropout units of neural networks error of the corresponding neural networks in the first step.

The second step’s loss function shares a close form of ∗ ∗ ∗

Execute 𝒂!"( Execute 𝒂!"% Execute 𝒂!"# Execute 𝒂!∗
Eq. (6) with an additional observation bias ∆x = x′t+1 −xt+1 Core 1
where the first term is the overall prediction in Eq. (5), the Observe 𝒙!"% Observe 𝒙!"# Observe 𝒙! Observe 𝒙!&#
second term is the next step state in samples:
Bias compensation Bias compensation Bias compensation Bias compensation
"!"%
𝒙 "!"#
𝒙 "!
𝒙 "!&#
𝒙
X Q
B X Core 2
L2 = T
[Êw i ,zj
σ −1 (xt+1 , at+1 , wi , zj ) Êwi ,zj MPC policy
Search 𝒂∗!"%
MPC policy
Search 𝒂∗!"#
MPC policy
Search 𝒂∗!
MPC policy
Search 𝒂∗!&#
i=1 j=1
⋯ 𝑡−2 𝑡−1 𝑡 𝑡+1 ⋯
+ log det σ −1 (xt+1 , at+1 , wi , zj )],
Update neural networks with two-steps dynamics loss
Sample set 𝒟 with two
Êwi ,zj = µ (xt+1 , at+1 , wi , zj ) − xt+2 − ∆x. Include the aleatoric uncertainty
steps continuous samples
(Section 3. A)
(7)
Uncertainty propagation based on random dropout and neural networks ensembles
Exclude the aleatoric uncertainty
The introduced bias is added to the target state as a manual (Section 3. B)

shift. It encourages the networks to focus on the prediction

of the current step while reducing the negative effect of Fig. 3. Workflow of the proposed PNMPC in USV control.
the predicted error from the previous step. The overall loss
function is defined as:
and PETS while filtering the predicted standard deviation of
each forward calculation. The MPC policy is optimized as:
B X
X L
H−1
L = L1 + L2 + λl ∥wil ∥22 (8) X
i=1 l=1
[a∗t , ..., a∗t+H−1 ] = arg max R(xt+s+1 , at+s ),
at ,...,at+H−1
s=0
B Q (9)
where the third term alleviates the over-large network weights 1 XX
s.t. xt+s+1 = µ (xt+s , at+s , wi , zj ) ,
by tunable parameter vector λ = {λ1 , ..., λL }. Although the BQ i=1 j=1
prediction errors over two steps are equally considered in this
xt+s+1 ∈ S, at+s ∈ A.
work by setting the weights of L1 and L2 as 1, it is free to
select different weights for specific tasks. Following the related application on real-world USV [28],
Compared to existing works based on GP models [28], we assume the unpredicted wind direction and velocity do
[30] which approximate the target system by Gaussian ker- not change by fixing cos(ϕw ) and v w in x from the first
nel functions, PNMPC enjoys superior uncertainty expression step of such a H steps optimization. The uncertainties prop-
capabilities while making fewer assumptions about noise and agated in Eq. (9) are determined by B ensembles of neural
state distributions. Its probabilistic neural networks do not networks with independent weights and Q sets of dropout
assume environmental noise follows Gaussian distributions units while the predicted standard deviation of each forward
and be able to approximate any non-linear function. The calculation is blocked in the multi-step prediction. This process
uncertainty propagation in PNMPC only requires Gaussian- not only reduces the computational complexity by removing
distributed sampling in Eq. (5) during the model update stage, B ×Q times random sampling but also mitigates the excessive
unlike moment-matching [38] which approximates input and amplification of standard deviation in multi-step prediction
output states as Gaussian distributions. caused by the insufficiently learned model which may lead
to a deteriorating control sequence in the MPC-based policy.
Compared to existing works based on GP model [28], [30],
it further relaxes the assumption of moment-matching [38]
that predicted states must follow Gaussian distributions at
B. Effective Uncertain Propagation in Probabilistic MPC
the stage of decision-making. From the perspective of the
source of uncertainty, the MPC-based policy above excludes
In this section, we introduce an effective propagation of un- the aleatoric uncertainty, i.e., the predicted standard deviation
certainty in the MPC process using the probabilistic neural net- σ(·) caused by data noises and external disturbances which
works above. Given the ensembles of neural networks weights has a weaker Markov property related to the USV dynamics.
{w1 , ..., wB } and the dropout units set {z1 , ..., zQ }, the one
step prediction is divided into B × Q forward calculation
following Eq. (3). During the training process introduced in C. Probabilistic Neural Networks Model Predictive Control
Section III-A, the proposed method predicts the next step state In this subsection, we detailed the workflow of PNMPC in
using Gaussian distributions and incorporates the standard controlling a USV with known dynamics. The interaction was
deviation in the loss function to guide the networks to capture abstracted in Fig. 3. The bottom part of Fig. 3 demonstrates the
the features of USV dynamics from a probabilistic perspective. training process. From right to left, the newly explored samples
In the decision-making process using the MPC-based policy, were stored in the sample set D. The samples of continuous
PNMPC proposed an effective uncertain propagation that two steps {xt , at , xt+1 , at+1 } were utilized to train the neural
naturally combines the uncertain predictions of Deep Pilco networks in a supervised learning way following Section III-A

Algorithm 1: Learning process of PNMPC in USV. At the same time, the second core predicts the current state
Input sample set D, number of network ensembles B, x̂t based on the previous state xt−1 and action at−1 sent by
number of dropout units set Q, executing time ∆t, the first core following Eq. (5). Setting it as the initial state
horizon of MPC policy H, reward function R. of the MPC policy effectively alleviates the bias caused by
# Initialize weights ensembles and random dropout set the USV operation during the optimization process, as stated
w = {w1 , ..., wB }, z = {z1 , ..., zQ } in [28]. The control signal for the next step is then decided
# Train neural networks model by D with warmup following Eq. (9) and sent to the first core at the end of this
samples following the loss function in Eq. (8) step. After each rollout with Lrollout steps, the model will be
fˆ = Train_NNS(D, L, w, z) iteratively updated based on the current sample set D following
the loss function L defined in Eq. (8). The whole learning
for i = 1, 2, ..., Ntrial do
# Initialize USV and control signal process was summarized in Algorithm 1.
x0 =Reset_USV_State()
a∗0 = [0, 0] IV. E XPERIMENTS
for t = 1, 2, ..., Lrollout do A. Experimental Settings
# Core 1
Operate_Control(a∗t−1 , ∆t) 1) Simulation Environment: To evaluate our approach,
xt = Observe_USV_State() we utilized the simulation environment developed by Furuno
if t > 1 then Electric CO. LTD., Japan which is based on a real boat (a
# Expand D with two steps’ samples Nissan JoyFisher with a length of seven meters) driving data
D = {D, {xt−2 , a∗t−2 , xt−1 , a∗t−1 }} in a real ocean environment. Following the dynamic model
detailed in [28], the simulated boat’s steering rudder had a
# Core 2
range δ o ∈ [−30, 30]◦ , its engine throttle was mapped to
# Bias compensation
PB PQ following Eq. (5)
1 ˆs (xt−1 , at−1 , wi , zj ) τ o ∈ [−1, 1], corresponding to the actions of moving backward
x̂t = BQ i=1 j=1 f
and forward, with maximum 800 rpm in rotation speed. The
# search optimal actions following Eq. (9)
executing time was set as ∆t = 2.0 second.
a∗t = MPC_Policy(x̂t , R, H, w, z)
There are two types of disturbances in this simulation: one
# Train neural networks model by D following the is the wind with speed v w and direction ψ w that can be
loss function in Eq. (8) detected by sensors, and the other is the unobservable wave
fˆ = Train_NNS(D, L, w, z) with speed v c and direction ψ c . Following the settings in
Return fˆ with w, z related works [28], [30], three levels of rand wind and current
velocity were set to simulate the ocean environments under
disturbances:
1) v w = 2.0+U (−0.1, 0.1), vc = 0.25+U (−0.1, 0.1) m/s,
while the loss function L fully considered randomness from
2) v w = 4.0 + U (−0.1, 0.1), vc = 0.5 + U (−0.1, 0.1) m/s,
multiple ensembles of networks, different dropout sets and the
3) v w = 6.0 + U (−0.1, 0.1), vc = 0.5 + U (−0.1, 0.1) m/s.
aleatoric uncertainties in neural networks. The updated neural
networks affected the MPC policy via Eq. (9) introduced in U indicates a uniform distribution. The increasing wind and
Section III-B, the aleatoric uncertainties of neural networks current speeds led to a significantly enhanced uncertainty of
were excluded for more stable control behaviors. USV dynamics and made the learning of robust policy more
challenging. Both wind and current directions were initialed
The interaction process between PNMPC and the USV by a certain value and affected by random noises at each step:
system mainly followed the two CPU cores setting proposed
in [28], as demonstrated in the top part of Fig. 3. Two ψ w = 37◦ + U (−30, 30)◦ , ψ c = 100◦ + U (−30, 30)◦ .
cores work in parallel to reduce the delay caused by the
optimization of the MPC policy. At the beginning of each trial, 2) Control Scenarios: Two control scenarios: position-
the probabilistic neural networks model was initially trained keeping and multiple targets-tracking were conducted. The
by the sample set D with some warmup date (e.g., USV first task focused on staying the USV near its initial position.
trajectories with random control signals). The RL training Its reward function was defined as:
process can be divided into Ntrial rollouts, each rollout has R(xt+1 ) = −||pt+1 − ptarget ||2 . (10)
Lrollout steps. At the beginning of each rollout, we first reset
the state of USV and obtained the initial state x0 , the initial pt+1 indicated the USV position [X u , Y u ] at step t + 1,
action is set as a∗0 = [0, 0]. The first core was responsible for ptarget = [0, 0] was the initial position. The learning process in
communicating with the USV system. At time step t, it sends Algorithm 1 was conducted with Ntrial = 20 iterations/rollouts,
the control signal a∗t−1 , which is optimized by the MPC policy each rollout contains Lrollout = 50 steps. After training, the
in the previous time step, to the USV system and receives the agent will be tested by 30 rollouts with Lrollout = 100 steps.
current state xt from sensors. The observed state will be sent In addition to the offset from the target position, we defined
to another core at the beginning of the next step. For step two additional evaluation criteria. The final success rate is
t > 1, the samples of two continuous steps will be added to determined by the ratio of staying USV near the target within
the sample set D = {D, {xt−2 , a∗t−2 , xt−1 , a∗t−1 }}. a 7 m range at the end of the test, The overall success rate is

TABLE I
C OMPARISON OF PROPOSED METHOD AND RELATED BASELINES

Computational complexity
Approach Model type Uncertainty propagation Horizon of model fitting Policy type
related to sample size N
DroQ [39] N/A N/A N/A Actor-Critic Networks Independent
TD7 [40] N/A N/A N/A Actor-Critic Networks Independent
SPMPC [28] Gaussian process Moment-Matching Single step MPC controller O(N 3 )
FPMPC [30] Gaussian process Moment-Matching Single step MPC controller O(N 3 )
PETS [36] Neural networks Ensembles sampling Single step MPC controller Independent
Ensembles sampling
PNMPC (ours) Neural networks Continuous two steps MPC controller Independent
+ Random dropout

determined by whether the USV holds its position within the where the sparse GP [31] was used to enable an H = 3 steps
range above during the entire test rollout. prediction horizon in MPC. According to the discussion of
The second task focused on sequentially passing multiple computational complexity in [28], it is difficult to complete
targets and staying near the last one. Two trajectories were MPC’s optimization within 2 seconds with H > 3 during a
designed for this task. The “circle” one included six targets 20 iterations’ learning procedure. PETS and PNMPC benefited
at [15.55, 12.89], [22.42, 32.15], [9.83, 46.77], [−13.45, 45.36], from the independent computational complexity of the sample
[−22.31, 28.53], [−13.25, 10.35] as one circle (120 m total capability. We set H = 10 for them to fully demonstrate these
distance) to test USV’s control skill in local scenes. The advantages. The neural networks were built with two hidden
“right” one included ten linearly spacing targets from [0, 0] to layers with 200 units, other related parameters were as follows:
[40, 200] (200 m total distance) to evaluate the control behavior B = 5, Q = 4, λ = [0.0001, 0.00025, 0.00025, 0.0005] for
in a long-term distance. The reward function was defined as: input, output and hidden layers in Eq. (8), the supervised
learning of neural works were conducted in 40 epoches with
R(pt+1 ) = −||pt+1 − pitarget ||2 −0.5 · ||pt+1 − pi+1
target ||
2
(11) learning rate 0.001. To evaluate the impact of the uncertainties
where the agent minimized its distances to the current and the in our method, we tested PNMPC without uncertainty prop-
next targets (denoted by index i and i + 1) when there is more agation (indicated as PNMPC-N), i.e., the size of ensembles
than one target. It turned to Eq. 10 with the last target. The was set to B = 1 while the dropout units were also disabled.
tracking distance was defined as: Table I summarized the characteristics of PNMPC and other
baselines. Leading model-free RL approaches, DroQ and TD7,
i
X learned control policies by updating Actor-Critic networks
D(t) = ||pjtarget −pj−1
target ||
2
− ||pt −pitarget ||2 . (12) without constructing explicit models. Their computational
j=1
complexity depended on the network structure rather than the
The complete ratio d was calculated by dividing it at the sample size. In comparison, SPMPC and FPMPC modeled the
last step by the total distance. The rollout was classified as target dynamics using GP and derived control policies from
a success with d ≥ 95%, and as a failure with d < 60%. MPC controllers based on GP’s uncertainty propagation using
The learning process was conducted with Ntrial = 20 it- analytic moment matching [38], offering robust resistance
erations/rollouts, each rollout contains Lrollout = 100 steps. to environmental disturbances. However, GP’s computational
After training, the agent will be tested by 30 rollouts with complexity increased exponentially with sample size, hamper-
Lrollout = 100 steps. ing its applications on USV with large sample sets. PETS ad-
3) Baselines and Parameters: Five baselines were se- dressed this issue by substituting GP with probabilistic neural
lected in the experiment. The SPMPC proposed by [28] is networks and estimated propagated uncertainties through net-
the original MBRL approach based on GP. FPMPC [30] can work ensembles. Our method PNMPC further enhanced PETS
be treated as its SOTA extension with support of partially to meet the requirement of USV: it considered the system
observable MDP. For the neural networks-based approach, dynamics in two continuous steps during the model learning
PETS [36] was compared. It can be seen as the proposed process and integrated random dropout from Deep Pilco [35]
method without the features of model training and uncertainty for superior representation capability of uncertainties.
propagation introduced in Sections. III-A and III-B. Two state- All experimental results were conducted on a computational
of-the-art model-free RL approaches DroQ [39] and TD7 [40] server with an Intel Xeon(R) W-2245 CPU, NVIDIA GeForce
were selected for a fair comparison. All compared methods RTX 2080 Ti and 64 GB memory by five independent trials
were implemented by PyTorch [41]. All MPC policies were with different random seeds.
implemented based on the bound optimization by quadratic
approximation [42] in NLopt1 for a fair comparison. The
B. Evaluation of Learning and Control Performances
settings of SPMPC and FPMPC followed their original paper,
1) Learning Performances: The learning curves of all
1 [Link] compared methods within 20 iterations were compared in

TABLE II
T EST RESULT OF PROPOSED METHOD AND BASELINES IN THE POSITION - KEEPING TASK .

Disturbance Method Average [m] Median [m] Time [s] Success rate (final) Success rate (overall)
DroQ 47.16 ± 28.55 43.96 0.002 ± 0.0003 0.0% 0.0%
TD7 20.70 ± 15.88 29.02 0.002 ± 0.0002 10.67% 9.33%
SPMPC 3.94 ± 6.71 1.51 0.43 ± 0.27 78.67% 66.0%
Lv. 1 FPMPC 1.85 ± 3.15 1.30 0.69 ± 0.83 97.33% 48.0%
PETS 7.69 ± 10.04 4.0 0.24 ± 0.09 90.0% 3.33%
PNMPC 1.85 ± 2.72 1.14 0.25 ± 0.08 95.33% 80.0%
PNMPC-N 18.48 ± 18.03 6.41 0.19 ± 0.01 68.0% 0.0%
DroQ 82.32 ± 51.42 79.23 0.002 ± 0.0003 0.0% 0.0%
TD7 23.85 ± 27.98 16.90 0.002 ± 0.0002 34.0% 20.0%
SPMPC 4.90 ± 9.98 2.93 0.36 ± 0.24 85.33% 25.33%
Lv. 2 FPMPC 4.32 ± 7.17 2.67 0.47 ± 0.49 90.0% 26.0%
PETS 26.17 ± 28.24 15.65 0.25 ± 0.10 49.33% 0.67%
PNMPC 3.50 ± 4.97 1.85 0.25 ± 0.09 92.67% 51.33%
PNMPC-N 22.42 ± 35.07 11.41 0.19 ± 0.01 32.0% 0.0%
DroQ 79.26 ± 48.21 77.18 0.002 ± 0.0003 0.0% 0.0%
TD7 41.91 ± 47.49 48.98 0.002 ± 0.0003 27.33% 15.33%
SPMPC 14.0 ± 28.64 3.30 0.30 ± 0.18 72.0% 20.0%
Lv. 3 FPMPC 5.33 ± 9.85 3.47 0.56 ± 0.50 84.67% 0.0%
PETS 24.70 ± 20.52 18.61 0.24 ± 0.09 26.67% 0.0%
PNMPC 3.42 ± 4.03 2.04 0.25 ± 0.09 98.0% 35.33%
PNMPC-N 35.21 ± 30.24 26.84 0.19 ± 0.01 29.33% 0.0%

Fig. 4 where the Y axis indicated the average position offset performances could have certain advantages over the mature
in the position-keeping task and the average tracking distance MBRL approach based on the GP models.
in the targets-tracking task with circle and right trajectories. 2) Testing Performances: The control performances of
The original SPMPC gradually failed to converge to proper PNMPC and other baselines in the testing were compared in
policies as the level of disturbance increased. The neural Tables II, III and IV to investigate the generalization capability
networks-based PETS had large standard deviations in the of our method after the training procedure. The average and
average position offset during the training, making it difficult median terms in Table II were related to the position offset,
to converge under all levels of disturbances. As a comparison, the time term in all tables indicated the average optimization
PNMPC stably converged to the same or even superior con- time of the MPC policy per step. The bold font indicated the
trol policies as FPMPC under all levels of disturbances and best performance in the corresponding term.
enjoyed a leading convergence velocity. In the targets-tracking
In the position-keeping task, PNMPC had shown significant
scenario with both circle and right trajectories, we observed
advantages under different levels of disturbances over other
similar advantages in the learning capability with PNMPC. It
baselines in terms of average position offset, median position
usually converged to higher average tracking distances than
offset, and task success rates in all tests, apart from the 2% less
FPMPC while the performances of SPMPC were significantly
success rate (final) in the first level disturbance compared to
influenced by the degree of disturbances. The neural networks-
FPMPC. As a comparison, the neural networks-based PETS
based PETS struggled to learn comparable control policies
performed consistently worse than SPMPC and FPMPC in
in most cases without the proposed tricks in modeling USV
position offset and success rates. For computational efficiency,
dynamics and propagating system uncertainties. Once the
the proposed method had a lower average optimization time
uncertainty propagation was disabled, PNMPC-N exhibited
than other GP model-based approaches while enjoying a
a decline in both learning ability and stability, falling sig-
longer prediction horizon. In the targets-tracking task with
nificantly behind other probabilistic MBRL approaches. In
the circle trajectory, the superiority of PNMPC was observed.
terms of model-free methods, DroQ and TD7 were unable
The proposed method outperformed other baselines in all
to successfully learn the tasks within 20 iterations due to the
control-related metrics. Compared to the suboptimal approach
perturbed environment and the limited number of interactions.
FPMPC, PNMPC improved the average tracking distance by
The results above demonstrated the feasibility of the pro- 8% under the highest disturbance level, resulting in a 30% in-
posed method in MBRL USV. Compared to the traditional crease in the success rate. In contrast, the control performance
neural networks-based approach PETS, our method had ro- of PETS and SPMPC quickly deteriorated with increasing
bustly learned superior control strategies in the MBRL loop. Its disturbance levels. In terms of computational complexity, the

TABLE III
T EST RESULT OF PROPOSED METHOD AND BASELINES IN THE TARGETS - TRACKING TASK ( CIRCLE ).

Disturbance Method Ave. complete distance [m] Ave. complete ratio Time [s] Success rate Failure rate
DroQ −170.31 ± 59.80 −141.92% 0.002 ± 0.0004 0.0% 100.0%
TD7 −150.70 ± 94.01 −125.58% 0.002 ± 0.0002 0.0% 100.0%
SPMPC 102.0 ± 18.49 85.0% 0.14 ± 0.09 50.0% 8.0%
Lv. 1 FPMPC 102.80 ± 20.01 85.57% 0.13 ± 0.08 56.0% 9.33%
PETS 100.33 ± 26.95 83.61% 0.21 ± 0.02 53.33% 19.33%
PNMPC 114.61 ± 13.61 95.51% 0.22 ± 0.10 88.0% 4.67%
PNMPC-N 79.66 ± 47.55 66.39% 0.14 ± 0.004 40.67% 37.33%
DroQ −172.23 ± 67.71 −143.52% 0.002 ± 0.0004 0.0% 100.0%
TD7 −49.18 ± 41.45 −40.98% 0.002 ± 0.0003 0.0% 100.0%
SPMPC 94.66 ± 28.03 79.14% 0.13 ± 0.08 56.67% 20.0%
Lv. 2 FPMPC 93.91 ± 21.02 78.26% 0.12 ± 0.07 31.33% 19.33%
PETS 45.85 ± 51.78 38.21% 0.21 ± 0.10 10.0% 70.67%
PNMPC 104.81 ± 26.14 87.35% 0.22 ± 0.11 69.33% 14.0%
PNMPC-N 52.38 ± 39.04 43.65% 0.14 ± 0.002 10.0% 74.67%
DroQ −130.10 ± 37.68 −108.42% 0.002 ± 0.0004 0.0% 100.0%
TD7 −194.02 ± 221.73 −161.68% 0.002 ± 0.0002 0.0% 100.0%
SPMPC 45.23 ± 82.01 37.69% 0.12 ± 0.07 13.33% 40.0%
Lv. 3 FPMPC 82.03 ± 22.23 68.40% 0.11 ± 0.06 18.67% 40.0%
PETS 58.86 ± 48.87 49.05% 0.21 ± 0.10 16.67% 63.33%
PNMPC 91.49 ± 30.89 76.24% 0.22 ± 0.10 48.0% 29.33%
PNMPC-N 5.08 ± 82.87 4.23% 0.19 ± 0.001 5.33% 82.67%

TABLE IV
T EST RESULT OF PROPOSED METHOD AND BASELINES IN THE TARGETS - TRACKING TASK ( RIGHT ).

Disturbance Method Ave. complete distance [m] Ave. complete ratio Time [s] Success rate Failure rate
DroQ −116.35 ± 90.91 −58.17% 0.002 ± 0.0004 0.0% 100.0%
TD7 −51.65 ± 42.30 −25.82% 0.002 ± 0.0003 0.0% 100.0%
SPMPC 182.62 ± 30.86 91.31% 0.18 ± 0.12 77.33% 14.0%
Lv. 1 FPMPC 190.50 ± 17.80 95.25% 0.16 ± 0.11 86.0% 1.33%
PETS 133.89 ± 26.95 58.50% 0.23 ± 0.10 21.33% 40.67%
PNMPC 191.07 ± 19.42 95.53% 0.24 ± 0.11 84.67% 2.0%
PNMPC-N 96.55 ± 49.74 48.27% 0.14 ± 0.003 5.33% 72.67%
DroQ −293.13 ± 85.57 −146.57% 0.002 ± 0.0004 0.0% 100.0%
TD7 −85.42 ± 127.76 −42.71% 0.002 ± 0.0002 0.0% 100.0%
SPMPC 152.84 ± 59.40 76.42% 0.15 ± 0.08 49.33% 27.33%
Lv. 2 FPMPC 163.57 ± 44.14 81.78% 0.13 ± 0.07 39.33% 16.67%
PETS 130.79 ± 59.67 65.39% 0.22 ± 0.09 22.67% 34.0%
PNMPC 179.15 ± 35.28 89.58% 0.23 ± 0.10 66.67% 8.0%
PNMPC-N 92.07 ± 57.29 46.04% 0.14 ± 0.002 5.33% 68.67%
DroQ −253.19 ± 42.30 −126.60% 0.002 ± 0.0004 0.0% 100.0%
TD7 −286.07 ± 42.30 −143.04% 0.002 ± 0.0002 0.0% 100.0%
SPMPC 145.29 ± 66.99 72.64% 0.15 ± 0.07 45.33% 30.0%
Lv. 3 FPMPC 161.56 ± 50.97 80.78% 0.14 ± 0.08 43.33% 19.33%
PETS 129.03 ± 50.07 64.51% 0.22 ± 0.10 18.0% 43.33%
PNMPC 180.57 ± 36.19 90.28% 0.23 ± 0.10 74.0% 10.67%
PNMPC-N 83.08 ± 51.97 41.90% 0.14 ± 0.002 2.0% 74.0%

Position-keeping Targets-tracking (circle) Targets-tracking (right)

Lv. 1

Lv. 2

Lv. 3

DroQ TD7 SPMPC FPMPC PETS PNMPC-N PNMPC (ours)

Fig. 4. Learning curves of PNMPC and baselines in three control tasks under different levels of disturbances. Lines and translucent regions represent the
mean and standard deviation.

Position Keeping Circle Right

0.16 0.10 0.23 0.13 0.16 0.06 0.25 0.08 0.19 0.04 0.27 0.08
0.24 0.06 0.32 0.12 0.17 0.06 0.27 0.08 0.24 0.06 0.27 0.07
Lv. 3 0.10 0.03 0.21 0.09 0.11 0.03 0.17 0.07 0.13 0.04 0.18 0.09
0.99 0.21 0.65 0.17 1.19 0.27 1.05 0.22 1.06 0.22 1.89 0.40
0.10 0.04 0.20 0.09 0.14 0.05 0.24 0.10 0.15 0.05 0.19 0.09
0.15 0.04 0.22 0.08 0.16 0.05 0.27 0.10 0.19 0.04 0.24 0.05
0.23 0.14 0.31 0.23 0.18 0.05 0.29 0.10 0.24 0.05 0.28 0.10
0.11 0.06 0.21 0.08 0.11 0.03 0.19 0.08 0.13 0.04 0.17 0.06
Lv. 2 0.70 0.39 0.46 0.18 0.85 0.40 1.13 0.20 1.12 0.21 1.86 0.37
0.10 0.05 0.20 0.08 0.13 0.04 0.22 0.08 0.17 0.04 0.20 0.06

0.13 0.04 0.18 0.08 0.18 0.03 0.25 0.05 0.17 0.03 0.22 0.04
0.13 0.05 0.18 0.08 0.22 0.04 0.26 0.07 0.24 0.05 0.22 0.04
0.10 0.07 0.15 0.07 0.11 0.02 0.16 0.05 0.14 0.04 0.14 0.04
Lv. 1 0.72 0.13 0.50 0.23 1.18 0.21 1.27 0.19 1.24 0.31 1.95 0.38
0.07 0.03 0.13 0.06 0.10 0.03 0.15 0.05 0.13 0.02 0.13 0.04

0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 0 0.5 1 1.5 0 0.5 1 1.5 0 0.5 1 1.5 0 0.5 1 1.5 2 2.5
Average predicted error of position (X axis) [m] Average predicted error of position (Y axis) [m] Average predicted error of position (X axis) [m] Average predicted error of position (Y axis) [m] Average predicted error of position (X axis) [m] Average predicted error of position (Y axis) [m]
4.22 1.53 0.21 0.12 4.22 1.53 0.14 0.07 4.91 1.41 0.19 0.03
6.91 1.52 0.21 0.06 6.91 1.52 0.10 0.03 6.66 1.77 0.19 0.02
Lv. 3 3.13 0.66 0.11 0.02 3.13 0.66 0.10 0.07 4.59 2.09 0.09 0.03
5.65 2.98 0.79 0.17 5.65 2.98 0.89 0.19 8.17 2.48 0.96 0.17
2.58 0.58 0.07 0.02 2.58 0.58 0.08 0.03 3.75 1.75 0.10 0.03
4.48 1.00 0.17 0.06 4.48 1.00 0.12 0.03 4.49 0.80 0.18 0.03
6.88 1.88 0.17 0.05 6.88 1.88 0.12 0.03 5.83 1.26 0.18 0.05
2.54 1.43 0.18 0.08 2.54 1.43 0.12 0.07 4.62 2.64 0.11 0.04
Lv. 2 6.63 3.64 0.65 0.18 6.63 3.64 0.99 0.18 9.88 2.88 1.10 0.17
2.79 1.19 0.08 0.03 2.79 1.19 0.08 0.02 3.80 1.62 0.10 0.03
3.91 0.87 0.12 0.04 3.91 0.87 0.10 0.01 4.65 0.90 0.11 0.02
5.83 1.70 0.11 0.04 5.83 1.70 0.10 0.02 6.40 1.09 0.11 0.02
1.84 1.88 0.15 0.04 1.84 1.88 0.06 0.01 7.37 5.76 0.06 0.02
Lv. 1 6.56 4.74 0.67 0.24 6.56 4.74 1.08 0.21 11.07 3.37 1.13 0.20
2.18 1.82 0.05 0.02 2.18 1.82 0.05 0.01 5.29 3.34 0.06 0.01

0 2 4 6 8 10 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 0 0.5 1 1.5 0 5 10 15 0 0.5 1 1.5

Average predicted error of orientation [°] Average predicted error of of velocity [m/s] Average predicted error of orientation [°] Average predicted error of of velocity [m/s] Average predicted error of orientation [°] Average predicted error of of velocity [m/s]

SPMPC FPMPC PETS PNMPC-N PNMPC (ours)

Fig. 5. Average prediction errors of the model learned by PNMPC and other baselines in position-keeping and targets-tracking tasks during the testing.

GP-based method SPMPC and FPMPC had achieved certain PNMPC demonstrated significant advantages over other
advantages which were potentially caused by the efficiency baselines in terms of success and failure rates, showcasing
of sparse GP in a larger task area (compared to the position- its robustness in controlling USV in ocean environments with
keeping task) while PNMPC achieved a similar and stable dynamic disturbances. Its comprehensive control capabilities
optimization time as in the position-keeping task. In the task were evident through smaller average offsets and longer
with the right trajectory, PNMPC consistently maintained its driving distances. Furthermore, the consistent excellence over
superiority and outperformed all other baseline methods. Un- three tasks highlighted its superior generalization ability.
der the third level of disturbance, the proposed method tracked 3) Model Accuracy: In this subsection, we evaluated the
10% more distances compared to the suboptimal approach quality of the probabilistic model learned by PNMPC. The
FPMPC, while improving success rate by 30%. Matching with average errors and their standard deviation between the first
the deteriorated learning ability shown in Fig. 4, PNMPC-N step prediction by the learned model and the real observation
demonstrated the worst performance compared with all MBRL in USV position, orientation and velocity using all model-
baselines in the testing. Although the model-free methods based baselines were demonstrated in Fig. 5.
DroQ and TD7 had clear advantages in calculation speed, they In all testing scenarios, PNMPC consistently achieved better
all failed to guide the USV to complete the given tasks. prediction accuracy and smaller prediction standard devia-

Position-keeping Targets-tracking (circle) Targets-tracking (right)

Lv. 1

Lv. 2

Lv. 3

DroQ TD7 PETS PNMPC-N PNMPC (ours)

Fig. 6. Learning curves of PNMPC, PETS, TD7 and DroQ in three control tasks with larger sample capacity under different levels of disturbances. Lines
and translucent regions represent the mean and standard deviation.

tion in all dimensions. The model accuracy was maintained three levels of disturbance. At each step, the wind direction and
even with the increasing disturbances. In comparison, the velocity were also affected by the random noises U (−30, 30)◦
GP model-based approaches suffered from larger prediction and U (−0.1, 0.1) m/s respectively.
errors caused by the sparse GP model. Although the neural The configuration above turned into a substantial expansion
networks-based PETS usually had a similar one-step predic- of the state space. It significantly increased the task’s difficulty,
tion accuracy to PNMPC, it significantly lagged in control while also being highly valuable for learning more general
performances using multi-step prediction MPC policy. This and robust control strategies in the highly random ocean
was due to not only the original loss function that did not environment rather than the one with a limited range of
consider the USV dynamics in continuous two steps but disturbances in [28]. To fully explore and learn the tasks
also the excessive uncertainties in long-term prediction. After under this configuration, we set the number of iterations/rollout
disabling the uncertainty propagation, the model prediction during the training to 100, using a total of 5000 to 10000
error of PNMPC-N significantly increased, which led to the samples. It is infeasible to employ SPMPC and FPMPC to
deterioration of the testing performance. Compared to the promptly control USV by the MPC policy under such a large
GP model-based methods SPMPC and FPMPC, the proposed sample set due to the overlarge computational complexity of
PNMPC significantly enhanced model accuracy. Compared to the GP model discussed in [28]. We compared DroQ, TD7,
PETS which modeled the target dynamics by neural networks, PETS, PNMPC-N and PNMPC to evaluate the superiority
our approach substantially improved generalization ability of the proposed method. All algorithmic settings remained
and learning performance by fitting the model in continuous consistent with Section IV-A.
two steps without introducing additional errors in single-step
1) Learning Performances: The learning curves of the
prediction. These results indicated the superiority of PNMPC
proposed PNMPC and baseline PETS within 100 iterations
in the quality of the learned USV model.
were compared in Fig. 6. Both PNMPC and PETS demon-
strated good learning behaviors in three levels of disturbance.
C. Evaluation with Large Sample Capacity Without uncertainty propagation, PNMPC-N had significant
In this subsection, the proposed method was evaluated in deficiencies in learning stability and finally converged to the
more challenging ocean environments with more learning worse position offsets and tracking distances. The model-free
iterations and larger sample capability to fully demonstrate method DroQ could not learn the tasks within 100 iterations.
its potential as a neural networks-based approach. Based on Although TD7 successfully converged in the position-keeping
the three levels of disturbance introduced in Section IV-A, tasks, it achieved unsatisfactory performances in two tracking
we generated the initial wind direction in a wider range tasks. In the position-keeping scenario, PETS frequently ex-
ψ w = U (−180, 180)◦ . The initial wind velocity was set as perienced drastic degradations in control performance during
v w = U (0, 1) · vmax
w w
where vmax = [2.0, 4.0, 6.0] m/s for the training as the environmental disturbances increased. In

TABLE V
T EST RESULT OF PROPOSED METHOD AND PETS WITH MORE LEARNING ITERATIONS IN THE POSITION - KEEPING TASK .

Disturbance Method Average [m] Median [m] Time [s] Success rate (final) Success rate (overall)
DroQ 13.0 ± 8.11 12.43 0.002 ± 0.0004 13.33% 4.44%
TD7 4.97 ± 2.91 4.57 0.002 ± 0.0002 84.0% 2.0%
Lv. 1 PETS 5.71 ± 11.67 2.71 0.24 ± 0.08 73.33% 34.67%
PNMPC 4.09 ± 4.94 2.48 0.24 ± 0.08 85.33% 42.0%
PNMPC-N 5.38 ± 8.33 2.59 0.21 ± 0.01 74.67% 37.33%
DroQ 21.37 ± 17.23 25.61 0.002 ± 0.0004 20.0% 0.0%
TD7 3.79 ± 2.48 3.31 0.002 ± 0.0003 93.33% 20.0%
Lv. 2 PETS 7.54 ± 11.09 3.75 0.24 ± 0.08 72.0% 18.67%
PNMPC 6.31 ± 8.13 3.10 0.25 ± 0.09 76.0% 26.67%
PNMPC-N 9.39 ± 13.43 4.99 0.21 ± 0.01 59.33% 12.67%
DroQ 30.96 ± 22.90 32.73 0.002 ± 0.0003 0.0% 0.0%
TD7 73.96 ± 46.26 75.48 0.002 ± 0.0002 0.0% 0.0%
Lv. 3 PETS 8.69 ± 20.03 3.91 0.24 ± 0.09 70.0% 26.0%
PNMPC 5.73 ± 7.73 3.53 0.25 ± 0.09 74.67% 27.33%
PNMPC-N 10.32 ± 14.19 5.62 0.21 ± 0.01 54.0% 6.0%

comparison, our method enjoyed a more robust learning pro- and achieved very low success rates in all levels’ disturbances.
cedure, with a smaller standard deviation in learning curves While TD7 performed very well under the disturbances of
and finally converged to superior average position offsets. In the first two levels, its control ability declined rapidly against
the target-reaching scenario, PNMPC and PETS had similar stronger disturbances. It was entirely incapable of completing
convergence behaviors. Their learning curves exhibited larger the task under the disturbances of level three.
standard deviations with increasing levels of disturbance.
In the targets-tracking scenario with the circle trajectory,
Unlike the significant differences in learning curves between
the proposed PNMPC also achieved control performance that
PETS and PNMPC shown in Fig. 4, Section IV-B, they
surpassed the baseline PETS. On average, it completed 8%
had similar learning curves in the scenarios where some
more tracking distance under three levels of disturbance while
observed states came from highly random distributions. We
significantly improving the quality of task completion: the
believed this phenomenon was caused by the neural networks’
success rate increased by about 9% and the failure rate
characteristics: they are highly effective in learning data with
decreased by about 10%. In the targets-tracking scenario
large distributions without being negatively affected by the
with the right trajectory that addressed long-term and high-
excessive update within a relatively local sample space.
speed driving, PNMPC demonstrated significant superiority
2) Testing Performances: The control performances of
over PETS. As the level of disturbance increased, our method
TD7, DroQ, PETS, PNMPC and PNMPC-N in the testing
improved the tracking distance by 12%, 17% and 16% and
were compared in Tables V, VI and VII to investigate the
enhanced the success rate of the task by 20%, 18% and 14%.
generalization capability of our method after the 100 inter-
Disabling the uncertainty in the probabilistic neural networks,
actions’ training. The bold font indicated the best perfor-
PNMPC-N had the worst performances in both circle and right
mance in the corresponding term. Despite the similar learning
trajectories. Without the guidance of the model, model-free RL
curves of FPMPC and PETS in Fig. 6, the proposed method
failed to learn this task, DroQ and TD7 could not even track
consistently showed advantages in the testing while PETS
the first target under all levels of disturbances. The results
failed to demonstrate sufficient generalization capability and
above demonstrated the superior control performance and
stability. In the position-keeping task, PNMPC outperformed
generalization capability of the proposed method in learning
PETS in all metrics related to control performances. Under the
USV control tasks with a large sample capacity.
first level of disturbance, it enjoyed over 10% improvement
in success rate (final) while reducing the average position Regarding computational complexity, the average decision-
offset by 28%. Although the two methods suffered decreases making time for all methods remained consistent with their
in success rate under the strongest level of disturbance, the performances in Section IV-B, irrespective of increased sam-
proposed method still maintained over 4% advantage and ple capacity. Although PNMPC’s uncertainty propagation is
achieved a 34% lower position offset than PETS. Without more complex than PETS due to additional dropout units,
the support of uncertainty propagation, PNMPC-N achieved we consider the slightly longer optimization time of ap-
close control performance and success rate to PETS after a proximately 0.01 s acceptable in practice. In contrast, all
long training process. As the disturbance level increased, the GP-based approaches were infeasible for these tasks, as the
performance gap between PNMPC-N and PETS also widened. optimization time per step with over 10, 000 samples could
For the model-free approaches, DroQ could not learn this task exceed 10 seconds. While the decision-making time of model-

TABLE VI
T EST RESULT OF PROPOSED METHOD AND PETS WITH MORE LEARNING ITERATIONS IN THE TARGETS - TRACKING TASK ( CIRCLE ).

Disturbance Method Average complete distance [m] Average complete ratio Time [s] Success rate Failure rate
DroQ −38.41 ± 33.32 −32.0% 0.002 ± 0.0004 0.0% 100.0%
TD7 −103.43 ± 12.51 −86.19% 0.002 ± 0.0002 0.0% 100.0%
Lv. 1 PETS 98.69 ± 26.59 82.24% 0.20 ± 0.07 62.0% 19.33%
PNMPC 104.16 ± 23.44 85.13% 0.22 ± 0.09 73.33% 10.0%
PNMPC-N 68.45 ± 77.31 57.04% 0.15 ± 0.004 34.0% 40.0%
DroQ −94.23 ± 47.66 −78.52% 0.002 ± 0.0003 0.0% 100.0%
TD7 −146.56 ± 56.10 −122.13% 0.002 ± 0.0003 0.0% 100.0%
Lv. 2 PETS 80.91 ± 41.43 67.43% 0.21 ± 0.09 45.33% 39.33%
PNMPC 95.36 ± 32.64 79.47% 0.23 ± 0.11 60.67% 22.0%
PNMPC-N 58.96 ± 44.13 49.14% 0.15 ± 0.004 19.33% 63.33%
DroQ −37.35 ± 29.21 −31.12% 0.002 ± 0.0003 0.0% 100.0%
TD7 −40.14 ± 68.37 −33.45% 0.002 ± 0.0003 0.0% 100.0%
Lv. 3 PETS 85.19 ± 36.89 71.0% 0.22 ± 0.10 42.67% 34.0%
PNMPC 93.98 ± 30.43 78.32% 0.22 ± 0.10 51.33% 24.67%
PNMPC-N 57.63 ± 45.72 48.02% 0.15 ± 0.004 19.33% 64.0%

TABLE VII
T EST RESULT OF PROPOSED METHOD AND PETS WITH MORE LEARNING ITERATIONS IN THE TARGETS - TRACKING TASK ( RIGHT ).

Disturbance Method Average complete distance [m] Average complete ratio Time [s] Success rate Failure rate
DroQ −38.35 ± 103.05 −19.18% 0.002 ± 0.0004 0.0% 93.33%
TD7 −73.97 ± 161.26 −36.99% 0.002 ± 0.0002 2.0% 90.67%
Lv. 1 PETS 135.89 ± 57.46 67.87% 0.22 ± 0.10 34.0% 40.67%
PNMPC 159.33 ± 53.27 79.67% 0.23 ± 0.10 54.67% 23.33%
PNMPC-N 110.33 ± 61.61 55.16% 0.15 ± 0.006 16.67% 52.0%
DroQ −84.18 ± 96.05 −42.09% 0.002 ± 0.0003 0.0% 100.0%
TD7 17.91 ± 37.45 −9.0% 0.002 ± 0.0003 0.0% 100.0%
Lv. 2 PETS 120.46 ± 66.25 60.23% 0.22 ± 0.09 27.33% 48.67%
PNMPC 155.53 ± 50.28 77.77% 0.23 ± 0.10 45.33% 23.33%
PNMPC-N 91.07 ± 65.83 45.53% 0.15 ± 0.007 12.0% 67.33%
DroQ −55.66 ± 26.17 −27.83% 0.002 ± 0.0004 0.0% 100.0%
TD7 −39.23 ± 43.25 −19.61% 0.002 ± 0.0003 0.0% 100.0%
Lv. 3 PETS 120.46 ± 66.25 60.23% 0.22 ± 0.09 27.33% 48.67%
PNMPC 152.43 ± 50.52 76.21% 0.23 ± 0.10 41.33% 26.0%
PNMPC-N 86.89 ± 62.03 43.45% 0.15 ± 0.007 4.67% 70.67%

free approaches DroQ and TD7 was about 1% of our method, bances compared with the results in Section IV-B. Both PETS
they demonstrated significant shortcomings in both control and PNMPC demonstrated reasonable prediction accuracy in
performance (average offset and driving distance) and robust- these states. On the other hand, PNMPC demonstrated sig-
ness (success and failure rates). These disadvantages became nificant advantages over PETS in both accuracy and stability
more pronounced as environmental disturbances intensified, when predicting the orientation of USV which was strongly
rendering these methods nearly incapable of completing tasks affected by the ocean disturbances. It consistently enjoyed far
under the highest level of disturbances. Consequently, we smaller prediction errors and lower standard deviation under
believed that the trade-off between computation time and various levels of disturbances. We believe these advantages
performance in our method was reasonable. mainly come from the proper uncertainty representation and
propagation in the proposed method. Once the uncertainty was
3) Model Accuracy: Figure 7 compared the average error disabled, the model prediction error of PNMPC-N quickly
and standard deviation between the first step prediction in the increased and led to a control performance inferior to PETS.
MPC policy and the observed state using the models learned
by PNMPC, PNMPC-N and PETS during the testing. It can
be observed that as the sample capacity increased, the large D. Case Study
prediction errors in USV position and velocity were partially In this subsection, a case study of position-keeping and
alleviated even under more random and challenging distur- targets-tracking scenarios using all compared approaches was

Position Keeping Circle Right

0.18 0.15 0.27 0.14 0.15 0.04 0.22 0.08 0.16 0.04 0.19 0.07
Lv. 3 0.64 0.28 0.64 0.31 1.21 0.33 1.41 0.32 1.23 0.38 1.92 0.53
0.17 0.08 0.28 0.12 0.15 0.04 0.24 0.08 0.16 0.04 0.21 0.09

0.15 0.05 0.23 0.10 0.14 0.04 0.22 0.08 0.16 0.04 0.19 0.07
0.64 0.28 0.64 0.25 1.27 0.35 1.43 0.39 1.22 0.39 1.93 0.49
Lv. 2
0.17 0.06 0.29 0.12 0.15 0.04 0.25 0.08 0.16 0.04 0.21 0.09

0.11 0.05 0.16 0.07 0.11 0.03 0.15 0.04 0.13 0.03 0.12 0.03
0.50 0.25 0.48 0.25 1.27 0.35 1.44 0.43 1.29 0.44 2.09 0.48
Lv. 1 0.13 0.05 0.18 0.07 0.13 0.03 0.16 0.05 0.14 0.03 0.13 0.04

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.5 1 1.5 2 0 0.5 1 1.5 2 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3
Average predicted error of position (X axis) [m] Average predicted error of position (Y axis) [m] Average predicted error of position (X axis) [m] Average predicted error of position (Y axis) [m] Average predicted error of position (X axis) [m] Average predicted error of position (Y axis) [m]
5.34 5.15 0.10 0.03 6.99 3.80 0.08 0.02 7.95 4.30 0.09 0.02
Lv. 3 7.27 2.70 0.75 0.26 10.53 2.00 1.15 0.22 10.72 2.27 1.22 0.26
3.23 1.20 0.10 0.03 4.24 1.00 0.09 0.02 4.48 1.54 0.10 0.03

4.93 3.68 0.10 0.03 7.47 4.35 0.08 0.02 7.95 4.30 0.09 0.02
7.37 2.86 0.74 0.26 10.50 1.90 1.16 0.21 10.85 2.31 1.24 0.27
Lv. 2 3.26 1.38 4.34 0.80
0.10 0.04 0.09 0.02 4.47 1.54 0.10 0.03

5.85 4.51 0.07 0.03 7.22 4.02 0.06 0.01 9.08 4.36 0.06 0.02
7.50 3.38 0.60 0.29 11.25 1.86 1.24 0.24 11.59 2.35 1.36 0.30
Lv. 1 3.05 1.11 0.08 0.03 4.10 0.75 0.06 0.02 4.23 1.25 0.07 0.02

0 2 4 6 8 10 12 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 0 0.5 1 1.5 0 5 10 15 0 0.5 1 1.5 2

PETS PNMPC-N PNMPC (ours)

Fig. 7. Average prediction errors of the model learned by PNMPC and PETS in position-keeping and targets-tracking tasks with large sample capacity during
the testing.

conducted to illustrate the control behavior learned by the too fast when tracking the second target. The subsequent
proposed method. After learning in 20 iterations with the same violent swaying in orientation further prevented the USV
random seed, all approaches were tested in one rollout with the from resetting, ultimately leading to the failure of the task.
same random distances. The testing rollout including the USV As a comparison, PNMPC quickly tracked all targets and
state and action trajectories were shown in Figs. 8, 9 and 10. stabilized at the final target within approximately 40 steps.
The boat shapes indicated the states of USV at time steps 1 to The effective uncertainty propagation in our method resulted
100 (blue to red). The yellow stars indicated tracking targets. in prompt feedback to the changing disturbances using the
The predicted values and the observed states were shown in MPC policy and therefore contributed to smooth and seamless
red and blue respectively. control trajectories of both USV position and orientation. The
In the position-keeping task shown in Fig. 8, all com- proper cooperation of the rudder and engine throttle drove the
pared approaches were able to maintain the USV position USV to quickly pass through all targets at a relatively high
within a range of 7 meters to the initial position at the last speed and finally stop steadily at the last target point.
step. However, compared to the GP model-based approaches According to Fig. 10, PNMPC had similar advantages in
SPMPC and FPMPC which resulted in a wider range of USV the long-distance tracking task (right trajectory). Based on
position and orientation, the proposed PNMPC consistently the traditional probabilistic neural networks model, the MPC
kept the position of USV as close to the initial position as policy of PETS failed to promptly respond to the changing
possible, except for necessary adjustments of orientation. The disturbances, resulting in the USV stalling near the fourth
proposed probabilistic neural networks accurately predicted target. In comparison to the GP model-based approaches that
the orientation of the USV, which contributed to a smooth excessively adjusted the USV velocity to sequentially track
trajectory of orientation with the smallest fluctuation range, multiple targets due to the large predicted error of direction
helping the USV to maintain its position under disturbances. and velocity, our method significantly reduced the frequency
Meanwhile, the neural networks-based PETS could not achieve of challenging throttle and contributed to superior efficiency
similar control performances. Compared to other GP model- and stability in the tracking task. Once the final target was
based baselines, the neural networks model in PETS sacrificed reached, PNMPC properly controlled the rudder and throttle,
model accuracy. It deviated far from the initial position with stably holding the USV’s position.
larger position offsets than our method. In terms of the states
and actions trajectories, thanks to its superior prediction accu- E. Discussion of PNMPC and its Implementation
racy of USV orientation than the GP model-based approaches, The experimental results above demonstrated the significant
PNMPC effectively controlled the USV heading while avoid- advantages of PNMPC over current state-of-the-art model-free
ing the fluctuation of orientations. Additionally, with a longer approaches in USV control scenarios. The agent of model-
prediction horizon in MPC policy, PNMPC always conducted free RL based on Actor-Critic networks attempts to sacrifice
smaller rudder and throttle commands, resulting in more the stability of the current system in pursuit of maximizing
precise control behaviors. the accumulated rewards in the long term, leading to loss
In the targets-tracking scenario with the circle trajectory of control in disturbed environments. In contrast, PNMPC
as shown in Fig. 9, PNMPC demonstrated superior and ef- achieved a superior balance between accumulated rewards and
fective control behavior compared with other baselines. With system stability in a short horizon via the MPC controller,
redundant USV trajectories, SPMPC and FPMPC remained making it more suitable for unmanned ship control. However,
in unstable position-keeping after tracking all targets. PETS it faces a heavier computational burden compared to DroQ and
was unable to complete the task at all. The large predicted TD7, while its performance is highly dependent on the MPC
error of velocity in the early stage caused the USV to move horizon and the reliability of the learned model.

Position Keeping

SPMPC FPMPC PETS PNMPC (ours)

30 30 30 30

15 15 15 15
Rudder [°]

Rudder [°]
Rudder [°]

Rudder [°]

0 0 0 0

-15 -15 -15 -15

-30 -30 -30 -30

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Time step Time step Time step Time step
100 100 100 100

Throttle [%]
Throttle [%]

50 50
Throttle [%]

Throttle [%]

50 50

0 0 0 0

-50 -50 -50 -50

-100 -100 -100 -100

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Time step Time step Time step Time step

Fig. 8. One test rollout with state and action trajectories of all compared approaches in position-keeping under the third level of disturbance.

For applications, PNMPC can be directly implemented in the probabilistic model of PNMPC with the Actor-Critic
a real-world USV system following [28] while achieving a networks to balance the long-term reward and the current
superior control frequency independent of the size of samples. system stability is also a meaningful direction.
Although the stability of probabilistic neural networks has
been less discussed in related works [35], [36], the proposed V. C ONCLUSIONS
method provides a practical mechanism to alleviate unknown
machine faults and unexpected system behaviors in USV This article proposed PNMPC, a novel probabilistic MBRL
control scenarios: these anomalies may lead to an increase in approach specialized for USV control. It tackled the compu-
the standard deviation of the model predictions, signaling the tational efficiency issue present in current GP model-based
system to promptly halt autonomous decision-making, thereby MBRL USV approaches. Within one MBRL framework, PN-
preventing potentially dangerous situations. In non-ideal sce- MPC utilized deep neural networks to properly model the
narios, such as when observed states are missing or corrupted uncertain dynamics of the USV from a probabilistic per-
by noise, related MBRL approaches GP-MPC, FPMPC, and spective and incorporated an MPC policy to robustly control
PETS often predict states with inaccurate distributions, leading the USV against various external disturbances. A novel loss
to accumulated bias in subsequent steps. As a comparison, our function was designed to emphasize the characteristics of
method leverages a two-step loss function in model updating USV dynamics in continuous time steps to properly train the
and a mean-based propagation in decision-making to tackle probabilistic neural networks model. The effectiveness of the
this issue, enhancing the stability of control behavior. proposed in modeling USV dynamics and learning control
strategies was evaluated across various USV scenarios. Com-
In terms of scalability, employing other Bayesian neural net- pared to recent GP model-based MBRL approaches, PNMPC
works in the proposed method was straightforward. However, offers computational complexity that is independent of sample
traditional Bayesian networks mainly designed for supervised capacity and plans more optimal USV driving trajectories.
learning usually had numerous parameters and struggled to Compared with existing work PETS based on neural networks
efficiently express uncertainties [43], [44]. We believed the model, PNMPC achieves higher model prediction accuracy
probabilistic neural networks in PNMPC that focused on stably and more responsive control behavior against external dis-
and efficiently modeling system dynamics were more suitable turbances, breaking through limitations of control capability
for the real-time control of USV. Algorithmically, integrating traditionally encountered with probabilistic neural networks

Targets-tracking (circle)

SPMPC FPMPC PETS PNMPC (ours)

40 40
Position in X [m]

Position in X [m]
20 20

0 0

-20 -20

-40 -40
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Time step Time step
60 60
Position in Y [m]

Position in Y [m]

40 40

20 20

0 0

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Time step Time step

100 100
Direction [°]
Direction [°]

0 0

-100 -100

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Time step Time step
5 5
Velocity [m/s]

Velocity [m/s]

0 0

-5 -5
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Time step Time step
30 30 30 30

15 15 15 15
Rudder [°]

Rudder [°]

Rudder [°]
0 0 0 0

-15 -15 -15 -15

-30 -30 -30 -30

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Time step Time step Time step Time step
100 100 100 100
Throttle [%]

Throttle [%]

Throttle [%]
Throttle [%]

50 50 50 50

0 0 0 0

-50 -50 -50 -50

-100 -100 -100 -100

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Time step Time step Time step Time step

Fig. 9. One test rollout with state and action trajectories of all compared approaches in targets-tracking (circle) under the third level of disturbance.

in engineering scenarios. This work expands the potential [8] United Nations Conference on Trade and Development, Review of
of probabilistic neural network model-based reinforcement Maritime Transport 2018. 2018.
[9] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
learning towards a fully autonomous USV. MIT press Cambridge, 2018.
[10] Z. Liu, Q. Liu, L. Tang, K. Jin, H. Wang, M. Liu, and H. Wang, “Vi-
R EFERENCES suomotor reinforcement learning for multirobot cooperative navigation,”
IEEE Transactions on Automation Science and Engineering, vol. 19,
[1] M. J. Er, C. Ma, T. Liu, and H. Gong, “Intelligent motion control no. 4, pp. 3234–3245, 2022.
of unmanned surface vehicles: A critical review,” Ocean Engineering, [11] Z. Yan, A. R. Kreidieh, E. Vinitsky, A. M. Bayen, and C. Wu, “Unified
vol. 280, p. 114562, 2023. automatic control of vehicular systems with reinforcement learning,”
[2] Y. Qiao, J. Yin, W. Wang, F. Duarte, J. Yang, and C. Ratti, “Survey of IEEE Transactions on Automation Science and Engineering, vol. 20,
deep learning for autonomous surface vehicles in marine environments,” no. 2, pp. 789–804, 2023.
IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 4, [12] Y. Zhao, X. Qi, Y. Ma, Z. Li, R. Malekian, and M. A. Sotelo,
pp. 3678–3701, 2023. “Path following optimization for an underactuated usv using smoothly-
[3] J. Mcmahon and E. Plaku, “Autonomous data collection with dy- convergent deep reinforcement learning,” IEEE Transactions on Intelli-
namic goals and communication constraints for marine vehicles,” IEEE gent Transportation Systems, vol. 22, no. 10, pp. 6208–6220, 2021.
Transactions on Automation Science and Engineering, vol. 20, no. 3, [13] N. Wang, Y. Gao, H. Zhao, and C. K. Ahn, “Reinforcement learning-
pp. 1607–1620, 2023. based optimal tracking control of an unknown unmanned surface ve-
[4] Y. Yu, C. Guo, and H. Yu, “Finite-time plos-based integral sliding- hicle,” IEEE Transactions on Neural Networks and Learning Systems,
mode adaptive neural path following for unmanned surface vessels with vol. 32, no. 7, pp. 3034–3045, 2021.
unknown dynamics and disturbances,” IEEE Transactions on Automation
Science and Engineering, vol. 16, no. 4, pp. 1500–1511, 2019. [14] N. Wang, Y. Gao, and X. Zhang, “Data-driven performance-prescribed
reinforcement learning control of an unmanned surface vehicle,” IEEE
[5] N. Yang, D. Chang, M. Johnson-Roberson, and J. Sun, “Energy-optimal
Transactions on Neural Networks and Learning Systems, vol. 32, no. 12,
control for autonomous underwater vehicles using economic model
pp. 5456–5467, 2021.
predictive control,” IEEE Transactions on Control Systems Technology,
vol. 30, no. 6, pp. 2377–2390, 2022. [15] A. Heiberg, T. N. Larsen, E. Meyer, A. Rasheed, O. San, and D. Varag-
[6] B.-O. H. Eriksen, M. Breivik, E. F. Wilthil, A. L. Flåten, and E. F. nolo, “Risk-based implementation of colregs for autonomous surface
Brekke, “The branching-course model predictive control algorithm for vehicles using deep reinforcement learning,” Neural Networks, vol. 152,
maritime collision avoidance,” Journal of Field Robotics, vol. 36, no. 7, pp. 17–33, 2022.
pp. 1222–1249, 2019. [16] F. Huang, J. Xu, L. Yin, D. Wu, Y. Cui, Z. Yan, and T. Chen, “A general
[7] L. Liu, D. Wang, Z. Peng, and Q.-L. Han, “Distributed path following motion control architecture for an autonomous underwater vehicle with
of multiple under-actuated autonomous surface vehicles based on data- actuator faults and unknown disturbances through deep reinforcement
driven neural predictors via integral concurrent learning,” IEEE Trans- learning,” Ocean Engineering, vol. 263, p. 112424, 2022.
actions on Neural Networks and Learning Systems, vol. 32, no. 12, [17] W. Gan, X. Qu, D. Song, and P. Yao, “Multi-usv cooperative chasing
pp. 5334–5344, 2021. strategy based on obstacles assistance and deep reinforcement learning,”

Targets-tracking (right)

SPMPC FPMPC PETS PNMPC (ours)

30 30 30 30

15 15 15 15
Rudder [°]

Rudder [°]
Rudder [°]
Rudder [°]

0 0 0 0

-15 -15 -15 -15

-30 -30 -30 -30

Throttle [%]
Throttle [%]

Throttle [%]

50 50 50 50

0 0 0 0

-50 -50 -50 -50

-100 -100 -100 -100

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Time step Time step Time step Time step

Fig. 10. One test rollout with state and action trajectories of all compared approaches in targets-tracking (right) under the third level of disturbance.

IEEE Transactions on Automation Science and Engineering, pp. 1–16, with probabilistic model predictive control,” in International Conference
2023. on Artificial Intelligence and Statistics, pp. 1701–1710, 2018.
[18] I. Masmitja, M. Martin, T. O’Reilly, B. Kieft, N. Palomeras, J. Navarro, [27] Y. Cui, S. Osaki, and T. Matsubara, “Reinforcement learning boat au-
and K. Katija, “Dynamic robotic tracking of underwater targets using topilot: A sample-efficient and model predictive control based approach,”
reinforcement learning,” Science Robotics, vol. 8, no. 80, p. eade7811, in 2019 IEEE/RSJ International Conference on Intelligent Robots and
2023. Systems (IROS), pp. 2868–2875, 2019.
[19] R. Chai, A. Tsourdos, S. Chai, Y. Xia, A. Savvaris, and C. L. P. Chen, [28] Y. Cui, S. Osaki, and T. Matsubara, “Autonomous boat driving system
“Multiphase overtaking maneuver planning for autonomous ground using sample-efficient model predictive control-based reinforcement
vehicles via a desensitized trajectory optimization approach,” IEEE learning approach,” Journal of Field Robotics, vol. 38, no. 3, pp. 331–
Transactions on Industrial Informatics, vol. 19, no. 1, pp. 74–87, 2023. 354, 2021.
[20] R. Chai, H. Niu, J. Carrasco, F. Arvin, H. Yin, and B. Lennox, [29] R. McAllister and C. E. Rasmussen, “Data-efficient reinforcement
“Design and experimental validation of deep reinforcement learning- learning in continuous state-action gaussian-pomdps,” in Advances in
based fast trajectory planning and control for mobile robot in unknown Neural Information Processing Systems, pp. 2040–2049, 2017.
environment,” IEEE Transactions on Neural Networks and Learning [30] Y. Cui, L. Peng, and H. Li, “Filtered probabilistic model predictive
Systems, vol. 35, no. 4, pp. 5778–5792, 2024. control-based reinforcement learning for unmanned surface vehicles,”
[21] C. E. Rasmussen and C. K. Williams, Gaussian processes for machine IEEE Transactions on Industrial Informatics, vol. 18, no. 10, pp. 6950–
learning, vol. 1. MIT press Cambridge, 2006. 6961, 2022.
[31] E. Snelson and Z. Ghahramani, “Sparse Gaussian processes using
[22] J. Meng, Y. Liu, R. Bucknall, W. Guo, and Z. Ji, “Anisotropic gpmp2: pseudo-inputs,” in Advances in neural information processing systems,
A fast continuous-time gaussian processes based motion planner for pp. 1257–1264, 2006.
unmanned surface vehicles in environments with ocean currents,” IEEE
[32] Y. Cui, W. Shi, H. Yang, C. Shao, L. Peng, and H. Li, “Probabilistic
Transactions on Automation Science and Engineering, vol. 19, no. 4,
model-based reinforcement learning unmanned surface vehicles using
pp. 3914–3931, 2022.
local update sparse spectrum approximation,” IEEE Transactions on
[23] M. P. Deisenroth, D. Fox, and C. E. Rasmussen, “Gaussian processes Industrial Informatics, vol. 20, no. 2, pp. 1283–1293, 2024.
for data-efficient learning in robotics and control,” IEEE Transactions [33] R. Chai, A. Tsourdos, A. Savvaris, S. Chai, Y. Xia, and C. L. P. Chen,
on Pattern Analysis & Machine Intelligence, vol. 37, no. 2, pp. 408–423, “Design and implementation of deep neural network-based control for
2013. automatic parking maneuver process,” IEEE Transactions on Neural
[24] R. Chai, A. Tsourdos, H. Gao, Y. Xia, and S. Chai, “Dual-loop tube- Networks and Learning Systems, vol. 33, no. 4, pp. 1400–1413, 2022.
based robust model predictive attitude tracking control for spacecraft [34] R. Chai, D. Liu, T. Liu, A. Tsourdos, Y. Xia, and S. Chai, “Deep
with system constraints and additive disturbances,” IEEE Transactions learning-based trajectory planning and control for autonomous ground
on Industrial Electronics, vol. 69, no. 4, pp. 4022–4033, 2021. vehicle parking maneuver,” IEEE Transactions on Automation Science
[25] R. Chai, A. Tsourdos, H. Gao, S. Chai, and Y. Xia, “Attitude tracking and Engineering, vol. 20, no. 3, pp. 1633–1647, 2023.
control for reentry vehicles using centralised robust model predictive [35] Y. Gal, R. McAllister, and C. E. Rasmussen, “Improving PILCO with
control,” Automatica, vol. 145, p. 110561, 2022. bayesian neural network dynamics models,” in Data-efficient machine
[26] S. Kamthe and M. Deisenroth, “Data-efficient reinforcement learning learning workshop, ICML, vol. 4, p. 25, 2016.

[36] K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep reinforce- Huiyun Li (Senior Member, IEEE) received the
ment learning in a handful of trials using probabilistic dynamics models,” [Link] degree in Electronic Engineering from
in Advances in neural information processing systems (NIPS), 2018. Nanyang Technological University in 2001, and the
[37] D. Sarkar, M. A. Osborne, and T. A. Adcock, “Prediction of tidal Ph.D degree from University of Cambridge, UK,
currents using bayesian machine learning,” Ocean Engineering, vol. 158, in 2006. She is now a professor at Shenzhen In-
pp. 221–231, 2018. stitutes of Advanced Technology, Chinese Academy
[38] A. Girard, C. E. Rasmussen, J. Q. Candela, and R. Murray-Smith, of Sciences and The Chinese University of Hong
“Gaussian process priors with uncertain inputs application to multiple- Kong. Her research interests automobile electronics
step ahead time series forecasting,” in Advances in neural information and autonomous vehicles.
processing systems (NIPS), pp. 545–552, 2003.
[39] T. Hiraoka, T. Imagawa, T. Hashimoto, T. Onishi, and Y. Tsuruoka,
“Dropout q-functions for doubly efficient reinforcement learning,” in
International Conference on Learning Representations (ICLR), 2022.
[40] S. Fujimoto, W.-D. Chang, E. Smith, S. Gu, D. Precup, and D. Meger,
“For sale: State-action representation learning for deep reinforcement
learning,” in Advances in Neural Information Processing Systems
(NeurIPS), vol. 36, pp. 61573–61624, 2023.
[41] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf,
E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,
L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-
performance deep learning library,” in Advances in neural information
processing systems (NIPS), pp. 8024–8035, 2019.
[42] M. J. Powell, “The bobyqa algorithm for bound constrained optimization
without derivatives,” Cambridge NA Report NA2009/06, University of
Cambridge, vol. 26, pp. 1–39, 2009.
[43] I. Osband, “Risk versus uncertainty in deep learning: Bayes, bootstrap
and the dangers of dropout,” in NIPS workshop on bayesian deep
learning, vol. 192, MIT Press, 2016.
[44] R. Egele, R. Maulik, K. Raghavan, B. Lusch, I. Guyon, and P. Bal-
aprakash, “Autodeuq: Automated deep ensemble with uncertainty quan-
tification,” in 2022 26th International Conference on Pattern Recognition
(ICPR), pp. 1908–1914, IEEE, 2022.

Xinyu Wu (Senior Member, IEEE) received the

B.E. and M.E. degrees from the Department of Au-
tomation, University of Science and Technology of
Wenjun Huang received the B.E. degree in in- China, Hefei, China, in 2001 and 2004, respectively.
dustrial engineering from the Harbin University of He is currently a Professor with Shenzhen Institutes
Commerce, Harbin, China, in 2022. He is currently of Advanced Technology, Shenzhen, China. and the
working toward his [Link]. degree in computer tech- Director of Center for Intelligent Bionic. His Ph.D.
nology at Shenzhen Institute of Advanced Tech- degree was awarded at the Chinese University of
nology, Chinese Academy of Science, Shenzhen, Hong Kong in 2008. He has authored or co-authored
China. His current research interests include deep more than 180 papers and two monographs. His
reinforcement learning, model-based reinforcement research interests include computer vision, robotics,
learning and robot control. and intelligent systems. Prof. Wu is an Associate Editor for IEEE Transactions
on Systems Man Cybernetics-Systems, IEEE Transactions on Automation
Science and Engineering, and IEEE Robotics and Automation Letters.

Yunduan Cui (Member, IEEE) received Ph.D. in

computer science from Nara Institute of Science
and Technology, Japan in 2017, M.E in computer
science from Doshisha University, Japan in 2014,
and the B.E. in Electronic Engineering from Xidian
University, China in 2012. From 2017 to 2020, he
was a post-doctor researcher in Nara Institute of
Science and Technology, Japan. He is currently an
Associate Professor at Shenzhen Institute of Ad-
vanced Technology, Chinese Academy of Sciences,
China. His research focuses on machine learning,
especially reinforcement learning-driven unmanned systems.

Cui 等 - 2022 - Filtered Probabilistic Model Predictive Control-Ba
No ratings yet
Cui 等 - 2022 - Filtered Probabilistic Model Predictive Control-Ba
11 pages
A Safe Reinforcement Learning Driven Weights-Varying Model Predictive Control For Autonomous Vehicle Motion Control
No ratings yet
A Safe Reinforcement Learning Driven Weights-Varying Model Predictive Control For Autonomous Vehicle Motion Control
8 pages
Jmse 12 00676
No ratings yet
Jmse 12 00676
26 pages
Incorporating Recurrent Reinforcement Learning Into Model Predictive
No ratings yet
Incorporating Recurrent Reinforcement Learning Into Model Predictive
7 pages
Path Following Control For Unmanned Surface Vehicles A Reinforcement Learning-Based Method With Experimental Validation
No ratings yet
Path Following Control For Unmanned Surface Vehicles A Reinforcement Learning-Based Method With Experimental Validation
14 pages
Finite-Time Control for USVs
No ratings yet
Finite-Time Control for USVs
12 pages
2772 Blending MPC Value Function AP
No ratings yet
2772 Blending MPC Value Function AP
16 pages
Autonomous Navigation of AGVs in Unknown Cluttered Environments Log MPPI Control Strategy
No ratings yet
Autonomous Navigation of AGVs in Unknown Cluttered Environments Log MPPI Control Strategy
8 pages
Learning Model Predictive Control For Quadrotors
No ratings yet
Learning Model Predictive Control For Quadrotors
7 pages
Auv RL
No ratings yet
Auv RL
11 pages
Adaptive Stochastic Nonlinear Model Predictive Control With Look-Ahead Dee
No ratings yet
Adaptive Stochastic Nonlinear Model Predictive Control With Look-Ahead Dee
8 pages
Pan 2022 J. Phys. Conf. Ser. 2203 012058
No ratings yet
Pan 2022 J. Phys. Conf. Ser. 2203 012058
7 pages
2021 3e Improvement of PMSM Control Using Reinforcement Learning Deep Deterministic Policy Gradient Agent
No ratings yet
2021 3e Improvement of PMSM Control Using Reinforcement Learning Deep Deterministic Policy Gradient Agent
6 pages
1 s2.0 S0925231225018053 Main
No ratings yet
1 s2.0 S0925231225018053 Main
10 pages
The Actor-Dueling-Critic Method
No ratings yet
The Actor-Dueling-Critic Method
20 pages
Survey of Model-Based Reinforcement Learning: Applications On Robotics
No ratings yet
Survey of Model-Based Reinforcement Learning: Applications On Robotics
21 pages
Algorithms 17 00014 v2
No ratings yet
Algorithms 17 00014 v2
15 pages
Safety-Critical NMPC for Mobile Robots
No ratings yet
Safety-Critical NMPC for Mobile Robots
12 pages
Improvement of PMSM Control Using Reinforcement Learning Deep Deterministic Policy Gradient Agent
No ratings yet
Improvement of PMSM Control Using Reinforcement Learning Deep Deterministic Policy Gradient Agent
6 pages
Controller Design For Electrical Drives by Deep Reinforcement Learning A Proof of Concept
No ratings yet
Controller Design For Electrical Drives by Deep Reinforcement Learning A Proof of Concept
8 pages
Good
No ratings yet
Good
10 pages
Reinforcement Learning-Based Event-Triggered Model Predictive Control For Autonomous Vehicle Path Following
No ratings yet
Reinforcement Learning-Based Event-Triggered Model Predictive Control For Autonomous Vehicle Path Following
6 pages
Real-Time Neural MPC Deep Learning Model Predictive Control For Quadrotors and Agile Robotic Platforms
No ratings yet
Real-Time Neural MPC Deep Learning Model Predictive Control For Quadrotors and Agile Robotic Platforms
8 pages
RLC Project
No ratings yet
RLC Project
13 pages
4001 Where To Go Next Learning A Subgoal Recommendation Policy For Navigation Among Pedestrians
No ratings yet
4001 Where To Go Next Learning A Subgoal Recommendation Policy For Navigation Among Pedestrians
8 pages
RL and MPC Techniques for Biped Walking
No ratings yet
RL and MPC Techniques for Biped Walking
2 pages
Controlling An Autonomous Vehicle With Deep Reinforcement Learning
No ratings yet
Controlling An Autonomous Vehicle With Deep Reinforcement Learning
7 pages
Design Development Paper (Theoretical)
No ratings yet
Design Development Paper (Theoretical)
16 pages
Multi-Thruster Unmanned Surface Vehicle Dynamic Positioning Control Based On Optimal Active Disturbance Rejection Control
No ratings yet
Multi-Thruster Unmanned Surface Vehicle Dynamic Positioning Control Based On Optimal Active Disturbance Rejection Control
6 pages
AUV Navigation with Deep Learning
No ratings yet
AUV Navigation with Deep Learning
17 pages
A Reinforcement Learning Agent For UAV Control Mathematical Foundations, Implementation, and Human Vs AI Benchmarking
No ratings yet
A Reinforcement Learning Agent For UAV Control Mathematical Foundations, Implementation, and Human Vs AI Benchmarking
13 pages
Low-Level Control of A Quadrotor With Deep
No ratings yet
Low-Level Control of A Quadrotor With Deep
7 pages
Robotics Control with MuJoCo
No ratings yet
Robotics Control with MuJoCo
14 pages
Theoratical Research Paper
No ratings yet
Theoratical Research Paper
7 pages
Using Q-Learning To Automatically Tune Quadcopter PID Controller Online For Fast Altitude Stabilization
No ratings yet
Using Q-Learning To Automatically Tune Quadcopter PID Controller Online For Fast Altitude Stabilization
6 pages
MPC-Based Motion Planning for USVs
No ratings yet
MPC-Based Motion Planning for USVs
11 pages
Model-Based Reinforcement Learning Overview
No ratings yet
Model-Based Reinforcement Learning Overview
56 pages
Cutler16 ICRA Final Submission
No ratings yet
Cutler16 ICRA Final Submission
7 pages
Benchmarking Deep Reinforcement Learning For Continuous Control
No ratings yet
Benchmarking Deep Reinforcement Learning For Continuous Control
14 pages
Learning Adaptive Control of A UUV Using A Bio-Inspired Experience Replay Mechanism
No ratings yet
Learning Adaptive Control of A UUV Using A Bio-Inspired Experience Replay Mechanism
14 pages
Computationally Efficient Gauss-Newton Reinforcement Learning For Model Predictive Control
No ratings yet
Computationally Efficient Gauss-Newton Reinforcement Learning For Model Predictive Control
14 pages
自适应神经网络迭代滑膜航向控制sci
No ratings yet
自适应神经网络迭代滑膜航向控制sci
15 pages
Reinforcement Learning For Robust Missile Autopilot Design
No ratings yet
Reinforcement Learning For Robust Missile Autopilot Design
10 pages
Real-Time Nonlinear MPC with GPUs
No ratings yet
Real-Time Nonlinear MPC with GPUs
8 pages
Neural Network MPC
No ratings yet
Neural Network MPC
8 pages
Reinforcement Learning Control of An Aerial Robot Based On A Tuned Proximal Policy Optimization in Takeoff and Hover Phases
No ratings yet
Reinforcement Learning Control of An Aerial Robot Based On A Tuned Proximal Policy Optimization in Takeoff and Hover Phases
7 pages
Reinforcement Learning in Robotics
No ratings yet
Reinforcement Learning in Robotics
3 pages
UAV Control with Reinforcement Learning
No ratings yet
UAV Control with Reinforcement Learning
16 pages
Reinforcement Learning For Robotics Advance
No ratings yet
Reinforcement Learning For Robotics Advance
2 pages
Saturated-command-Deviation Based Finite-Time Adaptive Control For DP of Usv
No ratings yet
Saturated-command-Deviation Based Finite-Time Adaptive Control For DP of Usv
11 pages
Guzman 22 A
No ratings yet
Guzman 22 A
12 pages
Action Guidance-Based Deep Interactive Reinforcement Learning For AUV Path Planning
No ratings yet
Action Guidance-Based Deep Interactive Reinforcement Learning For AUV Path Planning
8 pages
Safe Learning in Robotics - From Learning-Based Control To Safe Reinforcement Learning
No ratings yet
Safe Learning in Robotics - From Learning-Based Control To Safe Reinforcement Learning
36 pages
Shield Model Predictive Path Integral: A Computationally Efficient Robust MPC Approach Using Control Barrier Functions
No ratings yet
Shield Model Predictive Path Integral: A Computationally Efficient Robust MPC Approach Using Control Barrier Functions
8 pages
Autonomous Decision-Making Generation of UAV Based On Soft Actor-Critic Algorithm-1
No ratings yet
Autonomous Decision-Making Generation of UAV Based On Soft Actor-Critic Algorithm-1
6 pages
Sensors 24 07313 With Cover
No ratings yet
Sensors 24 07313 With Cover
51 pages
A3 Problem Solving Questions 1 1
No ratings yet
A3 Problem Solving Questions 1 1
2 pages
EREM MTech Brochure Aug 2025
No ratings yet
EREM MTech Brochure Aug 2025
14 pages
Sociological Imagination
100% (2)
Sociological Imagination
3 pages
Inclusive Schools in Chennai List
50% (4)
Inclusive Schools in Chennai List
4 pages
The Importance of Technology To 21st Century Learners Essay
100% (4)
The Importance of Technology To 21st Century Learners Essay
2 pages
Making Wise Decisions for Jehovah
No ratings yet
Making Wise Decisions for Jehovah
1 page
CSC 325 AI Lecture08 Supervised Learning
No ratings yet
CSC 325 AI Lecture08 Supervised Learning
32 pages
OUP Readers 2016
No ratings yet
OUP Readers 2016
7 pages
Woolley Malone HBR2011
No ratings yet
Woolley Malone HBR2011
4 pages
9 Sinif Noyabr 2025-2026
No ratings yet
9 Sinif Noyabr 2025-2026
3 pages
PPT - Consumer Learning Process
No ratings yet
PPT - Consumer Learning Process
32 pages
Jake S Resume
No ratings yet
Jake S Resume
1 page
Information For Parents About The Staar Test
No ratings yet
Information For Parents About The Staar Test
2 pages
IIT Bombay Guide For Visitors: Formalities&Essentials
No ratings yet
IIT Bombay Guide For Visitors: Formalities&Essentials
4 pages
(Ebook PDF) Financial and Management Accounting 8th Edition Download
100% (6)
(Ebook PDF) Financial and Management Accounting 8th Edition Download
53 pages
Burke's Metaphor For The Unending Conversation 2
No ratings yet
Burke's Metaphor For The Unending Conversation 2
3 pages
Quiz 1 Research in Daily Life 2
No ratings yet
Quiz 1 Research in Daily Life 2
3 pages
BTech MNC Resume Template
No ratings yet
BTech MNC Resume Template
2 pages
Weekly Plan For 3 Years Old
No ratings yet
Weekly Plan For 3 Years Old
40 pages
Delegate Angel Salazar Jr. Memorial School
No ratings yet
Delegate Angel Salazar Jr. Memorial School
4 pages
Social Studies Lesson For Portfolio
No ratings yet
Social Studies Lesson For Portfolio
4 pages
Samiha Ahmed CV
No ratings yet
Samiha Ahmed CV
1 page
Bài tập 1_ Viết lại câu sử dụng let’s sao cho nghĩa không thay đổi
No ratings yet
Bài tập 1_ Viết lại câu sử dụng let’s sao cho nghĩa không thay đổi
4 pages
Cognitive Therapy Scale Rating Manual
No ratings yet
Cognitive Therapy Scale Rating Manual
24 pages
IELTS Listening
No ratings yet
IELTS Listening
7 pages
Dementia Beyond Drugs: Changing The Culture of Care, Second Edition (Excerpt)
No ratings yet
Dementia Beyond Drugs: Changing The Culture of Care, Second Edition (Excerpt)
8 pages
DB Report Dharma (01) Paper
No ratings yet
DB Report Dharma (01) Paper
15 pages
Jmar I. Almazan: School Grade Level Teacher Learning Area Date Quarter Second Sections
No ratings yet
Jmar I. Almazan: School Grade Level Teacher Learning Area Date Quarter Second Sections
2 pages
TUCL Online Resources Guide
100% (1)
TUCL Online Resources Guide
31 pages
Cultural Captain Responsibilities in School
No ratings yet
Cultural Captain Responsibilities in School
6 pages

Effective Probabilistic Neural Networks Model For Model-Based Reinforcement Learning USV

Uploaded by

Effective Probabilistic Neural Networks Model For Model-Based Reinforcement Learning USV

Uploaded by

This article has been accepted for publication in IEEE Transactions on Automation Science and Engineering.

Effective Probabilistic Neural Networks Model for

Predicted States State Action

• Alleviate changing disturbances

[Model] Probabilistic Neural Networks [Policy] MPC Controller based on

Fig. 1. Principle and contributions of the proposed method PNMPC.

elements [X u , Y u , ψ u , v u , δ u , τ u , v w · cos(ψ w ), v w · sin(ψ w )] First step loss Total loss

and v u present the heading direction and velocity of USV,

engine throttle. Both wind direction ψ w and velocity v w are

The second step’s loss function shares a close form of ∗ ∗ ∗

shift. It encourages the networks to focus on the prediction

Position-keeping Targets-tracking (circle) Targets-tracking (right)

DroQ TD7 SPMPC FPMPC PETS PNMPC-N PNMPC (ours)

Position Keeping Circle Right

0 2 4 6 8 10 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 0 0.5 1 1.5 0 5 10 15 0 0.5 1 1.5

SPMPC FPMPC PETS PNMPC-N PNMPC (ours)

Position-keeping Targets-tracking (circle) Targets-tracking (right)

DroQ TD7 PETS PNMPC-N PNMPC (ours)

Position Keeping Circle Right

0 2 4 6 8 10 12 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 0 0.5 1 1.5 0 5 10 15 0 0.5 1 1.5 2

PETS PNMPC-N PNMPC (ours)

SPMPC FPMPC PETS PNMPC (ours)

-15 -15 -15 -15

-30 -30 -30 -30

-50 -50 -50 -50

-100 -100 -100 -100

SPMPC FPMPC PETS PNMPC (ours)

-15 -15 -15 -15

-30 -30 -30 -30

-50 -50 -50 -50

-100 -100 -100 -100

SPMPC FPMPC PETS PNMPC (ours)

-15 -15 -15 -15

-30 -30 -30 -30

-50 -50 -50 -50

-100 -100 -100 -100

Xinyu Wu (Senior Member, IEEE) received the

Yunduan Cui (Member, IEEE) received Ph.D. in

You might also like