Control of Nonholonomic Vehicle System Using Hierarchical Deep Reinforcement Learning

Uploaded by

statyoung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views4 pages

Control of Nonholonomic Vehicle System Using Hierarchical Deep Reinforcement Learning

Uploaded by

statyoung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

2017 International Symposium on Nonlinear Theory and Its Applications,

NOLTA2017, Cancun, Mexico, December 4-7, 2017

Control of Nonholonomic Vehicle System Using Hierarchical Deep

Reinforcement Learning
Naoyuki Masuda and Toshimitsu Ushio

Graduate School of Engineering Science, Osaka University

Toyonaka, Osaka, 560-8531, Japan
Email: n [email protected], [email protected]

Abstract—In this paper, we apply an approach integrat- it receives a reward. Such tasks in which it takes a long
ing two reinforcement algorithms to a parking problem of time before reward assignment remains a major challenge
4-wheeled vehicle and obtain a controller that generates an for the reinforcement learning.
optimal trajectory. One enables exploring more efficient by In this paper, for these two issues, we propose a learn-
hierarchizing a learning agent and the other enables learning algorithm that combine the hindsight experience replay
ing from a unshaped reward. By simulation, we show that (HER) [8] and hierarchical deep Q network (h-DQN) [7].
by hierarchizing the policy of the agent, a parking opera- The HER is a technique that allows reinforcement learn-
tion including cutting of the wheel, which requires a long ing algorithms to perform sample-efficient learning from an
exploration before acquisition of the movement, can be ac- unshaped reward function. The h-DQN is a framework that
quired efficiently. hierarchizes the vale function of the DQN. It improves the
exploration efficiency in an environment with sparse feed-
back. We apply the proposed algorithm to a parking prob-
1. Introduction
lem with the cutting of the wheel of the 4-wheeled vehicle
and show its effectiveness by simulation.
It is known that, in general, mechanical systems with
nonhonolomic constraints cannot by stabilized by a time-
invariant feedback controller [1]. So, discontinuous, con- 2. Preliminaries
trol methods have been proposed [2]. We consider a park-
ing problem of a 4-wheeled vehicle which is a nonholo- 2.1. Reinforcement Learning
nomic system. Because of the nonholonomic constraint, if
there is no trajectory that goes to the target point smoothly, We review the standard reinforcement leaning setting
it is necessary to cut the wheel. Thus, a discontinuous con- where an agent interacts with an environment over a num-
troller is needed. The design of such a controller requires ber of discrete time steps. The environment is described
a model of the vehicle. But, the model may include uncer- by a Markov decision process (MDP) which is defined as
tainty or be time-varying. a tuple (S , A, T, γ, R), where S is the state space, A is the
The design of controllers using reinforcement learning action space, T : S × A → S is the transition probability,
does not require knowledge about models of plants, and γ ∈ [0, 1] is the discount factor and R : S × A → R is the
can obtain a control law through interactions between the reward function. A deterministic policy is a mapping from
plant and the environment [3]. Reinforcement learning was states to actions. At each time step t, the agent receives a
applied to model-free design of an adaptive optimal output state st and selects an action at from a set of possible ac-
controller [4] and a decentralized supervisor [5]. Further- tions in st according to a policy π. Then, the agent moves
more, initiated by the deep Q network (DQN) [6], there to the next state st+1 and receives a reward rt . Such a pro-
has been growing interest in the use of deep neural net- cess continues until the agent reaches a terminal state after
works in reinforcement learning. This method has made which the process restarts. The purpose of the agent is to
∑
it possible to apply large non-linear function approxima- maximize the expected return Rt = k γk rt+k from each
tors for learning a control policy. A policy modeled by a state st .
deep neural network can expresses the characteristics for a The action value function Qπ (st , at ) = E[Rt |st , at ] is the
complex plant, and can output an appropriate action. How- expected return for selecting an action at in the state st and
ever, when reinforcement learning is applied to learning of following the policy π. The optimal action value function
parking motion of a vehicle, there are two issues. One is Q∗ (s, a) = maxπ Qπ (s, a) gives the maximum action value
how to design the reward function. We need to design a for the state s and the action a. An optimal policy is de-
reward function so that the policy optimization for the re- rived by selecting the highest valued action in each state.
ward leads to parking completion. The other is how to learn Q-learning, which is a value-based model-free reinforce-
a parking motion with cutting of the wheel. For the agent ment learning method, estimates the optimal action value
to learn this motion, it needs long time exploration before function.
- 26 -
2.2. Deep Q-Network(DQN) 3. Optimal Path Planning for Car Parking

Recent advances in function approximation with deep 3.1. Problem Formulation

neural networks have made it possible to handle a high-
dimensional state space. A deep Q-Network (DQN) is a
multi-layered neural network that outputs a vector of ac-
tion values Q(s, ·; θ) for each state s, where θ is a set of
parameters of the network. In order to make a learning
process more stable, two important ingredients were pro-
posed by Mnih et al.[6]. One is the use of a target net-
work and the other is the use of experience replay. The
target network is the same as the main network except its
set of parameters θ− that changes at a slower rate than
that of the main network. That is, the parameters are pe-
riodically copied from the main network and kept fixed
on all other steps. In the experience replay, the transi- Figure 2: 4-wheeled vehicle.
tions encountered during training are stored in the replay
buﬀer. The network is trained using mini-batch gradient
descent on the following loss L = E(Yttarget − Q(st , at ; θt ))2 , We consider a 4-wheeled vehicle shown in Fig. 2, where
where Yttarget = rt + γ maxa′ ∈A Q(st+1 , a′ ; θt− ) and the tuples (r x , ry ) is the center position of the rear axle of the vehicle,
(st , at , rt , st+1 ) are sampled from the replay buﬀer. θ is the body angle, and ϕ is the steering angle. When the
inputs are the forward velocity uv and the angular velocity
of the steering uw , the 4-wheeled vehicle is modeled by the
2.3. h-DQN following equation:
   
 r x   cos θ 0  [ ]
d  ry   sin θ 0  uv
  =   , (1)
dx  θ   1l tan ϕ 0  uw
ϕ 0 1

where l denotes the distance between the front and rear

axles.

3.2. Reward Shaping

Due to the nonholonomic constraint, in general, the ve-
hicle cannot directly achieve the terminal state by a contin-
uous control law but by a discontinuous one with cutting
of the wheel [1]. Such behavior is, however, difficult to
Figure 1: Architecture h-DQN. learn by a straightforward application of the DQN, since
the agent needs intrinsic motivation to explore meaningful
regions of the state space before it can learn about the ad-
A h-DQN is one of the extended framework of a DQN vantage of cutting of the wheel for itself. Then, we apply
that can treat the task with long-term credit assignment, the h-DQN framework. That is, a meta-controller learns
utilizing a notion of goals, which provide intrinsic moti- a policy over selecting switching points and a controller
vation for the agent. In this framework, as shown in Fig. learns a policy over a series of atomic actions to reach the
1, the agent uses a two-level hierarchy consisting of a con- switching points. By using the h-DQN, it is possible to effi-
troller and a meta-controller. The meta-controller receives ciently learn a policy even if a long exploration is required
the current state of the plant and chooses a goal. The goal before parking is completed and a reward is received.
is fixed for the next few time steps either until it is achieved However, a common challenge in reinforcement learning
or a terminal state is reached. For the controller to achieve is how to determine a reward function that reflects the task
the goals, the meta-controller provides a intrinsic reward to and guides the policy optimization. For instance, we con-
it based on whether the agent is able to achieve the goals. sidered a reward setting where the agent receives a negative
That is, to maximize a cumulative extrinsic reward pro- reward according to the distance between the parking point
vided from the environment , the meta-controller focuses and the vehicle every step. When getting close enough to
on setting the sequences of goals and the controller focuses the parking point, the agent receives a high reward as the
on achieving them. body angle approaches 0. Furthermore, when the agent
- 27 -
goes outside of the field, the episode will be terminated puts a vector of the action values Qm (s, g) for each sub-
with a negative reward. Such reward setting, however, does goal g ∈ G, and the controller outputs a vector of the ac-
not learn an optimal trajectory. If the negative reward that tion values Qc (s, a) for each action a ∈ A. We have nine
the agent receives according to the distance through the actions: “move forward”, “move backward”, “steer right”,
episode is not larger than the penalty for going outside of “steer left”, “do nothing”, “move forward and steer right”,
the field, the agent learns to leave the field. If the penalty “move forward and steer left”, “move backward and steer
is too large, the agent becomes passive against exploring. right”, and “move backward and steer left”. We have four
Moreover, it is necessary to determine the maximum num- sub-goals. The goal and the sub-goals describe the de-
ber of steps of each episode reflecting the difficulty of the sired position and the body angle with some fixed toler-
parking task. ance. We consider that the goal is achieved when the dis-
The shaped reward function like this example is not use- tance between the vehicle and the goal position becomes
ful because it can not be applied to cases where the prob- less than ϵd and the absolute value of the difference be-
lem setting changes slightly. Although For every problem tween the body angle and the goal angle becomes less than
of reinforcement learning there exists some reward setting ϵa over 10 steps. The position and the angle of the extrinsic
which make it easy, the design of such shaped rewards re- goal are (0, 0) and 0[rad] respectively, and those of sub-
1.25
quires a lot of domain knowledge and may not be much goals are (2.25, 1.25) and arctan( 2.25 )[rad], (2.5, 0.9) and
0.9 0.7
easier than the design of the policy using the mathemati- arctan( 2.5 )[rad], (2.9, 0.7) and arctan( 2.9 )[rad], and (0, 0)
cal model. It is, therefore, important to develop algorithms and 0[rad], respectively. The tolerance of the distance
π
which can learn from an unshaped reward. In our work, ϵd = 0.08 and that of the body angle ϵa = 30 for all goals.
we use a binary value indicating successful task comple- We have 80000 episodes. Each episode consists of 200
tion for both the intrinsic and the extrinsic reward. That steps and we perform 20 optimization steps on minibatches
is, the agent receives 1 when it reaches a sub-goal or com- of size 64 sampled uniformly from the replay buffer. The
plete the parking, −1 when it goes outside of the field, and 0 size of replay buffers for the controller and the meta-
otherwise. In such a sparse and binary reward setting, how- controller are 5.0 × 105 and 5.0 × 104 , respectively. We
ever, standard reinforcement learning algorithms are bound update the target networks at every optimization step using
to fail because they may not receive a positive reward. the decay coefficient of 0.99. We use the Adam optimizer
To overcome this problem, we used techniques for im- with the learning rate of 0.001. The discount factor is 0.99
proving exploration such as count-based exploration [9] , for all transitions. For exploration, we experiment with us-
but it is impractical to explore a large state space. In stead ing ϵ-greedy exploration and ϵ is annealed over time from
of using such techniques, we use Hindsight Experience Re- 1.0 to 0.1.
play (HER) [8]. This is an improved version of experi- We follow a two-phase training procedure. First, we
ence replay based on the idea of making it possible to learn learned the controller until reaching randomly sampled
from failure. Consider an episode with a state sequence sub-goals and then we train the controller and the meta-
s1 , ..., sT and a goal g , s1 , ..., sT . While this episode may controller simultaneously. This procedure prevents the net-
not help us learn how to achieve the goal g, it tells us some- work from specializing and converging into a local optimal
thing about how to achieve the state sT . This information is point so as to go to a specific goal in the course of learning.
utilized by creating an episode replacing the goal g of the
original episode by sT and replaying with the episodes in 4. Computer Simulation
the replay buffer. By this modification, at least half of the
replayed trajectories contain positive reward and learning We consider the following optimal car parking problem.
becomes much faster. As a learning operation, we consider the parking motion
from the initial state [1, 1, 0, 0]T to the origin with less than
3.3. Model Architecture and Training 6 degrees of the body angle. The trajectory obtained by
learning is shown in Fig. 3. As learning progresses, we first
In this paper, we use the meta controller network con- observed behaviors like learning to get a reward by moving
sisting of 1 hidden layer with 64 units, and the controller relative position and angle to zero with the given sub-goals.
network consisting of 3 hidden layer with 64 units in However, due to the nonholonomic constraint, even if the
each layer. All of the layers use a rectified linear activa- relative position and the angle to a sub-goal are close to 0,
tion function. The inputs to the controller are the rela- there may not exist an immediate action that can achieve
tive position and relative angle of the vehicle to the cur- the sub-goal from the current position. For this reason, in
rent sub-goal gt , the steering angle, and the derivatives the middle of the first stage of learning, a lot of back and
of the state of the vehicle. That is, the input is a vector forth motions were observed near a sub-goal. At the end of
[gtx − r x , gyt − ry , gθt − θ, ϕ, r˙x , r˙y , θ̇, ϕ̇]T , where gtx , gyt and gθt first-phase, the agent learned that it is necessary not only
are the x and y coordinates and the angle of the current to get the relative position and angle close to 0 but also to
sub-goal. The inputs to the meta controller are the state need something like a path planning considering the con-
of the vehicle and its derivative. The meta-controller out- straint to achieve the sub-goal. Finally the agent learned a
- 28 -
policy to reach all of the sub-goals. Learning of the meta-
controller started when the number of episodes where the
controller arrived at the sub-goals increased. The meta-
controller learned which switching point tends to receive
the highest return. In our computer simulation, since the
binary reward multiplied by the discount rate is maximized,
the agent learned how to select the switching point which
will complete the parking motion earliest. As a result, we
have an optimal policy that selects the optimal switching
point (2.25, 1.25). Fig. 4 compares the learning curve of
the DQN, the DQN with HER, and the h-DQN with the
HER. The DQN cannot learn a behavior of the parking
task at all. This is because the exploration is completely
random and there is no motivation to cutting of the wheel.
The DQN with HER can learn the behavior, but learning is
slow compared with the h-DQN with HER. By replaying Figure 4: Learning curves.
the episodes generated by the HER, the agent learns that if
the relative position is close to zero, it can get the reward.
But the exploration is random, it takes time to notice the
necessity of cutting of the wheel. On the other hand, the References
h-DQN with HER succeed in learning the task. This is be-
[1] R. W. Brockett, “Asymptotic stability and feedback sta-
cause HER accelerates learning, and hierarchization makes
bilization,” Differential geometric control theory, vol.
exploration more efficient.
27, no. 1, pp. 181-191, 1983.

[2] A. Astolfi, Discontinuous control of nonholonomic

systems, System and Control Letters, vol. 27, no. 1,
pp. 37-45, 1996.
[3] F. L. Lewis and D.Liu, Reinforcement Learning and
Approximate Dynamic Programming for Feedback
Control, IEEE Press, 2013.
[4] T. Fujita and T. Ushio, “Optimal digital control with
uncertain network delay of linear systems using re-
inforcement learning,” IEICE Trans. Fundamanetals,
vol. E99-A no. 2, pp. 454-461, 2016.

[5] T. Yamasaki and T. Ushio, “Decentralized supervi-

sory control of discrete event systems based on re-
inforcement learning,” IEICE Trans. Fundamanetals,
Figure 3: Learned trajectory with the cutting of the wheel, vol. E88-A no. 11, pp. 3045-3050, 2005.
where the green and the red point are the initial and fi-
nal(target) point, respectively. [6] V. Mnih et al., “Human-level control though deep rein-
forcement learning,” Nature, vol. 518(7540), pp. 529–
533, 2015.
[7] T. D. Kulkarni et al., “Hierarchical deep reinforcement
learning: Integrating temporal abstraction and intrinsic
5. Conclusions motivation,” 30th Conference on Neural Information
Processing Systems, 2016.
In this paper, we consider a parking problem of a 4-
wheeled car where the car may go back and forth. To get [8] M. Andrychowicz et al., “Hindsight Experience Re-
an optimal trajectory, we propose a reinforcement learning play,” arXiv preprint arXiv:1707.01495, 2017.
based method using the h-DQN with the HER. Future work
includes learning parking motion from any initial state and [9] G. Ostrovski et al., “Count-Based Exploration
generating sub-goals for the meta-controller automatically. with Neural Density Models,” arXiv preprint
arXiv:1703.01310, 2017.
Acknowledgement: This work was supported by JST ER-
ATO Grant Number JPMJER1603, Japan.
- 29 -

Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
Towards Adapting Reinforcement Learning Agents To New Tasks: Insights From Q-Values
No ratings yet
Towards Adapting Reinforcement Learning Agents To New Tasks: Insights From Q-Values
10 pages
Real Car Steering with NFQ in 20 Min
No ratings yet
Real Car Steering with NFQ in 20 Min
8 pages
Self-Driving Car Racing: Application of Deep Reinforcement Learning
No ratings yet
Self-Driving Car Racing: Application of Deep Reinforcement Learning
12 pages
Deep RL for Autonomous Car Racing
No ratings yet
Deep RL for Autonomous Car Racing
6 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
Untitled Document
No ratings yet
Untitled Document
11 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
Chapter 1 Introduction RL Report Kiran
No ratings yet
Chapter 1 Introduction RL Report Kiran
2 pages
Towards Monocular Vision Based Obstacle Avoidance Through Deep Reinforcement Learning
No ratings yet
Towards Monocular Vision Based Obstacle Avoidance Through Deep Reinforcement Learning
14 pages
Lecture Notes On Reinforcement Learning Basics
No ratings yet
Lecture Notes On Reinforcement Learning Basics
6 pages
The Actor-Dueling-Critic Method
No ratings yet
The Actor-Dueling-Critic Method
20 pages
Deep Deformable Q-Network An Extension of Deep Q-Network
No ratings yet
Deep Deformable Q-Network An Extension of Deep Q-Network
4 pages
Minor Project Synopsis
No ratings yet
Minor Project Synopsis
2 pages
Hindsight Experience Replay in RL
No ratings yet
Hindsight Experience Replay in RL
15 pages
Rule-Based Reinforcement Learning Augmented by External Knowledge
No ratings yet
Rule-Based Reinforcement Learning Augmented by External Knowledge
7 pages
Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation
No ratings yet
Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation
14 pages
Skill-Based Curiosity For Intrinsically Motivated Reinforcement Learning
No ratings yet
Skill-Based Curiosity For Intrinsically Motivated Reinforcement Learning
20 pages
Controlling An Autonomous Vehicle With Deep Reinforcement Learning
No ratings yet
Controlling An Autonomous Vehicle With Deep Reinforcement Learning
7 pages
Autonomous Driving With Deep Reinforcement Learning in CARLA Simulation
No ratings yet
Autonomous Driving With Deep Reinforcement Learning in CARLA Simulation
7 pages
AI Plays Geometry Dash
No ratings yet
AI Plays Geometry Dash
7 pages
CS6700 Reinforcement Learning Assignment
No ratings yet
CS6700 Reinforcement Learning Assignment
17 pages
Op Tim Ization
No ratings yet
Op Tim Ization
19 pages
Safe Navigation Based On Deep Q-Network Algorithm Using An Improved Control Architecture
No ratings yet
Safe Navigation Based On Deep Q-Network Algorithm Using An Improved Control Architecture
6 pages
Decision-Making Strategy On Highway For Autonomous Vehicles Using Deep Reinforcement Learning
No ratings yet
Decision-Making Strategy On Highway For Autonomous Vehicles Using Deep Reinforcement Learning
11 pages
RLDL PBL AmriteshChandra 09411503121
No ratings yet
RLDL PBL AmriteshChandra 09411503121
15 pages
Using Q-Learning To Automatically Tune Quadcopter PID Controller Online For Fast Altitude Stabilization
No ratings yet
Using Q-Learning To Automatically Tune Quadcopter PID Controller Online For Fast Altitude Stabilization
6 pages
Autonomous Driving System Based On Deep Q Learnig: Takafumi Okuyama, Tad Gonsalves Jaychand Upadhay
No ratings yet
Autonomous Driving System Based On Deep Q Learnig: Takafumi Okuyama, Tad Gonsalves Jaychand Upadhay
5 pages
AV Control with DQN for Safety
No ratings yet
AV Control with DQN for Safety
6 pages
Electronics: Decision-Making System For Lane Change Using Deep Reinforcement Learning in Connected and Automated Driving
No ratings yet
Electronics: Decision-Making System For Lane Change Using Deep Reinforcement Learning in Connected and Automated Driving
13 pages
Case
No ratings yet
Case
6 pages
15 Deep Reinforcement Learning v24.2
No ratings yet
15 Deep Reinforcement Learning v24.2
115 pages
LunarLander-v2 Reinforcement Learning
No ratings yet
LunarLander-v2 Reinforcement Learning
4 pages
MAE 598 Intro To Autonomous Project Dhiram Omkar Harshal
No ratings yet
MAE 598 Intro To Autonomous Project Dhiram Omkar Harshal
14 pages
Virtual Testing and Policy Deployment Framework For Autonomous Navigation of An Unmanned Ground Vehicle Using Reinforcement Learning
No ratings yet
Virtual Testing and Policy Deployment Framework For Autonomous Navigation of An Unmanned Ground Vehicle Using Reinforcement Learning
6 pages
Reinforcement Learning With Python
No ratings yet
Reinforcement Learning With Python
24 pages
Rahman2018 Article ImplementationOfQLearningAndDe PDF
No ratings yet
Rahman2018 Article ImplementationOfQLearningAndDe PDF
6 pages
Deep Reinforcement Learning Overview
No ratings yet
Deep Reinforcement Learning Overview
9 pages
Data Driven Control IEEE Paper
No ratings yet
Data Driven Control IEEE Paper
4 pages
Advanced Deep RL for AI Experts
No ratings yet
Advanced Deep RL for AI Experts
10 pages
Lecture 3.1 AML
No ratings yet
Lecture 3.1 AML
65 pages
3.reinforcement Learning DDPG-PPO Agent-Based Control S Ystem
No ratings yet
3.reinforcement Learning DDPG-PPO Agent-Based Control S Ystem
14 pages
6S191 MIT DeepLearning L5
No ratings yet
6S191 MIT DeepLearning L5
62 pages
Dhruv Anirudh DrSandeep
No ratings yet
Dhruv Anirudh DrSandeep
21 pages
Unit - 5
No ratings yet
Unit - 5
43 pages
Genetic Reinforcement Learning Algorithms For On-Line Fuzzy Inference System Tuning "Application To Mobile Robotic"
No ratings yet
Genetic Reinforcement Learning Algorithms For On-Line Fuzzy Inference System Tuning "Application To Mobile Robotic"
31 pages
A Deep Reinforcement Learning Algorithm For Robotic Manipulation Tasks in Simulated Environments
No ratings yet
A Deep Reinforcement Learning Algorithm For Robotic Manipulation Tasks in Simulated Environments
10 pages
Bio Inspired AI Seminar Paper
No ratings yet
Bio Inspired AI Seminar Paper
18 pages
HER Hindsight Experience Replay
No ratings yet
HER Hindsight Experience Replay
11 pages
An Efficient Hardware Implementation of Reinforcement Learning: The Q-Learning Algorithm
No ratings yet
An Efficient Hardware Implementation of Reinforcement Learning: The Q-Learning Algorithm
12 pages
An Efficient Hardware Implementation of Reinforcement Learning: The Q-Learning Algorithm
No ratings yet
An Efficient Hardware Implementation of Reinforcement Learning: The Q-Learning Algorithm
12 pages
RL Report
No ratings yet
RL Report
37 pages
Q-Transformer: Scalable Offline RL Method
No ratings yet
Q-Transformer: Scalable Offline RL Method
20 pages
Game AI with Reinforcement Learning
No ratings yet
Game AI with Reinforcement Learning
5 pages
Yang 20 A
No ratings yet
Yang 20 A
4 pages
13-RL DRL
No ratings yet
13-RL DRL
102 pages
Energies 16 01512
No ratings yet
Energies 16 01512
23 pages
IEEE Conference Template
No ratings yet
IEEE Conference Template
4 pages
Mahmoudi Et Al. S II ASR 71 1281 2023
No ratings yet
Mahmoudi Et Al. S II ASR 71 1281 2023
6 pages
Exploring, Investigating and Discovering in Mathematics by Vasile Berinde (Auth.) (Z-Lib - Org) - Fragment
No ratings yet
Exploring, Investigating and Discovering in Mathematics by Vasile Berinde (Auth.) (Z-Lib - Org) - Fragment
2 pages
Arctic Shipping: Azipod Propulsion Insights
100% (1)
Arctic Shipping: Azipod Propulsion Insights
13 pages
Liq Liq Extraction
No ratings yet
Liq Liq Extraction
4 pages
C File Handling for Beginners
100% (1)
C File Handling for Beginners
10 pages
Unbalanced Fault (SLG)
No ratings yet
Unbalanced Fault (SLG)
13 pages
Sip32431 RC Vishay
No ratings yet
Sip32431 RC Vishay
14 pages
Chemistry: Fitzroy's Storm Glass
No ratings yet
Chemistry: Fitzroy's Storm Glass
29 pages
Identify Direct and Inverse Variation Problems and Worksheets
No ratings yet
Identify Direct and Inverse Variation Problems and Worksheets
5 pages
Introducing XML: Beginning XML Joe Fawcett, Liam R.E. Quin, and Danny Ayers John Wiley & Sons, Inc., 2012
No ratings yet
Introducing XML: Beginning XML Joe Fawcett, Liam R.E. Quin, and Danny Ayers John Wiley & Sons, Inc., 2012
296 pages
The IOTA ETS-20 and ETS-20-DR: IOTA Emergency Lighting Technical Library
No ratings yet
The IOTA ETS-20 and ETS-20-DR: IOTA Emergency Lighting Technical Library
4 pages
KK Ack
No ratings yet
KK Ack
16 pages
Günther Patzig Aristotle's Theory of The Syllogism A Logico-Philological Study of Book A of The Prior Analytics Springer Netherlands (1968)
No ratings yet
Günther Patzig Aristotle's Theory of The Syllogism A Logico-Philological Study of Book A of The Prior Analytics Springer Netherlands (1968)
231 pages
All TOC E-Lecture Notes
No ratings yet
All TOC E-Lecture Notes
57 pages
A-Level Physics: Mechanics Guide
No ratings yet
A-Level Physics: Mechanics Guide
31 pages
C Lessons by Chris Sawtell
No ratings yet
C Lessons by Chris Sawtell
60 pages
Section 1 - Mathematical Foundations & Core Theory For Dog Behavior Detection From Video
No ratings yet
Section 1 - Mathematical Foundations & Core Theory For Dog Behavior Detection From Video
33 pages
Online Banking
No ratings yet
Online Banking
49 pages
JEE Main Class 11 Math Solutions
No ratings yet
JEE Main Class 11 Math Solutions
5 pages
Unit 2: Two Dimensional Inviscid, Incompressible Flow 2 Marks Question and Answers Aerodynamics 1
No ratings yet
Unit 2: Two Dimensional Inviscid, Incompressible Flow 2 Marks Question and Answers Aerodynamics 1
8 pages
Countable and Uncountable Nouns Guide
No ratings yet
Countable and Uncountable Nouns Guide
10 pages
Adsorption Chiller
No ratings yet
Adsorption Chiller
19 pages
Furnace PDF
No ratings yet
Furnace PDF
32 pages
2022 Fue317b Extra Questions
No ratings yet
2022 Fue317b Extra Questions
3 pages
Higher Mental Processes 1st Edition Robert W. Proctor (Editor)
No ratings yet
Higher Mental Processes 1st Edition Robert W. Proctor (Editor)
397 pages
Assertions and Reasons in Physics Concepts
No ratings yet
Assertions and Reasons in Physics Concepts
7 pages
Mathematical Pioneers of India
No ratings yet
Mathematical Pioneers of India
3 pages
Short Fiber-Reinforced Composite Restorations: A Review of The Current Literature
No ratings yet
Short Fiber-Reinforced Composite Restorations: A Review of The Current Literature
10 pages
Understanding the Delta Rule in Learning
No ratings yet
Understanding the Delta Rule in Learning
10 pages
How To Fix and Prevent FB14 Fault Drive Loader Version Hint 22 C
100% (9)
How To Fix and Prevent FB14 Fault Drive Loader Version Hint 22 C
4 pages

Control of Nonholonomic Vehicle System Using Hierarchical Deep Reinforcement Learning

Uploaded by

Control of Nonholonomic Vehicle System Using Hierarchical Deep Reinforcement Learning

Uploaded by

2017 International Symposium on Nonlinear Theory and Its Applications,

NOLTA2017, Cancun, Mexico, December 4-7, 2017

Control of Nonholonomic Vehicle System Using Hierarchical Deep

Graduate School of Engineering Science, Osaka University

Recent advances in function approximation with deep 3.1. Problem Formulation

where l denotes the distance between the front and rear

3.2. Reward Shaping

[2] A. Astolfi, Discontinuous control of nonholonomic

[5] T. Yamasaki and T. Ushio, “Decentralized supervi-

You might also like