0% found this document useful (0 votes)
22 views4 pages

Control of Nonholonomic Vehicle System Using Hierarchical Deep Reinforcement Learning

Uploaded by

statyoung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views4 pages

Control of Nonholonomic Vehicle System Using Hierarchical Deep Reinforcement Learning

Uploaded by

statyoung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

2017 International Symposium on Nonlinear Theory and Its Applications,

NOLTA2017, Cancun, Mexico, December 4-7, 2017

Control of Nonholonomic Vehicle System Using Hierarchical Deep


Reinforcement Learning
Naoyuki Masuda and Toshimitsu Ushio

Graduate School of Engineering Science, Osaka University


Toyonaka, Osaka, 560-8531, Japan
Email: n [email protected], [email protected]

Abstract—In this paper, we apply an approach integrat- it receives a reward. Such tasks in which it takes a long
ing two reinforcement algorithms to a parking problem of time before reward assignment remains a major challenge
4-wheeled vehicle and obtain a controller that generates an for the reinforcement learning.
optimal trajectory. One enables exploring more efficient by In this paper, for these two issues, we propose a learn-
hierarchizing a learning agent and the other enables learn- ing algorithm that combine the hindsight experience replay
ing from a unshaped reward. By simulation, we show that (HER) [8] and hierarchical deep Q network (h-DQN) [7].
by hierarchizing the policy of the agent, a parking opera- The HER is a technique that allows reinforcement learn-
tion including cutting of the wheel, which requires a long ing algorithms to perform sample-efficient learning from an
exploration before acquisition of the movement, can be ac- unshaped reward function. The h-DQN is a framework that
quired efficiently. hierarchizes the vale function of the DQN. It improves the
exploration efficiency in an environment with sparse feed-
back. We apply the proposed algorithm to a parking prob-
1. Introduction
lem with the cutting of the wheel of the 4-wheeled vehicle
and show its effectiveness by simulation.
It is known that, in general, mechanical systems with
nonhonolomic constraints cannot by stabilized by a time-
invariant feedback controller [1]. So, discontinuous, con- 2. Preliminaries
trol methods have been proposed [2]. We consider a park-
ing problem of a 4-wheeled vehicle which is a nonholo- 2.1. Reinforcement Learning
nomic system. Because of the nonholonomic constraint, if
there is no trajectory that goes to the target point smoothly, We review the standard reinforcement leaning setting
it is necessary to cut the wheel. Thus, a discontinuous con- where an agent interacts with an environment over a num-
troller is needed. The design of such a controller requires ber of discrete time steps. The environment is described
a model of the vehicle. But, the model may include uncer- by a Markov decision process (MDP) which is defined as
tainty or be time-varying. a tuple (S , A, T, γ, R), where S is the state space, A is the
The design of controllers using reinforcement learning action space, T : S × A → S is the transition probability,
does not require knowledge about models of plants, and γ ∈ [0, 1] is the discount factor and R : S × A → R is the
can obtain a control law through interactions between the reward function. A deterministic policy is a mapping from
plant and the environment [3]. Reinforcement learning was states to actions. At each time step t, the agent receives a
applied to model-free design of an adaptive optimal output state st and selects an action at from a set of possible ac-
controller [4] and a decentralized supervisor [5]. Further- tions in st according to a policy π. Then, the agent moves
more, initiated by the deep Q network (DQN) [6], there to the next state st+1 and receives a reward rt . Such a pro-
has been growing interest in the use of deep neural net- cess continues until the agent reaches a terminal state after
works in reinforcement learning. This method has made which the process restarts. The purpose of the agent is to

it possible to apply large non-linear function approxima- maximize the expected return Rt = k γk rt+k from each
tors for learning a control policy. A policy modeled by a state st .
deep neural network can expresses the characteristics for a The action value function Qπ (st , at ) = E[Rt |st , at ] is the
complex plant, and can output an appropriate action. How- expected return for selecting an action at in the state st and
ever, when reinforcement learning is applied to learning of following the policy π. The optimal action value function
parking motion of a vehicle, there are two issues. One is Q∗ (s, a) = maxπ Qπ (s, a) gives the maximum action value
how to design the reward function. We need to design a for the state s and the action a. An optimal policy is de-
reward function so that the policy optimization for the re- rived by selecting the highest valued action in each state.
ward leads to parking completion. The other is how to learn Q-learning, which is a value-based model-free reinforce-
a parking motion with cutting of the wheel. For the agent ment learning method, estimates the optimal action value
to learn this motion, it needs long time exploration before function.
- 26 -
2.2. Deep Q-Network(DQN) 3. Optimal Path Planning for Car Parking

Recent advances in function approximation with deep 3.1. Problem Formulation


neural networks have made it possible to handle a high-
dimensional state space. A deep Q-Network (DQN) is a
multi-layered neural network that outputs a vector of ac-
tion values Q(s, ·; θ) for each state s, where θ is a set of
parameters of the network. In order to make a learning
process more stable, two important ingredients were pro-
posed by Mnih et al.[6]. One is the use of a target net-
work and the other is the use of experience replay. The
target network is the same as the main network except its
set of parameters θ− that changes at a slower rate than
that of the main network. That is, the parameters are pe-
riodically copied from the main network and kept fixed
on all other steps. In the experience replay, the transi- Figure 2: 4-wheeled vehicle.
tions encountered during training are stored in the replay
buffer. The network is trained using mini-batch gradient
descent on the following loss L = E(Yttarget − Q(st , at ; θt ))2 , We consider a 4-wheeled vehicle shown in Fig. 2, where
where Yttarget = rt + γ maxa′ ∈A Q(st+1 , a′ ; θt− ) and the tuples (r x , ry ) is the center position of the rear axle of the vehicle,
(st , at , rt , st+1 ) are sampled from the replay buffer. θ is the body angle, and ϕ is the steering angle. When the
inputs are the forward velocity uv and the angular velocity
of the steering uw , the 4-wheeled vehicle is modeled by the
2.3. h-DQN following equation:
   
 r x   cos θ 0  [ ]
d  ry   sin θ 0  uv
  =   , (1)
dx  θ   1l tan ϕ 0  uw
ϕ 0 1

where l denotes the distance between the front and rear


axles.

3.2. Reward Shaping


Due to the nonholonomic constraint, in general, the ve-
hicle cannot directly achieve the terminal state by a contin-
uous control law but by a discontinuous one with cutting
of the wheel [1]. Such behavior is, however, difficult to
Figure 1: Architecture h-DQN. learn by a straightforward application of the DQN, since
the agent needs intrinsic motivation to explore meaningful
regions of the state space before it can learn about the ad-
A h-DQN is one of the extended framework of a DQN vantage of cutting of the wheel for itself. Then, we apply
that can treat the task with long-term credit assignment, the h-DQN framework. That is, a meta-controller learns
utilizing a notion of goals, which provide intrinsic moti- a policy over selecting switching points and a controller
vation for the agent. In this framework, as shown in Fig. learns a policy over a series of atomic actions to reach the
1, the agent uses a two-level hierarchy consisting of a con- switching points. By using the h-DQN, it is possible to effi-
troller and a meta-controller. The meta-controller receives ciently learn a policy even if a long exploration is required
the current state of the plant and chooses a goal. The goal before parking is completed and a reward is received.
is fixed for the next few time steps either until it is achieved However, a common challenge in reinforcement learning
or a terminal state is reached. For the controller to achieve is how to determine a reward function that reflects the task
the goals, the meta-controller provides a intrinsic reward to and guides the policy optimization. For instance, we con-
it based on whether the agent is able to achieve the goals. sidered a reward setting where the agent receives a negative
That is, to maximize a cumulative extrinsic reward pro- reward according to the distance between the parking point
vided from the environment , the meta-controller focuses and the vehicle every step. When getting close enough to
on setting the sequences of goals and the controller focuses the parking point, the agent receives a high reward as the
on achieving them. body angle approaches 0. Furthermore, when the agent
- 27 -
goes outside of the field, the episode will be terminated puts a vector of the action values Qm (s, g) for each sub-
with a negative reward. Such reward setting, however, does goal g ∈ G, and the controller outputs a vector of the ac-
not learn an optimal trajectory. If the negative reward that tion values Qc (s, a) for each action a ∈ A. We have nine
the agent receives according to the distance through the actions: “move forward”, “move backward”, “steer right”,
episode is not larger than the penalty for going outside of “steer left”, “do nothing”, “move forward and steer right”,
the field, the agent learns to leave the field. If the penalty “move forward and steer left”, “move backward and steer
is too large, the agent becomes passive against exploring. right”, and “move backward and steer left”. We have four
Moreover, it is necessary to determine the maximum num- sub-goals. The goal and the sub-goals describe the de-
ber of steps of each episode reflecting the difficulty of the sired position and the body angle with some fixed toler-
parking task. ance. We consider that the goal is achieved when the dis-
The shaped reward function like this example is not use- tance between the vehicle and the goal position becomes
ful because it can not be applied to cases where the prob- less than ϵd and the absolute value of the difference be-
lem setting changes slightly. Although For every problem tween the body angle and the goal angle becomes less than
of reinforcement learning there exists some reward setting ϵa over 10 steps. The position and the angle of the extrinsic
which make it easy, the design of such shaped rewards re- goal are (0, 0) and 0[rad] respectively, and those of sub-
1.25
quires a lot of domain knowledge and may not be much goals are (2.25, 1.25) and arctan( 2.25 )[rad], (2.5, 0.9) and
0.9 0.7
easier than the design of the policy using the mathemati- arctan( 2.5 )[rad], (2.9, 0.7) and arctan( 2.9 )[rad], and (0, 0)
cal model. It is, therefore, important to develop algorithms and 0[rad], respectively. The tolerance of the distance
π
which can learn from an unshaped reward. In our work, ϵd = 0.08 and that of the body angle ϵa = 30 for all goals.
we use a binary value indicating successful task comple- We have 80000 episodes. Each episode consists of 200
tion for both the intrinsic and the extrinsic reward. That steps and we perform 20 optimization steps on minibatches
is, the agent receives 1 when it reaches a sub-goal or com- of size 64 sampled uniformly from the replay buffer. The
plete the parking, −1 when it goes outside of the field, and 0 size of replay buffers for the controller and the meta-
otherwise. In such a sparse and binary reward setting, how- controller are 5.0 × 105 and 5.0 × 104 , respectively. We
ever, standard reinforcement learning algorithms are bound update the target networks at every optimization step using
to fail because they may not receive a positive reward. the decay coefficient of 0.99. We use the Adam optimizer
To overcome this problem, we used techniques for im- with the learning rate of 0.001. The discount factor is 0.99
proving exploration such as count-based exploration [9] , for all transitions. For exploration, we experiment with us-
but it is impractical to explore a large state space. In stead ing ϵ-greedy exploration and ϵ is annealed over time from
of using such techniques, we use Hindsight Experience Re- 1.0 to 0.1.
play (HER) [8]. This is an improved version of experi- We follow a two-phase training procedure. First, we
ence replay based on the idea of making it possible to learn learned the controller until reaching randomly sampled
from failure. Consider an episode with a state sequence sub-goals and then we train the controller and the meta-
s1 , ..., sT and a goal g , s1 , ..., sT . While this episode may controller simultaneously. This procedure prevents the net-
not help us learn how to achieve the goal g, it tells us some- work from specializing and converging into a local optimal
thing about how to achieve the state sT . This information is point so as to go to a specific goal in the course of learning.
utilized by creating an episode replacing the goal g of the
original episode by sT and replaying with the episodes in 4. Computer Simulation
the replay buffer. By this modification, at least half of the
replayed trajectories contain positive reward and learning We consider the following optimal car parking problem.
becomes much faster. As a learning operation, we consider the parking motion
from the initial state [1, 1, 0, 0]T to the origin with less than
3.3. Model Architecture and Training 6 degrees of the body angle. The trajectory obtained by
learning is shown in Fig. 3. As learning progresses, we first
In this paper, we use the meta controller network con- observed behaviors like learning to get a reward by moving
sisting of 1 hidden layer with 64 units, and the controller relative position and angle to zero with the given sub-goals.
network consisting of 3 hidden layer with 64 units in However, due to the nonholonomic constraint, even if the
each layer. All of the layers use a rectified linear activa- relative position and the angle to a sub-goal are close to 0,
tion function. The inputs to the controller are the rela- there may not exist an immediate action that can achieve
tive position and relative angle of the vehicle to the cur- the sub-goal from the current position. For this reason, in
rent sub-goal gt , the steering angle, and the derivatives the middle of the first stage of learning, a lot of back and
of the state of the vehicle. That is, the input is a vector forth motions were observed near a sub-goal. At the end of
[gtx − r x , gyt − ry , gθt − θ, ϕ, r˙x , r˙y , θ̇, ϕ̇]T , where gtx , gyt and gθt first-phase, the agent learned that it is necessary not only
are the x and y coordinates and the angle of the current to get the relative position and angle close to 0 but also to
sub-goal. The inputs to the meta controller are the state need something like a path planning considering the con-
of the vehicle and its derivative. The meta-controller out- straint to achieve the sub-goal. Finally the agent learned a
- 28 -
policy to reach all of the sub-goals. Learning of the meta-
controller started when the number of episodes where the
controller arrived at the sub-goals increased. The meta-
controller learned which switching point tends to receive
the highest return. In our computer simulation, since the
binary reward multiplied by the discount rate is maximized,
the agent learned how to select the switching point which
will complete the parking motion earliest. As a result, we
have an optimal policy that selects the optimal switching
point (2.25, 1.25). Fig. 4 compares the learning curve of
the DQN, the DQN with HER, and the h-DQN with the
HER. The DQN cannot learn a behavior of the parking
task at all. This is because the exploration is completely
random and there is no motivation to cutting of the wheel.
The DQN with HER can learn the behavior, but learning is
slow compared with the h-DQN with HER. By replaying Figure 4: Learning curves.
the episodes generated by the HER, the agent learns that if
the relative position is close to zero, it can get the reward.
But the exploration is random, it takes time to notice the
necessity of cutting of the wheel. On the other hand, the References
h-DQN with HER succeed in learning the task. This is be-
[1] R. W. Brockett, “Asymptotic stability and feedback sta-
cause HER accelerates learning, and hierarchization makes
bilization,” Differential geometric control theory, vol.
exploration more efficient.
27, no. 1, pp. 181-191, 1983.

[2] A. Astolfi, Discontinuous control of nonholonomic


systems, System and Control Letters, vol. 27, no. 1,
pp. 37-45, 1996.
[3] F. L. Lewis and D.Liu, Reinforcement Learning and
Approximate Dynamic Programming for Feedback
Control, IEEE Press, 2013.
[4] T. Fujita and T. Ushio, “Optimal digital control with
uncertain network delay of linear systems using re-
inforcement learning,” IEICE Trans. Fundamanetals,
vol. E99-A no. 2, pp. 454-461, 2016.

[5] T. Yamasaki and T. Ushio, “Decentralized supervi-


sory control of discrete event systems based on re-
inforcement learning,” IEICE Trans. Fundamanetals,
Figure 3: Learned trajectory with the cutting of the wheel, vol. E88-A no. 11, pp. 3045-3050, 2005.
where the green and the red point are the initial and fi-
nal(target) point, respectively. [6] V. Mnih et al., “Human-level control though deep rein-
forcement learning,” Nature, vol. 518(7540), pp. 529–
533, 2015.
[7] T. D. Kulkarni et al., “Hierarchical deep reinforcement
learning: Integrating temporal abstraction and intrinsic
5. Conclusions motivation,” 30th Conference on Neural Information
Processing Systems, 2016.
In this paper, we consider a parking problem of a 4-
wheeled car where the car may go back and forth. To get [8] M. Andrychowicz et al., “Hindsight Experience Re-
an optimal trajectory, we propose a reinforcement learning play,” arXiv preprint arXiv:1707.01495, 2017.
based method using the h-DQN with the HER. Future work
includes learning parking motion from any initial state and [9] G. Ostrovski et al., “Count-Based Exploration
generating sub-goals for the meta-controller automatically. with Neural Density Models,” arXiv preprint
arXiv:1703.01310, 2017.
Acknowledgement: This work was supported by JST ER-
ATO Grant Number JPMJER1603, Japan.
- 29 -

You might also like