Adaptive DP For Discrete Time LQR Optimal Tracking Control Problems With Unknown Dynamics

Adaptive Dynamic Programming for Discrete-time
LQR Optimal Tracking Control Problems with

Unknown Dynamics
Yang Liu Yanhong Luo Huaguang Zhang
School of Information School of Information School of Information
Science and Engineering Science and Engineering Science and Engineering
Northeastern University Northeastern University Northeastern University
Shenyang, Liaoning, China 110819 Shenyang, Liaoning, China Shenyang, Liaoning, China
Email: kersly961@gmail.com Email: neuluo@gmail.com Email: hgzhang@ieee.org
Abstract—In this paper, an optimal tracking control approach characteristics and disturbances using measured data along the
based on adaptive dynamic programming (ADP) algorithm is system trajectories.
proposed to solve the linear quadratic regulation (LQR) problems Under certain conditions [1], the tracking problem for linear
for unknown discrete-time systems in an online fashion. First,
we convert the optimal tracking problem into designing infinite- systems can be converted into the regulation problem. Gener-
horizon optimal regulator for the tracking error dynamics based ally, the optimal regulation problem of linear systems with
on the system transformation. Then we expand the error state quadratic cost functions can be implemented by solving the
equation by the history data of control and state. The iterative Riccati equation [2]. Online ADP algorithms have been widely
ADP algorithm of policy iteration (PI) and value iteration (VI) are used in the tracking controller design for tracking problem
introduced to solve the value function of the controlled system.
It is shown that the proposed ADP algorithm solves the LQR [13]–[17]. An online approximator for optimal tracking control
without requiring any knowledge of the system dynamics. The has been designed with partial unknown internal dynamics
simulation results show the convergence and effectiveness of the in [13], an ADP algorithm for discrete-time LQR optimal
proposed control scheme. tracking control problems was designed in [17], etc. Both of
the methods mentioned above are all require the knowledge
I. I NTRODUCTION part or all of the system dynamics. A method based on
the combination of reinforcement learning and Q learning
Optimal control is generally considered to be an off-line technique was designed for optimal tracking control problems
control strategy, which try to make the performance indica- with unknown dynamics in [18]. Direct adaptive controllers
tors maximum or minimum under certain constraints. When based on the measured output data that converge to optimal
designing the optimal controller, we need to known the full controls for unknown systems have been designed using policy
dynamics of the system. In order to achieve optimal control iteration and value iteration in [19].
of linear system in this paper, we need to solve the Riccati In this paper, an error system is derived and the system’s
equation [1]. As is well-known, dynamic programming is a information is represented by the expanded state equation
mathematical method for solving optimization decision pro- for the linear system tracking problems. Then we design the
cess based on the Bellman’s principle of optimality. Actually history data based ADP algorithm for linear system tracking
it is often unable to apply dynamic programming to obtain the problems without requiring the knowledge of the system
optimal solution, because of the ”curse of dimensionality” [2]. dynamics. The remainder of the paper is organized as follows.
In order to overcome this problem, adaptive dynamic pro- Review of the standard solution for the LQR problem and
gramming and its related researches have been achieved with the Bellman temporal difference (TD) error are described in
numerous exciting results [3]–[10]. There are two main itera- Section II. Only history data based iterative ADP algorithm
tive ADP algorithms that are known as value iteration [11] and for linear system state tracking problems is derived in Section
policy iteration [12]. Both of these two iteration algorithms III. Simulation results of the designed algorithm are presented
have a two-step iteration structure: policy evaluation and in Section IV. Finally, the conclusions are drawn in Section V.
policy improvement. The main distinction between PI and VI
is that PI is usually implemented online with an admissible II. P ROBLEM F ORMULATION
control policy and VI is usually implemented in an off-line In this section, we review the solution of a linear discrete-
manner. time (DT) system with quadratic cost function under a state
Adaptive control [2] is an online control method, which feedback structure, which is well-known as the DT LQR
can modify its parameters to adapt to the changes in dynamic problems.
978-1-4799-4552-8/14/$31.00 ©2014 IEEE

A. LQR Then, based on the Bellman’s optimality principle, the
optimal cost can be determined by solving the Hamilton-
Consider the linear time-invariant discrete-time system
Jacobi-Bellman equation
x(k + 1) = Ax(k) + Bux (k) (1)
V ∗ (e(k)) = min(eT (k)Qe(k)
u(k)
(8)
where x(k) ∈ Rn is the state, ux (k) ∈ Rm is the control
+ uT (k)Ru(k) + V ∗ (e(k + 1)))
input.
The objective of the optimal tracking problem is to de- where the optimal control can be written as
termine the optimal control sequence u∗ iteratively, in order
to make the linear system (1) track a desired trajectory μ∗ (e(k)) = arg min(eT (k)Qe(k)
u(k) (9)
γ(k) ∈ Rn in an optimal manner.
Assumption 1: The system (1) is controllable under the + uT (k)Ru(k) + V ∗ (e(k + 1)))
control variables u. Since in the LQR problem every value is quadratic in the
Assumption 2: The mapping between the state x(k) and the system state, the performance index of the policy u(k) =
reference trajectory γ(k) is one-to-one. μ(e(k)) is given by
According to Assumption 2 and system (1), we define the
dynamics of the desired trajectory as V μ (e(k)) = eT (k)P e(k) (10)
γ(k + 1) = Aγ(k) + Buγ (k) (2) where P is a n × n symmetric matrix. According to (6), the
error system LQR Bellman equation is
Then, the error system is obtained by subtracting (2) from (1)
eT (k)P e(k) =eT (k)Qe(k) + uT (k)Ru(k)
e(k + 1) = Ae(k) + Bu(k) (3) (11)
+ eT (k + 1)P e(k + 1)
where e(k) = (x(k) − γ(k)) ∈ Rn is the error variable and If the control policy has the structure of a linear feedback
u(k) = (ux (k)−uγ (k)) ∈ Rm is the control input of the error to the state variable, we have
system.
Given a stabilizing control policy u(k) = μ(e(k)), the u(k) = μ(e(k)) = −Ke(k) (12)
infinite horizon performance index for tracking problem is
derived as and
∞ ∞ e(k + 1) = (A − BK)e(k) (13)

μ T T
V (e(k)) = (e (i)Qe(i) + u (i)Ru(i)) ≡ r(i) (4)
i=k i=k
From [19], we can know that P satisfies the algebraic Riccati
equation(ARE)
where Q and R are symmetric and positive-define weight
matrices. P = AT P A + Q − AT P B(R + B T P B)−1 B T P A (14)
The utility is
The optimal control input of the error system with a state
T T
r(k) = e (k)Qe(k) + u (k)Ru(k) (5) feedback is given by
The objective of the optimal tracking control problem is to u(k) = −Ke(k) = −(R + B T P B)−1 B T P Ae(k) (15)
determine the optimal policy u(k) = μ(e(k)) that minimizes
the performance index (4) along the error system’s trajectories B. Bellman Temporal Difference (TD) Error
(3). Because of the quadratic structure of the cost and the To determine the optimal control and optimal value in the
dynamics of the error system, it is considered to be the LQR LQR case, online ADP methods are used based on the Bellman
problem. temporal difference [7] error
According to the performance index (4), the corresponding
Bellman equation can be written as ε(k) = − V μ (e(k)) + eT (k)Qe(k)
+ uT (k)Ru(k) + V μ (e(k + 1))
V μ (e(k)) =r(k) + V μ (e(k + 1)) (16)
T T
= − eT (k)P e(k) + eT (k)Qe(k)
=e (k)Qe(k) + u (k)Ru(k) (6)
+ uT (k)Ru(k) + eT (k + 1)P e(k + 1)
+ V μ (e(k + 1))
And the HJB equation (8) can be written as
The optimal cost of the error system is
∞
0 = min(−eT (k)P e(k) + eT (k)Qe(k)
u(k)
(17)
V ∗ (e(k)) = min (eT (i)Qe(i) + uT (i)Ru(i)) (7)
μ + uT (k)Ru(k) + eT (k + 1)P e(k + 1))
i=k

ū(k − 1, k − N )
III. ADP BASED O N H ISTORY DATA e(k) = Lu Le
ē(k − 1, k − N ) (27)
In this section, we design the history data based Adaptive
≡ Lu Le z̄(k − 1, k − N )
Dynamic Programming method for optimal tracking control
problem, which the knowledge of the system dynamics (A, B) The error system value function is given by
is not required.
V μ (e(k)) = eT (k)P e(k)
Consider the linear time-invariant tracking error system T
T L
e(k + 1) = Ae(k) + Bu(k) = z̄ (k − 1, k − N ) uT P Lu Le z̄(k − 1, k − N )

(18) Le
(28)
where e(k) ∈ Rn is the error variable and u(k) ∈ Rm is the
control input of the error system. Define T
Based on the assumption 1, the state of the error system Lu P L u LTu P Le
P̄ = (29)
can be written in the form of history data. Expand the error LTe P Lu LTe P Le
state equation from k − N to k, we have
Then
e(k) =AN e(k − N )
⎡⎤ V μ (e(k)) = z̄ T (k − 1, k − N )P̄ z̄(k − 1, k − N ) (30)
u(k − 1)
⎢ u(k − 2) ⎥
⎢ ⎥ (19) The optimal policy of the error system is written as
+ [B AB · · · AN −1 B] ⎢ .. ⎥
⎣ . ⎦ μ∗ (e(k)) = arg min(eT (k)Qe(k) + uT (k)Ru(k)
u(k − N ) u(k) (31)
+ z̄ T (k, k − N + 1)P̄ z̄(k, k − N + 1))
⎡ ⎤ ⎡ N −1 ⎤
e(k − 1) A
⎢ e(k − 2) ⎥ ⎢ .. ⎥ The corresponding performance index of the policy is given
⎢ ⎥ ⎢ . ⎥ by
⎢ .. ⎥=⎢ ⎥ e(k − N )
⎣ . ⎦ ⎣ A ⎦
z̄ T (k, k − N + 1)P̄ z̄(k, k − N + 1))
e(k − N ) I ⎡ ⎤T
⎡ ⎤⎡ ⎤ u(k)
0 B AB ... AN −2 B u(k − 1) (20)
⎢0 0 B ... AN −3 B ⎥ ⎢ u(k − 2) ⎥ = ⎣ū(k − 1, k − N + 1)⎦
⎢ ⎥⎢ ⎥ ē(k, k − N + 1) (32)
⎢ .. . . .. .. ⎥⎢ .. ⎥
+ ⎢. .. .. . . ⎥ ⎢ . ⎥ ⎡ ⎤⎡ ⎤
⎢ ⎥⎢ ⎥ p0 p1 p2 u(k)
⎣0 . . . . . . 0 B ⎦ ⎣ u(k − N + 1)⎦
· ⎣pT1 P22 P23 ⎦ ⎣ū(k − 1, k − N + 1)⎦
0 0 0 0 0 u(k − N )
pT2 P32 P33 ē(k, k − N + 1)
or simply written as
and p1 = pu , p2 = pe .
e(k) = AN e(k − N ) + ON ū(k − 1, k − N ) (21) The control policy of the error system can be obtained by
differentiating (31) with respect to u(k)
ē(k − 1, k − N ) = HN e(k − N ) + DN ū(k − 1, k − N ) (22)
u(k) = −(p0 + R)−1 (pu ū(k − 1, k − N + 1)
where ē(k − 1, k − N ) ∈ RnN and ū(k − 1, k − N ) ∈ RmN (33)
+ pe ē(k, k − N + 1))
are the data sequences over the time interval [k − N, k − 1].
ON ∈ Rn×mN , HN ∈ RnN ×n . A. PI Algorithm based on history data
Then, AN can be written as Select an admissible initial control policy u0 (k) = μ0 (k),
A N
= LHN (23) for j = 0, ..., perform the two step iteration until the conver-
gence condition achieved.
where L ∈ Rn×nN . The left inverse of HN exists Policy Evaluation: Determine the value of P̄ j+1 using the
† T Bellman Equation
HN HN = (HN HN )−1 HN
T T
HN = In (24)
z̄ T (k − 1, k − N )P̄ j+1 z̄(k − 1, k − N )
Substituting this into (23), we can obtain
= eT (k)Qe(k) + (uj (k))T Ruj (k) (34)
N † †
L=A HN + G(I − H N HN ) ≡ L1 + L2 (25) T
+ z̄ (k, k − N + 1)P̄ j+1
z̄(k, k − N + 1)
The state error of the tracking system is given uniquely in Policy Improvement: Determine an improved policy using
terms of the history data sequence by
u(k)j+1 = μj+1 (e(k))
e(k) = L1 ē(k − 1, k − N ) + (ON − L1 DN )ū(k − 1, k − N )
= − (pj+1
0 + R)−1 · (pj+1
u ū(k − 1, k − N + 1) (35)
≡ Le ē(k − 1, k − N ) + Lu ū(k − 1, k − N )
(26) + pj+1
e ē(k, k − N + 1))
B. VI algorithm based on history data Remark 2: Policy iteration algorithm requires an initial
0 0 admissible control policy to ensure the convergence of the
Select an arbitrarily initial control policy u (k) = μ (k), for
j = 0, . . . , perform the two step iteration until the convergence algorithm. There is no specific requirement for value iteration
condition achieved. algorithm to select an initial admissible control policy. Hence,
Policy Evaluation: Determine the value of P̄ j+1 using the value iteration can be used to the open-loop unstable systems.
Bellman Equation Remark 3: The control policy in (37) only depends on L1 ,
and is independent of L2 . Interested readers can refer to [19]
z̄ T (k − 1, k − N )P̄ j+1 z̄(k − 1, k − N ) for more details.
= eT (k)Qe(k) + (uj (k))T Ruj (k) (36)
T j
IV. S IMULATION
+ z̄ (k, k − N + 1)P̄ z̄(k, k − N + 1)
In this section, a mathematical example is studied to demon-
Policy Improvement: Determine an improved policy using strate the effectiveness of the proposed history data based ADP
u(k)j+1 = μj+1 (e(k)) algorithms.
Consider the DT linear system with quadratic cost function
= − (pj+1
0 + R)−1 · (pj+1
u ū(k − 1, k − N + 1) (37)
0.9841 0.0929 0.3443
+ pj+1
e ē(k, k − N + 1)) x(k + 1) = x(k) + u(k) (40)
−0.3093 0.8559 6.7122
Lemma 1 [17]: If the Assumption 1 holds, then the error
system (3) is controllable. If the PE condition satisfied, then The reference trajectory is defined as

the solution of the iterative ADP algorithm converges to the γ1 sin(t)
algebraic Riccati equation (14) when starting with P0 ≥ 0. = (41)
γ2 cos(t)
Remark 1: A persistent excitation signal is needed in the
iteration step of the ADP algorithm, in order to ensure the The initial state of the system is chosen as x0 = [1 0]T , and
sufficient training of the iterative algorithm. N = 2. A disturbance signal is added to the control input of
Theorem 1: If Lemma 1 holds, the reference control uγ the error system to ensure the persistent excitation condition.
in (2) has the same form in (37). Then the optimal tracking Q is a second order identity matrix and R = 1. The sampling
control policy at step k of the system is time of the system is 0.1s and the ADP algorithm’s maximal
iteration step is set to be 300.
ux (k) = u(k) + uγ (k) (38) The dashed line in Fig. 1 and Fig. 3 are the reference
where trajectories, and the solid line are the state trajectories of
the optimal tracking problems. It is shown that the state
uγ (k) = −(p0 + R)−1 (p1 ūγ (k − 1, k − N + 1) trajectories can accurately track the reference trajectories. The
(39)
+ p2 γ̄(k, k − N + 1)) iteration error trajectories of the error system are shown in
Proof : If Lemma 1 holds, then the error system (3) is Fig. 2 and Fig. 4, respectively.
controllable.
Since the state equation of reference trajectory γ is given 15
by
10
γ(k + 1) = Aγ(k) + Buγ (k)
and the state equation of error e is given by 5
e(k + 1) = Ae(k) + Bu(k) 0
−5
Then they have the same system dynamics A and B.
Since for the error system the solution P of the algebraic −10
Riccati equation is given by
−15
P = AT P A + Q − AT P B(R + B T P B)−1 B T P A
−20
Then for the reference system, the symmetric matrix P is 0 50 100 150 200 250 300
also the solution of the equation above.

So similar to the error system control policy improvement Fig. 1. System state trajectory of x1
equation (37), the reference control policy has the form of
uγ (k) = −(p0 + R)−1 (p1 ūγ (k − 1, k − N + 1) V. C ONCLUSION
+ p2 γ̄(k, k − N + 1)) In this paper, an iterative ADP algorithm is proposed for op-
And the optimal tracking control policy at step k of the timal tracking control problems, using only the history data of
system is the error dynamical system. For the tracking problem, history
ux (k) = u(k) + uγ (k) data based policy iteration and value iteration algorithms have
been designed, respectively. A persistent excitation signal is
added in the iteration step of the ADP algorithm for purpose of
45
ensure the sufficient training of the iterative algorithm. Finally,
40
one simulation example is given to demonstrate the validity
35 of the proposed ADP algorithms.
30
25 ACKNOWLEDGMENT
20 This work was supported by the National Natural Science
15 Foundation of China under Grants 61104010, China Postdoc-
10 toral Science Foundation (2012M510825, 2014T70260).
5
0
R EFERENCES
−5 [1] F. L. Lewis, D. Vrabie, and V. L. Syrmos, Optimal control. John Wiley
0 100 200 300 400 500
and Sons, 2012.
[2] H. Zhang, D. Liu, Y. Luo, and D. Wang, Adaptive Dynamic Program-
ming for Control: Algorithms and Stability. Springer Verlag, 2013.
Fig. 2. Iteration error trajectory of e1 [3] J. Si, A. G. Barto, W. B. Powell, and D. C. Wunsch, Handbook of
learning and approximate dynamic programming. IEEE Press New
York, 2004.
[4] T. Dierks and S. Jagannathan, “Online optimal control of nonlinear
discrete-time systems using approximate dynamic programming,” Jour-
nal of Control Theory and Applications, vol. 9, no. 3, pp. 361–369,
8
2011.
[5] H. Zhang, L. Cui, X. Zhang, and Y. Luo, “Data-driven robust approx-
6 imate optimal tracking control for unknown general nonlinear systems
using adaptive dynamic programming method,” Neural Networks, IEEE
4 Transactions on, vol. 22, no. 12, pp. 2226–2236, 2011.
[6] D. Liu, D. Wang, D. Zhao, Q. Wei, and N. Jin, “Neural-network-based
2 optimal control for a class of unknown discrete-time nonlinear systems
using globalized dual heuristic programming,” Automation Science and
0 Engineering, IEEE Transactions on, vol. 9, no. 3, pp. 628–634, 2012.
[7] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive
−2
dynamic programming for feedback control,” Circuits and Systems
−4
Magazine, IEEE, vol. 9, no. 3, pp. 32–50, 2009.
[8] H. Zhang, L. Cui, and Y. Luo, “Near-optimal control for nonzero-sum
−6 differential games of continuous-time nonlinear systems using single-
network adp,” Cybernetics, IEEE Transactions on, vol. 43, no. 1, pp.
−8 206–216, 2013.
0 50 100 150 200 250 300
[9] J. Fu, H. He, and X. Zhou, “Adaptive learning and control for mimo
system based on adaptive dynamic programming,” Neural Networks,
IEEE Transactions on, vol. 22, no. 7, pp. 1133–1148, 2011.
Fig. 3. System state trajectory of x2 [10] D. Liu and Q. Wei, “Policy iteration adaptive dynamic programming
algorithm for discrete-time nonlinear systems,” Neural Networks and
Learning Systems, IEEE Transactions on, vol. 25, no. 3, pp. 621–634,
2014.
[11] F.-Y. Wang, N. Jin, D. Liu, and Q. Wei, “Adaptive dynamic programming
for finite-horizon optimal control of discrete-time nonlinear systems
200
with-error bound,” Neural Networks, IEEE Transactions on, vol. 22,
no. 1, pp. 24–36, 2011.
[12] D. Liu and Q. Wei, “Finite-approximation-error-based optimal control
150
approach for discrete-time nonlinear systems,” Cybernetics, IEEE Trans-
actions on, vol. 43, no. 2, pp. 779–789, 2013.
[13] T. Dierks and S. Jagannathan, “Optimal tracking control of affine
100 nonlinear discrete-time systems with unknown internal dynamics,” in
Decision and Control, 2009 held jointly with the 2009 28th Chinese
Control Conference. CDC/CCC 2009. Proceedings of the 48th IEEE
50 Conference on. IEEE, 2009, Conference Proceedings, pp. 6750–6755.
[14] H. Zhang, R. Song, Q. Wei, and T. Zhang, “Optimal tracking control
for a class of nonlinear discrete-time systems with time delays based on
0 heuristic dynamic programming,” Neural Networks, IEEE Transactions
on, vol. 22, no. 12, pp. 1851–1862, 2011.
[15] H. Zhang, Q. Wei, and Y. Luo, “A novel infinite-time optimal tracking
−50 control scheme for a class of discrete-time nonlinear systems via the
0 100 200 300 400 500
greedy hdp iteration algorithm,” Systems, Man, and Cybernetics, Part
B: Cybernetics, IEEE Transactions on, vol. 38, no. 4, pp. 937–942,
2008.
Fig. 4. Iteration error trajectory of e2 [16] Z. Ni and H. He, “Adaptive learning in tracking control based on the
dual critic network design,” Neural Networks and Learning Systems,
IEEE Transactions on, vol. 24, no. 6, pp. 913–928, 2013.
[17] Q. Xie, B. Luo, and F. Tan, “Discrete-time lqr optimal tracking control
problems using approximate dynamic programming algorithm with dis-
turbance,” in Intelligent Control and Information Processing (ICICIP),
2013 Fourth International Conference on. IEEE, 2013, Conference
Proceedings, pp. 716–721.
[18] B. Kiumarsi, F. L. Lewis, H. Modares, A. Karimpour, and M.-B.
Naghibi-Sistani, “Reinforcement q-learning for optimal tracking control
of linear discrete-time systems with unknown dynamics,” Automatica,
vol. 50, no. 4, pp. 1167–1175, 2014.
[19] F. L. Lewis and K. G. Vamvoudakis, “Reinforcement learning for
partially observable dynamic processes: Adaptive dynamic programming
using measured output data,” Systems, Man, and Cybernetics, Part B:
Cybernetics, IEEE Transactions on, vol. 41, no. 1, pp. 14–25, 2011.

Adaptive DP For Discrete Time LQR Optimal Tracking Control Problems With Unknown Dynamics

Uploaded by

Adaptive DP For Discrete Time LQR Optimal Tracking Control Problems With Unknown Dynamics

Uploaded by

Adaptive Dynamic Programming for Discrete-time

LQR Optimal Tracking Control Problems with

978-1-4799-4552-8/14/$31.00 ©2014 IEEE

e(k + 1) = Ae(k) + Bu(k) = z̄ (k − 1, k − N ) uT P Lu Le z̄(k − 1, k − N )

e(k + 1) = Ae(k) + Bu(k) 0

also the solution of the equation above.

You might also like