Adaptive DP For Discrete Time LQR Optimal Tracking Control Problems With Unknown Dynamics
Adaptive DP For Discrete Time LQR Optimal Tracking Control Problems With Unknown Dynamics
Abstract—In this paper, an optimal tracking control approach characteristics and disturbances using measured data along the
based on adaptive dynamic programming (ADP) algorithm is system trajectories.
proposed to solve the linear quadratic regulation (LQR) problems Under certain conditions [1], the tracking problem for linear
for unknown discrete-time systems in an online fashion. First,
we convert the optimal tracking problem into designing infinite- systems can be converted into the regulation problem. Gener-
horizon optimal regulator for the tracking error dynamics based ally, the optimal regulation problem of linear systems with
on the system transformation. Then we expand the error state quadratic cost functions can be implemented by solving the
equation by the history data of control and state. The iterative Riccati equation [2]. Online ADP algorithms have been widely
ADP algorithm of policy iteration (PI) and value iteration (VI) are used in the tracking controller design for tracking problem
introduced to solve the value function of the controlled system.
It is shown that the proposed ADP algorithm solves the LQR [13]–[17]. An online approximator for optimal tracking control
without requiring any knowledge of the system dynamics. The has been designed with partial unknown internal dynamics
simulation results show the convergence and effectiveness of the in [13], an ADP algorithm for discrete-time LQR optimal
proposed control scheme. tracking control problems was designed in [17], etc. Both of
the methods mentioned above are all require the knowledge
I. I NTRODUCTION part or all of the system dynamics. A method based on
the combination of reinforcement learning and Q learning
Optimal control is generally considered to be an off-line technique was designed for optimal tracking control problems
control strategy, which try to make the performance indica- with unknown dynamics in [18]. Direct adaptive controllers
tors maximum or minimum under certain constraints. When based on the measured output data that converge to optimal
designing the optimal controller, we need to known the full controls for unknown systems have been designed using policy
dynamics of the system. In order to achieve optimal control iteration and value iteration in [19].
of linear system in this paper, we need to solve the Riccati In this paper, an error system is derived and the system’s
equation [1]. As is well-known, dynamic programming is a information is represented by the expanded state equation
mathematical method for solving optimization decision pro- for the linear system tracking problems. Then we design the
cess based on the Bellman’s principle of optimality. Actually history data based ADP algorithm for linear system tracking
it is often unable to apply dynamic programming to obtain the problems without requiring the knowledge of the system
optimal solution, because of the ”curse of dimensionality” [2]. dynamics. The remainder of the paper is organized as follows.
In order to overcome this problem, adaptive dynamic pro- Review of the standard solution for the LQR problem and
gramming and its related researches have been achieved with the Bellman temporal difference (TD) error are described in
numerous exciting results [3]–[10]. There are two main itera- Section II. Only history data based iterative ADP algorithm
tive ADP algorithms that are known as value iteration [11] and for linear system state tracking problems is derived in Section
policy iteration [12]. Both of these two iteration algorithms III. Simulation results of the designed algorithm are presented
have a two-step iteration structure: policy evaluation and in Section IV. Finally, the conclusions are drawn in Section V.
policy improvement. The main distinction between PI and VI
is that PI is usually implemented online with an admissible II. P ROBLEM F ORMULATION
control policy and VI is usually implemented in an off-line In this section, we review the solution of a linear discrete-
manner. time (DT) system with quadratic cost function under a state
Adaptive control [2] is an online control method, which feedback structure, which is well-known as the DT LQR
can modify its parameters to adapt to the changes in dynamic problems.
γ(k + 1) = Aγ(k) + Buγ (k) (2) where P is a n × n symmetric matrix. According to (6), the
error system LQR Bellman equation is
Then, the error system is obtained by subtracting (2) from (1)
eT (k)P e(k) =eT (k)Qe(k) + uT (k)Ru(k)
e(k + 1) = Ae(k) + Bu(k) (3) (11)
+ eT (k + 1)P e(k + 1)
where e(k) = (x(k) − γ(k)) ∈ Rn is the error variable and If the control policy has the structure of a linear feedback
u(k) = (ux (k)−uγ (k)) ∈ Rm is the control input of the error to the state variable, we have
system.
Given a stabilizing control policy u(k) = μ(e(k)), the u(k) = μ(e(k)) = −Ke(k) (12)
infinite horizon performance index for tracking problem is
derived as and
∞ ∞ e(k + 1) = (A − BK)e(k) (13)
μ T T
V (e(k)) = (e (i)Qe(i) + u (i)Ru(i)) ≡ r(i) (4)
i=k i=k
From [19], we can know that P satisfies the algebraic Riccati
equation(ARE)
where Q and R are symmetric and positive-define weight
matrices. P = AT P A + Q − AT P B(R + B T P B)−1 B T P A (14)
The utility is
The optimal control input of the error system with a state
T T
r(k) = e (k)Qe(k) + u (k)Ru(k) (5) feedback is given by
The objective of the optimal tracking control problem is to u(k) = −Ke(k) = −(R + B T P B)−1 B T P Ae(k) (15)
determine the optimal policy u(k) = μ(e(k)) that minimizes
the performance index (4) along the error system’s trajectories B. Bellman Temporal Difference (TD) Error
(3). Because of the quadratic structure of the cost and the To determine the optimal control and optimal value in the
dynamics of the error system, it is considered to be the LQR LQR case, online ADP methods are used based on the Bellman
problem. temporal difference [7] error
According to the performance index (4), the corresponding
Bellman equation can be written as ε(k) = − V μ (e(k)) + eT (k)Qe(k)
+ uT (k)Ru(k) + V μ (e(k + 1))
V μ (e(k)) =r(k) + V μ (e(k + 1)) (16)
T T
= − eT (k)P e(k) + eT (k)Qe(k)
=e (k)Qe(k) + u (k)Ru(k) (6)
+ uT (k)Ru(k) + eT (k + 1)P e(k + 1)
+ V μ (e(k + 1))
And the HJB equation (8) can be written as
The optimal cost of the error system is
∞
0 = min(−eT (k)P e(k) + eT (k)Qe(k)
u(k)
(17)
V ∗ (e(k)) = min (eT (i)Qe(i) + uT (i)Ru(i)) (7)
μ + uT (k)Ru(k) + eT (k + 1)P e(k + 1))
i=k
ū(k − 1, k − N )
III. ADP BASED O N H ISTORY DATA e(k) = Lu Le
ē(k − 1, k − N ) (27)
In this section, we design the history data based Adaptive
≡ Lu Le z̄(k − 1, k − N )
Dynamic Programming method for optimal tracking control
problem, which the knowledge of the system dynamics (A, B) The error system value function is given by
is not required.
V μ (e(k)) = eT (k)P e(k)
Consider the linear time-invariant tracking error system T
T L
by
10
γ(k + 1) = Aγ(k) + Buγ (k)
and the state equation of error e is given by 5
−5
Then they have the same system dynamics A and B.
Since for the error system the solution P of the algebraic −10
Riccati equation is given by
−15
P = AT P A + Q − AT P B(R + B T P B)−1 B T P A
−20
Then for the reference system, the symmetric matrix P is 0 50 100 150 200 250 300
25 ACKNOWLEDGMENT
20 This work was supported by the National Natural Science
15 Foundation of China under Grants 61104010, China Postdoc-
10 toral Science Foundation (2012M510825, 2014T70260).
5
0
R EFERENCES
−5 [1] F. L. Lewis, D. Vrabie, and V. L. Syrmos, Optimal control. John Wiley
0 100 200 300 400 500
and Sons, 2012.
[2] H. Zhang, D. Liu, Y. Luo, and D. Wang, Adaptive Dynamic Program-
ming for Control: Algorithms and Stability. Springer Verlag, 2013.
Fig. 2. Iteration error trajectory of e1 [3] J. Si, A. G. Barto, W. B. Powell, and D. C. Wunsch, Handbook of
learning and approximate dynamic programming. IEEE Press New
York, 2004.
[4] T. Dierks and S. Jagannathan, “Online optimal control of nonlinear
discrete-time systems using approximate dynamic programming,” Jour-
nal of Control Theory and Applications, vol. 9, no. 3, pp. 361–369,
8
2011.
[5] H. Zhang, L. Cui, X. Zhang, and Y. Luo, “Data-driven robust approx-
6 imate optimal tracking control for unknown general nonlinear systems
using adaptive dynamic programming method,” Neural Networks, IEEE
4 Transactions on, vol. 22, no. 12, pp. 2226–2236, 2011.
[6] D. Liu, D. Wang, D. Zhao, Q. Wei, and N. Jin, “Neural-network-based
2 optimal control for a class of unknown discrete-time nonlinear systems
using globalized dual heuristic programming,” Automation Science and
0 Engineering, IEEE Transactions on, vol. 9, no. 3, pp. 628–634, 2012.
[7] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive
−2
dynamic programming for feedback control,” Circuits and Systems
−4
Magazine, IEEE, vol. 9, no. 3, pp. 32–50, 2009.
[8] H. Zhang, L. Cui, and Y. Luo, “Near-optimal control for nonzero-sum
−6 differential games of continuous-time nonlinear systems using single-
network adp,” Cybernetics, IEEE Transactions on, vol. 43, no. 1, pp.
−8 206–216, 2013.
0 50 100 150 200 250 300
[9] J. Fu, H. He, and X. Zhou, “Adaptive learning and control for mimo
system based on adaptive dynamic programming,” Neural Networks,
IEEE Transactions on, vol. 22, no. 7, pp. 1133–1148, 2011.
Fig. 3. System state trajectory of x2 [10] D. Liu and Q. Wei, “Policy iteration adaptive dynamic programming
algorithm for discrete-time nonlinear systems,” Neural Networks and
Learning Systems, IEEE Transactions on, vol. 25, no. 3, pp. 621–634,
2014.
[11] F.-Y. Wang, N. Jin, D. Liu, and Q. Wei, “Adaptive dynamic programming
for finite-horizon optimal control of discrete-time nonlinear systems
200
with-error bound,” Neural Networks, IEEE Transactions on, vol. 22,
no. 1, pp. 24–36, 2011.
[12] D. Liu and Q. Wei, “Finite-approximation-error-based optimal control
150
approach for discrete-time nonlinear systems,” Cybernetics, IEEE Trans-
actions on, vol. 43, no. 2, pp. 779–789, 2013.
[13] T. Dierks and S. Jagannathan, “Optimal tracking control of affine
100 nonlinear discrete-time systems with unknown internal dynamics,” in
Decision and Control, 2009 held jointly with the 2009 28th Chinese
Control Conference. CDC/CCC 2009. Proceedings of the 48th IEEE
50 Conference on. IEEE, 2009, Conference Proceedings, pp. 6750–6755.
[14] H. Zhang, R. Song, Q. Wei, and T. Zhang, “Optimal tracking control
for a class of nonlinear discrete-time systems with time delays based on
0 heuristic dynamic programming,” Neural Networks, IEEE Transactions
on, vol. 22, no. 12, pp. 1851–1862, 2011.
[15] H. Zhang, Q. Wei, and Y. Luo, “A novel infinite-time optimal tracking
−50 control scheme for a class of discrete-time nonlinear systems via the
0 100 200 300 400 500
greedy hdp iteration algorithm,” Systems, Man, and Cybernetics, Part
B: Cybernetics, IEEE Transactions on, vol. 38, no. 4, pp. 937–942,
2008.
Fig. 4. Iteration error trajectory of e2 [16] Z. Ni and H. He, “Adaptive learning in tracking control based on the
dual critic network design,” Neural Networks and Learning Systems,
IEEE Transactions on, vol. 24, no. 6, pp. 913–928, 2013.
[17] Q. Xie, B. Luo, and F. Tan, “Discrete-time lqr optimal tracking control
problems using approximate dynamic programming algorithm with dis-
turbance,” in Intelligent Control and Information Processing (ICICIP),
2013 Fourth International Conference on. IEEE, 2013, Conference
Proceedings, pp. 716–721.
[18] B. Kiumarsi, F. L. Lewis, H. Modares, A. Karimpour, and M.-B.
Naghibi-Sistani, “Reinforcement q-learning for optimal tracking control
of linear discrete-time systems with unknown dynamics,” Automatica,
vol. 50, no. 4, pp. 1167–1175, 2014.
[19] F. L. Lewis and K. G. Vamvoudakis, “Reinforcement learning for
partially observable dynamic processes: Adaptive dynamic programming
using measured output data,” Systems, Man, and Cybernetics, Part B:
Cybernetics, IEEE Transactions on, vol. 41, no. 1, pp. 14–25, 2011.