0% found this document useful (0 votes)

29 views4 pages

Yang 20 A

This document presents a theoretical analysis of the Deep Q-Learning (DQN) algorithm, focusing on its algorithmic and statistical convergence rates. It establishes the significance of techniques like experience replay and target networks, while also proposing a new Minimax-DQN algorithm for two-player zero-sum Markov games. The analysis provides insights into the biases and variances associated with approximating action-value functions using deep neural networks.

Uploaded by

chemiethyl

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views4 pages

Yang 20 A

Uploaded by

chemiethyl

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Proceedings of Machine Learning Research vol 120:1–4, 2020 2nd Annual Conference on Learning for Dynamics and Control

A Theoretical Analysis of Deep Q-Learning

Jianqing Fan JQFAN @ PRINCETON . EDU

Princeton University
Zhaoran Wang ZHAORAN . WANG @ NORTHWESTERN . EDU
Northwestern University
Yuchen Xie YCXIE @ U . NORTHWESTERN . EDU
Northwestern University

Zhuoran Yang ZY 6@ PRINCETON . EDU

Princeton University

Editors: A. Bayen, A. Jadbabaie, G. J. Pappas, P. Parrilo, B. Recht, C. Tomlin, [Link]

1 Abstract
Despite the great empirical success of deep reinforcement learning, its theoretical foundation is less
well understood. In this work, we make the first attempt to theoretically understand the deep Q-
network (DQN) algorithm (Mnih et al., 2015) from both algorithmic and statistical perspectives. In
specific, we focus on a slight simplification of DQN that fully captures its key features. Under mild
assumptions, we establish the algorithmic and statistical rates of convergence for the action-value
functions of the iterative policy sequence obtained by DQN. In particular, the statistical error char-
acterizes the bias and variance that arise from approximating the action-value function using deep
neural network, while the algorithmic error converges to zero at a geometric rate. As a byprod-
uct, our analysis provides justifications for the techniques of experience replay and target network,
which are crucial to the empirical success of DQN. Furthermore, as a simple extension of DQN, we
propose the Minimax-DQN algorithm for zero-sum Markov game with two players. Borrowing the
analysis of DQN, we also quantify the difference between the policies obtained by Minimax-DQN
and the Nash equilibrium of the Markov game in terms of both the algorithmic and statistical rates
of convergence.
Keywords: Deep Q-Learning, Markov Decision Process, Zero-Sum Markov Game

Introduction. In this work, we aim to provide theoretical guarantees for DQN (Mnih et al., 2015),
which can be cast as an extension of the classical Q-learning algorithm (Watkins and Dayan, 1992)
that uses deep neural network to approximate the action-value function. Although the algorithmic
and statistical properties of the classical Q-learning algorithm are well-studied, theoretical analysis
of DQN is highly challenging due to its differences in the following two aspects.
First, in online gradient-based temporal-difference reinforcement learning algorithms, approx-
imating the action-value function often leads to instability. Baird (1995) proves that this is the
case even with linear function approximation. The key technique to achieve stability in DQN is
experience replay (Lin, 1992; Mnih et al., 2015). In specific, a replay memory is used to store the
1. Extended abstract. Full version appears as arXiv reference, arXiv:1901.00137.

c 2020 J. Fan, Z. Wang, Y. Xie & Z. Yang.

A T HEORETICAL A NALYSIS OF D EEP Q-L EARNING

trajectory of the Markov decision process (MDP). At each iteration of DQN, a mini-batch of states,
actions, rewards, and next states are sampled from the replay memory as observations to train the
Q-network, which approximates the action-value function. The intuition behind experience replay
is to achieve stability by breaking the temporal dependency among the observations used in training
the deep neural network.
Second, in addition to the aforementioned Q-network, DQN uses another neural network named
the target network to obtain an unbiased estimator of the mean-squared Bellman error used in train-
ing the Q-network. The target network is synchronized with the Q-network after each period of
iterations, which leads to a coupling between the two networks. Moreover, even if we fix the target
network and focus on updating the Q-network, the subproblem of training a neural network still
remains less well-understood in theory.
In this paper, we focus on a slight simplification of DQN, which is amenable to theoretical
analysis while fully capturing the above two aspects. In specific, we simplify the technique of ex-
perience replay with an independence assumption, and focus on deep neural networks with rectified
linear units (ReLU) (Nair and Hinton, 2010) and large batch size. Under this setting, DQN is re-
duced to the neural fitted Q-iteration (FQI) algorithm (Riedmiller, 2005) and the technique of target
network can be cast as the value iteration. More importantly, by adapting the approximation results
for ReLU networks to the analysis of Bellman operator, we establish the algorithmic and statistical
rates of convergence for the iterative policy sequence obtained by DQN. As we will show in the
main results, the statistical error characterizes the bias and variance that arise from approximating
the action-value function using neural network, while the algorithmic error geometrically decays to
zero as the number of iteration goes to infinity.
Furthermore, we extend DQN to two-player zero-sum Markov games (Shapley, 1953). The
proposed algorithm, named Minimax-DQN, can be viewed as a combination of the Minimax-Q
learning algorithm for tabular zero-sum Markov games (Littman, 1994) and deep neural networks
for function approximation. Compared with DQN, the main difference lies in the approaches to
compute the target values. In DQN, the target is computed via maximization over the action space.
In contrast, the target obtained computed by solving the Nash equilibrium of a zero-sum matrix
game in Minimax-DQN, which can be efficiently attained via linear programming. Despite such a
difference, both these two methods can be viewed as approximately applying the Bellman opera-
tor to the Q-network. Thus, borrowing the analysis of DQN, we also establish theoretical results
for Minimax-DQN. Specifically, we quantify the suboptimality of policy returned by the algorithm
by the difference between the action-value functions associated with this policy and with the Nash
equilibrium policy of the Markov game. For this notion of suboptimality, we establish the both algo-
rithmic and statistical rates of convergence, which implies that the action-value function converges
to the optimal counterpart up to an unimprovable statistical error in geometric rate.
Our contribution is three-fold. First, we establish the algorithmic and statistical errors of the neu-
ral FQI algorithm, which can be viewed as a slight simplification of DQN. Under mild assumptions,
our results show that the proposed algorithm obtains a sequence of Q-networks that geometrically
converges to the optimal action-value function up to an intrinsic statistical error induced by the ap-
proximation bias of ReLU network and finite sample size. Second, as a byproduct, our analysis

2
A T HEORETICAL A NALYSIS OF D EEP Q-L EARNING

justifies the techniques of experience replay and target network used in DQN, where the latter can
be viewed as a single step of the value iteration. Third, we propose the Minimax-DQN algorithm
that extends DQN to two-player zero-sum Markov games. Borrowing the analysis for DQN, we
establish the algorithmic and statistical convergence rates of the action-value functions associated
with the sequence of policies returned by the Minimax-DQN algorithm.

References
Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In
Machine Learning Proceedings 1995, pages 30–37. Elsevier, 1995.

Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang. A survey on dialogue systems: Recent
advances and new frontiers. arXiv preprint arXiv:1711.01731, 2017.

Jens Kober and Jan Peters. Reinforcement learning in robotics: A survey. In Reinforcement Learn-
ing, pages 579–610. Springer, 2012.

Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teach-
ing. Machine learning, 8(3-4):293–321, 1992.

Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In

Machine Learning Proceedings 1994, pages 157–163. Elsevier, 1994.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-
mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level
control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim
Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement
learning. In International Conference on Machine Learning, pages 1928–1937, 2016.

Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines.
In International conference on machine learning, pages 807–814, 2010.

Martin Riedmiller. Neural fitted Q iteration–first experiences with a data efficient neural reinforce-
ment learning method. In European Conference on Machine Learning, pages 317–328. Springer,
2005.

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region
policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.

Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10):
1095–1100, 1953.

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering
the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.

3
A T HEORETICAL A NALYSIS OF D EEP Q-L EARNING

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,
Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of Go
without human knowledge. Nature, 550(7676):354, 2017.

Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient
methods for reinforcement learning with function approximation. In Advances in Neural Infor-
mation Processing Systems, pages 1057–1063, 2000.

Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.

DQN Implementation Insights and Techniques
No ratings yet
DQN Implementation Insights and Techniques
9 pages
15 Deep Reinforcement Learning v24.2
No ratings yet
15 Deep Reinforcement Learning v24.2
115 pages
Ref 1
No ratings yet
Ref 1
7 pages
Chapter 1 Introduction RL Report Kiran
No ratings yet
Chapter 1 Introduction RL Report Kiran
2 pages
Untitled Document
No ratings yet
Untitled Document
11 pages
DQN Atari
No ratings yet
DQN Atari
26 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
Q Learning
No ratings yet
Q Learning
38 pages
Towards Adapting Reinforcement Learning Agents To New Tasks: Insights From Q-Values
No ratings yet
Towards Adapting Reinforcement Learning Agents To New Tasks: Insights From Q-Values
10 pages
6S191 MIT DeepLearning L5
No ratings yet
6S191 MIT DeepLearning L5
62 pages
Q Learning
No ratings yet
Q Learning
187 pages
Actor Mimic
No ratings yet
Actor Mimic
16 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
5SC28 L7 Machine Learning
No ratings yet
5SC28 L7 Machine Learning
61 pages
Pplication of Deep Reinforcement Learning For Ndian Stock Trading Automation
No ratings yet
Pplication of Deep Reinforcement Learning For Ndian Stock Trading Automation
9 pages
RL Report
No ratings yet
RL Report
37 pages
42-Deep Q Learning
No ratings yet
42-Deep Q Learning
8 pages
RL Algorithms in Gymnasium
No ratings yet
RL Algorithms in Gymnasium
59 pages
CH5 - Function Approximation
No ratings yet
CH5 - Function Approximation
33 pages
AI Plays Geometry Dash
No ratings yet
AI Plays Geometry Dash
7 pages
RLDL PBL AmriteshChandra 09411503121
No ratings yet
RLDL PBL AmriteshChandra 09411503121
15 pages
What Is TD Learning
No ratings yet
What Is TD Learning
15 pages
Morefeatures
No ratings yet
Morefeatures
6 pages
Game AI with Reinforcement Learning
No ratings yet
Game AI with Reinforcement Learning
5 pages
Enhancing DDPG with Prioritized Replay
No ratings yet
Enhancing DDPG with Prioritized Replay
10 pages
Report
No ratings yet
Report
11 pages
15) EXPLAIN Fitted Q and Deep Q-Learning
No ratings yet
15) EXPLAIN Fitted Q and Deep Q-Learning
17 pages
18 Deeprl
No ratings yet
18 Deeprl
19 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
Markov Decision Process: Reinforcement Learning
No ratings yet
Markov Decision Process: Reinforcement Learning
10 pages
Chapter 3
No ratings yet
Chapter 3
14 pages
Deep Reinforcement Learning Overview
No ratings yet
Deep Reinforcement Learning Overview
52 pages
Deep Learning Book Part5
No ratings yet
Deep Learning Book Part5
142 pages
DDQN PDF
No ratings yet
DDQN PDF
13 pages
Deep Q Network
No ratings yet
Deep Q Network
6 pages
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
13-RL DRL
No ratings yet
13-RL DRL
102 pages
Value-Policy Integration in RL
No ratings yet
Value-Policy Integration in RL
21 pages
Deep Learning for Atari Games
No ratings yet
Deep Learning for Atari Games
10 pages
Advanced Deep RL for AI Experts
No ratings yet
Advanced Deep RL for AI Experts
10 pages
Walking Through Original DQN Paper - by Stas Olekhnovich - Medium
No ratings yet
Walking Through Original DQN Paper - by Stas Olekhnovich - Medium
13 pages
Reinforcement Learning in Super Mario
No ratings yet
Reinforcement Learning in Super Mario
59 pages
Human-Level Control Through Deep Reinforcement Learning
No ratings yet
Human-Level Control Through Deep Reinforcement Learning
13 pages
Nature 14236
No ratings yet
Nature 14236
13 pages
Hindsight Experience Replay in RL
No ratings yet
Hindsight Experience Replay in RL
15 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
Lecture 6 Deep Q Network and Its Variants 1 20
No ratings yet
Lecture 6 Deep Q Network and Its Variants 1 20
20 pages
Doom AI
No ratings yet
Doom AI
7 pages
Hota ML ReinforcementLearning
No ratings yet
Hota ML ReinforcementLearning
12 pages
Continuous-Time q-Learning Theory
No ratings yet
Continuous-Time q-Learning Theory
61 pages
HER Hindsight Experience Replay
No ratings yet
HER Hindsight Experience Replay
11 pages
Lecture 6 Deep Q Network and Its Variants
No ratings yet
Lecture 6 Deep Q Network and Its Variants
59 pages
Report
No ratings yet
Report
3 pages
MIT CSM Final Paper
No ratings yet
MIT CSM Final Paper
9 pages
Playing FPS Games With Deep Reinforcement Learning: Guillaume Lample, Devendra Singh Chaplot
No ratings yet
Playing FPS Games With Deep Reinforcement Learning: Guillaume Lample, Devendra Singh Chaplot
7 pages
RL Unit V Qa
No ratings yet
RL Unit V Qa
13 pages
TD Learning & Deep Q-Networks
No ratings yet
TD Learning & Deep Q-Networks
20 pages
Deep Reinforcement Learning Overview
No ratings yet
Deep Reinforcement Learning Overview
9 pages
Unit - 5
No ratings yet
Unit - 5
43 pages
Polya's Problem Solving Strategy Guide
No ratings yet
Polya's Problem Solving Strategy Guide
24 pages
Convex Shape Optimization Guide
No ratings yet
Convex Shape Optimization Guide
25 pages
Multiple Choice Questions
No ratings yet
Multiple Choice Questions
8 pages
Bernoulli Variables Stochastic Convergence
No ratings yet
Bernoulli Variables Stochastic Convergence
2 pages
CV Muhammad Imran
No ratings yet
CV Muhammad Imran
4 pages
Crypto Encryption
No ratings yet
Crypto Encryption
3 pages
Advanced Relativistic Quantum Chemistry
No ratings yet
Advanced Relativistic Quantum Chemistry
7 pages
Final Capstone Project Report
100% (1)
Final Capstone Project Report
35 pages
Linear Regression & Contraceptive Use Analysis
No ratings yet
Linear Regression & Contraceptive Use Analysis
12 pages
Conservative Upwind Scheme for Traffic Flow
No ratings yet
Conservative Upwind Scheme for Traffic Flow
8 pages
Deep Learning in Eye Disease Detection With Citations 2
No ratings yet
Deep Learning in Eye Disease Detection With Citations 2
12 pages
Modeling & Numerical Methods Guide
No ratings yet
Modeling & Numerical Methods Guide
22 pages
ADA Theory Assignment
No ratings yet
ADA Theory Assignment
4 pages
(Vadhan) Pseudorandomness
No ratings yet
(Vadhan) Pseudorandomness
227 pages
Continuity of Functions Continuity of A Function at A Point
No ratings yet
Continuity of Functions Continuity of A Function at A Point
13 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
79 pages
Financial Econometrics Exam
No ratings yet
Financial Econometrics Exam
5 pages
Small-Footprint CRNN for Keyword Spotting
No ratings yet
Small-Footprint CRNN for Keyword Spotting
5 pages
Matrices and Determinants - JEE Mains PYQ 2023 Session 2
No ratings yet
Matrices and Determinants - JEE Mains PYQ 2023 Session 2
63 pages
Report: Expected Value 100%: Ques On 8
No ratings yet
Report: Expected Value 100%: Ques On 8
4 pages
Kruskal's Algorithm for Minimum Spanning Tree
No ratings yet
Kruskal's Algorithm for Minimum Spanning Tree
6 pages
American SIGN - LANGUAGE - DETECTION
No ratings yet
American SIGN - LANGUAGE - DETECTION
35 pages
Predicting Academic Success: Machine Learning Analysis of Student, Parental, and School Efforts
No ratings yet
Predicting Academic Success: Machine Learning Analysis of Student, Parental, and School Efforts
22 pages
BI and DSS Model Exam2
No ratings yet
BI and DSS Model Exam2
21 pages
Unit1 of AI
No ratings yet
Unit1 of AI
214 pages
Lab 03 Steady-State Error: Prepared by
No ratings yet
Lab 03 Steady-State Error: Prepared by
9 pages
System Identification Lecture Guide
No ratings yet
System Identification Lecture Guide
25 pages
Runge-Kutta Method of 4 Order in Numeric Solution of The Tuberculosis Disease Spreading Model
No ratings yet
Runge-Kutta Method of 4 Order in Numeric Solution of The Tuberculosis Disease Spreading Model
15 pages
Hw1-1 Answers Hand Written
No ratings yet
Hw1-1 Answers Hand Written
16 pages
GE Jet Engine Bracket Challenge
No ratings yet
GE Jet Engine Bracket Challenge
3 pages

Yang 20 A

Uploaded by

Yang 20 A

Uploaded by

Proceedings of Machine Learning Research vol 120:1–4, 2020 2nd Annual Conference on Learning for Dynamics and Control

A Theoretical Analysis of Deep Q-Learning

Jianqing Fan JQFAN @ PRINCETON . EDU

Zhuoran Yang ZY 6@ PRINCETON . EDU

Editors: A. Bayen, A. Jadbabaie, G. J. Pappas, P. Parrilo, B. Recht, C. Tomlin, [Link]

c 2020 J. Fan, Z. Wang, Y. Xie & Z. Yang.

Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In

You might also like