CSIE5439 — Deep Reinforcement Learning
Deep Reinforcement Learning
Lecture 6 — Deep Q-Network and Its Variants
National Taiwan University
Department of Computer Science
and Information Engineering
Prof. Chun-Yi Lee
1
CSIE 5439 — Deep Reinforcement Learning
Outline
• Announcement
• Deep Q-Learning
• The Variants of DQN
2
CSIE5439 — Deep Reinforcement Learning
Announcement
• Assignment 1 deadline has been extended (due on 3/24 (Mon) 23:59)
• Assignment 2 has been released on NTU Cool (due on 4/7 (Mon)
23:59)
• Individual assignment
• Check the discussion forum on NTU Cool before posting your
question, as your question might have already been addressed there
3
CSIE 5439 — Deep Reinforcement Learning
Outline
• Announcement
• Deep Q-Learning
• The Variants of DQN
4
CSIE 5439 — Deep Reinforcement Learning
Game Playing: The Playground of Reinforcement Learning
Atari-2600: From Pixels to Performance
• Atari-2600, raw image inputs
• High-dimensional state space
• Embrace the success in deep neural networks (DNN)
Why Atari games are ideal test environments?
• Discrete, manageable action space (4-18 actions) simplifies the learning
problem
• Clear, immediate reward signals through game scores provide
unambiguous feedback
• Standardized environments enable reliable comparison of different (210 × 160 × 3)
algorithms
5
CSIE 5439 — Deep Reinforcement Learning
Recall: Q-Learning
The Foundation of Value-Based RL !
.Q-learning with a function approximator
.Choose the next action using the greedy policy for the next state
.Algorithm :
1. Collect data samples {st, at, rt, st+1} by a policy π
2. Calculate the update target: yi = rt + γ max Qθ(st+1, a)
a
3. Update parameters: Δw = w − α ∇Qθ(st, at)(yi − Qw(st, at))
6
CSIE 5439 — Deep Reinforcement Learning
The Data Challenge in Q-Learning
Addressing Correlation in Sequential Decision Making
.First step : Collect data samples {st, at, rt, st+1} by a policy π
{st, at, rt, st+1} → {st+1, at+1, rt+1, st+2} → . . .
Strongly correlated data
.Issue: correlated data may not be good for training DNNs
7
CSIE 5439 — Deep Reinforcement Learning
The Hidden Challenge of Q-Learning
Breaking Sequential Dependencies in Reinforcement Learning
How RL Data Differs from Supervised Learning?
• Sequential dependency: RL samples are temporally correlated
while supervised learning assumes i.i.d. data
• Policy-dependent distribution: The agent's improving policy
continuously shifts the data distribution
• Delayed rewards: Value of actions may not be apparent until
many steps later, unlike immediate labels in supervised learning
8
CSIE 5439 — Deep Reinforcement Learning
Deep Q-Network (DQN)
The Pioneering Algorithm using DNNs in RL "
• The breakthrough that changed the field
• First introduced by DeepMind as a paper: “Human-level control through
deep reinforcement learning” (Link)
9
CSIE 5439 — Deep Reinforcement Learning
The DQN Innovation Toolkit
Key Components That Enabled Success
• Parameterize the Q-function with a DNN
• Enhance data-efficiency by an experience replay buffer
• Enhance the performance by introducing two Q-functions
• Qθ is the learning Q-function
• Qθ− is the target Q-function
• Modification of the input state representations
10
CSIE 5439 — Deep Reinforcement Learning
Deep Q-Network (DQN)
The Overall Framework of DQN
Stacked Action Outputs of
Grayscale Images The Game
11
CSIE 5439 — Deep Reinforcement Learning
The Concept of the Experience Replay Buffer
The Memory of Reinforcement Learning
• How agents remember and learn from past experiences?
• Regularly storing data into a buffer
Replay Buffer
{st, at, rt, st+1}
12
CSIE 5439 — Deep Reinforcement Learning
Store, Replace, and Repeat
Update the Contents of the Experience Replay Buffer
• Managing memory for efficient learning (typically FIFO)
• Regularly storing data into a buffer
When buffer is full
Replay Buffer
{st, at, rt, st+1}
13
CSIE 5439 — Deep Reinforcement Learning
Why Experience Replay Matters
Breaking the Correlation Curse ⚡
Three Key Advantages That Revolutionized Deep RL
• For sampling $: For each timestep, sample multiple
experience transitions from the replay buffer
• For reducing correlation %: Reduce the chances of
correlated data (i.e., the data samples are not sequential)
• For efficiency &: Data samples can be used multiple times
(high data-efficiency)
14
CSIE 5439 — Deep Reinforcement Learning
The Concept of the Target Network
Stabilizing the Learning Process '
• Training with two Q-functions
Agent
Modify the training target yi
Learning
Network as rt + γ max Qθ −(st+1, a)
a
Target
Network
yi
15
CSIE 5439 — Deep Reinforcement Learning
Why Two Networks Provides Stability
Update Mechanism of the Target Network
• Periodically update the target network by directly copying the parameters from those
of the training network
• Hard approach : Copy the parameter directly θ− ← θ
• Soft approach : Copy gradually θ − ← τθ − + (1 − τ)θ
• Benefit: Stabilize the training target (i.e., target does note always change)
• Objective: Stabilize the training process
• Other key enhancements:
• Frame skipping
• Stacked input frames 16
CSIE 5439 — Deep Reinforcement Learning
Deep Q-Network (DQN)
The Entire Workflow of the DQN Framework
{st, at, rt, st+1} × N
Replay Buffer {st , a
t , rt , s
t+1 }
at Agent
st, rt
17
CSIE 5439 — Deep Reinforcement Learning
Deep Q-Network (DQN)
The Pseudo Code of DQN (
for update the target network θ − ←θ
for store data samples {st, at, rt, st+1} collected by the policy π into the
experience replay buffer Z
Sample N data entries from the experience replay buffer Z
Derive the update target and update the parameters
yi = rt + γ max Qθ−(st+1, a)
a
1 N
∑
θ=θ−α ∇Qθ(si, ai)(yi − Qθ(si, ai))
N i
18
CSIE 5439 — Deep Reinforcement Learning
Outline
• Announcement
• Deep Q-Learning
• The Variants of DQN
19
CSIE 5439 — Deep Reinforcement Learning
DQN Variants
Multiple Variants for Improving DQN
• Double DQN
• Dueling DQN
• Prioritized Experience Replay for DQN
• Deep Recurrent Q-Network (DRQN)
• Rainbow DQN
20