DRL Final Notes
DRL Final Notes
Session #1:
Introduction to the Course
Instructors :
1. Prof. S. P. Vimal (vimalsp@[Link]),
2. Prof. Sangeetha Viswanathan ([Link]@[Link])
PRESENTATION TITLE 1
2
Course Objectives
Course Objectives:
1. Understand
a. the conceptual, mathematical foundations of deep reinforcement learning
b. various classic & state of the art Deep Reinforcement Learning algorithms
2. Implement and Evaluate the deep reinforcement learning solutions to various
problems like planning, control and decision making in various domains
3. Provide conceptual, mathematical and practical exposure on DRL
a. to understand the recent developments in deep reinforcement learning and
b. to enable modelling new problems as DRL problems.
Learning Outcomes
4
Course Operation
• Instructors
Prof. [Link], Prof. Chandra Sekar
Prof. Bharatesh, Prof. SK. Karthika
• Textbooks
1. Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G.
Barto, Second Ed. , MIT Press
2. Foundations of Deep Reinforcement Learning: Theory and Practice in Python
(Addison-Wesley Data & Analytics Series) 1st Edition by Laura Graesser and
Wah Loon Keng
5
Course Operation
• Evaluation
Two Quizzes for 5% each; Best 2 will be taken for 5% (In final grading)
Whatever be the points set for quizzes, the sore will be scaled to 5%.
No MAKEUP, for whatever be the reason. Ensure to attend at least one of the quizzes
Two Assignments - Tensorflow/ Pytorch / OpenAI Gym Toolkit → 25 %
Assignment 1: Partially Numerical +Implementation of Classic Algorithms –
10%
Assignment 2: Deep Learning based RL
Mid Term Exam - 30% [Only to be written in A4 sheets, scanned and uploaded]
Comprehensive Exam - 40% [Only to be written in A4 sheets, scanned and uploaded]
• Webinars/Tutorials
4 tutorials : 2 before mid-sem & 2 after mid-sem
• Teaching Assistants – Will be introduced in the upcoming classes.
6
Course Operation
• How to reach us ? (for any question on lab aspects, availability of slides on portal, quiz
availability , assignment operations )
Prof. [Link] - vimalsp@[Link]
Prof. SK. Karthika – Karthika@[Link]
• Plagiarism
All submissions for graded components must be the result of your original effort. It is strictly
prohibited to copy and paste verbatim from any sources, whether online or from your peers.
The use of unauthorized sources or materials, as well as collusion or unauthorized
collaboration to gain an unfair advantage, is also strictly prohibited. Please note that we will
not distinguish between the person sharing their resources and the one receiving them for
plagiarism, and the consequences will apply to both parties equally.
In cases where suspicious circumstances arise, such as identical verbatim answers or a
significant overlap of unreasonable similarities in a set of submissions, will be investigated,
and severe punishments will be imposed on all those found guilty of plagiarism.
7
Reinforcement Learning
1.A model of the environment is known, but an analytic solution is not available;
[Link] a simulation model of the environment is given (the subject of simulation-based
optimization)
[Link] only way to collect information about the environment is to interact with it.
8
(Deep) Reinforcement Learning
11
12
Types of Reinforcement Learning
Negative Reinforcement - strengthening of behavior because a negative
condition is stopped or avoided. Advantages of negative reinforcement
learning:
•Increases Behavior
•Provide defiance to a minimum standard of performance
•It Only provides enough to meet up the minimum behavior
13
14
Elements of Reinforcement Learning
Beyond the agent and the environment, one can identify four main sub-
elements of a reinforcement learning system: a policy, a reward , a value
function, and, optionally, a model of the environment.
15
•Agent
- An entity that tries to learn the best way to perform a specific task.
- In our example, the child is the agent who learns to ride a bicycle.
•Action (A) -
- What the agent does at each time step.
- In the example of a child learning to walk, the action would be “walking”.
- A is the set of all possible moves.
- In video games, the list might include running right or left, jumping high or
low, crouching or standing still.
16
Elements of Reinforcement Learning
•State (S)
- Current situation of the agent.
- After doing performing an action, the agent can move to different states.
- In the example of a child learning to walk, the child can take the action of
taking a step and move to the next state (position).
•Rewards (R)
- Feedback that is given to the agent based on the action of the agent.
- If the action of the agent is good and can lead to winning or a positive side
then a positive reward is given and vice versa.
17
•Environment
- Outside world of an agent or physical world in which the agent operates.
•Discount factor
- The discount factor is multiplied by future rewards as discovered by the
agent in order to dampen these rewards’ effect on the agent’s choice of
action.
- Why? It is designed to make future rewards worth less than immediate
rewards.
18
Reinforcement Learning - Definition
19
21
•Policy (π)
- Policy in RL decides which action will the agent take in the current state.
- It tells the probability that an agent will select a specific action from a specific state.
- Policy is a function that maps a given state to probabilities of selecting each
possible action from the given state.
•If at time t, an agent follows policy π, then π(a|s) becomes the probability that
the action at time step t is at=a if the state at time step t is St=s .The meaning of this
is, the probability that an agent will take an action a in state s is π(a|s) at time t
with policy π.
22
Elements of Reinforcement Learning
•Value Functions
- A simple measure of how good it is for an agent to be in a given state, or
how good it is for the agent to perform a given action in a given state.
•Two types
- State- Value function
- Action-Value function
23
•State-value function
- The state-value function for policy π denoted as vπ determines the goodness
of any given state for an agent who is following policy π.
- This function gives us the value which is the expected return starting from
state s at time step t and following policy π afterward.
24
Elements of Reinforcement Learning
25
26
(Deep) Reinforcement Learning
27
From OpenAI
28
An example scenario - Tic-Tac-Toe
Two players take turns playing on a three-by-three board. One player plays Xs and
the other Os until one player wins by placing three marks in a row, horizontally,
vertically, or diagonally
Assumptions
- playing against an imperfect player, one whose play is sometimes incorrect and
allows you to win
Aim
How might we construct a player that will find the imperfections in its
opponent’s play and learn to maximize its chances of winning?
29
•We set up a table of numbers, one for each possible state of the game. Each
number will be the latest estimate of the probability of our winning from that
state.
•We treat this estimate as the state’s value, and the whole table is the learned
value function.
•State A has higher value than state B, or is considered better than state B, if the
current estimate of the probability of our winning from A is higher than it is from B.
30
Reinforcement Learning for Tic-Tac-Toe
31
32
Reinforcement Learning for Tic-Tac-Toe
● Once a game is started, our agent computes all possible actions it can take in
the current state and the new states which would result from each action.
● The values of these states are collected from a state_value vector, which
contains values for all possible states in the game.
● The agent can then choose the action which leads to the state with the highest
value(exploitation), or chooses a random action(exploration), depending on the
value of epsilon.
33
Thank you
34
Deep Reinforcement Learning
2022-23 Second Semester, [Link] (AIML)
Session #2-3:
Multi-armed Bandits
Instructors :
1. Prof. S. P. Vimal (vimalsp@[Link]),
2. Prof. Sangeetha Viswanathan ([Link]@[Link])
1
• Recap
• k-armed Bandit Problem & its significance
• Action-Value Methods
Sample Average Method & Incremental Implementation
• Non-stationary Problem
• Initial Values & Action Selection
• Gradient Bandit Algorithms [ Class #3 ]
• Associative Search [ Class #3 ]
Acknowledgements: Some of the slides were adopted with permission from the course CSCE-689 (Texas A&M University) by Prof. Guni Sharon 2
An example scenario - Tic-Tac-Toe
Two players take turns playing on a three-by-three board. One player plays Xs and
the other Os until one player wins by placing three marks in a row, horizontally,
vertically, or diagonally
Assumptions
- playing against an imperfect player, one whose play is sometimes incorrect and
allows you to win
Aim
How might we construct a player that will find the imperfections in its
opponent’s play and learn to maximize its chances of winning?
•We set up a table of numbers, one for each possible state of the game. Each
number will be the latest estimate of the probability of our winning from that
state.
•We treat this estimate as the state’s value, and the whole table is the learned
value function.
•State A has higher value than state B, or is considered better than state B, if the
current estimate of the probability of our winning from A is higher than it is from B.
4
Reinforcement Learning for Tic-Tac-Toe
6
Reinforcement Learning for Tic-Tac-Toe
● Once a game is started, our agent computes all possible actions it can take in
the current state and the new states which would result from each action.
● The values of these states are collected from a state_value vector, which
contains values for all possible states in the game.
● The agent can then choose the action which leads to the state with the highest
value(exploitation), or chooses a random action(exploration), depending on the
value of epsilon.
Tic-Tac-Toc
8
Tic-Tac-Toc
States Initial Values
0.5
… … 9
0.5
1.0
O 0
O
… … 10
Tic-Tac-Toc ( prev. class)
𝜶- Step Size 11
Parameter
Questions:
(1) What happens if 𝜶 is gradually made to 0 over many
games with the opponent?
(2) What happens if 𝜶 is gradually reduced over many
games, but never made 0?
(3) What happens if 𝜶 is kept constant throughout its
life time?
𝜶- Step Size 12
Parameter
Tic-Tac-Toc ( prev. class)
Key Takeaways:
(1) Learning while interacting with the
environment (opponent).
(2) We have a clear goal
(3) Our policy is to make moves that maximizes our
chances of reaching goal
○ Use the values of states most of the time
(exploration) and explore rest of the time.
𝜶- Step Size 13
Parameter
Reading Assigned:
Identify how this reinforcement learning solution is
different from solutions using minimax algorithm
and genetic algorithms.
Post your answers in the discussion forum;
𝜶- Step Size 14
Parameter
K-armed Bandit Problem
Problem
• You are faced repeatedly with a choice among k different options, or actions
• After each choice of actions you receive a numerical reward
Reward is chosen from a stationary probability distribution that depends on the selected
action
• Objective : to maximize the expected total reward over some time period
15
16
K-armed Bandit Problem
Problem
• You are faced repeatedly with a choice among k different options, or actions
• After each choice of actions you receive a numerical reward
Reward is chosen from a stationary probability distribution that depends on the selected
action
• Objective : to maximize the expected total reward over some time period
Strategy:
● Identify the best lever(s)
● Keep pulling the identified ones
Questions:
● How do we define the best ones?
● What are the best levers?
17
Strategy:
● Identify the best lever(s)
● Keep pulling the identified ones
Questions:
● How do we define the best ones?
● What are the best levers?
18
K-armed Bandit Problem
Problem
• You are faced repeatedly with a choice among k different options, or actions
• After each choice of actions you receive a numerical reward
Reward is chosen from a stationary probability distribution that depends on the selected
action
• Objective : to maximize the expected total reward over some time period
19
Note: If you knew the value of each action, then it would be trivial to solve the k -armed bandit
problem: you would always select the action with highest value :-)
20
K-armed Bandit Problem
21
23
Actions which are inferior by the value estimate upto time t, could be indeed better than
the greedy action at t !!!
25
26
K-armed Bandit Problem
Greedy Action
27
28
K-armed Bandit Problem
ε-Greedy Action Selection / near-greedy action selection
● In the limit as the number of steps increases, every action will be sampled by ε-greedy action
selection an infinite number of times. This ensures that all the Qt (a) converge to q* (a).
● Easy to implement / optimize for epsilon / yields good results
29
Ex-1: In ε-greedy action selection, for the case of two actions and ε
= 0.5, what is the probability that the greedy action is selected?
30
Ex-1: In ε-greedy action selection, for the case of two actions and ε
= 0.5, what is the probability that the greedy action is selected?
p (greedy action)
= p (greedy action AND greedy selection ) + p (greedy action AND random selection )
= p (greedy action | greedy selection ) p ( greedy selection )
+ p (greedy action | random selection ) p (random selection )
= p (greedy action | greedy selection ) (1-𝛆) + p (greedy action | random selection ) (𝛆)
= p (greedy action | greedy selection ) (0.5) + p (greedy action | random selection ) (0.5)
= (1) (0.5) + (0.5) (0.5)
= 0.5 + 0.25
= 0.75
31
10-armed Testbed
Example:
• A set of 2000 randomly generated k -armed
bandit problems with k = 10
• Action values were selected according to a
normal (Gaussian) distribution with mean 0
and variance 1.
• While selecting action At at time step t, the
actual reward, Rt , was selected from a
normal distribution with mean q*(At ) and
variance 1
• One Run : Apply a method for 1000 time
steps to one of the bandit problems
• Perform 2000 runs, each run with a
different bandit problem, to get an An example bandit problem from the 10-armed testbed
32
Average performance of 𝛆-greedy action-value methods on
the 10-armed testbed
33
34
Discussion on Exploration vs. Exploitation
35
Ex-2:
Consider applying to this problem a bandit algorithm using 𝛆-greedy action selection, sample-
average action-value estimates, and initial estimates of Q1 (a) = 0, for all a.
On some of these time steps the 𝛆 case may have occurred, causing an action to be selected at
random.
On which time steps did this definitely occur? On which time steps could this possibly have
occurred?
36
37
38
39
Incremental Implementation
• Efficient approach to
compute the estimate of
action-value;
40
Incremental Implementation
Note:
● StepSize decreases with each update
● We use 𝛂 or 𝛂t(a) to denote step size (constant /
varies with each step)
Discussion:
Const vs. Variable step size?
41
42
Non-stationary Problem
43
Non-stationary Problem
44
45
46
Optimistic Initial Values
• All the above discussed methods are biased by their initial estimates
• For sample average method the bias disappears once all actions have been selected at
least once
• For methods with constant α, the bias is permanent, though decreasing over time
• Initial action values can also be used as a simple way of encouraging exploration.
• In 10 armed testbed, set initial estimate to +5 rather than 0.
This can encourage action-value methods to explore.
Whichever actions are initially selected, the reward is less than the starting estimates;
the learner switches to other actions, being disappointed with the rewards it is receiving.
The result is that all actions are tried several times before the value estimates converge.
47
Caution:
Optimistic Initial Values can
only be considered as a simple
trick that can be quite effective
on stationary problems, but it is
far from being a generally
useful approach to encouraging
exploration.
Question:
Explain how in the non-
stationary scenario the
optimistic initial values will fail
(to explore adequately).
49
Measure of Uncertainty
50
51
52
53
UCB often performs well, as shown here, but is more difficult than "-greedy to extend beyond bandits to the more
general reinforcement learning settings
Policy-based algorithms
55
Softmax function
56
Softmax function
57
Gradient ascend
• ?
58
Update Rule
59
60
What did we learn?
61
A different scenario
62
Required Readings
63
Thank you !
64
Deep Reinforcement Learning
2023-24 Second Semester, [Link] (AIML)
Session #4:
Markov Decision Processes
Policy-based algorithms
2
Softmax function
Softmax function
4
Gradient ascend
• ?
Update Rule
6
7
8
A different scenario
Announcement !!!
We have our Teaching Assistants now !!! You will see their names in the course home
page.
Acknowledgements: Some of the slides were adopted with permission from the course CSCE-689 (Texas A&M University) by Prof. Guni Sharon 10
Agent-Environment Interface
11
● A maze-like problem
○ The agent lives in a grid
○ Walls block the agent’s path
● Noisy movement: actions do not always go as planned
○ 80% of the time, the action North takes the agent North
(if there is no wall there)
○ 10% of the time, North takes the agent West; 10% East
○ If there is a wall in the direction the agent would have
been taken, the agent stays put
● The agent receives rewards each time step
○ -0.1 per step (battery loss)
○ +1 if arriving at (4,3) ; -1 for arriving at (4,2) ;-1 for
arriving at (2,2)
● Goal: maximize accumulated rewards
12
Markov Decision Processes
● An MDP is defined by
○ A set of states
○ A set of actions
○ State-transition probabilities
■ Probability of arriving to after performing at
■ Also called the model dynamics
○ A reward function
■ The utility gained from arriving to after
performing at
■ Sometimes just or even
○ A start state
○ Maybe a terminal state
13
State-transition probabilities
14
Markov Decision Processes - Discussion
15
16
MDP Formalization : Video Games
● State:
○ raw pixels
● Actions:
○ game controls
● Reward:
○ change in score
● State-transition probabilities:
○ defined by stochasticity in game evolution
Ref: Playing Atari with deep reinforcement learning”, Mnih et al., 2013 17
● State:
○ Current signal assignment (green, yellow,
and red assignment for each phase)
○ For each lane: number of approaching
vehicles, accumulated waiting time,
number of stopped vehicles, and average
speed of approaching vehicles
● Actions:
○ signal assignment
● Reward:
○ Reduction in traffic delay
● State-transition probabilities:
○ defined by stochasticity in approaching
demand
Ref: “Learning an Interpretable Traffic Signal Control Policy”, Ault et al., 2020 18
MDP Formalization : Recycling Robot (Detailed Ex.)
● Robot has
○ sensors for detecting cans
○ arm and gripper that can pick the cans and place in an
onboard bin;
● Runs on a rechargeable battery
● Its control system has components for interpreting sensory
information, for navigating, and for controlling the arm and
gripper
● Task for the RL Agent: Make high-level decisions about how
to search for cans based on the current charge level of the
battery
19
● State:
○ Assume that only two charge levels can be distinguished
○ S = {high, low}
● Actions:
○ A(high) = {search, wait}
○ A(low) = {search, wait, recharge}
● Reward:
○ Zero most of the time, except when securing a can
○ Cans are secured by searching and waiting, but rsearch > rwait
● State-transition probabilities:
○ [Next Slide]
20
MDP Formalization : Recycling Robot (Detailed Ex.)
21
22
Note on Goals & Rewards
● Reward Hypothesis:
All of what we mean by goals and purposes can be well thought of as
the maximization of the expected value of the cumulative sum of a
received scalar signal (called reward).
● The rewards we set up truly indicate what we want accomplished,
○ not the place to impart prior knowledge on how we want it to do
● Ex: Chess Playing Agent
○ If the agent is rewarded for taking opponents pieces, the agent might fall
for the opponent's trap.
● Ex: Vacuum Cleaner Agent
○ If the agent is rewarded for each unit of dirt it sucks, it can repeatedly
deposit and suck the dirt for larger reward
23
24
Returns & Episodes
● Generally T = ∞
○ What if the agent receive a reward
of +1 for each timestep?
○ Discounted Return:
Note:
● What if 𝛾 is 0?
● What if 𝛾 is 1?
● Computing discounted rewards incrementally
26
Returns & Episodes
27
Policy
28
Defining Value Functions
29
30
31
32
May skip to the next slide !
Bellman Equation for V𝝅
● Dynamic programming equation associated with discrete-time optimization
problems
○ Expressing Vℼ recursively i.e. relating V𝝅(s) to V𝝅(s’) for all s’ ∈ succ(s)
33
Backup Diagram
34
Understanding V𝝅(s) with Gridworld
Reward:
○ -1 if an action takes agent off the grid
○ Exceptional reward from A and B for all actions taking agent to A’ and B’ resp.
○ 0, everywhere else
36
Understanding V𝝅(s) with Gridworld
37
Ex-1
Recollect the reward function used for Gridworld as below:
○ -1 if an action takes agent off the grid
○ Exceptional reward from A and B for all actions taking agent to A’ and B’ resp.
○ 0, everywhere else
Let us add a constant c ( say 10) to the rewards of all the actions. Will it change
anything?
38
39
40
Optimal Policies and Optimal Value Functions
41
42
Optimal Policies and Optimal Value Functions
45
Notation
•
46
Race car example
•
47
48
Race car example
•
49
Value iteration
50
Value Iteration
0 0 0
2 1 0
3.35 2.35 0
52
53
Infinite Utilities?!
• Problem: What if the game lasts forever? Do we get infinite
rewards?
• Solutions:
• Finite horizon: (similar to depth-limited search)
• Terminate episodes after a fixed T steps (e.g. life)
• Gives nonstationary policies (π depends on time left)
54
Discount factor
•
Geometric series
55
Discounting
• It’s reasonable to maximize the sum of rewards
• It’s also reasonable to prefer rewards now to rewards later
• Discount factor: values of rewards decay exponentially
57
Quiz: Discounting
• Given grid world:
• Actions: East, West, and Exit (‘Exit’ only available in terminal states: a, e)
• Rewards are given only after an exit action
• Transitions: deterministic
• Quiz 1: For γ = 1, what is the optimal policy?
• Quiz 3: For which γ are West and East equally good when in state d?
58
59
60
k=0
Noise = 0.2
Discount = 0.9
Living reward = 0
61
k=1
Noise = 0.2
Discount = 0.9
Living reward = 0
62
k=2
Noise = 0.2
Discount = 0.9
Living reward = 0
63
k=3
Noise = 0.2
Discount = 0.9
Living reward = 0
64
k=4
Noise = 0.2
Discount = 0.9
Living reward = 0
65
k=5
Noise = 0.2
Discount = 0.9
Living reward = 0
66
k=6
Noise = 0.2
Discount = 0.9
Living reward = 0
67
k=7
Noise = 0.2
Discount = 0.9
Living reward = 0
68
k=8
Noise = 0.2
Discount = 0.9
Living reward = 0
69
k=9
Noise = 0.2
Discount = 0.9
Living reward = 0
70
k=10
Noise = 0.2
Discount = 0.9
Living reward = 0
71
k=11
Noise = 0.2
Discount = 0.9
Living reward = 0
72
k=12
Noise = 0.2
Discount = 0.9
Living reward = 0
73
k=100
Noise = 0.2
Discount = 0.9
Living reward = 0
74
Problems with Value Iteration
•
75
76
• Asynchronous value iteration
• In value iteration, we update every state in each iteration
• Actually, any sequences of Bellman updates will converge if every state is visited
infinitely often regardless of the visitation order
• Idea: prioritize states whose value we expect to change significantly
77
A single
update per
iteration
78
Double the work?
•
79
80
Q-learning
•
81
82
Issue 3: The policy often converges long
before the values
•
83
Policy Iteration
•
84
Policy Evaluation
•
85
86
Comparison
• Both value iteration and policy iteration compute the same thing (optimal
state values)
• In value iteration:
• Every iteration updates both the values and (implicitly) the policy
• We don’t track the policy, but taking the max over actions implicitly define it
• In policy iteration:
• We do several passes that update utilities with fixed policies (each pass is fast
because we consider only one action, not all of them)
• After the policy is evaluated, a new policy is chosen (slow like a value iteration
pass)
• The new policy will be better (or we’re done)
87
89
90
Notation
•
91
Required Readings
92
Thank you
93
Session #6-7:
Monte Carlo Methods
DRL Instructors
1
Agenda for the class
• Introduction
• On-Policy Monte Carlo Methods
• Off-Policy Monte Carlo Methods
Acknowledgements: Some of the slides were adopted with permission from the course CSCE-689 (Texas A&M University) by Prof. Guni Sharon 2
Introduction
3
Introduction
•
Introduction
•
5
(Aside) Offline vs. Online (RL)
9
Acknowledgements : This example is taken from the tutorial by Peter Bodík, RAD Lab, UC Berkeley
Ex-2: First-visit Monte-Carlo Policy Evaluation
[estimate V𝛑(s)]
Input Policy π Observed Episodes (Training) Output Values
Episode 1 Episode 2
B, east, C, -1 B, east, C, -1 -10
A A
C, east, D, -1 C, east, D, -1
D, exit, , +10 D, exit, , +10 +8 +4 +10
B C D B C D
Episode 3 Episode 4 -2
E E
E, north, C, -1 E, north, C, -1
C, east, D, -1 C, east, A, -1
Assume: γ = 1 D, exit, , +10 A, exit, , -10
10
+1 +10
State
NA NA NA NA 0
0 0 0 0 NA
0 0 0 0 NA
NA NA NA NA 0
NA NA NA NA 0 12
+1 +10
State
NA NA NA NA 0
0 -1 0 0 NA
-1 0 0 0 NA
NA NA NA NA 0
NA NA NA NA -1 13
What about exploration?
•
-1
+1 +10
State
NA NA NA NA 1
0 We converged
-1 0 on 1 NA
-1 0 0 0 NA
a local optimum!
NA NA NA NA 0
NA NA NA NA -1 14
Must explore!
•
15
16
5 4,3 2,1 0
• w x y z
w x y z
- -,- -,- -
w x y z
exit exit
w x y z
17
MC control - example -100 +10
5 4,3 2,1 0
• w x y z
w x y z
- -,- -,- -
w x y z
exit exit
w x y z
18
5 4,3 2,1 0
• w x y z
w x y z
w x y z
exit exit
w x y z
19
MC control - example -100 +10
w x y z
exit exit
w x y z
20
w x y z
exit exit
w x y z
21
MC control - example -100 +10
w x y z
exit exit
w x y z
22
w x y z
exit exit
w x y z
23
MC control - example -100 +10
w x y z
exit exit
w x y z
24
w x y z
exit exit
w x y z
25
MC control - example -100 +10
w x y z
exit exit
w x y z
26
On-policy learning
True
Estimation
value
•
-1
+1 +10
State
NA NA
0 -1
-1 0
NA NA
NA 27 NA
Quick Recap !
On-policy vs. Off-policy Learning
28
Off-policy learning
•
29
Off-policy learning conditions
•
30
Trajectory probability
•
31
32
33
34
Importance sampling
•
YES!
35
36
37
38
Importance sampling
•
39
40
41
42
43
Importance sampling: proof
•
Importance
ratio
44
+1 +10
45
Weighted importance sampling
• Trick: normalize by the
sum of importance ratios
-1
+1 +10
47
MC control + importance sampling
48
49
MC control + importance sampling
50
Discount future
rewards and add
immediate reward
51
MC control + importance sampling
52
Incremental update of Q
values (waited moving
average)
53
MC control + importance sampling
54
55
MC control + importance sampling
56
- 4,3 2,1 -
•
w x y z
w x y z
- 0,0 0,0 -
w x y z
exit exit
w x y z
57
MC control + IS example -100 +10
- 4,3 2,1 -
•
w x y z
w x y z
- 0,0 0,0 -
w x y z
exit exit
w x y z
58
5 4,3 2,1 5
•
w x y z
w x y z
0 0,0 0,0 0
w x y z
exit exit
w x y z
59
MC control + IS example -100 +10
5 4,3 2,1 5
•
w x y z
w x y z
0 0,0 0,0 0
w x y z
exit exit
w x y z
60
5 4,3 2,1 5
•
w x y z
w x y z
1 0,0 0,0 0
w x y z
exit exit
w x y z
61
MC control + IS example -100 +10
w x y z
exit exit
w x y z
62
w x y z
exit exit
w x y z
63
MC control + IS example -100 +10
w x y z
exit exit
w x y z
64
w x y z
exit exit
w x y z
65
MC control + IS example -100 +10
w x y z
exit exit
w x y z
66
w x y z
exit exit
w x y z
67
MC control + IS example -100 +10
w x y z
exit exit
w x y z
68
w x y z
exit exit
w x y z
69
MC control + IS example -100 +10
w x y z
exit exit
w x y z
70
w x y z
exit exit
w x y z
71
MC control + IS example -100 +10
w x y z
exit exit
w x y z
72
w x y z
exit exit
w x y z
73
MC control + IS example -100 +10
w x y z
exit exit
w x y z
74
w x y z
exit exit
w x y z
75
MC control + IS example -100 +10
w x y z
exit exit
w x y z
76
w x y z
exit exit
w x y z
77
MC control + IS example -100 +10
w x y z
exit exit
w x y z
78
w x y z
exit exit
w x y z
79
MC control + IS example -100 +10
w x y z
exit exit
w x y z
80
w x y z
exit exit
w x y z
81
MC control + IS example -100 +10
w x y z
exit exit
w x y z
82
-90,8.1
-100 2,9 10
•
w x y z
w x y z
1 2,4 0,2 1
w x y z
exit exit
w x y z
83
MC control + IS example -100 +10
-90,8.1
-100 2,9 10
•
w x y z
w x y z
1 2,4 0,2 1
w x y z
exit exit
w x y z
84
-90,8.1
-100 2,9 10
•
w x y z
w x y z
1 2,4 0,2 1
w x y z
exit exit
w x y z
85
What did we learn?
•
86
88
Thank you
89
Deep Reinforcement Learning
2022-23 Second Semester, [Link] (AIML)
Session #9:
Temporal Difference Learning
Instructors :
1. Prof. S. P. Vimal (vimalsp@[Link]),
2. Prof. Sangeetha Viswanathan ([Link]@[Link])
1
Acknowledgements: Some of the slides were adopted with permission from the course CSCE-689 (Texas A&M University) by Prof. 2
Guni Sharon
Solving MDPs so far
-100 +10
w x y z 3
Fuse DP and MC
Dynamic programming Monte-Carlo
✅ Off policy ❌ On-policy (though important
✅ local learning, propagating sampling can be used)
values from neighbors ❌ Requires a full episode to
(Bootstrapping) train on
❌ Model based ✅ Model free, online learning
TD Learning
✅ Off policy
✅ local learning, propagating
values from neighbors
(Bootsraping)
4
✅ Model free, online learning
Temporal difference learning
5
Temporal difference learning
10
SARSA: On-policy TD Control
11
SARSA: On-policy TD Control
-100 +10
13
w x y z
-100 +10
0 0,0 0,0 0
w x y z
14
w x y z
SARSA: On-policy TD Control
-100 +10
0 0,0 0,0 0
S A R S’ A’ w x y z
15
w x y z
-100 +10
0 0,0 0,0 0
S A R S’ A’ w x y z
16
w x y z
SARSA: On-policy TD Control
-100 +10
0 0,0 0,0 0
S A R S’ A’ w x y z
17
w x y z
-100 +10
0 0,0 0,0 0
S A R S’ A’ w x y z
18
w x y z
SARSA: On-policy TD Control
-100 +10
0 0,0 0,0 0
S A R S’ A’ w x y z
19
w x y z
-100 +10
-100 0,0 0,0 0
S A R S’ A’ w x y z
20
w x y z
SARSA: On-policy TD Control
-100 +10
-100 0,0 0,0 0
S A R S’ A’ w x y z
21
w x y z
-100 +10
-100 0,0 0,0 0
S A R S’ A’ w x y z
22
w x y z
SARSA: On-policy TD Control
-100 +10
-100 - 0,0 0
90,0
S A R S’ A’ w x y z
23
w x y z
And so on…
-100 +10
-100 - 0,0 0
90,0
S A R S’ A’ w x y z
24
w x y z
Q-learning: Off-policy TD Control
25
+10 -100
0 0,0 0,0 0
S A R S’ w x y z
26
w x y z
Q-learning: Off-policy TD Control
+10 -100
0 0,0 0,0 0
S A R S’ w x y z
27
w x y z
+10 -100
0 0,0 0,0 0
S A R S’ w x y z
28
w x y z
Q-learning: Off-policy TD Control
+10 -100
0 0,0 0,0 0
S A R S’ w x y z
29
w x y z
+10 -100
0 0,0 0,0 0
S A R S’ w x y z
30
w x y z
Q-learning: Off-policy TD Control
+10 -100
0 0,0 0,0 0
S A R S’ w x y z
31
w x y z
+10 -100
0 0,0 0,0 0
S A R S’ w x y z
32
w x y z
Q-learning: Off-policy TD Control
+10 -100
0 0,0 0,0 -100
S A R S’ w x y z
33
w x y z
+10 -100
0 0,0 0,0 -100
S A R S’ w x y z
34
w x y z
Q-learning: Off-policy TD Control
+10 -100
0 0,0 0,0 -100
S A R S’ w x y z
35
w x y z
+10 -100
0 0,0 0,0 -100
S A R S’ w x y z
36
w x y z
Q-learning: Off-policy TD Control
+10 -100
0 0,0 0,-90 -100
S A R S’ w x y z
37
w x y z
+10 -100
0 0,0 0,-90 -100
S A R S’ w x y z
38
w x y z
Q-learning: Off-policy TD Control
+10 -100
0 0,0 0,-90 -100
S A R S’ w x y z
39
w x y z
+10 -100
0 0,0 0,-90 -100
S A R S’ w x y z
40
w x y z
Q-learning: Off-policy TD Control
And so on…
+10 -100
0 0,0 0,-90 -100
S A R S’ w x y z
41
w x y z
Required Readings
42
Thank you
43
Session #10-11-12:
On Policy Prediction with Approximation
Instructors :
1. Prof. S. P. Vimal (vimalsp@[Link]),
2. Prof. Sangeetha Viswanathan ([Link]@[Link])
1
Agenda for the classes
➔ Introduction
➔ Value Function Approximation
➔ Stochastic Gradient, Semi-Gradient Methods
➔ Role of Deep Learning for Function Approximation;
➔ Feature Construction Methods
Acknowledgements: Some of the slides were adopted with permission from the course CSCE-689 (Texas A&M University) by Prof. 2
Guni Sharon
Demo
• Naïve Q-learning
• After 50 training
episodes
5
Learn an approximation function
Learn Predict
Generalize
7
Parameterized function approximator
●
Gradient Descent
●
9
Gradient Decent
●
10
Gradient Decent
●
Chain rule
11
Gradient Descent
● Idea:
○ Start somewhere
○ Repeat: Take a step in the gradient direction
12
●
13
Stochastic Gradient Decent (SGD)
14
Mini-Batch Gradient Decent
16
18
10
19
Semi-gradient methods
● They do converge reliably in important cases such as the linear
approximation case
● They offer important advantages that make them often clearly
preferred
● They typically enable significantly faster learning, as we have
seen in Chapters 6 and 7
● They enable learning to be continual and online, without
waiting for the end of an episode
● This enables them to be used on continuing problems and
provides computational advantages 20
Semi-gradient TD(0)
22
Two-step return:
Two-step return:
n-step return:
Two-step return:
n-step return:
● Again, only a
simple
modification over
the tabular setting
27
● Again, only a
simple
modification over
the tabular setting
● Weight update
instead of tabular
entry update
28
29
●
For a linear
approximator
30
TD-error = 0
Least squares TD
●
31
Least squares TD
32
Least squares TD
33
Incremental updates
(no need to store all
previous transitions)
Least squares TD
●
34
Feature selection
●
35
36
37
Automatic features extraction for linear approximator -
Coarse Coding
● Natural representation of the state set is
continuous
● In 2-d, features corresponding to circles in
state space
● Coding of a state:
○ If the state is inside a circle, then the
corresponding feature has the value 1
○ otherwise the feature is 0
● Corresponding to each circle is a single
weight (a component of w) that is learned
○ Training a state affects the weights of all the
intersecting circles.
44
45
Deep Q Network
Mnih et al. 2015
• First deep learning model to successfully learn control policies
directly from high-dimensional sensory input using reinforcement
learning
• The model is a convolutional neural network, trained with a variant of
Q-learning
• Input is raw pixels and output is an action-value function estimating
future rewards
• Surpassed a human expert on various Atari video games
46
47
Example: Pacman
Let’s say we discover In naïve q-learning, Or even this one!
through experience we know nothing
that this state is bad: about this state:
• Naïve Q-learning
• After 50 training
episodes
49
• Generalize Q-learning
with function
approximator
50
• Generalizing
knowledge results in
efficient learning
• E.g., learn to avoid the
ghosts
51
Generalizing with Deep learning
• Supervised: Require large amounts of hand-labelled training data
• RL on the other hand, learns from a scalar reward signal that is frequently
sparse, noisy, and delayed
• Supervised: Assume the data samples are independent
• In RL one typically encounters sequences of highly correlated state
• Supervised: Assume a fixed underlying distribution
• In RL the data distribution changes as the algorithm learns new behaviors
• DQN was first to demonstrate that a convolutional neural network
can overcome these challenges to learn successful control policies
from raw video data in complex RL environments
52
53
Original Q-learning
54
55
Deep Q learning [Mnih et al. 2015]
•
56
57
Q-learning with experience replay
58
59
Q-learning with experience replay
60
61
Q-learning with experience replay
62
63
Q-learning with experience replay
64
65
Q-learning with experience replay
66
Deep Q learning
•
67
Experience replay
• Utilizing experience replay has several advantages
• Each step of experience is potentially used in many weight updates, which allows
for greater data efficiency
• Learning directly from consecutive samples is inefficient, due to the strong
correlations between the samples; randomizing the samples breaks these
correlations and therefore reduces the variance of the updates
• The behavior distribution is averaged over many of its previous states, smoothing
out learning and avoiding oscillations or divergence in the parameters
• Note that when learning by experience replay, it is necessary to learn off-
policy (because our current parameters are different to those used to
generate the sample), which motivates the choice of Q-learning
68
Experience replay
• DQN only stores the last N experience tuples in the replay memory
• Old transitions are overwritten
• Samples uniformly at random from the buffer when performing
updates
• Is there room for improvement?
• Important transitions?
• Prioritized sweeping
• Prioritize deletions from the replay memory
• see prioritized experience reply, [Link]
69
Results: DQN
•
70
Results: DQN
• Average predicted action-value on a held-out set of states on Space
Invaders (c) and Seaquest (d)
71
• Normalized between a
professional human games
tester (100%) and random
play (0%)
72
Maximization bias
73
Double Deep Q networks (DDQN)
We have two available Q
networks. Let's use them for
double learning
74
75
• Hasselt et al.
2015
• The straight horizontal
orange (for DQN) and
blue (for Double DQN)
lines in the top row are
computed by running the
corresponding agents
after learning concluded,
and averaging the actual
discounted return
obtained from each
visited state. These
straight lines would
match the learning
curves at the right side of
the plots if there is no
bias.
76
77
Required Readings
78
Thank you
79
Deep Reinforcement Learning
2022-23 Second Semester, [Link] (AIML)
Session #13:
Policy Gradients - REINFORCE, Actor-
Critic algorithms
Instructors :
1. Prof. S. P. Vimal (vimalsp@[Link]),
2. Prof. Sangeetha Viswanathan ([Link]@[Link])
1
➔ Introduction
➔ Policy gradients
➔ REINFORCE algorithm
➔ Actor-critic methods
➔ REINFORCE - example
Acknowledgements: Some of the slides were adopted with permission from the course CSCE-689 (Texas A&M University) by Prof. 2
Guni Sharon
Notation
•
Stochastic policy
•
Evaluate the gradient in performance
12
REINFORCE [Williams, 1992]
Calibrating REINFORCE
14
PG with baseline
15
PG with baseline
●
PG with baseline
18
Policy Gradient
20
Add a critic
●
Critic’s duties
22
Benefits from a critic
Actor+critic
One-step actor-critic
26
Keep track of
accumulated discount
27
28
Compute the TD error
29
30
Update the actor with
discounting. Early actions
matter more.
31
Update accumulated
discount and progress to
the next state
32
In practice: training the
network at every step on a
single observation is
inefficient (slow and
correlated)
33
values = [Link](values)
Qvals = [Link](Qvals)
log_probs = [Link](log_probs)
advantage = Qvals - values
●
Add eligibility traces
37
●
What did we learn?
•
39
Required Readings
40
Thank you
41
Session #14:
Model Based Algorithms
Instructors :
1. Prof. S. P. Vimal (vimalsp@[Link]),
2. Prof. Sangeetha Viswanathan ([Link]@[Link])
1
Agenda for the classes
➔ Introduction
➔ Upper-Confidence-bound Action Selection
➔ Monte-Carlo Tree Search
➔ AlphaGo Zero
➔ MuZero, PlaNet
Acknowledgements: Some of the slides were adopted with permission from the course CSCE-689 (Texas A&M University) by Prof. 2
Guni Sharon
The Upper Confidence Bound follows the principle of optimism in the face of uncertainty
which implies that if we are uncertain about an action, we should optimistically assume that
it is the correct action.
Initially, UCB explores more to systematically reduce uncertainty but its exploration reduces
over time. Thus we can say that UCB obtains greater reward on average than other
algorithms such as Epsilon-greedy, Optimistic Initial Values, etc.
UCB Algorithm
● Select: Select a single node in the tree that is not fully expanded. By this, we mean at
least one of its children is not yet explored.
● Expand: Expand this node by applying one available action (as defined by the MDP)
from the node.
● Simulation: From one of the outcomes of the expanded, perform a complete random
simulation of the MDP to a terminating state. This therefore assumes that the simulation
is finite, but versions of MCTS exist in which we just execute for some time and then
estimate the outcome.
● Backpropagate: Finally, the value of the node is backpropagated to the root node,
updating the value of each ancestor node on the way using expected value
MCTS Framework
In a basic MCTS algorithm we incrementally build of the search tree. Each node in the
tree stores:
Expansion
Select an action q to apply in state s, either
randomly or using an heuristic. Get an
outcome state s’ from applying action a in
state s according to the probability
distribution p(s’|s). Expand a new
environment node and a new state node for
that outcome.
Simulation: Perform a randomised
simulation of the MDP until we
reach a terminating state. That is,
at each choice point, randomly
select an possible action from the
MDP, and use transition
probabilities P(s’|s) to choose an
outcome for each action.
Heuristics can be used to improve the random simulation by guiding it towards more promising states.
G is the cumulative discounted reward received from the simulation starting at s’ until the simulation
terminates.
To avoid memory explosion, we discard all nodes generated from the simulation. In any non-trivial
search, we are unlikely to ever need them again.
AlphaZero
AlphaZero
A DNN initialised to return 2 randomly generated outputs: v(s), and p(s). The network will later be
trained to take in position s and predict v(s) and p(s) from it.
The AlphaZero framework. Mastering the
Game of Go without Human Knowledge.
D. Silver, et al. Nature volume 550, pages
354–359 (2017)
Implications of AlphaZero
- AlphaZero (2017) - learning to play on its own, without any human data or
domain knowledge, or even by mastering three different games (Go, Chess,
Shogi) with only one algorithm being only provided with the known set of
rules.
- MuZero (2020) - earnt to master these games(Go, Chess, Shogi, Atari) without
being provided known rules
- Big step forward towards general-purpose algorithms
MuZero
PlaNet
- Deep Planning Network - Google AI & DeepMind Initiative
- PlaNet agent was tasked with ‘planning’ a sequence of actions to achieve a
goal like pole balancing, teaching a virtual entity (human or cheetah) to
walk, or keeping a box rotating by hitting it in a specific location.
- Common goals between these tasks that the PlaNet needed to achieve:
1. The Agent needs to predict a variety of possible futures (for robust
planning)
2. The Agent needs to update the plan based on the outcomes/rewards of a
recent action
3. The Agent needs to retain information over many time steps
1. Learning with a latent dynamics model — PlaNet learns from a series of hidden or
latent states instead of images to predict the latent state moving forward.
2. Model-based planning — PlaNet works without a policy network and instead makes
decisions based on continuous planning.
3. Transfer learning — The Google AI team trained a single PlaNet agent to solve all
six different tasks.
1) Latent Dynamics Model - Key benefits to using compact latent state spaces are that it allows
the agent to learn more abstract representations like the objects’ positions and velocities and
also avoid having to generate images.
3. Transfer Learning
- one agent for all tasks
The agent is randomly placed into different environments without knowing the task,
so it needs to infer the task from its image observations. Without changes to the hyper
parameters, the multi-task agent achieves the same mean performance as individual
agents.
1. [Link]
2. [Link]
confidence-bound-%28ucb%29
3. [Link]
reinforcement-learning-b97d3e743d0f
4. [Link]
5. [Link]
what-sets-it-apart-and-what-it-can-tell-us-4ab3d2d08867
6. [Link]
7. [Link]
about-googles-new-planet-reinforcement-learning-network-
144c2ca3f284
8. [Link]
[Link]?m=1
41
Thank you
42
Session #15:
Imitation Learning
Instructors :
1. Prof. S. P. Vimal (vimalsp@[Link]),
2. Prof. Sangeetha Viswanathan ([Link]@[Link])
1
Agenda for the classes
➔ Introduction
➔ Upper-Confidence-bound Action Selection
➔ Monte-Carlo Tree Search
➔ AlphaGo Zero
➔ MuZero, PlaNet
Acknowledgements: Some of the slides were adopted with permission from the course CSCE-689 (Texas A&M University) by Prof. Guni Sharon
2
vs
• ALVINN [Link]
neural-network/
Dean Pomerleau et al., 1989-1999 [Link]
• Helicopter Acrobatics
Learning for Control from Multiple Demonstrations - Adam Coates, Pieter Abbeel, Andrew Ng, ICML 2008
An Application of Reinforcement Learning to Aerobatic Helicopter Flight - Pieter Abbeel, Adam Coates, Morgan
Quigley, Andrew Y. Ng, NIPS 2006
[Link]
s = game screen
a = turning angle
Compounding Errors
Inverse RL
• What if we don’t have an online demonstrator?
• We only have access to an offline set of demonstrated trajectories
• Behavioral cloning is not robust
• Suffers from overfitting
• We know what to do in observed states but can’t generalize well to other states
• How can we learn to mimic the demonstrator in a general why?
• Learn the demonstrator’s objective (reward) function
• Apply RL
Inverse RL
• What if we don’t have an online demonstrator?
• We only have access to an offline set of demonstrated trajectories
• Behavioral cloning is not robust
• Suffers from overfitting
• We know what to do in observed states but can’t generalize well to other states
• How can we learn to mimic the demonstrator in a general why?
• Learn the demonstrator’s objective (reward) function
• Apply RL
Inverse RL
Inverse RL
Inverse RL
Inverse RL
Inverse RL
Inverse RL
GAN
Inverse RL as GAN
Inverse RL as GAN
Model based algorithms - Advantages
33
Thank you
34
Session #16:
Multi-Agent Reinforcement Learning
Instructors :
1. Prof. S. P. Vimal (vimalsp@[Link]),
2. Prof. Sangeetha Viswanathan ([Link]@[Link])
1
Agenda for the classes
➔ Introduction
➔ Cooperative vs Competitive agents,
➔ centralized vs. decentralized RL ;
➔ Proximity Primal Optimization (Surrogate Objective
Function, Clipping)
Acknowledgements: Some of the slides were adopted with permission from the course CSCE-689 (Texas A&M University) by Prof. Guni Sharon
2
Multi-Agent RL (MARL)
MARL
Traditional (Single-Agent) RL
Multiagent RL
Motivations: Research in Multiagent RL
Centralized:
Decentralized:
Decentralized systems
Decentralized ayatems
• In decentralized learning, each agent is trained independently from the others.
• The benefit is that since no information is shared between agents, these
vacuums can be designed and trained like we train single agents.
• The idea here is that our training agent will consider other agents as part of the
environment dynamics. Not as agents.
• However, the big drawback of this technique is that it will make the environment
non-stationary since the underlying Markov decision process changes over time
as other agents are also interacting in the environment. And this is problematic
for many Reinforcement Learning algorithms that can’t reach a global optimum
with a non-stationary environment.
Centralized systems
Centralized systems
Dimensions of MARL
Prescriptive:
Descriptive:
Dimensions of MARL
Numbers of agents
- One (single-agent)
- Two (very common)
- Finite
- Infinite
Foundations of MARL
Markov Game
MARL Formulation
Nash Q-learning
Nash Q-Learning
Chanel Settings
Multi-agent centralized training
The algorithm, introduced by OpenAI in 2017, seems to strike the right balance
between performance and comprehension.
PPO aims to strike a balance between important factors like ease of implementation,
ease of tuning, sample complexity,sample efficiency and trying to compute an update
at each step that minimizes the cost function while ensuring the deviation from the
previous policy is relatively small.
PPO is in fact, a policy gradient method that learns from online data as well. It merely
ensures that the updated policy isn’t too much different from the old policy to ensure
low variance in training.
PPO - Surrogate Objective
PPO Algorithm
42
Thank you
43