0% found this document useful (0 votes)
680 views281 pages

DRL Final Notes

Uploaded by

journeyonleaves
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
680 views281 pages

DRL Final Notes

Uploaded by

journeyonleaves
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Deep Reinforcement Learning

2022-23 Second Semester, [Link] (AIML)

Session #1:
Introduction to the Course

Instructors :
1. Prof. S. P. Vimal (vimalsp@[Link]),
2. Prof. Sangeetha Viswanathan ([Link]@[Link])
PRESENTATION TITLE 1

What is Reinforcement Learning ?

- Reward based learning / Feedback based learning


- Not a type of NN nor it is an alternative to NN. Rather it is an approach
for learning
- Autonomous driving, gaming

Why Reinforcement Learning ?


- A goal-oriented learning based on interaction with environment

2
Course Objectives

Course Objectives:
1. Understand
a. the conceptual, mathematical foundations of deep reinforcement learning
b. various classic & state of the art Deep Reinforcement Learning algorithms
2. Implement and Evaluate the deep reinforcement learning solutions to various
problems like planning, control and decision making in various domains
3. Provide conceptual, mathematical and practical exposure on DRL
a. to understand the recent developments in deep reinforcement learning and
b. to enable modelling new problems as DRL problems.

Learning Outcomes

1. Understand the fundamental concepts of reinforcement learning (RL),


algorithms and apply them for solving problems including control, decision-
making, and planning.
2. Implement DRL algorithms, handle challenges in training due to stability and
convergence
3. Evaluate the performance of DRL algorithms, including metrics such as sample
efficiency, robustness and generalization.
4. Understand the challenges and opportunities of applying DRL to real-world
problems & model real life problems

4
Course Operation

• Instructors
Prof. [Link], Prof. Chandra Sekar
Prof. Bharatesh, Prof. SK. Karthika

• Textbooks
1. Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G.
Barto, Second Ed. , MIT Press
2. Foundations of Deep Reinforcement Learning: Theory and Practice in Python
(Addison-Wesley Data & Analytics Series) 1st Edition by Laura Graesser and
Wah Loon Keng
5

Course Operation
• Evaluation
Two Quizzes for 5% each; Best 2 will be taken for 5% (In final grading)
Whatever be the points set for quizzes, the sore will be scaled to 5%.
No MAKEUP, for whatever be the reason. Ensure to attend at least one of the quizzes
Two Assignments - Tensorflow/ Pytorch / OpenAI Gym Toolkit → 25 %
Assignment 1: Partially Numerical +Implementation of Classic Algorithms –
10%
Assignment 2: Deep Learning based RL
Mid Term Exam - 30% [Only to be written in A4 sheets, scanned and uploaded]
Comprehensive Exam - 40% [Only to be written in A4 sheets, scanned and uploaded]
• Webinars/Tutorials
4 tutorials : 2 before mid-sem & 2 after mid-sem
• Teaching Assistants – Will be introduced in the upcoming classes.

6
Course Operation
• How to reach us ? (for any question on lab aspects, availability of slides on portal, quiz
availability , assignment operations )
Prof. [Link] - vimalsp@[Link]
Prof. SK. Karthika – Karthika@[Link]
• Plagiarism
All submissions for graded components must be the result of your original effort. It is strictly
prohibited to copy and paste verbatim from any sources, whether online or from your peers.
The use of unauthorized sources or materials, as well as collusion or unauthorized
collaboration to gain an unfair advantage, is also strictly prohibited. Please note that we will
not distinguish between the person sharing their resources and the one receiving them for
plagiarism, and the consequences will apply to both parties equally.
In cases where suspicious circumstances arise, such as identical verbatim answers or a
significant overlap of unreasonable similarities in a set of submissions, will be investigated,
and severe punishments will be imposed on all those found guilty of plagiarism.
7

Reinforcement Learning

Reinforcement learning (RL) is based on rewarding desired behaviors or punishing


undesired ones. Instead of one input producing one output, the algorithm produces
a variety of outputs and is trained to select the right one based on certain variables
– Gartner

When to use RL?


RL can be used in large environments in the following situations:

1.A model of the environment is known, but an analytic solution is not available;
[Link] a simulation model of the environment is given (the subject of simulation-based
optimization)
[Link] only way to collect information about the environment is to interact with it.

8
(Deep) Reinforcement Learning

Criteria Supervised ML Unsupervised ML Reinforcement ML


Definition Learns by using labelled Trained using Works on interacting with
data unlabelled data the environment
without any guidance.
Types of Learning

Type of data Labelled data Unlabelled data No – predefined data


Type of problems Regression and Association and Exploitation or Exploration
classification Clustering

Supervision Extra supervision No supervision No supervision


Algorithms Linear Regression, Logistic K – Means, Q – Learning,
Regression, SVM, KNN etc. C – Means, Apriori SARSA

Aim Calculate outcomes Discover underlying Learn a series of action


patterns

Application Risk Evaluation, Forecast Recommendation Self Driving Cars, Gaming,


Sales System, Anomaly Healthcare
10
Detection
Characteristics of RL

•No supervision, only a real value or reward signal


•Decision making is sequential
•Time plays a major role in reinforcement problems
•Feedback isn’t prompt but delayed
•The following data it receives is determined by the agent’s actions

11

Types of Reinforcement learning

•Positive Reinforcement - an event, occurs due to a particular behavior,


increases the strength and the frequency of the behavior. In other words, it has
a positive effect on behavior.
•Maximizes Performance
•Sustain Change for a long period of time
•Too much Reinforcement can lead to an overload of states which can
diminish the results

12
Types of Reinforcement Learning
Negative Reinforcement - strengthening of behavior because a negative
condition is stopped or avoided. Advantages of negative reinforcement
learning:
•Increases Behavior
•Provide defiance to a minimum standard of performance
•It Only provides enough to meet up the minimum behavior

13

Elements of Reinforcement Learning

14
Elements of Reinforcement Learning

Beyond the agent and the environment, one can identify four main sub-
elements of a reinforcement learning system: a policy, a reward , a value
function, and, optionally, a model of the environment.
15

Elements of Reinforcement Learning

•Agent
- An entity that tries to learn the best way to perform a specific task.
- In our example, the child is the agent who learns to ride a bicycle.

•Action (A) -
- What the agent does at each time step.
- In the example of a child learning to walk, the action would be “walking”.
- A is the set of all possible moves.
- In video games, the list might include running right or left, jumping high or
low, crouching or standing still.

16
Elements of Reinforcement Learning

•State (S)
- Current situation of the agent.
- After doing performing an action, the agent can move to different states.
- In the example of a child learning to walk, the child can take the action of
taking a step and move to the next state (position).
•Rewards (R)
- Feedback that is given to the agent based on the action of the agent.
- If the action of the agent is good and can lead to winning or a positive side
then a positive reward is given and vice versa.

17

Elements of Reinforcement Learning

•Environment
- Outside world of an agent or physical world in which the agent operates.
•Discount factor
- The discount factor is multiplied by future rewards as discovered by the
agent in order to dampen these rewards’ effect on the agent’s choice of
action.
- Why? It is designed to make future rewards worth less than immediate
rewards.

18
Reinforcement Learning - Definition

Formal Definition - Reinforcement learning (RL) is an area of


machine learning concerned with how intelligent agents ought to
take actions in an environment in order to maximize the notion of
cumulative reward.

19

Elements of Reinforcement Learning


•Goal of RL - maximize the total amount of rewards or cumulative rewards that
are received by taking actions in given states.
•Notations –
a set of states as S,
a set of actions as A,
a set of rewards as R.
At each time step t = 0, 1, 2, …, some representation of the environment’s state St
∈ S is received by the agent. According to this state, the agent selects an action At
∈ A which gives us the state-action pair (St , At). In the next time step t+1, the
transition of the environment happens and the new state St+1 ∈ S is achieved. At
this time step t+1, a reward Rt+1 ∈ R is received by the agent for the action At taken
from state St. 20
Elements of Reinforcement Learning

•Maximize cumulative rewards, Expected Return Gt

•Discount factor γ is introduced here which forces the agent to focus on


immediate rewards instead of future rewards. The value of γ remains between
0 and 1.

21

Elements of Reinforcement Learning

•Policy (π)

- Policy in RL decides which action will the agent take in the current state.
- It tells the probability that an agent will select a specific action from a specific state.
- Policy is a function that maps a given state to probabilities of selecting each
possible action from the given state.

•If at time t, an agent follows policy π, then π(a|s) becomes the probability that
the action at time step t is at=a if the state at time step t is St=s .The meaning of this
is, the probability that an agent will take an action a in state s is π(a|s) at time t
with policy π.

22
Elements of Reinforcement Learning

•Value Functions
- A simple measure of how good it is for an agent to be in a given state, or
how good it is for the agent to perform a given action in a given state.
•Two types
- State- Value function
- Action-Value function

23

Elements of Reinforcement Learning

•State-value function
- The state-value function for policy π denoted as vπ determines the goodness
of any given state for an agent who is following policy π.
- This function gives us the value which is the expected return starting from
state s at time step t and following policy π afterward.

24
Elements of Reinforcement Learning

•Action value function


- Determines the goodness of the action taken by the agent from a given state
for policy π.
- This function gives the value which is the expected return starting from
state s at time step t, with action a, and following policy π afterward.
- The output of this function is also called as Q-value where q stands for
Quality. Note that in the state-value function, we did not consider the action
taken by the agent.

25

Elements of Reinforcement Learning

•Model of the environment


- Mimics the behavior of the environment, or more generally, that allows
inferences to be made about how the environment will behave.
- For example, given a state and action, the model might predict the resultant
next state and next reward.
- Models are used for planning

26
(Deep) Reinforcement Learning

27
From OpenAI

Advantages of Reinforcement Learning

- Solve very complex problems that cannot be solved by conventional


techniques
- Achieve long-term results
- Model can correct the errors that occurred during the training process.
- In the absence of a training dataset, it is bound to learn from its experience
- Can be useful when the only way to collect information about the
environment is to interact with it
- Reinforcement learning algorithms maintain a balance between
exploration and exploitation. Exploration is the process of trying different
things to see if they are better than what has been tried before. Exploitation
is the process of trying the things that have worked best in the past. Other
learning algorithms do not perform this balance

28
An example scenario - Tic-Tac-Toe

Two players take turns playing on a three-by-three board. One player plays Xs and
the other Os until one player wins by placing three marks in a row, horizontally,
vertically, or diagonally
Assumptions
- playing against an imperfect player, one whose play is sometimes incorrect and
allows you to win
Aim
How might we construct a player that will find the imperfections in its
opponent’s play and learn to maximize its chances of winning?

29

Reinforcement Learning for Tic-Tac-Toe

•We set up a table of numbers, one for each possible state of the game. Each
number will be the latest estimate of the probability of our winning from that
state.
•We treat this estimate as the state’s value, and the whole table is the learned
value function.
•State A has higher value than state B, or is considered better than state B, if the
current estimate of the probability of our winning from A is higher than it is from B.

30
Reinforcement Learning for Tic-Tac-Toe

● Assuming we always play X’s, three X = probability is 1, three O = probability =0 .


Initial = 0.5
● Play many games against the opponent. To select our moves we examine the
states that would result from each of our possible moves and look up their current
values in the table.
● Most of the time we move greedily, selecting the move that leads to the state with
greatest value, i.e, with the highest estimated probability of winning.
● Occasionally, however, we select randomly from among the other moves instead.
These are called exploratory moves because they cause us to experience states
that we might otherwise never see.

31

Reinforcement Learning for Tic-Tac-Toe


•The solid lines represent the moves
taken during a game; the dashed lines
represent moves that we (our
reinforcement learning player)
considered but did not make.
•Our second move was an exploratory
move, meaning that it was taken even
though another sibling move, the one
leading to e∗, was ranked higher.

32
Reinforcement Learning for Tic-Tac-Toe

● Once a game is started, our agent computes all possible actions it can take in
the current state and the new states which would result from each action.
● The values of these states are collected from a state_value vector, which
contains values for all possible states in the game.
● The agent can then choose the action which leads to the state with the highest
value(exploitation), or chooses a random action(exploration), depending on the
value of epsilon.

33

Thank you

34
Deep Reinforcement Learning
2022-23 Second Semester, [Link] (AIML)

Session #2-3:
Multi-armed Bandits

Instructors :
1. Prof. S. P. Vimal (vimalsp@[Link]),
2. Prof. Sangeetha Viswanathan ([Link]@[Link])
1

Agenda for the class

• Recap
• k-armed Bandit Problem & its significance
• Action-Value Methods
Sample Average Method & Incremental Implementation
• Non-stationary Problem
• Initial Values & Action Selection
• Gradient Bandit Algorithms [ Class #3 ]
• Associative Search [ Class #3 ]

Acknowledgements: Some of the slides were adopted with permission from the course CSCE-689 (Texas A&M University) by Prof. Guni Sharon 2
An example scenario - Tic-Tac-Toe

Two players take turns playing on a three-by-three board. One player plays Xs and
the other Os until one player wins by placing three marks in a row, horizontally,
vertically, or diagonally
Assumptions
- playing against an imperfect player, one whose play is sometimes incorrect and
allows you to win
Aim
How might we construct a player that will find the imperfections in its
opponent’s play and learn to maximize its chances of winning?

Reinforcement Learning for Tic-Tac-Toe

•We set up a table of numbers, one for each possible state of the game. Each
number will be the latest estimate of the probability of our winning from that
state.
•We treat this estimate as the state’s value, and the whole table is the learned
value function.
•State A has higher value than state B, or is considered better than state B, if the
current estimate of the probability of our winning from A is higher than it is from B.

4
Reinforcement Learning for Tic-Tac-Toe

● Assuming we always play X’s, three X = probability is 1, three O = probability =0 .


Initial = 0.5
● Play many games against the opponent. To select our moves we examine the
states that would result from each of our possible moves and look up their current
values in the table.
● Most of the time we move greedily, selecting the move that leads to the state with
greatest value, i.e, with the highest estimated probability of winning.
● Occasionally, however, we select randomly from among the other moves instead.
These are called exploratory moves because they cause us to experience states
that we might otherwise never see.

Reinforcement Learning for Tic-Tac-Toe


•The solid lines represent the moves
taken during a game; the dashed lines
represent moves that we (our
reinforcement learning player)
considered but did not make.
•Our second move was an exploratory
move, meaning that it was taken even
though another sibling move, the one
leading to e∗, was ranked higher.

6
Reinforcement Learning for Tic-Tac-Toe

● Once a game is started, our agent computes all possible actions it can take in
the current state and the new states which would result from each action.
● The values of these states are collected from a state_value vector, which
contains values for all possible states in the game.
● The agent can then choose the action which leads to the state with the highest
value(exploitation), or chooses a random action(exploration), depending on the
value of epsilon.

Tic-Tac-Toc

8
Tic-Tac-Toc
States Initial Values

0.5

0.5 Learning Task: Play as many


times against the opponent
and learn the values
1.0

Set up a table of states initial values


O 0
O

… … 9

Tic-Tac-Toc ( prev. class)


States Initial Values

0.5

St - state before greedy move


0.5 St+1 - state after greedy move

1.0

O 0
O

… … 10
Tic-Tac-Toc ( prev. class)

Temporal Difference Learning Rule

𝜶- Step Size 11

Parameter

Tic-Tac-Toc ( prev. class)

Questions:
(1) What happens if 𝜶 is gradually made to 0 over many
games with the opponent?
(2) What happens if 𝜶 is gradually reduced over many
games, but never made 0?
(3) What happens if 𝜶 is kept constant throughout its
life time?

Temporal Difference Learning Rule

𝜶- Step Size 12

Parameter
Tic-Tac-Toc ( prev. class)

Key Takeaways:
(1) Learning while interacting with the
environment (opponent).
(2) We have a clear goal
(3) Our policy is to make moves that maximizes our
chances of reaching goal
○ Use the values of states most of the time
(exploration) and explore rest of the time.

Temporal Difference Learning Rule

𝜶- Step Size 13

Parameter

Tic-Tac-Toc ( prev. class)

Reading Assigned:
Identify how this reinforcement learning solution is
different from solutions using minimax algorithm
and genetic algorithms.
Post your answers in the discussion forum;

Temporal Difference Learning Rule

𝜶- Step Size 14

Parameter
K-armed Bandit Problem
Problem
• You are faced repeatedly with a choice among k different options, or actions
• After each choice of actions you receive a numerical reward
Reward is chosen from a stationary probability distribution that depends on the selected
action
• Objective : to maximize the expected total reward over some time period

15

K-armed Bandit Problem


Problem
• You are faced repeatedly with a choice among k different options, or actions
• After each choice of actions you receive a numerical reward
Reward is chosen from a stationary probability distribution that depends on the selected
action
• Objective : to maximize the expected total reward over some time period

16
K-armed Bandit Problem
Problem
• You are faced repeatedly with a choice among k different options, or actions
• After each choice of actions you receive a numerical reward
Reward is chosen from a stationary probability distribution that depends on the selected
action
• Objective : to maximize the expected total reward over some time period

Strategy:
● Identify the best lever(s)
● Keep pulling the identified ones
Questions:
● How do we define the best ones?
● What are the best levers?
17

K-armed Bandit Problem


Problem
• You are faced repeatedly with a choice among k different options, or actions
• After each choice of actions you receive a numerical reward
Reward is chosen from a stationary probability distribution that depends on the selected
action
• Objective : to maximize the expected total reward over some time period

Strategy:
● Identify the best lever(s)
● Keep pulling the identified ones
Questions:
● How do we define the best ones?
● What are the best levers?
18
K-armed Bandit Problem
Problem
• You are faced repeatedly with a choice among k different options, or actions
• After each choice of actions you receive a numerical reward
Reward is chosen from a stationary probability distribution that depends on the selected
action
• Objective : to maximize the expected total reward over some time period

● Expected Mean Reward for each


action selected
→ call it Value of the action

19

K-armed Bandit Problem

• At - action selected on time step t


• Qt (a) - estimated value of action a at time step t
• q*(a) - value of an arbitrary action a

Note: If you knew the value of each action, then it would be trivial to solve the k -armed bandit
problem: you would always select the action with highest value :-)
20
K-armed Bandit Problem

21

K-armed Bandit Problem

Keep pulling the levers; update the estimate of action values;


22
K-armed Bandit Problem

23

K-armed Bandit Problem


1. How to maintain the estimate of expected rewards for each action?
Average the rewards actually received !!!

1. How to use the estimate in selecting the right action?


Greedy Action Selection
24
K-armed Bandit Problem
2. How to use the estimate in selecting the right action?
Greedy Action Selection

Actions which are inferior by the value estimate upto time t, could be indeed better than
the greedy action at t !!!

3. Exploration vs. Exploitation?


ε-Greedy Action Selection / near-greedy action selection
Behave greedily most of the time; Once in a while, with small probability 𝛆 select randomly
from among all the actions with equal probability, independently of the action-value estimates.

25

K-armed Bandit Problem

26
K-armed Bandit Problem
Greedy Action

27

K-armed Bandit Problem


Action to Explore

28
K-armed Bandit Problem
ε-Greedy Action Selection / near-greedy action selection

epsilon = 0.05 // small value to control exploration


def get_action():
if [Link]() > epsilon:
return argmaxa(Q(a))
else:
return [Link](A)

● In the limit as the number of steps increases, every action will be sampled by ε-greedy action
selection an infinite number of times. This ensures that all the Qt (a) converge to q* (a).
● Easy to implement / optimize for epsilon / yields good results

29

Ex-1: In ε-greedy action selection, for the case of two actions and ε
= 0.5, what is the probability that the greedy action is selected?

30
Ex-1: In ε-greedy action selection, for the case of two actions and ε
= 0.5, what is the probability that the greedy action is selected?

p (greedy action)
= p (greedy action AND greedy selection ) + p (greedy action AND random selection )
= p (greedy action | greedy selection ) p ( greedy selection )
+ p (greedy action | random selection ) p (random selection )
= p (greedy action | greedy selection ) (1-𝛆) + p (greedy action | random selection ) (𝛆)
= p (greedy action | greedy selection ) (0.5) + p (greedy action | random selection ) (0.5)
= (1) (0.5) + (0.5) (0.5)
= 0.5 + 0.25
= 0.75

31

10-armed Testbed
Example:
• A set of 2000 randomly generated k -armed
bandit problems with k = 10
• Action values were selected according to a
normal (Gaussian) distribution with mean 0
and variance 1.
• While selecting action At at time step t, the
actual reward, Rt , was selected from a
normal distribution with mean q*(At ) and
variance 1
• One Run : Apply a method for 1000 time
steps to one of the bandit problems
• Perform 2000 runs, each run with a
different bandit problem, to get an An example bandit problem from the 10-armed testbed

algorithms average behavior

32
Average performance of 𝛆-greedy action-value methods on
the 10-armed testbed

33

Average performance of 𝛆-greedy action-value methods on


the 10-armed testbed

34
Discussion on Exploration vs. Exploitation

1) What if the reward variance is


a. larger, say 10 instead of 1?
b. zero ? [ deterministic ]
2) What if the bandit task is non-stationary? [ that is, the true values of the
actions changed over time]

35

Ex-2:

Consider a k -armed bandit problem with k = 4 actions, denoted 1, 2, 3, and 4.

Consider applying to this problem a bandit algorithm using 𝛆-greedy action selection, sample-
average action-value estimates, and initial estimates of Q1 (a) = 0, for all a.

Suppose the initial sequence of actions and rewards is A1 = 1, R1 = 1, A2 = 2, R2 = 1, A3 = 2, R3 = 2,


A4 = 2, R4 = 2, A5 = 3, R5 = 0.

On some of these time steps the 𝛆 case may have occurred, causing an action to be selected at
random.

On which time steps did this definitely occur? On which time steps could this possibly have
occurred?

36
37

38
39

Incremental Implementation

• Efficient approach to
compute the estimate of
action-value;

• Given Qn and the nth


reward, Rn , the new
average of all n rewards can
be computed as follows

40
Incremental Implementation

Note:
● StepSize decreases with each update
● We use 𝛂 or 𝛂t(a) to denote step size (constant /
varies with each step)

Discussion:
Const vs. Variable step size?

41

Bandit Algorithm with Incremental Update/ 𝛆-greedy


selection

42
Non-stationary Problem

• Most RL problems are non-stationary !


• Give more weight to recent rewards than to long-past rewards !!!

43

Non-stationary Problem

• Most RL problems are non-stationary !


• Give more weight to recent rewards than to long-past rewards !!!

Exponential recency-weighted average

44
45

46
Optimistic Initial Values

• All the above discussed methods are biased by their initial estimates
• For sample average method the bias disappears once all actions have been selected at
least once
• For methods with constant α, the bias is permanent, though decreasing over time
• Initial action values can also be used as a simple way of encouraging exploration.
• In 10 armed testbed, set initial estimate to +5 rather than 0.
This can encourage action-value methods to explore.
Whichever actions are initially selected, the reward is less than the starting estimates;
the learner switches to other actions, being disappointed with the rewards it is receiving.
The result is that all actions are tried several times before the value estimates converge.
47

Optimistic Initial Values

Caution:
Optimistic Initial Values can
only be considered as a simple
trick that can be quite effective
on stationary problems, but it is
far from being a generally
useful approach to encouraging
exploration.

Question:
Explain how in the non-
stationary scenario the
optimistic initial values will fail
(to explore adequately).

The effect of optimistic initial action-value estimates on the 10-armed testbed.


Both methods used a constant step-size parameter, 𝛂 = 0.1
48
Upper-Confidence-Bound Action Selection

• 𝛆-greedy action selection forces the non-greedy actions to be tried,


Indiscriminately, with no preference for those that are nearly greedy or particularly uncertain
• It would be better to select among the non-greedy actions
according to their potential for actually being optimal
Take into account both how close their estimates are to being maximal and the uncertainties
in those estimates.

49

Upper-Confidence-Bound Action Selection


● Each time a is selected the uncertainty is presumably reduced
● Each time an action other than a is selected, t increases but Nt(a) does not; because t appears in the numerator, the
uncertainty estimate increases.
● Actions with lower value estimates, or that have already been selected frequently, will be selected with decreasing
frequency over time
Confidence Level
Action Value at time t for a

Measure of Uncertainty

50
51

52
53

Upper-Confidence-Bound Action Selection

UCB often performs well, as shown here, but is more difficult than "-greedy to extend beyond bandits to the more
general reinforcement learning settings
Policy-based algorithms

55

Softmax function

56
Softmax function

57

Gradient ascend

• ?

58
Update Rule

On each step, after selecting action A t and receiving the reward Rt ,


Update the action preferences :

59

60
What did we learn?

• Problem: choose the action that results in highest expected reward


• Assumptions: 1. actions’ expected reward is unknown, 2. we are
confronted with the same problem over and over, 3. we are able to
observe an action’s outcome once chosen
• Approach: learn the actions’ expected reward through exploration
(value based) or learn a policy directly (policy based), exploit learnt
knowledge to choose best action
• Methods: 1. greedy + initializing estimates optimistically, 2. epsilon-
greedy, 3. Upper-Confidence-Bounds, 4. gradient ascend + soft-max

61

A different scenario

● Associative vs. Non-associative tasks ?


● Policy: A mapping from situations to the
actions that are best in those situations
● (discuss) How do we extend the solution for
non-associative task to an associative task?
○ Approach: Extend the solutions to non-stationary task to
non-associative tasks
■ Works, if the true action values changes slowly
○ What if the context switching between the situations are
made explicit?
■ How?
■ Need Special approaches !!!

62
Required Readings

1. Chapter-2 of Introduction to Reinforcement Learning,2nd Ed., Sutton &


Barto
2. A Survey on Practical Applications of Multi-Armed and Contextual Bandits,
Djallel Bouneffouf , Irina Rish [[Link]

63

Thank you !

64
Deep Reinforcement Learning
2023-24 Second Semester, [Link] (AIML)

Session #4:
Markov Decision Processes

DRL Course Instructors

Policy-based algorithms

2
Softmax function

Softmax function

4
Gradient ascend

• ?

Update Rule

On each step, after selecting action A t and receiving the reward Rt ,


Update the action preferences :

6
7

What did we learn?

• Problem: choose the action that results in highest expected reward


• Assumptions: 1. actions’ expected reward is unknown, 2. we are
confronted with the same problem over and over, 3. we are able to
observe an action’s outcome once chosen
• Approach: learn the actions’ expected reward through exploration
(value based) or learn a policy directly (policy based), exploit learnt
knowledge to choose best action
• Methods: 1. greedy + initializing estimates optimistically, 2. epsilon-
greedy, 3. Upper-Confidence-Bounds, 4. gradient ascend + soft-max

8
A different scenario

● Associative vs. Non-associative tasks ?


● Policy: A mapping from situations to the
actions that are best in those situations
● (discuss) How do we extend the solution for
non-associative task to an associative task?
○ Approach: Extend the solutions to non-stationary task to
non-associative tasks
■ Works, if the true action values changes slowly
○ What if the context switching between the situations are
made explicit?
■ How?
■ Need Special approaches !!!

Agenda for the class

• Agent-Environment Interface (Sequential Decision Problem)


• MDP
Defining MDP,
Rewards,
Returns, Policy & Value Function,
Optimal Policy and Value Functions
• Approaches to solve MDP

Announcement !!!
We have our Teaching Assistants now !!! You will see their names in the course home
page.

Acknowledgements: Some of the slides were adopted with permission from the course CSCE-689 (Texas A&M University) by Prof. Guni Sharon 10
Agent-Environment Interface

● Agent - Learner & the decision


maker
● Environment - Everything
outside the agent
● Interaction:
○ Agent performs an action
○ Environment responds by
■ presenting a new situation
(change in state) Note:
■ presents numerical reward ● Interaction occurs in discrete time steps
● Objective (of the interaction):
○ Maximize the return
(cumulative rewards) over time

11

Grid World Example

● A maze-like problem
○ The agent lives in a grid
○ Walls block the agent’s path
● Noisy movement: actions do not always go as planned
○ 80% of the time, the action North takes the agent North
(if there is no wall there)
○ 10% of the time, North takes the agent West; 10% East
○ If there is a wall in the direction the agent would have
been taken, the agent stays put
● The agent receives rewards each time step
○ -0.1 per step (battery loss)
○ +1 if arriving at (4,3) ; -1 for arriving at (4,2) ;-1 for
arriving at (2,2)
● Goal: maximize accumulated rewards

12
Markov Decision Processes

● An MDP is defined by
○ A set of states
○ A set of actions
○ State-transition probabilities
■ Probability of arriving to after performing at
■ Also called the model dynamics
○ A reward function
■ The utility gained from arriving to after
performing at
■ Sometimes just or even
○ A start state
○ Maybe a terminal state

13

Markov Decision Processes


Model Dynamics

State-transition probabilities

Expected rewards for state–action–next-state triples

14
Markov Decision Processes - Discussion

● MDP framework is abstract and flexible


○ Time steps need not refer to fixed intervals of real time
○ The actions can be
■ at low-level controls or high-level decisions
■ totally mental or computational
○ States can take a wide variety of forms
■ Determined by low-level sensations or high-level and abstract (ex.
symbolic descriptions of objects in a room)
● The agent–environment boundary represents the limit of the agent’s absolute
control, not of its knowledge.
○ The boundary can be located at different places for different purposes

15

Markov Decision Processes - Discussion

● MDP framework is a considerable abstraction of the problem of goal-directed


learning from interaction.
● It proposes that whatever the details of the sensory, memory, and control
apparatus, and whatever objective one is trying to achieve, any problem of
learning goal-directed behavior can be reduced to three signals passing back
and forth between an agent and its environment:
○ one signal to represent the choices made by the agent (the actions)
○ one signal to represent the basis on which the choices are made (the
states),
○ and one signal to define the agent’s goal (the rewards).

16
MDP Formalization : Video Games

● State:
○ raw pixels
● Actions:
○ game controls
● Reward:
○ change in score
● State-transition probabilities:
○ defined by stochasticity in game evolution

Ref: Playing Atari with deep reinforcement learning”, Mnih et al., 2013 17

MDP Formalization : Traffic Signal Control

● State:
○ Current signal assignment (green, yellow,
and red assignment for each phase)
○ For each lane: number of approaching
vehicles, accumulated waiting time,
number of stopped vehicles, and average
speed of approaching vehicles
● Actions:
○ signal assignment
● Reward:
○ Reduction in traffic delay
● State-transition probabilities:
○ defined by stochasticity in approaching
demand
Ref: “Learning an Interpretable Traffic Signal Control Policy”, Ault et al., 2020 18
MDP Formalization : Recycling Robot (Detailed Ex.)

● Robot has
○ sensors for detecting cans
○ arm and gripper that can pick the cans and place in an
onboard bin;
● Runs on a rechargeable battery
● Its control system has components for interpreting sensory
information, for navigating, and for controlling the arm and
gripper
● Task for the RL Agent: Make high-level decisions about how
to search for cans based on the current charge level of the
battery

19

MDP Formalization : Recycling Robot (Detailed Ex.)

● State:
○ Assume that only two charge levels can be distinguished
○ S = {high, low}
● Actions:
○ A(high) = {search, wait}
○ A(low) = {search, wait, recharge}
● Reward:
○ Zero most of the time, except when securing a can
○ Cans are secured by searching and waiting, but rsearch > rwait
● State-transition probabilities:
○ [Next Slide]

20
MDP Formalization : Recycling Robot (Detailed Ex.)

● State-transition probabilities (contd…):

21

MDP Formalization : Recycling Robot (Detailed Ex.)

● State-transition probabilities (contd…):

22
Note on Goals & Rewards

● Reward Hypothesis:
All of what we mean by goals and purposes can be well thought of as
the maximization of the expected value of the cumulative sum of a
received scalar signal (called reward).
● The rewards we set up truly indicate what we want accomplished,
○ not the place to impart prior knowledge on how we want it to do
● Ex: Chess Playing Agent
○ If the agent is rewarded for taking opponents pieces, the agent might fall
for the opponent's trap.
● Ex: Vacuum Cleaner Agent
○ If the agent is rewarded for each unit of dirt it sucks, it can repeatedly
deposit and suck the dirt for larger reward
23

Returns & Episodes

● Goal is to maximize the expected return


● Return (Gt) is defined as some specific function of the reward
sequence
● Episodic tasks vs. Continuing tasks
● When there is a notion of final time step, say T, return can be

○ Applicable when agent-environment interaction breaks into


episodes
○ Ex: Playing Game, Trips through maze etc. [ called episodic tasks]

24
Returns & Episodes

● Generally T = ∞
○ What if the agent receive a reward
of +1 for each timestep?
○ Discounted Return:

Note:

○ Discount rate determines the present


value of future rewards
25

Returns & Episodes

● What if 𝛾 is 0?
● What if 𝛾 is 1?
● Computing discounted rewards incrementally

• Sum of an infinite number of terms, it is still finite if the reward is nonzero


and constant and if 𝛾 < 1.
• Ex: reward is +1 constant

26
Returns & Episodes

➔ Objective: To apply forces to a cart


moving along a track so as to keep a
pole hinged to the cart from falling over
➔ Discuss:
➔ Consider the task as episodic, that is
try/maintain balance until failure.
What could be the reward function?
➔ Repeat prev. assuming task is
continuous.

27

Policy

● A mapping from states to


probabilities of selecting each
possible action.
○ 𝛑 (a|s) is the probability that At
= a if St = s
● The purpose of learning is to
improve the agent's policy with its
experience

28
Defining Value Functions

State-value function for policy 𝝿

Action-value function for policy 𝝿

29

30
31

Defining Value Functions

State Value function in terms of Action-value function for policy 𝝿

Action Value function in terms of State value function for policy 𝝿

32
May skip to the next slide !
Bellman Equation for V𝝅
● Dynamic programming equation associated with discrete-time optimization
problems
○ Expressing Vℼ recursively i.e. relating V𝝅(s) to V𝝅(s’) for all s’ ∈ succ(s)

33

Bellman Equation for V𝝅

Value of the start state must equal


(1) the (discounted) value of the expected next state,
plus
(1) the reward expected along the way

Backup Diagram
34
Understanding V𝝅(s) with Gridworld
Reward:
○ -1 if an action takes agent off the grid
○ Exceptional reward from A and B for all actions taking agent to A’ and B’ resp.
○ 0, everywhere else

Exceptional reward dynamics State-value function for the equiprobable


random policy with 𝛾 = 0.9 35

Understanding V𝝅(s) with Gridworld

Verify V𝝅(s) using Bellman equation for this state


with 𝛾 = 0.9, and equiprobable random policy

36
Understanding V𝝅(s) with Gridworld

37

Ex-1
Recollect the reward function used for Gridworld as below:
○ -1 if an action takes agent off the grid
○ Exceptional reward from A and B for all actions taking agent to A’ and B’ resp.
○ 0, everywhere else
Let us add a constant c ( say 10) to the rewards of all the actions. Will it change
anything?

38
39

Optimal Policies and Optimal Value Functions

● 𝝿 ≥ 𝝿’ if and only if v𝝿(s) ≥ v𝝿’(s) for all s ∊ S


● There is always at least one policy that is better than or
equal to all other policies → optimal policy (denoted as 𝝿*)
○ There could be more than one optimal policy !!!
Optimal state-value function

Optimal action-value function

40
Optimal Policies and Optimal Value Functions

Bellman optimality equation - expresses that the value of a state under


an optimal policy must equal the expected return for the best action
from that state
Bellman optimality equation for V*

41

Optimal Policies and Optimal Value Functions

Bellman optimality equation - expresses that the value of a state under


an optimal policy must equal the expected return for the best action
from that state

Bellman optimality equation for q*

42
Optimal Policies and Optimal Value Functions

Bellman optimality equation - expresses that the value of a state under


an optimal policy must equal the expected return for the best action
from that state

Backup diagrams for v* and q* 43

Optimal solutions to the gridworld example

Backup diagrams for v* and q* 44


MDP - Objective

45

Notation

46
Race car example

47

48
Race car example

49

Value iteration

50
Value Iteration

0 0 0

2 1 0

3.35 2.35 0

Check this computation on paper.


51

52
53

Infinite Utilities?!
• Problem: What if the game lasts forever? Do we get infinite
rewards?

• Solutions:
• Finite horizon: (similar to depth-limited search)
• Terminate episodes after a fixed T steps (e.g. life)
• Gives nonstationary policies (π depends on time left)

• Discounting: use 0 < γ < 1

• Smaller γ means smaller “horizon” – shorter term focus

54
Discount factor

Geometric series

55

Discounting
• It’s reasonable to maximize the sum of rewards
• It’s also reasonable to prefer rewards now to rewards later
• Discount factor: values of rewards decay exponentially

Worth Now Worth Next Step Worth In Two Steps


56
Discounting
• How to discount?
• Each time we descend a level, we
multiply in the discount once
• Why discount?
• Sooner rewards probably do have
higher utility than later rewards
• Also helps our algorithms converge
• Example: discount of 0.5
• G(r=[1,2,3]) = 1*1 + 0.5*2 + 0.25*3
• G([1,2,3]) < G([3,2,1])

57

Quiz: Discounting
• Given grid world:

• Actions: East, West, and Exit (‘Exit’ only available in terminal states: a, e)
• Rewards are given only after an exit action
• Transitions: deterministic
• Quiz 1: For γ = 1, what is the optimal policy?

• Quiz 2: For γ = 0.1, what is the optimal policy?

• Quiz 3: For which γ are West and East equally good when in state d?
58
59

Example: Grid World


A maze-like problem
The agent lives in a grid
Walls block the agent’s path
Noisy movement: actions do not always go as planned
80% of the time, the action North takes the agent North
10% of the time, North takes the agent West; 10% East
If there is a wall in the direction the agent would have
been taken, the agent stays put

The agent receives rewards each time step


Small negative reward each step (battery drain)
Big rewards come at the end (good or bad)
Goal: maximize sum of (discounted) rewards

60
k=0

Noise = 0.2
Discount = 0.9
Living reward = 0

61

k=1

Noise = 0.2
Discount = 0.9
Living reward = 0

62
k=2

Noise = 0.2
Discount = 0.9
Living reward = 0

63

k=3

Noise = 0.2
Discount = 0.9
Living reward = 0

64
k=4

Noise = 0.2
Discount = 0.9
Living reward = 0

65

k=5

Noise = 0.2
Discount = 0.9
Living reward = 0

66
k=6

Noise = 0.2
Discount = 0.9
Living reward = 0

67

k=7

Noise = 0.2
Discount = 0.9
Living reward = 0

68
k=8

Noise = 0.2
Discount = 0.9
Living reward = 0

69

k=9

Noise = 0.2
Discount = 0.9
Living reward = 0

70
k=10

Noise = 0.2
Discount = 0.9
Living reward = 0

71

k=11

Noise = 0.2
Discount = 0.9
Living reward = 0

72
k=12

Noise = 0.2
Discount = 0.9
Living reward = 0

73

k=100

Noise = 0.2
Discount = 0.9
Living reward = 0

74
Problems with Value Iteration

75

Solutions (briefly, more later…)


76
• Asynchronous value iteration
• In value iteration, we update every state in each iteration
• Actually, any sequences of Bellman updates will converge if every state is visited
infinitely often regardless of the visitation order
• Idea: prioritize states whose value we expect to change significantly

77

Asynchronous Value Iteration


• Which states should be prioritized for an update?

A single
update per
iteration

78
Double the work?

79

Issue 2: A policy cannot be easily extracted


80
Q-learning

81

Q-learning as value iteration


82
Issue 3: The policy often converges long
before the values

83

Policy Iteration

84
Policy Evaluation

85

Policy value as a Linear program

86
Comparison
• Both value iteration and policy iteration compute the same thing (optimal
state values)
• In value iteration:
• Every iteration updates both the values and (implicitly) the policy
• We don’t track the policy, but taking the max over actions implicitly define it
• In policy iteration:
• We do several passes that update utilities with fixed policies (each pass is fast
because we consider only one action, not all of them)
• After the policy is evaluated, a new policy is chosen (slow like a value iteration
pass)
• The new policy will be better (or we’re done)
87

Issue 4: requires knowing the model and the


reward function

Offline optimization Online Learning 88


Issue 5: requires discrete (finite) set of
actions
• We will explore policy gradient approaches that are suitable for
continuous actions, e.g., throttle and steering for a vehicle
• Can such approaches be relevant for discrete action spaces?
• Yes! We can always define a continues action space as a distribution over the
discreate actions (e.g., using the softmax function)
• Can we combine value-based approaches and policy gradient
approaches and get the best of both?
• Yes! Actor-critic methods

89

Issue 6: infeasible in large (or continues)


state spaces
• Most real-life problems contain very large state spaces (practically
infinite)
• It is infeasible to learn and store a value for every state
• Moreover, doing so is not useful as the chance of encountering a
state more than once is very small
• We will learn to generalize our learning to apply to unseen states
• We will use value function approximators that can generalize the
acquired knowledge and provide a value to any state (even if it was
not previously seen)

90
Notation

91

Required Readings

1. Chapter-3,4 of Introduction to Reinforcement Learning,2nd Ed., Sutton


& Barto

92
Thank you

93

Deep Reinforcement Learning


2023-24 Second Semester, [Link] (AIML)

Session #6-7:
Monte Carlo Methods

DRL Instructors

1
Agenda for the class

• Introduction
• On-Policy Monte Carlo Methods
• Off-Policy Monte Carlo Methods

Acknowledgements: Some of the slides were adopted with permission from the course CSCE-689 (Texas A&M University) by Prof. Guni Sharon 2

Introduction

● Recollect the problem


○ We need to learn a policy that
takes us as far and as faster
possible;

3
Introduction

Introduction

5
(Aside) Offline vs. Online (RL)

Offline Optimization Online Learning


6

Monte Carlo Methods


• Monte Carlo methods are a broad class of computational algorithms that
rely on repeated random sampling to obtain numerical results
• The underlying concept is to obtain unbiased samples from a
complex/unknown distribution through a random process
• They are often used in physical and mathematical problems and are most
useful when it is difficult or impossible to compute a solution analytically
• Weather prediction
• Computational biology
• Computer graphics
• Finance and business
• Sport game prediction
7
First-visit Monte-Carlo Policy Evaluation
[estimate V𝛑(s)]

Ex-1:First-visit Monte-Carlo Policy Evaluation


[estimate V𝛑(s)]

9
Acknowledgements : This example is taken from the tutorial by Peter Bodík, RAD Lab, UC Berkeley
Ex-2: First-visit Monte-Carlo Policy Evaluation
[estimate V𝛑(s)]
Input Policy π Observed Episodes (Training) Output Values
Episode 1 Episode 2
B, east, C, -1 B, east, C, -1 -10
A A
C, east, D, -1 C, east, D, -1
D, exit, , +10 D, exit, , +10 +8 +4 +10
B C D B C D
Episode 3 Episode 4 -2
E E
E, north, C, -1 E, north, C, -1
C, east, D, -1 C, east, A, -1
Assume: γ = 1 D, exit, , +10 A, exit, , -10
10

Problems with MC Evaluation


• What’s good about direct evaluation? Output Values
• It’s easy to understand
• It doesn’t require any knowledge of the underlying -10
model A
• It converges to the true expected values +8 +4 +10
B C D
• What bad about it? -2
• It wastes information about transition probabilities E
• Each state must be learned separately
Think : If B and E both go
• So, it takes a long time to learn to C with the same
probability, how can their
values be different? 11
What about exploration?

-1

+1 +10

State
NA NA NA NA 0
0 0 0 0 NA
0 0 0 0 NA
NA NA NA NA 0
NA NA NA NA 0 12

What about exploration?



-1

+1 +10

State
NA NA NA NA 0
0 -1 0 0 NA
-1 0 0 0 NA
NA NA NA NA 0
NA NA NA NA -1 13
What about exploration?

-1

+1 +10

State
NA NA NA NA 1
0 We converged
-1 0 on 1 NA
-1 0 0 0 NA
a local optimum!
NA NA NA NA 0
NA NA NA NA -1 14

Must explore!

15
16

MC control - example -100 +10

5 4,3 2,1 0
• w x y z
w x y z

- -,- -,- -

w x y z

exit exit

w x y z

17
MC control - example -100 +10

5 4,3 2,1 0
• w x y z
w x y z

- -,- -,- -

w x y z

exit exit

w x y z

18

MC control - example -100 +10

5 4,3 2,1 0
• w x y z
w x y z

-100 -90,0 -,- -

w x y z

exit exit

w x y z

19
MC control - example -100 +10

-100 -90,3 2,1 0


• w x y z
w x y z

-100 -90,0 -,- -

w x y z

exit exit

w x y z

20

MC control - example -100 +10

-100 -90,3 2,1 0


• w x y z
w x y z

-100 -90,0 -,- -

w x y z

exit exit

w x y z

21
MC control - example -100 +10

-100 -90,3 2,1 0


• w x y z
w x y z

-100 -90,0 -,- -

w x y z

exit exit

w x y z

22

MC control - example -100 +10

-100 -90,3 2,1 0


• w x y z
w x y z

-100 -90,0 -,- -

w x y z

exit exit

w x y z

23
MC control - example -100 +10

-100 -90,-72.9 -81,1 0


• w x y z
w x y z

-100 -90,-72.9 -81,- -

w x y z

exit exit

w x y z

24

MC control - example -100 +10

-100 -90,-72.9 -81,1 0


• w x y z
w x y z

-100 -90,-72.9 -81,- -

w x y z

exit exit

w x y z

25
MC control - example -100 +10

-100 -90,-72.9 -81,1 0


• w x y z
w x y z

-100 -90,-72.9 -81,- -

w x y z

exit exit

w x y z

26

On-policy learning
True
Estimation
value

-1

+1 +10

State
NA NA
0 -1
-1 0
NA NA
NA 27 NA
Quick Recap !
On-policy vs. Off-policy Learning

28

Off-policy learning

29
Off-policy learning conditions

30

Trajectory probability

31
32

33
34

Importance sampling

YES!

35
36

37
38

Importance sampling

39
40

41
42

Importance sampling: proof


43
Importance sampling: proof

Importance
ratio

44

(ordinary) Importance sampling - example



-1

+1 +10

45
Weighted importance sampling
• Trick: normalize by the
sum of importance ratios
-1

+1 +10

Ordinary Importance sampling is unbiased


while the weighted version is biased
(initially). Ordinary Importance sampling
results in high variance while the weighted
version has a bounded variance 46

Ordinary vs weighted importance sampling


• Estimating a black-jack state
• Target policy: hit on 19 or
below
• Behavior policy: random
(uniform)
• Both approaches converge
to the true value
• weighted importance
sampling is much better
initially

47
MC control + importance sampling

48

MC control + importance sampling

49
MC control + importance sampling

Going back in time

50

MC control + importance sampling

Discount future
rewards and add
immediate reward

51
MC control + importance sampling

52

MC control + importance sampling

Incremental update of Q
values (waited moving
average)

53
MC control + importance sampling

Update target policy


(greedy)

54

MC control + importance sampling

55
MC control + importance sampling

56

MC control + IS example -100 +10

- 4,3 2,1 -

w x y z
w x y z
- 0,0 0,0 -

w x y z

exit exit

w x y z

57
MC control + IS example -100 +10

- 4,3 2,1 -

w x y z
w x y z
- 0,0 0,0 -

w x y z

exit exit

w x y z

58

MC control + IS example -100 +10

5 4,3 2,1 5

w x y z
w x y z
0 0,0 0,0 0

w x y z

exit exit

w x y z

59
MC control + IS example -100 +10

5 4,3 2,1 5

w x y z
w x y z
0 0,0 0,0 0

w x y z

exit exit

w x y z

60

MC control + IS example -100 +10

5 4,3 2,1 5

w x y z
w x y z
1 0,0 0,0 0

w x y z

exit exit

w x y z

61
MC control + IS example -100 +10

-100 4,3 2,1 5



w x y z
w x y z
1 0,0 0,0 0

w x y z

exit exit

w x y z

62

MC control + IS example -100 +10

-100 4,3 2,1 5



w x y z
w x y z
1 0,0 0,0 0

w x y z

exit exit

w x y z

63
MC control + IS example -100 +10

-100 4,3 2,1 5



w x y z
w x y z
1 0,0 0,0 0

w x y z

exit exit

w x y z

64

MC control + IS example -100 +10

-100 4,3 2,1 5



w x y z
w x y z
1 0,0 0,0 0

w x y z

exit exit

w x y z

65
MC control + IS example -100 +10

-100 4,3 2,1 5



w x y z
w x y z
1 2,0 0,0 0

w x y z

exit exit

w x y z

66

MC control + IS example -100 +10

-100 -90,3 2,1 5



w x y z
w x y z
1 2,0 0,0 0

w x y z

exit exit

w x y z

67
MC control + IS example -100 +10

-100 -90,3 2,1 5



w x y z
w x y z
1 2,0 0,0 0

w x y z

exit exit

w x y z

68

MC control + IS example -100 +10

-100 -90,3 2,1 5



w x y z
w x y z
1 2,0 0,0 0

w x y z

exit exit

w x y z

69
MC control + IS example -100 +10

-100 -90,3 2,1 5



w x y z
w x y z
1 2,0 0,0 0

w x y z

exit exit

w x y z

70

MC control + IS example -100 +10

-100 -90,3 2,1 5



w x y z
w x y z
1 2,0 0,0 0

w x y z

exit exit

w x y z

71
MC control + IS example -100 +10

-100 -90,3 2,1 5



w x y z
w x y z
1 2,0 0,0 1

w x y z

exit exit

w x y z

72

MC control + IS example -100 +10

-100 -90,3 2,1 10



w x y z
w x y z
1 2,0 0,0 1

w x y z

exit exit

w x y z

73
MC control + IS example -100 +10

-100 -90,3 2,1 10



w x y z
w x y z
1 2,0 0,0 1

w x y z

exit exit

w x y z

74

MC control + IS example -100 +10

-100 -90,3 2,1 10



w x y z
w x y z
1 2,0 0,0 1

w x y z

exit exit

w x y z

75
MC control + IS example -100 +10

-100 -90,3 2,1 10



w x y z
w x y z
1 2,0 0,0 1

w x y z

exit exit

w x y z

76

MC control + IS example -100 +10

-100 -90,3 2,1 10



w x y z
w x y z
1 2,0 0,2 1

w x y z

exit exit

w x y z

77
MC control + IS example -100 +10

-100 -90,3 2,9 10



w x y z
w x y z
1 2,0 0,2 1

w x y z

exit exit

w x y z

78

MC control + IS example -100 +10

-100 -90,3 2,9 10



w x y z
w x y z
1 2,0 0,2 1

w x y z

exit exit

w x y z

79
MC control + IS example -100 +10

-100 -90,3 2,9 10



w x y z
w x y z
1 2,0 0,2 1

w x y z

exit exit

w x y z

80

MC control + IS example -100 +10

-100 -90,3 2,9 10



w x y z
w x y z
1 2,0 0,2 1

w x y z

exit exit

w x y z

81
MC control + IS example -100 +10

-100 -90,3 2,9 10



w x y z
w x y z
1 2,4 0,2 1

w x y z

exit exit

w x y z

82

MC control + IS example -100 +10

-90,8.1
-100 2,9 10

w x y z
w x y z
1 2,4 0,2 1

w x y z

exit exit

w x y z

83
MC control + IS example -100 +10

-90,8.1
-100 2,9 10

w x y z
w x y z
1 2,4 0,2 1

w x y z

exit exit

w x y z

84

MC control + IS example -100 +10

-90,8.1
-100 2,9 10

w x y z
w x y z
1 2,4 0,2 1

w x y z

exit exit

w x y z

85
What did we learn?

86

Generalized Policy Iteration - DP


Required Readings

1. Chapter-3,4 of Introduction to Reinforcement Learning,2nd Ed., Sutton


& Barto

88

Thank you

89
Deep Reinforcement Learning
2022-23 Second Semester, [Link] (AIML)

Session #9:
Temporal Difference Learning

Instructors :
1. Prof. S. P. Vimal (vimalsp@[Link]),
2. Prof. Sangeetha Viswanathan ([Link]@[Link])
1

Agenda for the class

• Temporal Difference Learning


- TD(0)
- SARSA
- Q-Learning

Acknowledgements: Some of the slides were adopted with permission from the course CSCE-689 (Texas A&M University) by Prof. 2
Guni Sharon
Solving MDPs so far

Dynamic programming Monte-Carlo


✅ Off policy ❌ On-policy (though important
✅ local learning, propagating sampling can be used)
values from neighbors ❌ Requires a full episode to
(Bootstrapping) train on
❌ Model based ✅ Model free, online learning

-100 +10

w x y z 3

Fuse DP and MC
Dynamic programming Monte-Carlo
✅ Off policy ❌ On-policy (though important
✅ local learning, propagating sampling can be used)
values from neighbors ❌ Requires a full episode to
(Bootstrapping) train on
❌ Model based ✅ Model free, online learning

TD Learning
✅ Off policy
✅ local learning, propagating
values from neighbors
(Bootsraping)
4
✅ Model free, online learning
Temporal difference learning

● Temporal Difference Learning Error

5
Temporal difference learning

10
SARSA: On-policy TD Control

11
SARSA: On-policy TD Control

-100 +10

13
w x y z

SARSA: On-policy TD Control

-100 +10
0 0,0 0,0 0

w x y z
14
w x y z
SARSA: On-policy TD Control

-100 +10
0 0,0 0,0 0

S A R S’ A’ w x y z
15
w x y z

SARSA: On-policy TD Control

-100 +10
0 0,0 0,0 0

S A R S’ A’ w x y z
16
w x y z
SARSA: On-policy TD Control

-100 +10
0 0,0 0,0 0

S A R S’ A’ w x y z
17
w x y z

SARSA: On-policy TD Control

-100 +10
0 0,0 0,0 0

S A R S’ A’ w x y z
18
w x y z
SARSA: On-policy TD Control

-100 +10
0 0,0 0,0 0

S A R S’ A’ w x y z
19
w x y z

SARSA: On-policy TD Control

-100 +10
-100 0,0 0,0 0

S A R S’ A’ w x y z
20
w x y z
SARSA: On-policy TD Control

-100 +10
-100 0,0 0,0 0

S A R S’ A’ w x y z
21
w x y z

SARSA: On-policy TD Control

-100 +10
-100 0,0 0,0 0

S A R S’ A’ w x y z
22
w x y z
SARSA: On-policy TD Control

-100 +10
-100 - 0,0 0
90,0
S A R S’ A’ w x y z
23
w x y z

SARSA: On-policy TD Control

And so on…

-100 +10
-100 - 0,0 0
90,0
S A R S’ A’ w x y z
24
w x y z
Q-learning: Off-policy TD Control

25

Q-learning: Off-policy TD Control

+10 -100
0 0,0 0,0 0

S A R S’ w x y z
26
w x y z
Q-learning: Off-policy TD Control

+10 -100
0 0,0 0,0 0

S A R S’ w x y z
27
w x y z

Q-learning: Off-policy TD Control

+10 -100
0 0,0 0,0 0

S A R S’ w x y z
28
w x y z
Q-learning: Off-policy TD Control

+10 -100
0 0,0 0,0 0

S A R S’ w x y z
29
w x y z

Q-learning: Off-policy TD Control

+10 -100
0 0,0 0,0 0

S A R S’ w x y z
30
w x y z
Q-learning: Off-policy TD Control

+10 -100
0 0,0 0,0 0

S A R S’ w x y z
31
w x y z

Q-learning: Off-policy TD Control

+10 -100
0 0,0 0,0 0

S A R S’ w x y z
32
w x y z
Q-learning: Off-policy TD Control

+10 -100
0 0,0 0,0 -100

S A R S’ w x y z
33
w x y z

Q-learning: Off-policy TD Control

+10 -100
0 0,0 0,0 -100

S A R S’ w x y z
34
w x y z
Q-learning: Off-policy TD Control

+10 -100
0 0,0 0,0 -100

S A R S’ w x y z
35
w x y z

Q-learning: Off-policy TD Control

+10 -100
0 0,0 0,0 -100

S A R S’ w x y z
36
w x y z
Q-learning: Off-policy TD Control

+10 -100
0 0,0 0,-90 -100

S A R S’ w x y z
37
w x y z

Q-learning: Off-policy TD Control

+10 -100
0 0,0 0,-90 -100

S A R S’ w x y z
38
w x y z
Q-learning: Off-policy TD Control

+10 -100
0 0,0 0,-90 -100

S A R S’ w x y z
39
w x y z

Q-learning: Off-policy TD Control

+10 -100
0 0,0 0,-90 -100

S A R S’ w x y z
40
w x y z
Q-learning: Off-policy TD Control

And so on…

+10 -100
0 0,0 0,-90 -100

S A R S’ w x y z
41
w x y z

Required Readings

1. Chapter-6 of Introduction to Reinforcement Learning,2nd Ed., Sutton


& Barto

42
Thank you

43

Deep Reinforcement Learning


2022-23 Second Semester, [Link] (AIML)

Session #10-11-12:
On Policy Prediction with Approximation

Instructors :
1. Prof. S. P. Vimal (vimalsp@[Link]),
2. Prof. Sangeetha Viswanathan ([Link]@[Link])
1
Agenda for the classes

➔ Introduction
➔ Value Function Approximation
➔ Stochastic Gradient, Semi-Gradient Methods
➔ Role of Deep Learning for Function Approximation;
➔ Feature Construction Methods

Acknowledgements: Some of the slides were adopted with permission from the course CSCE-689 (Texas A&M University) by Prof. 2
Guni Sharon

Generalizing Across States

● Tabular Learning keeps a table of all state values


● In realistic situations, we cannot possibly learn
about every single state!
○ Too many states to visit them all during training
○ Too many states to hold a value table in memory
● Instead, we want to generalize:
○ Learn about some small number of training states from
experience
○ Generalize that experience to new, similar situations 3
○ This is a fundamental idea in machine learning
Example: Pacman

Let’s say we discover In naïve tabular-learning,


through experience we know nothing about Or even this one!
that this state is bad: this state:

Demo

• Naïve Q-learning
• After 50 training
episodes

5
Learn an approximation function

Learn Predict

Generalize

● Q-learning with function


approximator

7
Parameterized function approximator

Gradient Descent

9
Gradient Decent

10

Gradient Decent

Chain rule

11
Gradient Descent

● Idea:
○ Start somewhere
○ Repeat: Take a step in the gradient direction

12

Figure source: Mathworks

Batch Gradient Decent


13
Stochastic Gradient Decent (SGD)

Observation: once gradient on one training example has been


computed, might as well incorporate before computing next one

14
Mini-Batch Gradient Decent

Observation: gradient over small set of training examples (=mini-batch)


can be computed in parallel, might as well do that instead of a single one

16

SGD for Monte Carlo estimation

w are the tunable


parameters of the value
approximation function
17
Example

18

10

Learning approximation with bootstrapping


19
Semi-gradient methods
● They do converge reliably in important cases such as the linear
approximation case
● They offer important advantages that make them often clearly
preferred
● They typically enable significantly faster learning, as we have
seen in Chapters 6 and 7
● They enable learning to be continual and online, without
waiting for the end of an episode
● This enables them to be used on continuing problems and
provides computational advantages 20

Semi-gradient TD(0)

What’s the difference


from the tabular case?
21
Example

22

[Review] n-step TD Prediction


One-step return:

Two-step return:

Ref Section 7.1 of TB


[Review] n-step TD Prediction
One-step return:

Two-step return:

n-step return:

Ref Section 7.1 of TB

[Review] n-step TD Prediction


One-step return:

Two-step return:

n-step return:

State-value learning algorithm for using n-step returns:

Ref Section 7.1 of TB


[Review] n-step TD Prediction

● Again, only a
simple
modification over
the tabular setting

27
● Again, only a
simple
modification over
the tabular setting
● Weight update
instead of tabular
entry update

28

Another optimization approach

29

For a linear
approximator

30

TD-error = 0

Least squares TD

31
Least squares TD

Store the inverse of A


instead of A

32

Least squares TD

33
Incremental updates
(no need to store all
previous transitions)
Least squares TD

34

Feature selection

35

Features are domain depended


requiring expert knowledge
Automatic features extraction

36

Automatic features extraction for linear approximator

37
Automatic features extraction for linear approximator -
Coarse Coding
● Natural representation of the state set is
continuous
● In 2-d, features corresponding to circles in
state space
● Coding of a state:
○ If the state is inside a circle, then the
corresponding feature has the value 1
○ otherwise the feature is 0
● Corresponding to each circle is a single
weight (a component of w) that is learned
○ Training a state affects the weights of all the
intersecting circles.

Automatic features extraction for linear approximator -


Coarse Coding
Automatic features extraction for linear approximator -
Coarse Coding

Automatic features extraction for linear approximator - Tile


Coding
Automatic features extraction for linear approximator - Tile
Coding

Automatic features extraction for linear approximator - Tile


Coding
Automatic features extraction for linear
approximator
● Other approaches include: Coarse Coding, Tile Coding,
Radial Basis Functions (See chapter 9.5 in textbook)
● Each of these approaches defines a set of features, some
useful yet most are not
○ E.g., is there a polynomial/Fourier function that translates pixels to pan
location?
○ Probably but it’s a needle in a (combinatorial) haystack
● Can we do better (generically)
○ Yes, using deep neural networks…

44

What did we learn?


● Reinforcement learning must generalize on observed
experience if it is to be applicable to real world domains
● We can use parameterized function approximation to represent
our knowledge about the domain state/action values
● Use stochastic gradient descend to update the tunable
parameters such that the observed (TD, rollout) error is
reduced
● When using a linear approximator, the Least squares TD
method provides the most sample efficient approximation

45
Deep Q Network
Mnih et al. 2015
• First deep learning model to successfully learn control policies
directly from high-dimensional sensory input using reinforcement
learning
• The model is a convolutional neural network, trained with a variant of
Q-learning
• Input is raw pixels and output is an action-value function estimating
future rewards
• Surpassed a human expert on various Atari video games

46

The age of deep learning


• Previous models relied on hand-crafted features combined with
linear value functions
• The performance of such systems heavily relies on the quality of the feature
representation
• Advances in deep learning have made it possible to automatically
extract high-level features from raw sensory data

47
Example: Pacman
Let’s say we discover In naïve q-learning, Or even this one!
through experience we know nothing
that this state is bad: about this state:

We must generalize our knowledge!


48

• Naïve Q-learning
• After 50 training
episodes

49
• Generalize Q-learning
with function
approximator

50

• Generalizing
knowledge results in
efficient learning
• E.g., learn to avoid the
ghosts

51
Generalizing with Deep learning
• Supervised: Require large amounts of hand-labelled training data
• RL on the other hand, learns from a scalar reward signal that is frequently
sparse, noisy, and delayed
• Supervised: Assume the data samples are independent
• In RL one typically encounters sequences of highly correlated state
• Supervised: Assume a fixed underlying distribution
• In RL the data distribution changes as the algorithm learns new behaviors
• DQN was first to demonstrate that a convolutional neural network
can overcome these challenges to learn successful control policies
from raw video data in complex RL environments
52

Deep Q learning [Mnih et al. 2015]


• Trains a generic neural network-based agent that successfully learns
to operate in as many domains as possible
• The network is not provided with any domain-specific information or
hand-designed features
• Must learn from nothing but the raw input (pixels), the reward,
terminal signals, and the set of possible actions

53
Original Q-learning

54

Deep Q learning [Mnih et al. 2015]


• DQN addresses problems of correlated data and non-stationary
distributions
• Use an experience replay mechanism
• Randomly samples and trains on previous transitions
• Results in a smoother training distribution over many past behaviors

55
Deep Q learning [Mnih et al. 2015]

56

Deep Q learning [Mnih et al. 2015]


57
Q-learning with experience replay

58

Q-learning with experience replay

Play m episodes (full games)

59
Q-learning with experience replay

60

Q-learning with experience replay

For each time step during the episode

61
Q-learning with experience replay

With small probability select a random


action (explore), otherwise select the,
currently known, best action (exploit).

62

Q-learning with experience replay

Execute the chosen action and store the


(processed) observed transition in the
replay memory

63
Q-learning with experience replay

64

Q-learning with experience replay

65
Q-learning with experience replay

66

Deep Q learning

67
Experience replay
• Utilizing experience replay has several advantages
• Each step of experience is potentially used in many weight updates, which allows
for greater data efficiency
• Learning directly from consecutive samples is inefficient, due to the strong
correlations between the samples; randomizing the samples breaks these
correlations and therefore reduces the variance of the updates
• The behavior distribution is averaged over many of its previous states, smoothing
out learning and avoiding oscillations or divergence in the parameters
• Note that when learning by experience replay, it is necessary to learn off-
policy (because our current parameters are different to those used to
generate the sample), which motivates the choice of Q-learning

68

Experience replay
• DQN only stores the last N experience tuples in the replay memory
• Old transitions are overwritten
• Samples uniformly at random from the buffer when performing
updates
• Is there room for improvement?
• Important transitions?
• Prioritized sweeping
• Prioritize deletions from the replay memory
• see prioritized experience reply, [Link]

69
Results: DQN

70

Results: DQN
• Average predicted action-value on a held-out set of states on Space
Invaders (c) and Seaquest (d)

71
• Normalized between a
professional human games
tester (100%) and random
play (0%)

• E.g., in Pong, DQN achieved a


factor of 1.32 higher score on
average when compared to a
professional human player

72

Maximization bias

73
Double Deep Q networks (DDQN)
We have two available Q
networks. Let's use them for
double learning

74

Double Deep Q networks (DDQN)


We have two available Q
networks. Let's use them for
double learning

75
• Hasselt et al.
2015
• The straight horizontal
orange (for DQN) and
blue (for Double DQN)
lines in the top row are
computed by running the
corresponding agents
after learning concluded,
and averaging the actual
discounted return
obtained from each
visited state. These
straight lines would
match the learning
curves at the right side of
the plots if there is no
bias.
76

What did we learn?


• Using deep neural networks as function approximators in RL is tricky
• Sparse samples
• Correlated samples
• Evolving policy (nonstationary sample distribution)
• DQN attempts to address these issues
• Reuse previous transitions at each training (SGD) step
• Randomly sample previous transitions to break correlation
• Use off-policy, TD(0) learning to allow convergence to the true target values
(Q*)
• No guarantees for non-linear (DNN) approximators

77
Required Readings

1. Chapter-6 of Introduction to Reinforcement Learning,2nd Ed., Sutton


& Barto

78

Thank you

79
Deep Reinforcement Learning
2022-23 Second Semester, [Link] (AIML)

Session #13:
Policy Gradients - REINFORCE, Actor-
Critic algorithms

Instructors :
1. Prof. S. P. Vimal (vimalsp@[Link]),
2. Prof. Sangeetha Viswanathan ([Link]@[Link])
1

Agenda for the classes

➔ Introduction
➔ Policy gradients
➔ REINFORCE algorithm
➔ Actor-critic methods
➔ REINFORCE - example

Acknowledgements: Some of the slides were adopted with permission from the course CSCE-689 (Texas A&M University) by Prof. 2
Guni Sharon
Notation

Improving the policy



Advantages of PG
• policy convergence over time as opposed to an epsilon-greedy value-based
approach
• Naturally applies to continues action space as opposed to a Q learning
approach
• In many domains the policy is a simpler function to approximate
Though this is not always the case
• Choice of policy parameterization is sometimes a good way of injecting prior
knowledge
E.g., in phase assignment by a traffic controller
• Can converge on stochastic optimal policies, as opposed to value-based
approaches
Useful in games with imperfect information where the optimal play is often to do two different
things with specific probabilities, e.g., bluffing in PokerThe

Stochastic policy


Evaluate the gradient in performance

Monte-Carlo Policy Gradient



REINFORCE [Williams, 1992]

REINFORCE [Williams, 1992]



REINFORCE [Williams, 1992]

REINFORCE [Williams, 1992]

12
REINFORCE [Williams, 1992]

● The crazy corridor


domain

● With the right step size,


the total reward per
episode approaches the
optimal value of the start 13
state

Guaranteed to converge to a local optimum under standard


stochastic approximation conditions for decreasing α

Calibrating REINFORCE

14
PG with baseline

15

PG with baseline

PG with baseline

REINFORCE with baseline

18

Each approximator has its


unique learning rate
REINFORCE with baseline

Policy Gradient

20
Add a critic

Critic’s duties

22
Benefits from a critic

● REINFORCE with baseline is unbiased* and will converge asymptotically


to a local optimum
* With a linear state value approximator, and when b is not a function of a
○ Like all Monte-Carlo methods it tends to learn slowly (produce estimates of high variance)
○ Not suitable for online or for continuing problems
● Temporal-difference methods can eliminate these inconveniences
● In order to gain the TD advantages in the case of policy gradient methods
we use actor–critic methods

Actor+critic

● Actor-critic algorithms are a derivative of policy iteration, which alternates


between policy evaluation—computing the value function for a policy—and
policy improvement—using the value function to obtain a better policy
● In large-scale reinforcement learning problems, it is typically impractical to
run either of these steps to convergence, and instead the value function
and policy are optimized jointly
● The policy is referred to as the actor, and the value function as the critic
Advantage function

One-step actor-critic

26
Keep track of
accumulated discount

27

Follow the current


policy

28
Compute the TD error

29

Update the critic without


the accumulated discount.
(The discount factor is
included in the TD error)

30
Update the actor with
discounting. Early actions
matter more.

31

Update accumulated
discount and progress to
the next state

32
In practice: training the
network at every step on a
single observation is
inefficient (slow and
correlated)

33

Instead: store all state approx


values, log probabilities, and
rewards along the episode.
Train once at the end of the
episode.
34
Instead: store all state approx.
values, log probabilities, and
rewards along the episode.
Train once at the end of the
episode.
35

values = [Link](values)
Qvals = [Link](Qvals)
log_probs = [Link](log_probs)
advantage = Qvals - values

Advantage Actor-Critic (A2C)


Add eligibility traces

Gradient eligibility per tunable


parameter for both the actor
and the critic approximators

37

What did we learn?


What did we learn?

39

Required Readings

1. Chapter-6 of Introduction to Reinforcement Learning,2nd Ed., Sutton


& Barto

40
Thank you

41

Deep Reinforcement Learning


2022-23 Second Semester, [Link] (AIML)

Session #14:
Model Based Algorithms

Instructors :
1. Prof. S. P. Vimal (vimalsp@[Link]),
2. Prof. Sangeetha Viswanathan ([Link]@[Link])
1
Agenda for the classes

➔ Introduction
➔ Upper-Confidence-bound Action Selection
➔ Monte-Carlo Tree Search
➔ AlphaGo Zero
➔ MuZero, PlaNet

Acknowledgements: Some of the slides were adopted with permission from the course CSCE-689 (Texas A&M University) by Prof. 2
Guni Sharon

Model Based Algorithms

- Learn the model of an environment’s transition dynamics or make use


of a known dynamic model
- Once an agent has a model of the environment, P(s’|s,a), it can
“imagine” what will happen in the future by predicting the trajectory for a
few time steps.
- If the environment is in state s, an agent can estimate how the state will
change if it makes a sequence of actions a1, a2, . . . , an by repeatedly
applying P(s’|s,a) all without actually producing an action to change the
environment.
- Hence, the predicted trajectory occurs in the agent’s “head” using a
model.
Model Based Algorithms

- Most commonly applied to games with a target state, such as


winning or losing in a game of chess, or navigation tasks with
a goal state s*.
- Do not model any rewards

Model based algorithms - Advantages

- Very appealing : it can play out scenarios and understand the


consequences of its actions without having to actually act in
an environment.
- Require many fewer samples of data to learn good policies
since having a model enables an agent to supplement its
actual experiences with imagined ones.
Model based algorithms - challenges

- Models are hard to come by.


- An environment with a large state space and action space
can be very difficult to model; doing so may even be
intractable, especially if the transitions are extremely complex
- Models are only useful when they can accurately predict the
transitions of an environment many steps into the future -
prediction errors

Upper Confidence bound action selection

UCB is a deterministic algorithm for Reinforcement Learning that focuses on


exploration and exploitation based on a confidence boundary that the algorithm
assigns to each machine on each round of exploration. These boundary decreases
when a machine is used more in comparison to other machines.

The Upper Confidence Bound follows the principle of optimism in the face of uncertainty
which implies that if we are uncertain about an action, we should optimistically assume that
it is the correct action.

Initially, UCB explores more to systematically reduce uncertainty but its exploration reduces
over time. Thus we can say that UCB obtains greater reward on average than other
algorithms such as Epsilon-greedy, Optimistic Initial Values, etc.
UCB Algorithm

Upper Confidence bound action selection

Steps followed in Upper Confidence Bound

1. At each round n, we consider two numbers for machine m.


-> Nₘ(n) = number of times the machine m was selected up to round n.
-> Rₘ(n) = number of rewards of the machine m up to round n.
2. From these two numbers we have to calculate,
a. The average reward of machine m up to round n, rₘ(n) = Rₘ(n) / Nₘ(n).
b. The confidence interval [ rₘ(n) — Δₘ(n), rₘ(n)+Δₘ(n) ] at round n with, Δₘ(n)=
sqrt( 1.5 * log(n) / Nₘ(n) )
3. We select the machine m that has the maximum UCB, ( rₘ(n)+Δₘ(n) )
Monte Carlo Tree Search
- Monte Carlo Tree Search (MCTS) is a heuristic search set of rules that has won big attention
and reputation within the discipline of synthetic intelligence, specially in the area of choice-
making and game playing.
- MCTS combines the standards of Monte Carlo strategies, which rely upon random sampling
and statistical evaluation, with tree-primarily based search techniques.
- Idea :
- build a seek tree incrementally by using simulating more than one random performs
(regularly known as rollouts or playouts) from the current recreation nation.
- These simulations are carried out until a terminal state or a predefined intensity is
reached.
- The results of these simulations are then backpropagated up the tree, updating the
records of the nodes visited at some stage in the play, which includes the wide variety of visits
and the win ratios.

Monte Carlo Tree Search

- It is a probabilistic and heuristic driven search algorithm that combines the


classic tree search implementations alongside machine learning principles of
reinforcement learning.
- exploration-exploitation trade-off : exploits the actions and strategies that is
found to be the best till now but also must continue to explore the local space
of alternative decisions and find out if they could replace the current best.
- exploration expands the tree’s breadth more than its [Link] it quickly
becomes inefficient in situations with large number of steps or repetitions.
- Exploitation sticks to a single path that has the greatest estimated value. This is
a greedy approach and this will extend the tree’s depth more than its breadth.
Monte Carlo Tree Search

Why use Monte Carlo Tree Search (MCTS) ?

1. Handling Complex and Strategic Games


2. Unknown or Imperfect Information
3. Learning from Simulations
4. Optimizing Exploration and Exploitation
5. Scalability and Parallelization
6. Applicability Beyond Games
7. Domain Independence

Monte Carlo Tree Search


Nodes are the building blocks of the search tree and are formed based on the outcome of
a number of simulations.
MDPs can be represented as trees (or graphs), called ExpectiMax trees:

The letters a-e represent actions, and


letters s-x represent states. White nodes
are state nodes, and the small black
nodes represent the probabilistic
uncertainty: the ‘environment’
choosing which outcome from an
action happens, based on the transition
function.
Monte Carlo Tree Search – Overview
The algorithm is online, which means the action selection is interleaved with action
execution. Thus, MCTS is invoked every time an agent visits a new state.
Fundamental features:
1. The Q-value Q(s,a) for each is approximated using random simulation.
2. For a single-agent problem, an ExpectiMax search tree is built incrementally
3. The search terminates when some pre-defined computational budget is used up,
such as a time limit or a number of expanded nodes. Therefore, it is an anytime
algorithm, as it can be terminated at any time and still give an answer.
4. The best performing action is returned.
○ This is complete if there are no dead–ends.
○ This is optimal if an entire search can be performed (which is unusual – if
the problem is that small we should just use a dynamic programming
technique such as value iteration).

The Framework of MCTS


The basic framework is to build up a tree using simulation. The states that have been evaluated
are stored in a search tree. The set of evaluated states is incrementally built be iterating over
the following four steps:

● Select: Select a single node in the tree that is not fully expanded. By this, we mean at
least one of its children is not yet explored.
● Expand: Expand this node by applying one available action (as defined by the MDP)
from the node.
● Simulation: From one of the outcomes of the expanded, perform a complete random
simulation of the MDP to a terminating state. This therefore assumes that the simulation
is finite, but versions of MCTS exist in which we just execute for some time and then
estimate the outcome.
● Backpropagate: Finally, the value of the node is backpropagated to the root node,
updating the value of each ancestor node on the way using expected value
MCTS Framework

In a basic MCTS algorithm we incrementally build of the search tree. Each node in the
tree stores:

1. a set of children nodes;


2. pointers to its parent node and parent action; and
3. the number of times it has been visited.
Selection: The first loop
progressively selects a branch in the
tree using a multi-armed bandit
algorithm using Q(s|a). The
outcome that occurs from an action
is chosen according to P(s’|s,a)
defined in the MDP.

Expansion
Select an action q to apply in state s, either
randomly or using an heuristic. Get an
outcome state s’ from applying action a in
state s according to the probability
distribution p(s’|s). Expand a new
environment node and a new state node for
that outcome.
Simulation: Perform a randomised
simulation of the MDP until we
reach a terminating state. That is,
at each choice point, randomly
select an possible action from the
MDP, and use transition
probabilities P(s’|s) to choose an
outcome for each action.

Heuristics can be used to improve the random simulation by guiding it towards more promising states.
G is the cumulative discounted reward received from the simulation starting at s’ until the simulation
terminates.
To avoid memory explosion, we discard all nodes generated from the simulation. In any non-trivial
search, we are unlikely to ever need them again.

Backpropagation: The reward


from the simulation is
backpropagated from the
selected node to its ancestors
recursively. We must not
forget the discount factor! For
each state s and action a
selected in the Select step,
update the cumulative reward
of that state.
Example :
Backpropagation
Issues in Monte Carlo Tree Search:
1. Exploration-Exploitation Trade-off
2. Sample Efficiency
3. High Variance
4. Heuristic Design
5. Computation and Memory Requirements
6. Overfitting
7. Domain-specific Challenges

AlphaZero

- Combining MCTS and TD learning: Alpha Zero


- Alpha Zero (or more accurately its predecessor AlphaGo) made headlines when it beat
Go world champion Lee Sodol in 2016.
- It uses a combination of MCTS and (deep) reinforcement learning to learn a policy.
AlphaZero - Overview
1. AlphaZero uses a deep neural network to estimate the Q-function. More accurately, it
gives an estimate of the probability of selecting action a in state s, (P(a|s) and the value
of the state (V(s)), which represents the probability of the player winning from s.
2. It is trained via self-play. Self-play is when the same policy is used to generate the moves
of both the learning agent and any of its opponents. In AlphaZero, this means that
initially, both players make random moves, but both also learn the same policy and use it
to select subsequent moves.
3. In short, AlphaZero is a game-playing program that, through a combination of self-play
and neural network reinforcement learning (more on that later), is able to learn to play
games such as chess and Go from scratch ─ that is, after being fed nothing more than the
rules of said games.

AlphaZero

A DNN initialised to return 2 randomly generated outputs: v(s), and p(s). The network will later be
trained to take in position s and predict v(s) and p(s) from it.
The AlphaZero framework. Mastering the
Game of Go without Human Knowledge.
D. Silver, et al. Nature volume 550, pages
354–359 (2017)

Implications of AlphaZero

- Increase the probability of a win come the end of the game


- AlphaZero’s evaluation function is the product of a highly sophisticated neural
network’s experience accumulated over millions of games
- AlphaZero does not make use of any human knowledge, we can expect it to come up
with brand-new ideas previously unknown to mankind.
MuZero

- AlphaZero (2017) - learning to play on its own, without any human data or
domain knowledge, or even by mastering three different games (Go, Chess,
Shogi) with only one algorithm being only provided with the known set of
rules.
- MuZero (2020) - earnt to master these games(Go, Chess, Shogi, Atari) without
being provided known rules
- Big step forward towards general-purpose algorithms

MuZero

MuZero models three elements that are key to the planner:

● Value: how good is the current state?

● Policy: which action should be the next one?

● Reward: how good was the last action taken?


MuZero - overview
• The MuZero algorithm aims to accurately predict characteristics and details of
thefuture which it deems important for planning.
• The algorithm initially receives an input, for instance an image of a chess
board, which is translated into a hidden state.
• The hidden state then undergoes iterations based on the previous hidden state
and a proposed subsequent plan of action.
• Each time the hidden state is updated the model predicts three variables:
policy, value function and immediate reward.
• The policy is the next move to be played, the value function is the predicted
winner, and the immediate reward is the strength of the move (if it improves
the player’s position).
• The model is then trained to accurately predict the values of the three
aforementioned variables.

- The MuZero algorithm does not


receive any rules of the game, for
instance in a chess game, the algorithm
is unaware of the legal moves or what
constitutes as win, draw or a loss.
- MuZero only receives the rewards of
its actions when a game is terminated,
either by winning, drawing, or losing.
However, during its learning, MuZero
receives rewards periodically based on
how well it is doing in its own mini-
games.
Training and Loss function
- function train_network repeatedly trains the neural network by making use of the
replayBuffer.
- The train_network function works by looping the total number of training steps, which
is set to one million by default.
- The function then samples a batch at every step and uses that data to update the
neural network.
- The tuples within a batch are: the current state of the game, a list of actions taken from
the current position and lastly the targets used to train the neural networks.
- The targets used to train the neural networks are calculated by using Temporal
Difference (TD) Learning .
- The loss function of MuZero is responsible for how the weights of the neural network
are updated.

PlaNet
- Deep Planning Network - Google AI & DeepMind Initiative
- PlaNet agent was tasked with ‘planning’ a sequence of actions to achieve a
goal like pole balancing, teaching a virtual entity (human or cheetah) to
walk, or keeping a box rotating by hitting it in a specific location.
- Common goals between these tasks that the PlaNet needed to achieve:
1. The Agent needs to predict a variety of possible futures (for robust
planning)
2. The Agent needs to update the plan based on the outcomes/rewards of a
recent action
3. The Agent needs to retain information over many time steps

So how did the Google AI team achieve these goals?


PlaNet Overview

1. Learning with a latent dynamics model — PlaNet learns from a series of hidden or
latent states instead of images to predict the latent state moving forward.
2. Model-based planning — PlaNet works without a policy network and instead makes
decisions based on continuous planning.
3. Transfer learning — The Google AI team trained a single PlaNet agent to solve all
six different tasks.

1) Latent Dynamics Model - Key benefits to using compact latent state spaces are that it allows
the agent to learn more abstract representations like the objects’ positions and velocities and
also avoid having to generate images.

Learned Latent Dynamics


Model — Instead of using
the input images directly,
the encoder networks (gray
trapezoids) compress the
images’ information into
hidden states (green circles).
These hidden states are then
used to predict future
images (blue trapezoids)
and rewards (blue
rectangle).
2. Model based planning

3. Transfer Learning
- one agent for all tasks

The agent is randomly placed into different environments without knowing the task,
so it needs to infer the task from its image observations. Without changes to the hyper
parameters, the multi-task agent achieves the same mean performance as individual
agents.

Required Readings and references

1. [Link]
2. [Link]
confidence-bound-%28ucb%29
3. [Link]
reinforcement-learning-b97d3e743d0f
4. [Link]
5. [Link]
what-sets-it-apart-and-what-it-can-tell-us-4ab3d2d08867
6. [Link]
7. [Link]
about-googles-new-planet-reinforcement-learning-network-
144c2ca3f284
8. [Link]
[Link]?m=1
41
Thank you

42

Deep Reinforcement Learning


2022-23 Second Semester, [Link] (AIML)

Session #15:
Imitation Learning

Instructors :
1. Prof. S. P. Vimal (vimalsp@[Link]),
2. Prof. Sangeetha Viswanathan ([Link]@[Link])
1
Agenda for the classes

➔ Introduction
➔ Upper-Confidence-bound Action Selection
➔ Monte-Carlo Tree Search
➔ AlphaGo Zero
➔ MuZero, PlaNet

Acknowledgements: Some of the slides were adopted with permission from the course CSCE-689 (Texas A&M University) by Prof. Guni Sharon
2

Imitation Learning in a Nutshell

Given: demonstrations or demonstrator


Goal: train a policy to mimic demonstrations

Expert Demonstrations State/Action Pairs Learning

Images from Stephane Ross


Ingredients of Imitation Learning

Demonstrations or Demonstrator Environment / Simulator Policy Class

vs

Loss Function Learning Algorithm

Some Interesting Examples

• ALVINN [Link]
neural-network/
Dean Pomerleau et al., 1989-1999 [Link]

• Helicopter Acrobatics
Learning for Control from Multiple Demonstrations - Adam Coates, Pieter Abbeel, Andrew Ng, ICML 2008
An Application of Reinforcement Learning to Aerobatic Helicopter Flight - Pieter Abbeel, Adam Coates, Morgan
Quigley, Andrew Y. Ng, NIPS 2006

[Link]

• Ghosting ( Sports Analytics) - Next Slide.


Ghosting

Data Driven Ghosting using


Deep Imitation Learning
Hoang M. Le et al., SSAC 2017
[Link]

Notation & Set-up


Notation & Set-up

Example #1: Racing Game


(Super Tux Kart)

s = game screen
a = turning angle

Training set: D={𝑟≔(s,a)} from 𝜋*


● s = sequence of s
● a = sequence of a

Goal: learn 𝜋 (s)→a


𝜃

Images from Stephane Ross


Example #2: Basketball
Trajectories

s = location of players & ball


a = next location of player

Training set: D={𝑟≔(s,a)} from 𝜋*


● s = sequence of s
● a = sequence of a

Goal: learn 𝜋𝜃(s)→a

Behavioral Cloning = Reduction to Supervised Learning (Ignoring


regularization for brevity.)
Behavioral Cloning vs. Imitation Learning

Limitations of Behavioral Cloning


Limitations of Behavioral Cloning

Compounding Errors

When to use Behavioral Cloning?


Types of Imitation Learning

Types of Imitation Learning


DAGGER: Dataset Aggregation

Inverse RL
• What if we don’t have an online demonstrator?
• We only have access to an offline set of demonstrated trajectories
• Behavioral cloning is not robust
• Suffers from overfitting
• We know what to do in observed states but can’t generalize well to other states
• How can we learn to mimic the demonstrator in a general why?
• Learn the demonstrator’s objective (reward) function
• Apply RL
Inverse RL
• What if we don’t have an online demonstrator?
• We only have access to an offline set of demonstrated trajectories
• Behavioral cloning is not robust
• Suffers from overfitting
• We know what to do in observed states but can’t generalize well to other states
• How can we learn to mimic the demonstrator in a general why?
• Learn the demonstrator’s objective (reward) function
• Apply RL

Inverse RL
Inverse RL

Inverse RL
Inverse RL

Inverse RL
Inverse RL

GAN
Inverse RL as GAN

Inverse RL as GAN
Model based algorithms - Advantages

- Very appealing : it can play out scenarios and understand the


consequences of its actions without having to actually act in
an environment.
- Require many fewer samples of data to learn good policies
since having a model enables an agent to supplement its
actual experiences with imagined ones.

Required Readings and references

33
Thank you

34

Deep Reinforcement Learning


2022-23 Second Semester, [Link] (AIML)

Session #16:
Multi-Agent Reinforcement Learning

Instructors :
1. Prof. S. P. Vimal (vimalsp@[Link]),
2. Prof. Sangeetha Viswanathan ([Link]@[Link])
1
Agenda for the classes

➔ Introduction
➔ Cooperative vs Competitive agents,
➔ centralized vs. decentralized RL ;
➔ Proximity Primal Optimization (Surrogate Objective
Function, Clipping)

Acknowledgements: Some of the slides were adopted with permission from the course CSCE-689 (Texas A&M University) by Prof. Guni Sharon
2

Multi-Agent RL (MARL)

Vanilla reinforcement learning is concerned with a single agent, in an environment,


seeking to maximize the total reward in that environment.
A robot learning to walk, where its overall goal is to walk without falling.
It receives rewards for taking steps without falling over, and through trial and error, and
maximizing these rewards, the robot eventually learns to walk.
In this context, we have a single agent seeking to accomplish a goal through maximizing
total rewards.
MARL

Multi-agent reinforcement learning studies how multiple agents interact in a common


environment.
That is, when these agents interact with the environment and one another, can we observe
them collaborate, coordinate, compete, or collectively learn to accomplish a particular task.
In general it’s the same as single agent reinforcement learning, where each agent is trying to
learn it’s own policy to optimize its own reward.
Using a central policy for all agents is possible, but multiple agents would have to
communicate with a central server to compute their actions (which is problematic in most
real world scenarios), so in practice decentralized multi-agent reinforcement learning is
used.

MARL
Traditional (Single-Agent) RL

Multiagent RL
Motivations: Research in Multiagent RL

Motivations: Research in Multiagent RL


Motivations: Research in Multiagent RL

Motivations: Research in Multiagent RL


Dimensions of MARL

Centralized:

●One brain / algorithm deployed across many agents

Decentralized:

●All agents learn individually


●Communication limitations defined by environment

Decentralized systems
Decentralized ayatems
• In decentralized learning, each agent is trained independently from the others.
• The benefit is that since no information is shared between agents, these
vacuums can be designed and trained like we train single agents.
• The idea here is that our training agent will consider other agents as part of the
environment dynamics. Not as agents.
• However, the big drawback of this technique is that it will make the environment
non-stationary since the underlying Markov decision process changes over time
as other agents are also interacting in the environment. And this is problematic
for many Reinforcement Learning algorithms that can’t reach a global optimum
with a non-stationary environment.

Centralized systems
Centralized systems

• In this architecture, we have a high-level process that collects agents’


experiences: the experience buffer. And we’ll use these experiences to
learn a common policy.
• We use that collective experience to train a policy that will move all three
robots in the most beneficial way as a whole. So each robot is learning
from their common experience. We now have a stationary environment
since all the agents are treated as a larger entity, and they know the
change of other agents’ policies (since it’s the same as theirs).

Dimensions of MARL

Prescriptive:

●Suggests how agents should behave

Descriptive:

●Forecast how agent will behave


Dimensions of MARL
Cooperative: Agents cooperate to achieve a goal
- Shared team reward
- For instance, in a warehouse, robots must collaborate to load and unload the
packages efficiently (as fast as possible).
Competitive: Agents compete against each other
- Zero-sum games
- Individual opposing rewards
- For example, in a game of tennis, each agent wants to beat the other agent.
Neither: Agents maximize their utility which may require cooperating and/or
competing
- General-sum games
- like in our SoccerTwos environment, two agents are part of a team (blue or purple):
they need to cooperate with each other and beat the opponent team.

Dimensions of MARL

Numbers of agents
- One (single-agent)
- Two (very common)
- Finite
- Infinite
Foundations of MARL

Benefits of Multi-agent Learning systems


Challenges of Multi-agent learning systems

Challenges of Multi-agent learning systems


Theoretical framework of MARL

- Markov/Stochastic games - perfect information


- Extensive Form games - imperfect information

Markov game Extensive-form game

Markov Game
MARL Formulation

Nash Q-learning
Nash Q-Learning

Multi-agent Deep Q network - Eg: Pursuit


Evasion
Problem representation

Chanel Settings
Multi-agent centralized training

Multi-agent centralized training


Dealing with agent ambiguity

MADQN Architecture - Residual network type


MADQN Architecture

Proximal Policy Optimization (PPO)

The algorithm, introduced by OpenAI in 2017, seems to strike the right balance
between performance and comprehension.
PPO aims to strike a balance between important factors like ease of implementation,
ease of tuning, sample complexity,sample efficiency and trying to compute an update
at each step that minimizes the cost function while ensuring the deviation from the
previous policy is relatively small.
PPO is in fact, a policy gradient method that learns from online data as well. It merely
ensures that the updated policy isn’t too much different from the old policy to ensure
low variance in training.
PPO - Surrogate Objective

- surrogate objective - avoids performance collapse by


guaranteeing monotonic policy improvement.

Modifying the objective

- relative policy performance identity

- The relative policy performance identity J(𝝅) - J(𝝅’) serves as a metric


to measure policy improvements. If the difference is positive, the newer
policy 𝝅’ is better than .
- During a policy iteration, we should ideally choose a new policy 𝝅’ such
that this difference is maximized. Therefore, maximizing objective J(𝝅’)
is equivalent to maximizing this identity, and they can both be done by
gradient ascent.
Surrogate objective

PPO Algorithm

- Page 174 - Foundations of Deep Reinforcement Learning: Theory


and Practice in Python (Addison-Wesley Data & Analytics Series) 1st
Edition by Laura Graesser and Wah Loon Keng
Required Readings and references

1. Foundations of Deep Reinforcement Learning: Theory and Practice


in Python (Addison-Wesley Data & Analytics Series) 1st Edition by
Laura Graesser and Wah Loon Keng

42

Thank you

43

You might also like