0% found this document useful (0 votes)

680 views281 pages

DRL Final Notes

Uploaded by

journeyonleaves

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

680 views281 pages

DRL Final Notes

Uploaded by

journeyonleaves

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Deep Reinforcement Learning

2022-23 Second Semester, [Link] (AIML)

Session #1:
Introduction to the Course

Instructors :
1. Prof. S. P. Vimal (vimalsp@[Link]),
2. Prof. Sangeetha Viswanathan ([Link]@[Link])
PRESENTATION TITLE 1

What is Reinforcement Learning ?

- Reward based learning / Feedback based learning

- Not a type of NN nor it is an alternative to NN. Rather it is an approach
for learning
- Autonomous driving, gaming

Why Reinforcement Learning ?

- A goal-oriented learning based on interaction with environment

2
Course Objectives

Course Objectives:
1. Understand
a. the conceptual, mathematical foundations of deep reinforcement learning
b. various classic & state of the art Deep Reinforcement Learning algorithms
2. Implement and Evaluate the deep reinforcement learning solutions to various
problems like planning, control and decision making in various domains
3. Provide conceptual, mathematical and practical exposure on DRL
a. to understand the recent developments in deep reinforcement learning and
b. to enable modelling new problems as DRL problems.

Learning Outcomes

1. Understand the fundamental concepts of reinforcement learning (RL),

algorithms and apply them for solving problems including control, decision-
making, and planning.
2. Implement DRL algorithms, handle challenges in training due to stability and
convergence
3. Evaluate the performance of DRL algorithms, including metrics such as sample
efficiency, robustness and generalization.
4. Understand the challenges and opportunities of applying DRL to real-world
problems & model real life problems

4
Course Operation

• Instructors
Prof. [Link], Prof. Chandra Sekar
Prof. Bharatesh, Prof. SK. Karthika

• Textbooks
1. Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G.
Barto, Second Ed. , MIT Press
2. Foundations of Deep Reinforcement Learning: Theory and Practice in Python
(Addison-Wesley Data & Analytics Series) 1st Edition by Laura Graesser and
Wah Loon Keng
5

Course Operation
• Evaluation
Two Quizzes for 5% each; Best 2 will be taken for 5% (In final grading)
Whatever be the points set for quizzes, the sore will be scaled to 5%.
No MAKEUP, for whatever be the reason. Ensure to attend at least one of the quizzes
Two Assignments - Tensorflow/ Pytorch / OpenAI Gym Toolkit → 25 %
Assignment 1: Partially Numerical +Implementation of Classic Algorithms –
10%
Assignment 2: Deep Learning based RL
Mid Term Exam - 30% [Only to be written in A4 sheets, scanned and uploaded]
Comprehensive Exam - 40% [Only to be written in A4 sheets, scanned and uploaded]
• Webinars/Tutorials
4 tutorials : 2 before mid-sem & 2 after mid-sem
• Teaching Assistants – Will be introduced in the upcoming classes.

6
Course Operation
• How to reach us ? (for any question on lab aspects, availability of slides on portal, quiz
availability , assignment operations )
Prof. [Link] - vimalsp@[Link]
Prof. SK. Karthika – Karthika@[Link]
• Plagiarism
All submissions for graded components must be the result of your original effort. It is strictly
prohibited to copy and paste verbatim from any sources, whether online or from your peers.
The use of unauthorized sources or materials, as well as collusion or unauthorized
collaboration to gain an unfair advantage, is also strictly prohibited. Please note that we will
not distinguish between the person sharing their resources and the one receiving them for
plagiarism, and the consequences will apply to both parties equally.
In cases where suspicious circumstances arise, such as identical verbatim answers or a
significant overlap of unreasonable similarities in a set of submissions, will be investigated,
and severe punishments will be imposed on all those found guilty of plagiarism.
7

Reinforcement Learning

Reinforcement learning (RL) is based on rewarding desired behaviors or punishing

undesired ones. Instead of one input producing one output, the algorithm produces
a variety of outputs and is trained to select the right one based on certain variables
– Gartner

When to use RL?

RL can be used in large environments in the following situations:

1.A model of the environment is known, but an analytic solution is not available;
[Link] a simulation model of the environment is given (the subject of simulation-based
optimization)
[Link] only way to collect information about the environment is to interact with it.

8
(Deep) Reinforcement Learning

Criteria Supervised ML Unsupervised ML Reinforcement ML

Definition Learns by using labelled Trained using Works on interacting with
data unlabelled data the environment
without any guidance.
Types of Learning

Type of data Labelled data Unlabelled data No – predefined data

Type of problems Regression and Association and Exploitation or Exploration
classification Clustering

Supervision Extra supervision No supervision No supervision

Algorithms Linear Regression, Logistic K – Means, Q – Learning,
Regression, SVM, KNN etc. C – Means, Apriori SARSA

Aim Calculate outcomes Discover underlying Learn a series of action

patterns

Application Risk Evaluation, Forecast Recommendation Self Driving Cars, Gaming,

Sales System, Anomaly Healthcare
10
Detection
Characteristics of RL

•No supervision, only a real value or reward signal

•Decision making is sequential
•Time plays a major role in reinforcement problems
•Feedback isn’t prompt but delayed
•The following data it receives is determined by the agent’s actions

Types of Reinforcement learning

•Positive Reinforcement - an event, occurs due to a particular behavior,

increases the strength and the frequency of the behavior. In other words, it has
a positive effect on behavior.
•Maximizes Performance
•Sustain Change for a long period of time
•Too much Reinforcement can lead to an overload of states which can
diminish the results

12
Types of Reinforcement Learning
Negative Reinforcement - strengthening of behavior because a negative
condition is stopped or avoided. Advantages of negative reinforcement
learning:
•Increases Behavior
•Provide defiance to a minimum standard of performance
•It Only provides enough to meet up the minimum behavior

Elements of Reinforcement Learning

14
Elements of Reinforcement Learning

Beyond the agent and the environment, one can identify four main sub-
elements of a reinforcement learning system: a policy, a reward , a value
function, and, optionally, a model of the environment.
15

Elements of Reinforcement Learning

•Agent
- An entity that tries to learn the best way to perform a specific task.
- In our example, the child is the agent who learns to ride a bicycle.

•Action (A) -
- What the agent does at each time step.
- In the example of a child learning to walk, the action would be “walking”.
- A is the set of all possible moves.
- In video games, the list might include running right or left, jumping high or
low, crouching or standing still.

16
Elements of Reinforcement Learning

•State (S)
- Current situation of the agent.
- After doing performing an action, the agent can move to different states.
- In the example of a child learning to walk, the child can take the action of
taking a step and move to the next state (position).
•Rewards (R)
- Feedback that is given to the agent based on the action of the agent.
- If the action of the agent is good and can lead to winning or a positive side
then a positive reward is given and vice versa.

Elements of Reinforcement Learning

•Environment
- Outside world of an agent or physical world in which the agent operates.
•Discount factor
- The discount factor is multiplied by future rewards as discovered by the
agent in order to dampen these rewards’ effect on the agent’s choice of
action.
- Why? It is designed to make future rewards worth less than immediate
rewards.

18
Reinforcement Learning - Definition

Formal Definition - Reinforcement learning (RL) is an area of

machine learning concerned with how intelligent agents ought to
take actions in an environment in order to maximize the notion of
cumulative reward.

Elements of Reinforcement Learning

•Goal of RL - maximize the total amount of rewards or cumulative rewards that
are received by taking actions in given states.
•Notations –
a set of states as S,
a set of actions as A,
a set of rewards as R.
At each time step t = 0, 1, 2, …, some representation of the environment’s state St
∈ S is received by the agent. According to this state, the agent selects an action At
∈ A which gives us the state-action pair (St , At). In the next time step t+1, the
transition of the environment happens and the new state St+1 ∈ S is achieved. At
this time step t+1, a reward Rt+1 ∈ R is received by the agent for the action At taken
from state St. 20
Elements of Reinforcement Learning

•Maximize cumulative rewards, Expected Return Gt

•Discount factor γ is introduced here which forces the agent to focus on

immediate rewards instead of future rewards. The value of γ remains between
0 and 1.

Elements of Reinforcement Learning

•Policy (π)

- Policy in RL decides which action will the agent take in the current state.
- It tells the probability that an agent will select a specific action from a specific state.
- Policy is a function that maps a given state to probabilities of selecting each
possible action from the given state.

•If at time t, an agent follows policy π, then π(a|s) becomes the probability that
the action at time step t is at=a if the state at time step t is St=s .The meaning of this
is, the probability that an agent will take an action a in state s is π(a|s) at time t
with policy π.

22
Elements of Reinforcement Learning

•Value Functions
- A simple measure of how good it is for an agent to be in a given state, or
how good it is for the agent to perform a given action in a given state.
•Two types
- State- Value function
- Action-Value function

Elements of Reinforcement Learning

•State-value function
- The state-value function for policy π denoted as vπ determines the goodness
of any given state for an agent who is following policy π.
- This function gives us the value which is the expected return starting from
state s at time step t and following policy π afterward.

24
Elements of Reinforcement Learning

•Action value function

- Determines the goodness of the action taken by the agent from a given state
for policy π.
- This function gives the value which is the expected return starting from
state s at time step t, with action a, and following policy π afterward.
- The output of this function is also called as Q-value where q stands for
Quality. Note that in the state-value function, we did not consider the action
taken by the agent.

Elements of Reinforcement Learning

•Model of the environment

- Mimics the behavior of the environment, or more generally, that allows
inferences to be made about how the environment will behave.
- For example, given a state and action, the model might predict the resultant
next state and next reward.
- Models are used for planning

26
(Deep) Reinforcement Learning

27
From OpenAI

Advantages of Reinforcement Learning

- Solve very complex problems that cannot be solved by conventional

techniques
- Achieve long-term results
- Model can correct the errors that occurred during the training process.
- In the absence of a training dataset, it is bound to learn from its experience
- Can be useful when the only way to collect information about the
environment is to interact with it
- Reinforcement learning algorithms maintain a balance between
exploration and exploitation. Exploration is the process of trying different
things to see if they are better than what has been tried before. Exploitation
is the process of trying the things that have worked best in the past. Other
learning algorithms do not perform this balance

28
An example scenario - Tic-Tac-Toe

Two players take turns playing on a three-by-three board. One player plays Xs and
the other Os until one player wins by placing three marks in a row, horizontally,
vertically, or diagonally
Assumptions
- playing against an imperfect player, one whose play is sometimes incorrect and
allows you to win
Aim
How might we construct a player that will find the imperfections in its
opponent’s play and learn to maximize its chances of winning?

Reinforcement Learning for Tic-Tac-Toe

•We set up a table of numbers, one for each possible state of the game. Each
number will be the latest estimate of the probability of our winning from that
state.
•We treat this estimate as the state’s value, and the whole table is the learned
value function.
•State A has higher value than state B, or is considered better than state B, if the
current estimate of the probability of our winning from A is higher than it is from B.

30
Reinforcement Learning for Tic-Tac-Toe

● Assuming we always play X’s, three X = probability is 1, three O = probability =0 .

Initial = 0.5
● Play many games against the opponent. To select our moves we examine the
states that would result from each of our possible moves and look up their current
values in the table.
● Most of the time we move greedily, selecting the move that leads to the state with
greatest value, i.e, with the highest estimated probability of winning.
● Occasionally, however, we select randomly from among the other moves instead.
These are called exploratory moves because they cause us to experience states
that we might otherwise never see.

Reinforcement Learning for Tic-Tac-Toe

•The solid lines represent the moves
taken during a game; the dashed lines
represent moves that we (our
reinforcement learning player)
considered but did not make.
•Our second move was an exploratory
move, meaning that it was taken even
though another sibling move, the one
leading to e∗, was ranked higher.

32
Reinforcement Learning for Tic-Tac-Toe

● Once a game is started, our agent computes all possible actions it can take in
the current state and the new states which would result from each action.
● The values of these states are collected from a state_value vector, which
contains values for all possible states in the game.
● The agent can then choose the action which leads to the state with the highest
value(exploitation), or chooses a random action(exploration), depending on the
value of epsilon.

Thank you

34
Deep Reinforcement Learning
2022-23 Second Semester, [Link] (AIML)

Session #2-3:
Multi-armed Bandits

Instructors :
1. Prof. S. P. Vimal (vimalsp@[Link]),
2. Prof. Sangeetha Viswanathan ([Link]@[Link])
1

Agenda for the class

• Recap
• k-armed Bandit Problem & its significance
• Action-Value Methods
Sample Average Method & Incremental Implementation
• Non-stationary Problem
• Initial Values & Action Selection
• Gradient Bandit Algorithms [ Class #3 ]
• Associative Search [ Class #3 ]

Acknowledgements: Some of the slides were adopted with permission from the course CSCE-689 (Texas A&M University) by Prof. Guni Sharon 2
An example scenario - Tic-Tac-Toe

Reinforcement Learning for Tic-Tac-Toe

4
Reinforcement Learning for Tic-Tac-Toe

● Assuming we always play X’s, three X = probability is 1, three O = probability =0 .

Reinforcement Learning for Tic-Tac-Toe

6
Reinforcement Learning for Tic-Tac-Toe

Tic-Tac-Toc

8
Tic-Tac-Toc
States Initial Values

0.5

0.5 Learning Task: Play as many

times against the opponent
and learn the values
1.0

Set up a table of states initial values

O 0
O

… … 9

Tic-Tac-Toc ( prev. class)

States Initial Values

0.5

St - state before greedy move

0.5 St+1 - state after greedy move

1.0

O 0
O

… … 10
Tic-Tac-Toc ( prev. class)

Temporal Difference Learning Rule

𝜶- Step Size 11

Parameter

Tic-Tac-Toc ( prev. class)

Questions:
(1) What happens if 𝜶 is gradually made to 0 over many
games with the opponent?
(2) What happens if 𝜶 is gradually reduced over many
games, but never made 0?
(3) What happens if 𝜶 is kept constant throughout its
life time?

Temporal Difference Learning Rule

𝜶- Step Size 12

Parameter
Tic-Tac-Toc ( prev. class)

Key Takeaways:
(1) Learning while interacting with the
environment (opponent).
(2) We have a clear goal
(3) Our policy is to make moves that maximizes our
chances of reaching goal
○ Use the values of states most of the time
(exploration) and explore rest of the time.

Temporal Difference Learning Rule

𝜶- Step Size 13

Parameter

Tic-Tac-Toc ( prev. class)

Reading Assigned:
Identify how this reinforcement learning solution is
different from solutions using minimax algorithm
and genetic algorithms.
Post your answers in the discussion forum;

Temporal Difference Learning Rule

𝜶- Step Size 14

Parameter
K-armed Bandit Problem
Problem
• You are faced repeatedly with a choice among k different options, or actions
• After each choice of actions you receive a numerical reward
Reward is chosen from a stationary probability distribution that depends on the selected
action
• Objective : to maximize the expected total reward over some time period

K-armed Bandit Problem

Problem
• You are faced repeatedly with a choice among k different options, or actions
• After each choice of actions you receive a numerical reward
Reward is chosen from a stationary probability distribution that depends on the selected
action
• Objective : to maximize the expected total reward over some time period

16
K-armed Bandit Problem
Problem
• You are faced repeatedly with a choice among k different options, or actions
• After each choice of actions you receive a numerical reward
Reward is chosen from a stationary probability distribution that depends on the selected
action
• Objective : to maximize the expected total reward over some time period

Strategy:
● Identify the best lever(s)
● Keep pulling the identified ones
Questions:
● How do we define the best ones?
● What are the best levers?
17

K-armed Bandit Problem

Strategy:
● Identify the best lever(s)
● Keep pulling the identified ones
Questions:
● How do we define the best ones?
● What are the best levers?
18
K-armed Bandit Problem
Problem
• You are faced repeatedly with a choice among k different options, or actions
• After each choice of actions you receive a numerical reward
Reward is chosen from a stationary probability distribution that depends on the selected
action
• Objective : to maximize the expected total reward over some time period

● Expected Mean Reward for each

action selected
→ call it Value of the action

3. Exploration vs. Exploitation?

ε-Greedy Action Selection / near-greedy action selection
Behave greedily most of the time; Once in a while, with small probability 𝛆 select randomly
from among all the actions with equal probability, independently of the action-value estimates.

K-armed Bandit Problem

26
K-armed Bandit Problem
Greedy Action

K-armed Bandit Problem

Action to Explore

28
K-armed Bandit Problem
ε-Greedy Action Selection / near-greedy action selection

epsilon = 0.05 // small value to control exploration

def get_action():
if [Link]() > epsilon:
return argmaxa(Q(a))
else:
return [Link](A)

● In the limit as the number of steps increases, every action will be sampled by ε-greedy action
selection an infinite number of times. This ensures that all the Qt (a) converge to q* (a).
● Easy to implement / optimize for epsilon / yields good results

Ex-1: In ε-greedy action selection, for the case of two actions and ε
= 0.5, what is the probability that the greedy action is selected?

30
Ex-1: In ε-greedy action selection, for the case of two actions and ε
= 0.5, what is the probability that the greedy action is selected?

p (greedy action)
= p (greedy action AND greedy selection ) + p (greedy action AND random selection )
= p (greedy action | greedy selection ) p ( greedy selection )
+ p (greedy action | random selection ) p (random selection )
= p (greedy action | greedy selection ) (1-𝛆) + p (greedy action | random selection ) (𝛆)
= p (greedy action | greedy selection ) (0.5) + p (greedy action | random selection ) (0.5)
= (1) (0.5) + (0.5) (0.5)
= 0.5 + 0.25
= 0.75

10-armed Testbed
Example:
• A set of 2000 randomly generated k -armed
bandit problems with k = 10
• Action values were selected according to a
normal (Gaussian) distribution with mean 0
and variance 1.
• While selecting action At at time step t, the
actual reward, Rt , was selected from a
normal distribution with mean q*(At ) and
variance 1
• One Run : Apply a method for 1000 time
steps to one of the bandit problems
• Perform 2000 runs, each run with a
different bandit problem, to get an An example bandit problem from the 10-armed testbed

algorithms average behavior

32
Average performance of 𝛆-greedy action-value methods on
the 10-armed testbed

Average performance of 𝛆-greedy action-value methods on

the 10-armed testbed

34
Discussion on Exploration vs. Exploitation

1) What if the reward variance is

a. larger, say 10 instead of 1?
b. zero ? [ deterministic ]
2) What if the bandit task is non-stationary? [ that is, the true values of the
actions changed over time]

Ex-2:

Consider a k -armed bandit problem with k = 4 actions, denoted 1, 2, 3, and 4.

Consider applying to this problem a bandit algorithm using 𝛆-greedy action selection, sample-
average action-value estimates, and initial estimates of Q1 (a) = 0, for all a.

Suppose the initial sequence of actions and rewards is A1 = 1, R1 = 1, A2 = 2, R2 = 1, A3 = 2, R3 = 2,

A4 = 2, R4 = 2, A5 = 3, R5 = 0.

On some of these time steps the 𝛆 case may have occurred, causing an action to be selected at
random.

On which time steps did this definitely occur? On which time steps could this possibly have
occurred?

36
37

38
39

Incremental Implementation

• Efficient approach to
compute the estimate of
action-value;

• Given Qn and the nth

reward, Rn , the new
average of all n rewards can
be computed as follows

40
Incremental Implementation

Note:
● StepSize decreases with each update
● We use 𝛂 or 𝛂t(a) to denote step size (constant /
varies with each step)

Discussion:
Const vs. Variable step size?

Bandit Algorithm with Incremental Update/ 𝛆-greedy

selection

42
Non-stationary Problem

• Most RL problems are non-stationary !

• Give more weight to recent rewards than to long-past rewards !!!

Non-stationary Problem

• Most RL problems are non-stationary !

• Give more weight to recent rewards than to long-past rewards !!!

Exponential recency-weighted average

44
45

46
Optimistic Initial Values

• All the above discussed methods are biased by their initial estimates
• For sample average method the bias disappears once all actions have been selected at
least once
• For methods with constant α, the bias is permanent, though decreasing over time
• Initial action values can also be used as a simple way of encouraging exploration.
• In 10 armed testbed, set initial estimate to +5 rather than 0.
This can encourage action-value methods to explore.
Whichever actions are initially selected, the reward is less than the starting estimates;
the learner switches to other actions, being disappointed with the rewards it is receiving.
The result is that all actions are tried several times before the value estimates converge.
47

Optimistic Initial Values

Caution:
Optimistic Initial Values can
only be considered as a simple
trick that can be quite effective
on stationary problems, but it is
far from being a generally
useful approach to encouraging
exploration.

Question:
Explain how in the non-
stationary scenario the
optimistic initial values will fail
(to explore adequately).

The effect of optimistic initial action-value estimates on the 10-armed testbed.

Both methods used a constant step-size parameter, 𝛂 = 0.1
48
Upper-Confidence-Bound Action Selection

• 𝛆-greedy action selection forces the non-greedy actions to be tried,

Indiscriminately, with no preference for those that are nearly greedy or particularly uncertain
• It would be better to select among the non-greedy actions
according to their potential for actually being optimal
Take into account both how close their estimates are to being maximal and the uncertainties
in those estimates.

Upper-Confidence-Bound Action Selection

● Each time a is selected the uncertainty is presumably reduced
● Each time an action other than a is selected, t increases but Nt(a) does not; because t appears in the numerator, the
uncertainty estimate increases.
● Actions with lower value estimates, or that have already been selected frequently, will be selected with decreasing
frequency over time
Confidence Level
Action Value at time t for a

Measure of Uncertainty

50
51

52
53

Upper-Confidence-Bound Action Selection

UCB often performs well, as shown here, but is more difficult than "-greedy to extend beyond bandits to the more
general reinforcement learning settings
Policy-based algorithms

Softmax function

56
Softmax function

Gradient ascend

• ?

58
Update Rule

On each step, after selecting action A t and receiving the reward Rt ,

Update the action preferences :

60
What did we learn?

• Problem: choose the action that results in highest expected reward

• Assumptions: 1. actions’ expected reward is unknown, 2. we are
confronted with the same problem over and over, 3. we are able to
observe an action’s outcome once chosen
• Approach: learn the actions’ expected reward through exploration
(value based) or learn a policy directly (policy based), exploit learnt
knowledge to choose best action
• Methods: 1. greedy + initializing estimates optimistically, 2. epsilon-
greedy, 3. Upper-Confidence-Bounds, 4. gradient ascend + soft-max

A different scenario

● Associative vs. Non-associative tasks ?

● Policy: A mapping from situations to the
actions that are best in those situations
● (discuss) How do we extend the solution for
non-associative task to an associative task?
○ Approach: Extend the solutions to non-stationary task to
non-associative tasks
■ Works, if the true action values changes slowly
○ What if the context switching between the situations are
made explicit?
■ How?
■ Need Special approaches !!!

62
Required Readings

1. Chapter-2 of Introduction to Reinforcement Learning,2nd Ed., Sutton &

Barto
2. A Survey on Practical Applications of Multi-Armed and Contextual Bandits,
Djallel Bouneffouf , Irina Rish [[Link]

Thank you !

64
Deep Reinforcement Learning
2023-24 Second Semester, [Link] (AIML)

Session #4:
Markov Decision Processes

DRL Course Instructors

Policy-based algorithms

2
Softmax function

Softmax function

4
Gradient ascend

• ?

Update Rule

On each step, after selecting action A t and receiving the reward Rt ,

Update the action preferences :

6
7

What did we learn?

• Problem: choose the action that results in highest expected reward

8
A different scenario

● Associative vs. Non-associative tasks ?

Agenda for the class

• Agent-Environment Interface (Sequential Decision Problem)

• MDP
Defining MDP,
Rewards,
Returns, Policy & Value Function,
Optimal Policy and Value Functions
• Approaches to solve MDP

Announcement !!!
We have our Teaching Assistants now !!! You will see their names in the course home
page.

Acknowledgements: Some of the slides were adopted with permission from the course CSCE-689 (Texas A&M University) by Prof. Guni Sharon 10
Agent-Environment Interface

● Agent - Learner & the decision

maker
● Environment - Everything
outside the agent
● Interaction:
○ Agent performs an action
○ Environment responds by
■ presenting a new situation
(change in state) Note:
■ presents numerical reward ● Interaction occurs in discrete time steps
● Objective (of the interaction):
○ Maximize the return
(cumulative rewards) over time

Grid World Example

● A maze-like problem
○ The agent lives in a grid
○ Walls block the agent’s path
● Noisy movement: actions do not always go as planned
○ 80% of the time, the action North takes the agent North
(if there is no wall there)
○ 10% of the time, North takes the agent West; 10% East
○ If there is a wall in the direction the agent would have
been taken, the agent stays put
● The agent receives rewards each time step
○ -0.1 per step (battery loss)
○ +1 if arriving at (4,3) ; -1 for arriving at (4,2) ;-1 for
arriving at (2,2)
● Goal: maximize accumulated rewards

12
Markov Decision Processes

● An MDP is defined by
○ A set of states
○ A set of actions
○ State-transition probabilities
■ Probability of arriving to after performing at
■ Also called the model dynamics
○ A reward function
■ The utility gained from arriving to after
performing at
■ Sometimes just or even
○ A start state
○ Maybe a terminal state

Markov Decision Processes

Model Dynamics

State-transition probabilities

Expected rewards for state–action–next-state triples

14
Markov Decision Processes - Discussion

● MDP framework is abstract and flexible

○ Time steps need not refer to fixed intervals of real time
○ The actions can be
■ at low-level controls or high-level decisions
■ totally mental or computational
○ States can take a wide variety of forms
■ Determined by low-level sensations or high-level and abstract (ex.
symbolic descriptions of objects in a room)
● The agent–environment boundary represents the limit of the agent’s absolute
control, not of its knowledge.
○ The boundary can be located at different places for different purposes

Markov Decision Processes - Discussion

● MDP framework is a considerable abstraction of the problem of goal-directed

learning from interaction.
● It proposes that whatever the details of the sensory, memory, and control
apparatus, and whatever objective one is trying to achieve, any problem of
learning goal-directed behavior can be reduced to three signals passing back
and forth between an agent and its environment:
○ one signal to represent the choices made by the agent (the actions)
○ one signal to represent the basis on which the choices are made (the
states),
○ and one signal to define the agent’s goal (the rewards).

16
MDP Formalization : Video Games

● State:
○ raw pixels
● Actions:
○ game controls
● Reward:
○ change in score
● State-transition probabilities:
○ defined by stochasticity in game evolution

Ref: Playing Atari with deep reinforcement learning”, Mnih et al., 2013 17

MDP Formalization : Traffic Signal Control

● State:
○ Current signal assignment (green, yellow,
and red assignment for each phase)
○ For each lane: number of approaching
vehicles, accumulated waiting time,
number of stopped vehicles, and average
speed of approaching vehicles
● Actions:
○ signal assignment
● Reward:
○ Reduction in traffic delay
● State-transition probabilities:
○ defined by stochasticity in approaching
demand
Ref: “Learning an Interpretable Traffic Signal Control Policy”, Ault et al., 2020 18
MDP Formalization : Recycling Robot (Detailed Ex.)

● Robot has
○ sensors for detecting cans
○ arm and gripper that can pick the cans and place in an
onboard bin;
● Runs on a rechargeable battery
● Its control system has components for interpreting sensory
information, for navigating, and for controlling the arm and
gripper
● Task for the RL Agent: Make high-level decisions about how
to search for cans based on the current charge level of the
battery

MDP Formalization : Recycling Robot (Detailed Ex.)

● State:
○ Assume that only two charge levels can be distinguished
○ S = {high, low}
● Actions:
○ A(high) = {search, wait}
○ A(low) = {search, wait, recharge}
● Reward:
○ Zero most of the time, except when securing a can
○ Cans are secured by searching and waiting, but rsearch > rwait
● State-transition probabilities:
○ [Next Slide]

20
MDP Formalization : Recycling Robot (Detailed Ex.)

● State-transition probabilities (contd…):

MDP Formalization : Recycling Robot (Detailed Ex.)

● State-transition probabilities (contd…):

22
Note on Goals & Rewards

● Reward Hypothesis:
All of what we mean by goals and purposes can be well thought of as
the maximization of the expected value of the cumulative sum of a
received scalar signal (called reward).
● The rewards we set up truly indicate what we want accomplished,
○ not the place to impart prior knowledge on how we want it to do
● Ex: Chess Playing Agent
○ If the agent is rewarded for taking opponents pieces, the agent might fall
for the opponent's trap.
● Ex: Vacuum Cleaner Agent
○ If the agent is rewarded for each unit of dirt it sucks, it can repeatedly
deposit and suck the dirt for larger reward
23

Returns & Episodes

● Goal is to maximize the expected return

● Return (Gt) is defined as some specific function of the reward
sequence
● Episodic tasks vs. Continuing tasks
● When there is a notion of final time step, say T, return can be

○ Applicable when agent-environment interaction breaks into

episodes
○ Ex: Playing Game, Trips through maze etc. [ called episodic tasks]

24
Returns & Episodes

● Generally T = ∞
○ What if the agent receive a reward
of +1 for each timestep?
○ Discounted Return:

Note:

○ Discount rate determines the present

value of future rewards
25

Returns & Episodes

● What if 𝛾 is 0?
● What if 𝛾 is 1?
● Computing discounted rewards incrementally

• Sum of an infinite number of terms, it is still finite if the reward is nonzero

and constant and if 𝛾 < 1.
• Ex: reward is +1 constant

26
Returns & Episodes

➔ Objective: To apply forces to a cart

moving along a track so as to keep a
pole hinged to the cart from falling over
➔ Discuss:
➔ Consider the task as episodic, that is
try/maintain balance until failure.
What could be the reward function?
➔ Repeat prev. assuming task is
continuous.

Policy

● A mapping from states to

probabilities of selecting each
possible action.
○ 𝛑 (a|s) is the probability that At
= a if St = s
● The purpose of learning is to
improve the agent's policy with its
experience

28
Defining Value Functions

State-value function for policy 𝝿

Action-value function for policy 𝝿

30
31

Defining Value Functions

State Value function in terms of Action-value function for policy 𝝿

Action Value function in terms of State value function for policy 𝝿

32
May skip to the next slide !
Bellman Equation for V𝝅
● Dynamic programming equation associated with discrete-time optimization
problems
○ Expressing Vℼ recursively i.e. relating V𝝅(s) to V𝝅(s’) for all s’ ∈ succ(s)

Bellman Equation for V𝝅

Value of the start state must equal

(1) the (discounted) value of the expected next state,
plus
(1) the reward expected along the way

Backup Diagram
34
Understanding V𝝅(s) with Gridworld
Reward:
○ -1 if an action takes agent off the grid
○ Exceptional reward from A and B for all actions taking agent to A’ and B’ resp.
○ 0, everywhere else

Exceptional reward dynamics State-value function for the equiprobable

random policy with 𝛾 = 0.9 35

Understanding V𝝅(s) with Gridworld

Verify V𝝅(s) using Bellman equation for this state

with 𝛾 = 0.9, and equiprobable random policy

36
Understanding V𝝅(s) with Gridworld

Ex-1
Recollect the reward function used for Gridworld as below:
○ -1 if an action takes agent off the grid
○ Exceptional reward from A and B for all actions taking agent to A’ and B’ resp.
○ 0, everywhere else
Let us add a constant c ( say 10) to the rewards of all the actions. Will it change
anything?

38
39

Optimal Policies and Optimal Value Functions

● 𝝿 ≥ 𝝿’ if and only if v𝝿(s) ≥ v𝝿’(s) for all s ∊ S

● There is always at least one policy that is better than or
equal to all other policies → optimal policy (denoted as 𝝿*)
○ There could be more than one optimal policy !!!
Optimal state-value function

Optimal action-value function

40
Optimal Policies and Optimal Value Functions

Bellman optimality equation - expresses that the value of a state under

an optimal policy must equal the expected return for the best action
from that state
Bellman optimality equation for V*

Optimal Policies and Optimal Value Functions

Bellman optimality equation - expresses that the value of a state under

an optimal policy must equal the expected return for the best action
from that state

Bellman optimality equation for q*

42
Optimal Policies and Optimal Value Functions

Bellman optimality equation - expresses that the value of a state under

an optimal policy must equal the expected return for the best action
from that state

Backup diagrams for v* and q* 43

Optimal solutions to the gridworld example

Backup diagrams for v* and q* 44

MDP - Objective
•

Notation
•

46
Race car example
•

48
Race car example
•

Value iteration

50
Value Iteration

0 0 0

2 1 0

3.35 2.35 0

Check this computation on paper.

52
53

Infinite Utilities?!
• Problem: What if the game lasts forever? Do we get infinite
rewards?

• Solutions:
• Finite horizon: (similar to depth-limited search)
• Terminate episodes after a fixed T steps (e.g. life)
• Gives nonstationary policies (π depends on time left)

• Discounting: use 0 < γ < 1

• Smaller γ means smaller “horizon” – shorter term focus

54
Discount factor
•

Geometric series

Discounting
• It’s reasonable to maximize the sum of rewards
• It’s also reasonable to prefer rewards now to rewards later
• Discount factor: values of rewards decay exponentially

Worth Now Worth Next Step Worth In Two Steps

56
Discounting
• How to discount?
• Each time we descend a level, we
multiply in the discount once
• Why discount?
• Sooner rewards probably do have
higher utility than later rewards
• Also helps our algorithms converge
• Example: discount of 0.5
• G(r=[1,2,3]) = 1*1 + 0.5*2 + 0.25*3
• G([1,2,3]) < G([3,2,1])

Quiz: Discounting
• Given grid world:

• Actions: East, West, and Exit (‘Exit’ only available in terminal states: a, e)
• Rewards are given only after an exit action
• Transitions: deterministic
• Quiz 1: For γ = 1, what is the optimal policy?

• Quiz 2: For γ = 0.1, what is the optimal policy?

• Quiz 3: For which γ are West and East equally good when in state d?
58
59

Example: Grid World

A maze-like problem
The agent lives in a grid
Walls block the agent’s path
Noisy movement: actions do not always go as planned
80% of the time, the action North takes the agent North
10% of the time, North takes the agent West; 10% East
If there is a wall in the direction the agent would have
been taken, the agent stays put

The agent receives rewards each time step

Small negative reward each step (battery drain)
Big rewards come at the end (good or bad)
Goal: maximize sum of (discounted) rewards

60
k=0