Lecture15 Deep Reinforcement Learning PDF
Lecture15 Deep Reinforcement Learning PDF
Semester 1 2019/20
Xavier Bresson
School of Computer Science and Engineering
Data Science and AI Research Centre
Nanyang Technological University (NTU), Singapore
Xavier Bresson 1
2
Material :
David Silver :
Tutorial on Deep Reinforcement Learning, ICML’16
UCL Course on Reinforcement Learning, 2015
Pieter Abbeel :
Talk at Simons Institute for the Theory of Computing, Berkeley, 2017
Mario Martin :
Slides on Policy Search: Actor-Critic and Gradient Policy search, 2019
Xavier Bresson 2
3
Outline
Xavier Bresson 3
4
Outline
Xavier Bresson 4
5
Deep Learning
DL defines a framework to learn the best possible representation of data that can solve tasks
like classification, regression, recommender systems, etc.
DL is a supervised learning technique :
Data are all labeled.
Research and industrial breakthrough applications in CV (autonomous cars, etc), Speech
Recognition (speech-to-text and inversely), NLP (machine translation, etc).
Xavier Bresson 5
6
Deep Learning
Potential revolutionary applications in other fields like physics (simulation acceleration, particle
detection with complex geometry and 108 sensors, generative models for machine calibration)
and chemistry (drugs and materials design).
Learn a deep compositional parametric functions by backpropagation on the loss function.
Loss functions are continuous and differentiable.
Xavier Bresson 6
7
Reinforcement Learning
RL defines a framework to learn the best possible sequence of decisions in a given environment
that can solve tasks like playing Go/Chess, making autonomous robots, recommending ads on
internet, etc.
RL is a causal framework.
Time is important – RL is a sequential decision-making technique.
Xavier Bresson 7
8
Reinforcement Learning
RL is different from supervised learning, but has also similarities (later discussed).
Applications :
Playing games (super-human performances): Go, Chess, Poker, Atari, StarCraft II
Exploring virtual worlds : Maze
Robots (still preliminary) : Simulation and real autonomous robots
Learning to learn (meta-learning): Compute simultaneously architectures and weights.
End of humans designing deep learning architectures ?
Xavier Bresson 8
9
Reinforcement Learning
Applications :
Xavier Bresson 9
10
Reinforcement Learning
Applications :
Xavier Bresson 10
11
Brain Inspiration
Xavier Bresson 11
12
Deep RL
Reinforcement learning has developed powerful algorithms to solve control problems :
Dynamic programming 1953 (Richard Bellman) :
Solve a difficult problem by breaking it into simpler sub-problems in a recursive
manner.
DP can solve optimally low-dimensional control problems by looking at the whole
search space.
DP becomes intractable in high-dimensional search space.
Solution :
Combine DL + RL = Deep RL
Xavier Bresson 12
13
Outline
Xavier Bresson 13
14
RL framework :
Splitting agent and environment :
The environment/world can be anything, from simple world like Go board to
complex world like Earth.
The agent can make actions/decisions.
Each action influences/changes the state of the environment (where the
agent lives).
Each action produces a scalar reward from the environment.
Goal of RL :
Design an agent that acts optimally to optimize reward.
Xavier Bresson 14
15
Xavier Bresson 15
16
et = s1, a1, s2, r1, a2, s3, r2, … , st, at, st+1, rt
At time step t :
st, at, st+1, rt
Xavier Bresson 16
17
MDP
Xavier Bresson 17
18
MDP
Xavier Bresson 18
19
Example
Pong can be cast as a MDP :
A MDP can be visualized as a graph where :
Each node is either a game state st or an action at.
Each edge is a probabilistic transition given either by the state transition probability
matrix P(st+1|st,at), or the policy function π(at|st) (later discussed).
π(a1|s1)
s1 a1
π(a2|s1) P(s2|s1,a1)
P(s1|s3,a3) a2 s2
P(s3|s1,a1) π(a4|s2)
a3 P(s3|s1,a2)
s3 a4
π(a3|s3)
P(s3|s2,a4)
PONG : A game state st is the
image of the game at time state t.
An action at is a move, MDP of PONG
either Up or Down.
Xavier Bresson 19
20
Outline
Xavier Bresson 20
21
Properties of RL
The goal of RL is to act optimally to maximize a value function that represents the total
reward of a sequence of actions.
Xavier Bresson 21
22
Policy
Xavier Bresson 22
23
Policy
Stochastic policy :
π(a|s) = Probability of action a in state s
Given stochastic π, a new episode can be drawn as follows :
et = s1, a1 ~ π(a|s1), s2 ~ P(s|s1,a1), r1, a2 ~ π(a|s2), s3 ~ P(s|s2,a2), r2,
… st ~ P(s|st-1,at-1), at ~ π(a|st), st+1 ~ P(s|st,at), rt
From a given state st, there are multiple possible actions, unlike for the
deterministic policy.
Xavier Bresson 23
24
Policy
Pr
X
x1 x2 x3 xK
Xavier Bresson 24
25
Policy
Given probability functions π(a|s) and P(a|s),
A new episode et can be drawn by rolling out/sampling π and P :
et = s1, a1 ~ π(a|s1), s2 ~ P(s|s1,a1), r1, a2 ~ π(a|s2), s3 ~ P(s|s2,a2), r2,
st+1
… st ~ P(s|st-1,at-1), at ~ π(a|st), st+1 ~ P(s|st,at), rt
at
st+1
s3
at
a2 st+1
s3
s2 at
a2 st+1
s3
a1 at
s2 a2
st+1
s3
at
s1 a1 s2 a2 s3 … at
st+1
st+1
s2 a2 s3
π(a1|s1) a1 at
st+1
a2 s3
s2 at
P(s2|s1,a1) st+1
s3
π(a2|s2) a2 at
st+1
s3 at
P(s3|s2,a2) st+1
Xavier Bresson P(st+1|st,at) 25
26
Policy
Maze example :
Actions are : Up, Down, Left, Right
States are : Agent’s locations
Model
Maze example :
Actions are : Up, Down, Left, Right
States are : Agent’s locations
Model is :
Dynamics of the environment modeled by P(s|st,at)
Immediate reward rt(st,at,st+1) given by the environment
Xavier Bresson 27
28
State-Value Function
The state value function Q(s) is the expected future reward for being in state s.
It is used to evaluate the goodness/badness of each state s.
It acts as the label information in supervised learning (later discussed).
Let us compute the value function Q in state st considering only one episode :
et = st, at ~ π(a|st), st+1 ~ P(s|st,at), rt, at+1 ~ π(a|st+1), st+2 ~ P(s|st+1,at+1), rt+1, …
Q(st) = rt + γ.rt+1 + γ2.rt+2 + … = Σk=0∞ γk.rt+k
The value function Q(st) is the total discounted reward (a.k.a. return) we will receive
for being in state st.
rt(st,at,st+1) rt+1(st+2,at+1,st+1)
State-Value Function
Discounted reward using hyper-parameter γ :
Q(st) = rt + γ.rt+1 + γ2.rt+2 + … = Σk=0∞ γk.rt+k
Exponential decay factor γk :
0<γ<1
γ=0.9 ⇒ γk ≈ 0 for k ≥ 50 : short-term consequences
γ=0.99 ⇒ γk ≈ 0 for k ≥ 500 : long-term consequences (harder to learn)
Justifications of γ :
Math :
Avoid infinite total reward with value γ<1.
Model uncertainty about the policy network :
The action taken is not absolutely certain to be optimal, so it is better to
discount its future reward.
Short-term return is usually preferred than long-term (e.g. Finance).
Xavier Bresson 29
30
State-Value Function
Maze example :
Actions are : Up, Down, Left, Right
States are : Agent’s locations
Model is :
Dynamics of the environment modeled by P(s|st,at)
Immediate reward rt(st,at,st+1) = -1 given by the environment
Xavier Bresson 30
31
State-Value Function
Let us compute the value function Q in state st considering all possible episodes drawn
from π and P :
et = st, at ~ π(a|st), st+1 ~ P(s|st,at), rt, at+1 ~ π(a|st+1), st+2 ~ P(s|st+1,at+1), rt+1, …
Q(st) = !π,P ( rt + γ.rt+1 + γ2.rt+2 + … )
Q(st) is the expected total discounted reward for being in state st.
rt(st,at,st+1) rt+1(st+2,at+1,st+1)
Xavier Bresson 31
32
State-Value Function
Xavier Bresson 32
33
State-Value Function
The Bellman equation of the value function is stationary, i.e. independent of the time step t :
Let st, at, st+1, rt be re-written as
s, a, s’, r
Let Q(st) = !π,P ( rt + γ. Q(st+1) ) be re-written as
Q(s) = !π,P ( r(s,a,s’)+ γ. Q(s’) )
= Σa π(a|s) Σs’ P(s’|a,s) ( r(s,a,s’) + γ. Q(s’) )
…
π(a|s)
s’
P(s’|s,a) r(s,a,s’)
s
P(s’|s,a) s’
π(a|s) r(s,a,s’)
a
P(s’|s,a) s’
Xavier Bresson 33
34
Action-Value Function
Let us compute the value function Q of taking action at in state st considering only one episode :
et = st, at selected, st+1 ~ P(s|st,at), rt, at+1 ~ π(a|st+1), st+2 ~ P(s|st+1,at+1), rt+1, …
Q(at|st) = rt + γ.rt+1 + γ2.rt+2 + … = Σk=0∞ γk.rt+k
Same equation as state-value function Q(st) except the action at is not a random
variable but an arbitrary selection.
rt(st,at,st+1) rt+1(st+2,at+1,st+1)
Action-Value Function
Let us compute the value function Q of taking action at in state st considering all possible
episodes drawn from π and P :
et = st, at selected, st+1 ~ P(s|st,at), rt, at+1 ~ π(a|st+1), st+2 ~ P(s|st+1,at+1), rt+1, …
Q(at|st) = !π,P ( rt + γ.rt+1 + γ2.rt+2 + … )
rt(st,at,st+1) rt+1(st+2,at+1,st+1)
Xavier Bresson 35
36
Action-Value Function
Xavier Bresson 36
37
Action-Value Function
The Bellman equation of the action-value function is stationary, i.e. independent of the time step t :
Let st, at, st+1, rt , at+1 be re-written as
s, a, s’, r, a’
Let Q(at|st) = !π,P ( rt + γ. Q(at+1|st+1) ) be re-written as
Q(a|s) = !π,P ( r(s,a,s’)+ γ. Q(a’|s’) )
= Σa’,s’ P(s’|a,s) ( r(s,a,s’) + γ. Q(a’|s’) )
= Σs’ P(s’|a,s) ( r(s,a,s’) + γ. Σa’ Q(a’|s’) )
= !P ( r(s,a,s’) + γ. Σa’ Q(a’|s’) ) r(s,a,s’) a’
…
π(a’|s’)
s a r(s,a,s’) a’
π(a’|s’)
s’
Action a is
a’
selected π(a’|s’)
Xavier Bresson 37
38
Outline
Xavier Bresson 38
39
Goals of RL :
Find the optimal value function Q*(a|s) :
We search for the maximum reward given by action a in state s.
Find the optimal policy π*(a|s) :
We search for the policy that provides the best action a in state s.
The optimal policy is deterministic, π*(a|s) := π*(s).
Both objectives are equivalent :
Q*(a|s) = maxπ Q(a|s) = maxa Q(a|s) = Qπ*(a|s)
Once the optimal state-value function Q* is found, then the problem is solved
as it is possible to act optimally :
π*(s) = argmaxa Q*(a|s)
Xavier Bresson 39
40
r(s,a,s’) π*(s’)
a’
π(a’|s’) Q*(a’|s’)
P(s’|s,a) s’
a’
π(a’|s’)
Q(a|s)
a r(s,a,s’)
π*(s) a’
Q*(a|s)
π(a’|s’) Q(a|s)
s’
s
π*(s’)
a’
…
Q*(a’|s’)
Q(a|s)
Xavier Bresson 40
41
The Bellman equation of the optimal action-value function (the maximum Q value for taking
action a in state s) :
Q*(a|s) = maxπ Q(a|s)
= maxπ !π,P ( r(s,a,s’)+ γ. Q(a’|s’) )
= maxπ !P ( r(s,a,s’)+ γ. Q(a’|s’) )
= maxπ Σs’ P(s’|a,s) ( r(s,a,s’) + γ. Q(a’|s’) )
r(s,a,s’) π*(s’)
= Σs’ P(s’|a,s) ( r(s,a,s’) + γ. maxπ Q(a’|s’) ) a’
Q*(a’|s’)
= Σs’ P(s’|a,s) ( r(s,a,s’) + γ. maxa’ Q(a’|s’) ) P(s’|s,a) s’
π(a’|s’)
a’
= !s’ ( r + γ. Q*(a’|s’) ) π(a’|s’)
Q(a|s)
a r(s,a,s’)
π*(s) a’
Q*(a|s)
π(a’|s’) Q(a|s)
s’
s
π*(s’)
a’
…
Q*(a’|s’)
Q(a|s)
Xavier Bresson 41
42
Outline
Xavier Bresson 42
43
Deep Q-Learning
Target to regress
(supervised learning)
Qw*(a1|s) ≈ Q*(a1|s)
DQN
s
…
Qw*(a|s)
Qw*(aK|s) ≈ Q*(aK|s)
Xavier Bresson 43
44
Deep Q-Learning
We need to estimate :
Q*(a|s) = !s’ ( r + γ. maxa’ Q(a’|s’) )
= Σs’ P(s’|a,s) ( r(s,a,s’) + γ. maxa’ Q(a’|s’) )
We approximate the analytical optimal value Q* :
Q*(a|s) = !s’ ( r + γ. maxa’ Q(a’|s’) )
≈ 1/|s’| Σs’ ( r(s,a,s’) + γ. maxa’ Q(a’|s’) )
≈ r(s,a,s’) + γ. maxa’ Q(a’|s’) (if only one state s’ is collected)
≈ r(s,a,s’) + γ. maxa’ Q*(a’|s’) (we have maxa’ Q = Q* = maxa’ Q*)
Xavier Bresson 44
45
Deep Q-Learning
Xavier Bresson 45
46
Xavier Bresson 46
47
s, a, s’, r
Xavier Bresson 47
48
DQN Algorithm
DQN algorithm :
Initialize w randomly and wBL = w
Repeat :
Roll-out an episode : st, at ~ πw, rt, st+1 and store data in replay memory.
Update state-value function Q*w by mini-batch gradient descent
(batch of randomly sampled data from the lookup table) :
wk+1 = wk – lr . ∇w ( 1/N Σ(s,a,s’,r) || r(s,a,s’) + γ. maxa’ Q*w_BL(a’|s’) - Q*w(a|s) ||22 )
Update baseline network wBL = w if greedy evaluation of network w is better than wBL.
Xavier Bresson 48
49
DQN Applications
Observation ot Action at
Reward is a score
for this action
and state.
Xavier Bresson 49
50
DQN Applications
Xavier Bresson 50
51
DQN Applications
Xavier Bresson 51
52
DQN Applications
Xavier Bresson 52
53
DQN Applications
https://github.com/aleju/mario-ai
Xavier Bresson 53
54
Demo 01
Finding optimal strategy to stabilize a pole attached by an un-actuated joint to a cart,
which moves along a frictionless track.
The pendulum starts upright, and the goal is to prevent it from falling over by
increasing and reducing the cart's velocity.
The cart-pole environment is the “MNIST” experiment for RL.
Use it to develop new RL architectures !
Optimal strategy
Xavier Bresson 54
55
Demo 01
Replay Memory
Xavier Bresson 55
56
Q network Demo 01
Compute Q
scores given a
state
Select action
given a state
Read (s,a,s’,r) in
memory
Standard
MSE loss
Xavier Bresson 56
57
Demo 01
Rollout class
Rollout one
episode
Write (s,a,s’,r)
in memory
Xavier Bresson 57
58
Demo 01
Rollout one
episode.
Compute loss,
gradient, update.
Xavier Bresson 58
59
Train multiple epochs Demo 01 Train one epoch
by rolling out a
few episodes
Updating the
baseline network if
greedy evaluation
of current Q
network is better.
Stopping condition
(high reward)
Xavier Bresson 59
60
DQN Algorithm
Advantages :
Memory Replay module memorizes the history of (state,action,reward).
Examples : All professional games of Go, Chess, etc can be memorized.
Memory Replay is a nice attractive feature toward AI.
Challenges :
Hard to compute a good baseline.
Hard to control the exploration of the space (state,action).
Xavier Bresson 60
61
Outline
Xavier Bresson 61
62
Policy Networks
Find optimal policy that maximizes an expected reward :
Local discarded rewards :
Optimize the expected reward of all actions and states :
maxπ !a,s ( Q(a|s) )
Q(a|s) : Local discarded rewards for taking action a in state s in e.g. cartpole
simulation.
Global/sparse rewards :
Optimize the expected reward of all episodes/trajectories :
maxπ !e ~ π,P ( Q(e) )
Q(e) : Global rewards s.a. win/loss in Go or length of cartpole episode.
Xavier Bresson 62
63
Xavier Bresson 63
64
Policy Gradient
Policy gradient :
∇w ( "a,s ( Q(a|s) ) = ∇w ( Σa,s πw(a|s) . Q(a|s) )
= Σa,s ( ∇w πw(a|s) ) . Q(a|s) )
= Σa,s πw(a|s)/πw(a|s) ∇w πw(a|s) . Q(a|s)
= Σa,s πw(a|s) ( ∇w πw(a|s)/πw(a|s) ) . Q(a|s)
= Σa,s πw(a|s) ∇w log πw(a|s) . Q(a|s)
= "a,s ( ∇w log πw(a|s) . Q(a|s) )
= ∇w ( "a,s ( log πw(a|s) . Q(a|s) ) )
+
Policy loss for local discarded rewards :
minw LL(w)
Xavier Bresson 64
65
Policy gradient :
!a,s ( - ∇w log πw(a|s) . Q(a|s) )
Xavier Bresson 65
66
Xavier Bresson 66
67
Loss Approximation
Loss for one episode :
LL(w) = !a,s ( - log πw(a|s) . Q(a|s) )
≈ 1/T Σt - log πw(at|st) . Q(at|st)
Xavier Bresson 67
68
REINFORCE Algorithm
Repeat :
Collect a mini-batch of M episodes em by rolling out the current policy network πw.
≈ ∇w LL(w)
Xavier Bresson 68
69
Demo 02
Xavier Bresson 69
70
Selection an
action a given a
state s
Select action by
Bernouilli random
sampling to explore Compute
the space of discarded
(state,action) rewards
Loss
Z-score
discarded
rewards
Demo 02
Rollout class
Rollout one
episode
Xavier Bresson 71
72
Demo 02
Rollout one
episode.
Compute loss,
gradient, update.
Xavier Bresson 72
73
Demo 02
Train one epoch
Train multiple epochs
by rolling out a
few episodes
Stopping condition
(high reward)
Xavier Bresson 73
74
Σe Pe(e|π,P) . Q(e) s3
at
st+1
at
a2 st+1
s3
s2 at
a2 st+1
s3
a1 at
s2 a2 st+1
s3
Probability of s1 a1 s2
…
at
st+1
a2 s3
episode e given π,P at
st+1
s2 a2 s3
a1 at
π(a1|s1) st+1
a2 s3
s2 at
st+1
P(s2|s1,a1) s3
a2 at
π(a2|s2) st+1
s3 at
P(s3|s2,a2) st+1
P(st+1|st,at)
Xavier Bresson 74
75
Policy Gradient
Policy gradient :
∇w ( "e ~ π,P ( Q(e) ) ) = ∇w ( Σe Pe(e|πw,P) . Q(e) )
= Σe ∇w Pe(e|πw,P) . Q(e)
= Σe Pe(e|πw,P) ( ∇w Pe(e|πw,P)/Pe(e|πw,P) ) . Q(e)
= Σe Pe(e|πw,P) ∇w log Pe(e|πw,P) . Q(e)
= "e ( ∇w log Pe(e|πw,P) . Q(e) )
≈ 1/M Σm ∇w log Pe(em|πw,P) . Q(em)
Approximation of
expected value
with a mini-batch Probability of having the mth episode
of M episodes
Pe(em|πw,P) = ∏t πw(atm |stm ) . P(st+1m |stm,atm)
Xavier Bresson 75
76
Policy Gradient
Policy gradient :
∇w ( "e ~ π,P ( Q(e) ) ) ≈ 1/M Σm ∇w log Pe(em|πw,P) . Q(em)
≈ 1/M Σm ∇w log ( ∏t πw(atm |stm ) . P(st+1m |stm,atm) ) . Q(em)
≈ 1/M Σm ∇w ( Σt log πw(atm |stm ) + Σt log P(st+1m |stm,atm) ) . Q(em)
+
Policy loss for global reward :
LG(w) = 1/M Σm Σt - log πw(atm |stm ) . Q(em)
minw LG(w)
Xavier Bresson 76
77
Demo 03
Xavier Bresson 77
78
Policy network
Demo 03
Compute action
probabilities
given a state s
Selection an
action a given a
state s
Loss
Compare
current policy
with baseline
Demo 03
Rollout class
Rollout one
episode
Xavier Bresson 79
80
Demo 03
Rollout one
episode for current
and baseline
policies.
Compute loss,
gradient, update.
Xavier Bresson 80
81
Stopping condition
(high reward)
Xavier Bresson 81
82
Policy Algorithms
Xavier Bresson 82
83
Outline
Xavier Bresson 83
84
Actor-Critic Algorithm
Monte-Carlo policy gradient (i.e. policy w/ local discarded rewards) a.k.a. REINFORCE has
a high variance (because of Q(at|st)) :
∇w LL(w) = Σt - ∇w log πw(at|st) . Q(at|st)
where Q(at|st) = rt + γ.rt+1 + γ2.rt+2 + … + γT.rT = Σk=tT γk-t.rk
The variance of the policy (actor) can be reduced by constraining Q(at|st) (critic) to satisfy
the Bellman equation :
Q(at|st) = "π,P ( rt + γ. Q(at+1|st+1) )
≈ rt + γ. Q(at+1|st+1) (during one episode)
Q(a|s) will be estimated with a neural network.
One-Step Actor-Critic (QAC) loss :
LQAC(w,v) = Σt - log πw(at|st) . Qv(at|st)
+ Σt || Qv(at|st) – ( rt + γ. Qv(at+1|st+1) ) ||22
Xavier Bresson 84
85
QAC Algorithm
Xavier Bresson 85
86
Demo 04
Xavier Bresson 86
87
AAC Algorithm
Xavier Bresson 87
88
AAC Algorithm
Xavier Bresson 88
89
AAC Algorithm
Xavier Bresson 89
90
Policy (actor)
Evaluate action-
value Q function
(critic)
Evaluate state-
value Q function
(critic)
Action selected
by policy
Xavier Bresson 90
91
Demo 05
Loss
Reward must be
negative
State-value Q
function loss for
one episode
State Q function
loss for one
episode
Xavier Bresson 91
92
Demo 05
Rollout class
Rollout one
episode
Write in memory
(s,a,s’,r)
Xavier Bresson 92
93
Demo 05
Rollout one
episode for current
and baseline
policy.
Compute loss,
gradient, update.
Xavier Bresson 93
94
Demo 05
Train multiple epochs Train one epoch
by rolling out a
few episodes
Stopping condition
(high reward)
Xavier Bresson 94
95
Continuous control :
Deterministic policy and continuous action space :
a = π(s)
a in R (continuous variable)
Find the deterministic policy π that maximizes the expected total reward over all possible
actions and states:
maxπ !a=π(s),s ( Q(a|s) )
Σa=π(s),s Q(a|s)
Σs Q(π(s)|s)
Xavier Bresson 95
96
Xavier Bresson 96
97
Further Improvements
Xavier Bresson 97
98
Applications
Xavier Bresson 98
99
Outline
Xavier Bresson 99
100
Supervised Learning
SL is learning with a teacher who tells the best action to take given any state of the environment.
Training set for SL :
(st,at)t=1:T where at is the label/target action that optimizes the return.
By analogy to computer vision :
s = image
a = class
Likelihood probability of class a given image s :
π(a|s)
It corresponds to the policy function π in RL.
Supervised Learning
Loss function :
minw Lπ(W) = Σt cross_entropy ( πw(at|st) , π*(at|st) )
= - Σt log πw(at|st) . π*(at|st)
maxw Lπ(W) = - minw Lπ(W) πw(a|s)
1
In SL, we know the best
action to take to optimize
the reward function (i.e. the
a1 a2 a* aK
cross entropy loss).
Target probability
Reinforcement Learning
RL and SL Similarities
Supervised learning :
∇w Lπ(w) = ∇w ( Σt log πw(at|st) . π*(at|st) )
= Σt ∇w log πw(at|st) . π*(at|st)
Outline
RL :
Agent interacts with the environment to understand it and improve its policy to get the
best possible reward from this environment.
Agent learns implicitly the environment.
Planning :
We assume the environment is known (or it was learned).
The agent can inquiry multiple times the simulated environment (no real action taken in
the real environment), and decides what to do from these inquiries.
The agent is “thinking” what to do before acting.
Examples :
Game of Go :
Play the move that is known to be the best.
Try a new strategy.
Online advertisement :
Show the consumer the most successful ads.
Try a new ad.
Model-based RL :
Learn a model of a proxy of the environment.
This model can be used to interact virtually with the environment, just as we would
interact with the real environment.
A model is essential to do a look-ahead search by querying the environment with some
actions and observe the consequences/rewards of this action.
This allows to think ahead and make better actions in high-dimensional spaces.
Questions?