0% found this document useful (0 votes)

96 views

Lecture15 Deep Reinforcement Learning PDF

Uploaded by

daniel18ct

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

96 views

Lecture15 Deep Reinforcement Learning PDF

Uploaded by

daniel18ct

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 109

1

CE7454 : Deep Learning for Data Science

Lecture 15: Deep Reinforcement Learning

Semester 1 2019/20

Xavier Bresson
School of Computer Science and Engineering
Data Science and AI Research Centre
Nanyang Technological University (NTU), Singapore

Xavier Bresson 1
2

Material :
David Silver :
Tutorial on Deep Reinforcement Learning, ICML’16
UCL Course on Reinforcement Learning, 2015
Pieter Abbeel :
Talk at Simons Institute for the Theory of Computing, Berkeley, 2017
Mario Martin :
Slides on Policy Search: Actor-Critic and Gradient Policy search, 2019

Xavier Bresson 2
3

Outline

RL and Deep Learning

Agent, Environment and MDP
Policy, Value Function and Model
Optimal Value Function and Policy
Deep Q-Learning (DQN)
Policy Networks
Actor-Critic Algorithm
RL and Supervised Learning
Learning and Planning

Xavier Bresson 3
4

Outline

RL and Deep Learning

Xavier Bresson 4
5

Deep Learning

DL defines a framework to learn the best possible representation of data that can solve tasks
like classification, regression, recommender systems, etc.
DL is a supervised learning technique :
Data are all labeled.
Research and industrial breakthrough applications in CV (autonomous cars, etc), Speech
Recognition (speech-to-text and inversely), NLP (machine translation, etc).

Xavier Bresson 5
6

Deep Learning

Potential revolutionary applications in other fields like physics (simulation acceleration, particle
detection with complex geometry and 108 sensors, generative models for machine calibration)
and chemistry (drugs and materials design).
Learn a deep compositional parametric functions by backpropagation on the loss function.
Loss functions are continuous and differentiable.

Quark and gluon field

Particle jet Molecule design
fluctuations

Xavier Bresson 6
7

Reinforcement Learning

RL defines a framework to learn the best possible sequence of decisions in a given environment
that can solve tasks like playing Go/Chess, making autonomous robots, recommending ads on
internet, etc.

RL is the science of decision-making (D. Silver) :

Goal is to maximize future reward.
Decisions may have long-term consequences :
AlphaGo vs Lee Sedol with move 37 in game 2.
High positive and negative rewards are not instantaneous.
Reward may be delayed for higher values (giving up short-term rewards).

RL is a causal framework.
Time is important – RL is a sequential decision-making technique.

Xavier Bresson 7
8

Reinforcement Learning

RL is different from supervised learning, but has also similarities (later discussed).

Unlike DL, no real-world breakthrough applications of RL (except super-human game players).

Applications to real-world problems are limited, but it might change quickly (DeepMind).

Applications :
Playing games (super-human performances): Go, Chess, Poker, Atari, StarCraft II
Exploring virtual worlds : Maze
Robots (still preliminary) : Simulation and real autonomous robots
Learning to learn (meta-learning): Compute simultaneously architectures and weights.
End of humans designing deep learning architectures ?

Xavier Bresson 8
9

Reinforcement Learning

Applications :

Helicopters that learn to be autonomous Robots that learn to grasp objects

(Stanford) (Google)

Xavier Bresson 9
10

Reinforcement Learning

Applications :

Historical AI milestone AlphaStar for StarCraft II – Jan 2019

Game of Go – March 2016 (DeepMind)
(DeepMind)
AlphaStar beat 5-0 Team Liquid’s Grzegorz
"MaNa" Komincz, one of the world’s strongest
professional StarCraft players
https://deepmind.com/blog/article/alphastar-
mastering-real-time-strategy-game-starcraft-ii

Xavier Bresson 10
11

Brain Inspiration

The human brain may use the cerebellum for

supervised learning representation of the
world data.

The human brain may use the cortex for

unsupervised learning representation of the
world data.

The human brain may use the basal ganglia

for making decisions on problems. John Sowa

Xavier Bresson 11
12

Deep RL
Reinforcement learning has developed powerful algorithms to solve control problems :
Dynamic programming 1953 (Richard Bellman) :
Solve a difficult problem by breaking it into simpler sub-problems in a recursive
manner.
DP can solve optimally low-dimensional control problems by looking at the whole
search space.
DP becomes intractable in high-dimensional search space.
Solution :
Combine DL + RL = Deep RL

Best Best decision Best decision

representation maker in low- maker in high-
learning dim space. dim space.

Each move in Go leads to ~250 possible moves.

The exploration space has 10170 paths.

Xavier Bresson 12
13

Outline

RL and Deep Learning

Xavier Bresson 13
14

Agent and Environment

RL framework :
Splitting agent and environment :
The environment/world can be anything, from simple world like Go board to
complex world like Earth.
The agent can make actions/decisions.
Each action influences/changes the state of the environment (where the
agent lives).
Each action produces a scalar reward from the environment.

Goal of RL :
Design an agent that acts optimally to optimize reward.

Xavier Bresson 14
15

Agent and Environment

RL framework :
At each time step t :
The agent:
Observation ot Action at
Receives an observation ot and
executes an action at.
Observes ot+1 the result of its action. Agent
Receives a scalar reward rt.
The environment :
Receives an action at.
Produces two outputs: Observation ot+1
Reward rt
Observation ot+1 Environment
Reward rt

Xavier Bresson 15
16

Episode, State and Environment

An episode/trajectory/experience et is a sequence of observations, actions, rewards :

et = o1, a1, o2, r1, a2, o3, r2, … , ot, at, ot+1, rt

A state st is a summary of the episode/experience : State st Action at

st = f(et)
= f(o1, a1, o2, r1, a2, o3, r2, … , ot, at, ot+1, rt)
Agent

If the environment is fully observable :

Go, Chess, but not Poker
st = f(et) = f(o1, a1, o2, r1, a2, o3, r2, … , ot, at, ot+1, rt) = ot
State st+1
⇒ st = ot Reward rt
An episode is written as: Environment

et = s1, a1, s2, r1, a2, s3, r2, … , st, at, st+1, rt
At time step t :
st, at, st+1, rt

Xavier Bresson 16
17

MDP

A fully observable environment is defined by a Markov Decision Process (MDP) :

The current state st completely characterizes the future, independently of the past.

A state st is Markov if the state transition probability P is defined as :

P(st+1|st) = P(s+1|st, st-1,…,s1)
The current state st captures all relevant information from the history st-1,…,s1.
The history is not useful and can be discarded.

A Markov process/chain is a memoryless random process.

A sequence of states s1,s2,…,st with Markov property.

Xavier Bresson 17
18

MDP

A Markov decision process (MDP) :

A Markov process with decisions/actions.

The state transition probability matrix is defined as :

P(st+1|st,at)
It encodes the dynamics/behavior of the environment where the agent lives.

Xavier Bresson 18
19

Example
Pong can be cast as a MDP :
A MDP can be visualized as a graph where :
Each node is either a game state st or an action at.
Each edge is a probabilistic transition given either by the state transition probability
matrix P(st+1|st,at), or the policy function π(at|st) (later discussed).

π(a1|s1)
s1 a1
π(a2|s1) P(s2|s1,a1)

P(s1|s3,a3) a2 s2

P(s3|s1,a1) π(a4|s2)
a3 P(s3|s1,a2)

s3 a4
π(a3|s3)
P(s3|s2,a4)
PONG : A game state st is the
image of the game at time state t.
An action at is a move, MDP of PONG
either Up or Down.
Xavier Bresson 19
20

Outline

RL and Deep Learning

Xavier Bresson 20
21

Properties of RL

The goal of RL is to act optimally to maximize a value function that represents the total
reward of a sequence of actions.

There are three types of RL problems, based on three fundamental properties of RL :

Policy function :
Agent’s behavior function (select the agent’s action).
Value function:
Function that quantifies how good or bad is each state or action in state.
Model function :
Agent’s representation of the environment (for planning/reasoning).

We will use a simple example to illustrate these properties:

Maze

Xavier Bresson 21
22

Policy

A policy encodes the agent’s behavior :

It is a map from states to actions.
Deterministic policy :
a = π(s)
Given deterministic π, a new episode can be drawn as follows :
et = s1, a1 = π(s1), s2 ~ P(s|s1,a1), r1, a2 = π(s2), s3 ~ P(s|s2,a2), r2,
… st ~ P(s|st-1,at-1), at = π(st), st+1 ~ P(s|st,at), rt

r1(s1,a1,s2) r2(s2,a2,s3) rt(st,at,st+1)

π(a1) P(s2|s1,a1) π(a2) P(s3|s2,a2) π(at) P(st+1|st,at)

s1 a1 s2 a2 s3 … at st+1

Xavier Bresson 22
23

Policy

r1(s1,a1,s2) r2(s2,a2,s3) rt(st,at,st+1)

π(a1|s1) P(s2|s1,a1) π(a2|s2) P(s3|s2,a2) π(at|st) P(st+1|st,at)

s1 a1 s2 a2 s3 … at st+1

Xavier Bresson 23
24

Policy

How to sample at ~ π(a|st) and st ~ P(s|st-1,at-1) ?

Bernoulli sampling :
A random draw with exactly K possible outcomes, in which the probability Pr of
having X=xk depends on a stationary distribution (i.e. the distribution stays
unchanged at each time step).
NumPy/PyTorch :
s = np.random.choice( p=Pr(X) )
s = torch.distributions.Categorical( p=Pr(X) ).sample()

X
x1 x2 x3 xK

Xavier Bresson 24
25

st+1
s2 a2 s3
π(a1|s1) a1 at
st+1

a2 s3
s2 at
P(s2|s1,a1) st+1
s3
π(a2|s2) a2 at
st+1
s3 at
P(s3|s2,a2) st+1
Xavier Bresson P(st+1|st,at) 25
26

Policy

Maze example :
Actions are : Up, Down, Left, Right
States are : Agent’s locations

Arrows represent optimal policy π(s)

for each state s
Xavier Bresson 26
27

Model

Maze example :
Actions are : Up, Down, Left, Right
States are : Agent’s locations
Model is :
Dynamics of the environment modeled by P(s|st,at)
Immediate reward rt(st,at,st+1) given by the environment

Dynamics : Grid layout represents the transition

model P
(same probability value for all state-action)

Reward : Number -1 represents immediate reward r

from each state s
(same reward value for all actions a)

Xavier Bresson 27
28

State-Value Function

The state value function Q(s) is the expected future reward for being in state s.
It is used to evaluate the goodness/badness of each state s.
It acts as the label information in supervised learning (later discussed).

Let us compute the value function Q in state st considering only one episode :
et = st, at ~ π(a|st), st+1 ~ P(s|st,at), rt, at+1 ~ π(a|st+1), st+2 ~ P(s|st+1,at+1), rt+1, …
Q(st) = rt + γ.rt+1 + γ2.rt+2 + … = Σk=0∞ γk.rt+k

Return for Immediate Sum of discounted

being in state st reward future rewards

The value function Q(st) is the total discounted reward (a.k.a. return) we will receive
for being in state st.
rt(st,at,st+1) rt+1(st+2,at+1,st+1)

π(at|st) P(st+1|st,at) π(at+1|st+1) P(st+2|st+1,at+1)

st at st+1 at+1 st+2 …
Xavier Bresson 28
29

State-Value Function
Discounted reward using hyper-parameter γ :
Q(st) = rt + γ.rt+1 + γ2.rt+2 + … = Σk=0∞ γk.rt+k
Exponential decay factor γk :
0<γ<1
γ=0.9 ⇒ γk ≈ 0 for k ≥ 50 : short-term consequences
γ=0.99 ⇒ γk ≈ 0 for k ≥ 500 : long-term consequences (harder to learn)
Justifications of γ :
Math :
Avoid infinite total reward with value γ<1.
Model uncertainty about the policy network :
The action taken is not absolutely certain to be optimal, so it is better to
discount its future reward.
Short-term return is usually preferred than long-term (e.g. Finance).

Xavier Bresson 29
30

State-Value Function
Maze example :
Actions are : Up, Down, Left, Right
States are : Agent’s locations
Model is :
Dynamics of the environment modeled by P(s|st,at)
Immediate reward rt(st,at,st+1) = -1 given by the environment

Numbers represent optimal value

function Q(s) (w/ γ=1) for each state s.

Xavier Bresson 30
31

State-Value Function

Let us compute the value function Q in state st considering all possible episodes drawn
from π and P :
et = st, at ~ π(a|st), st+1 ~ P(s|st,at), rt, at+1 ~ π(a|st+1), st+2 ~ P(s|st+1,at+1), rt+1, …
Q(st) = !π,P ( rt + γ.rt+1 + γ2.rt+2 + … )
Q(st) is the expected total discounted reward for being in state st.

rt(st,at,st+1) rt+1(st+2,at+1,st+1)

π(at|st) P(st+1|st,at) π(at+1|st+1) P(st+2|st+1,at+1)

st at st+1 at+1 st+2 …

Xavier Bresson 31
32

State-Value Function

Let us write the Bellman equation of the value function Q :

Q(st) = !π,P ( rt + γ.rt+1 + γ2.rt+2 + … )
= !π,P ( rt + !π,P ( γ.rt+1 + γ2.rt+2 + … ) )
= !π,P ( rt + γ. !π,P ( rt+1 + γ.rt+2 + … ) )
= !π,P ( rt + γ. Q(st+1) )
Recursive formulation (dynamic programming)
The sum of total reward Q for being in state st can be decomposed into 2 parts :
The immediate reward rt we get after one time step.
The discounted sum of future rewards γ.Q(st+1) we will receive.
It is the expectation over all possible states st+1.

Xavier Bresson 32
33

State-Value Function

The Bellman equation of the value function is stationary, i.e. independent of the time step t :
Let st, at, st+1, rt be re-written as
s, a, s’, r
Let Q(st) = !π,P ( rt + γ. Q(st+1) ) be re-written as
Q(s) = !π,P ( r(s,a,s’)+ γ. Q(s’) )
= Σa π(a|s) Σs’ P(s’|a,s) ( r(s,a,s’) + γ. Q(s’) )

Probability of action Transition probability r(s,a,s’)

a in state s from action a in state
s to state s’ s’
P(s’|s,a)
r(s,a,s’)
a

…
π(a|s)
s’
P(s’|s,a) r(s,a,s’)
s
P(s’|s,a) s’
π(a|s) r(s,a,s’)
a

P(s’|s,a) s’

Xavier Bresson 33
34

Action-Value Function

The state-value function Q(st) can be extended to an action-value function Q(at|st) :

Q(at|st) = Future reward (good or bad) of selecting action at in state st.

Let us compute the value function Q of taking action at in state st considering only one episode :
et = st, at selected, st+1 ~ P(s|st,at), rt, at+1 ~ π(a|st+1), st+2 ~ P(s|st+1,at+1), rt+1, …
Q(at|st) = rt + γ.rt+1 + γ2.rt+2 + … = Σk=0∞ γk.rt+k
Same equation as state-value function Q(st) except the action at is not a random
variable but an arbitrary selection.

rt(st,at,st+1) rt+1(st+2,at+1,st+1)

P(st+1|st,at) π(at+1|st+1) P(st+2|st+1,at+1)

st at st+1 at+1 st+2 …

Selected action at, not

drawn from π(a|s).
Xavier Bresson 34
35

Action-Value Function

Let us compute the value function Q of taking action at in state st considering all possible
episodes drawn from π and P :
et = st, at selected, st+1 ~ P(s|st,at), rt, at+1 ~ π(a|st+1), st+2 ~ P(s|st+1,at+1), rt+1, …
Q(at|st) = !π,P ( rt + γ.rt+1 + γ2.rt+2 + … )

rt(st,at,st+1) rt+1(st+2,at+1,st+1)

P(st+1|st,at) π(at+1|st+1) P(st+2|st+1,at+1)

st at st+1 at+1 st+2 …

Selected action at, not

drawn from π(a|s).

Xavier Bresson 35
36

Action-Value Function

Let us write the Bellman equation of the action-value function Q :

Q(at|st) = !π,P ( rt + γ.rt+1 + γ2.rt+2 + … )
= !π,P ( rt + !π,P ( γ.rt+1 + γ2.rt+2 + … ) )
= !π,P ( rt + γ. !π,P ( rt+1 + γ.rt+2 + … ) )
= !π,P ( rt + γ. Q(at+1|st+1) )
The sum of total reward Q for taking action at in state st can be decomposed into 2 parts :
The immediate reward rt we get from action at in state st.
The discounted expected sum of future rewards γ.Q(at+1|st+1) we will receive.
It is the expectation over all possible actions at+1 in all possible states st+1.

Xavier Bresson 36
37

Action-Value Function

There is no more π(a|s) as action a is selected. π(a’|s’)

P(s’|s,a) s’
a’

…
π(a’|s’)

s a r(s,a,s’) a’

π(a’|s’)
s’
Action a is
a’
selected π(a’|s’)

Xavier Bresson 37
38

Outline

RL and Deep Learning

Xavier Bresson 38
39

Optimal Value Function and Policy

Goals of RL :
Find the optimal value function Q*(a|s) :
We search for the maximum reward given by action a in state s.
Find the optimal policy π*(a|s) :
We search for the policy that provides the best action a in state s.
The optimal policy is deterministic, π*(a|s) := π*(s).
Both objectives are equivalent :
Q*(a|s) = maxπ Q(a|s) = maxa Q(a|s) = Qπ*(a|s)
Once the optimal state-value function Q* is found, then the problem is solved
as it is possible to act optimally :
π*(s) = argmaxa Q*(a|s)

Xavier Bresson 39
40

Optimal Value Function and Policy

Optimal value function Q(a|s) and optimal policy π(s) :

r(s,a,s’) π*(s’)
a’

π(a’|s’) Q*(a’|s’)
P(s’|s,a) s’
a’
π(a’|s’)
Q(a|s)
a r(s,a,s’)
π*(s) a’
Q*(a|s)
π(a’|s’) Q(a|s)
s’

s
π*(s’)
a’
…
Q*(a’|s’)

Q(a|s)

Xavier Bresson 40
41

Optimal Value Function and Policy

Q*(a’|s’)
= Σs’ P(s’|a,s) ( r(s,a,s’) + γ. maxa’ Q(a’|s’) ) P(s’|s,a) s’
π(a’|s’)

a’
= !s’ ( r + γ. Q*(a’|s’) ) π(a’|s’)
Q(a|s)
a r(s,a,s’)
π*(s) a’
Q*(a|s)
π(a’|s’) Q(a|s)
s’

s
π*(s’)
a’
…
Q*(a’|s’)

Q(a|s)
Xavier Bresson 41
42

Outline

RL and Deep Learning

Xavier Bresson 42
43

Deep Q-Learning

The goal is to estimate the optimal action value function Q* defined as :

Q*(a|s) = !s’ ( r + γ. maxa’ Q(a’|s’) )
It is intractable to compute Q* for all actions and all states.
Q* will be estimated with a neural network (DQN) :
Q*w(a|s) ≈ Q*(a|s) = !s’ ( r + γ. maxa’ Q(a’|s’) )

Target to regress
(supervised learning)

Qw*(a1|s) ≈ Q*(a1|s)
DQN
s

…
Qw*(a|s)
Qw*(aK|s) ≈ Q*(aK|s)

Xavier Bresson 43
44

Deep Q-Learning

Xavier Bresson 44
45

Deep Q-Learning

We approximate with DQN the expression :

Q(a|s) ≈ r(s,a,s’) + γ. maxa’ Q(a’|s’) ⇒ Qw(a|s) ≈ r(s,a,s’) + γ. maxa’ Qw(a’|s’)

Training data (s,a,s’,r) will be collected by rolling out episodes.

DQN MSE loss obtained after rolling out a few of episodes:

L(w) = 1/N Σ(s,a,s’,r) || r(s,a,s’) + γ. maxa’ Qw(a’|s’) - Qw(a|s) ||22

where N=|(s,a,s’,r)| is the total number of collected tuples (s,a,s’,r).

Xavier Bresson 45
46

Vanilla DQN Algorithm

Vanilla DQN algorithm :

Initialize w randomly
Repeat :
Roll-out an episode : st, at ~ πw, rt, st+1
Update state-value function Q*w by gradient descent :
wk+1 = wk – lr . ∇w ( 1/N Σt || r(st,at,st+1) + γ. maxa_t+1 Q*w(at+1|st+1) - Q*w(at|st) ||22 )

DQN does not converge !

Strong correlations between samples.
Predictive function over-fits (memorization) rather than generalizes.
Target values are non-stationary.
The target values change too quickly.

Xavier Bresson 46
47

Vanilla DQN Algorithm

To remove correlations between training data :

s 1 , a 1 , s 2 , r1
Memory/Experience Replay : s 2 , a 2 , s 3 , r2
Data are randomly sampled from a lookup table. s 3 , a 3 , s 4 , r3
…

st, at, st+1, rt

s, a, s’, r

To stabilize the learning process w.r.t. the signal non-stationarity :

A baseline network wBL is introduced.
This network keeps a copy of the network w for a fixed number of iterations.

Xavier Bresson 47
48

DQN Algorithm

DQN algorithm :
Initialize w randomly and wBL = w
Repeat :
Roll-out an episode : st, at ~ πw, rt, st+1 and store data in replay memory.
Update state-value function Q*w by mini-batch gradient descent
(batch of randomly sampled data from the lookup table) :
wk+1 = wk – lr . ∇w ( 1/N Σ(s,a,s’,r) || r(s,a,s’) + γ. maxa’ Q*w_BL(a’|s’) - Q*w(a|s) ||22 )
Update baseline network wBL = w if greedy evaluation of network w is better than wBL.

Xavier Bresson 48
49

DQN Applications

DQN application to Atari games :

Observation ot Action at

Observation is a stack of 4 Agent

consecutive image frames
of the game.
Action is the joystick
position and a button.
Observation ot+1
Reward rt
Environment

Reward is a score
for this action
and state.

Xavier Bresson 49
50

DQN Applications

DQN application to Atari games :

Learning of values Qw(a|s) from input s.
Input state s is a stack of raw pixels from last 4 frames.
Output is Qw(a|s) for the 18 joystick/button positions.

Xavier Bresson 50
51

DQN Applications

Application to Atari games :

Xavier Bresson 51
52

DQN Applications

Learning optimal strategy :

Xavier Bresson 52
53

DQN Applications

Application to Mario Bros :

https://github.com/aleju/mario-ai

Xavier Bresson 53
54

Demo 01
Finding optimal strategy to stabilize a pole attached by an un-actuated joint to a cart,
which moves along a frictionless track.
The pendulum starts upright, and the goal is to prevent it from falling over by
increasing and reducing the cart's velocity.
The cart-pole environment is the “MNIST” experiment for RL.
Use it to develop new RL architectures !

Optimal strategy
Xavier Bresson 54
55

Demo 01

Replay Memory

Xavier Bresson 55
56

Q network Demo 01

Compute Q
scores given a
state

Select action
given a state

Select random action

to explore the space
of (state,action)
Loss

Read (s,a,s’,r) in
memory

Standard
MSE loss

Xavier Bresson 56
57

Demo 01

Rollout class

Rollout one
episode

Write (s,a,s’,r)
in memory

Xavier Bresson 57
58

Demo 01

Rollout one
episode.
Compute loss,
gradient, update.

Xavier Bresson 58
59
Train multiple epochs Demo 01 Train one epoch
by rolling out a
few episodes

Smoothly decreasing the

random action prob
(space exploration
parameter)

Updating the
baseline network if
greedy evaluation
of current Q
network is better.

Stopping condition
(high reward)

Xavier Bresson 59
60

DQN Algorithm

Advantages :
Memory Replay module memorizes the history of (state,action,reward).
Examples : All professional games of Go, Chess, etc can be memorized.
Memory Replay is a nice attractive feature toward AI.
Challenges :
Hard to compute a good baseline.
Hard to control the exploration of the space (state,action).

Xavier Bresson 60
61

Outline

RL and Deep Learning

Xavier Bresson 61
62

Policy Networks
Find optimal policy that maximizes an expected reward :
Local discarded rewards :
Optimize the expected reward of all actions and states :
maxπ !a,s ( Q(a|s) )
Q(a|s) : Local discarded rewards for taking action a in state s in e.g. cartpole
simulation.
Global/sparse rewards :
Optimize the expected reward of all episodes/trajectories :
maxπ !e ~ π,P ( Q(e) )
Q(e) : Global rewards s.a. win/loss in Go or length of cartpole episode.

Xavier Bresson 62
63

Policy Network for Local Discarded Rewards

Optimize the expected reward of all actions and states : π(a|s) a

maxπ !a,s ( Q(a|s) ) s

Σa,s π(a|s) . Q(a|s)

π(a|s)
a

It is intractable to compute the exact policy function π in high-dimensional spaces.

We will approximate the policy function π with a neural network :

π(a|s) ⇒ πw(a|s)

Xavier Bresson 63
64

+
Policy loss for local discarded rewards :

LL(w) = "a,s ( - log πw(a|s) . Q(a|s) )

minw LL(w)

Xavier Bresson 64
65

Interpretation of Policy Gradient

Policy gradient :
!a,s ( - ∇w log πw(a|s) . Q(a|s) )

Score function Reward for action a

(log likelihood) in state s

Gradient of score function favors actions

of high probability, and diminishes
actions with low probability.

This term increases probability of actions

with positive reward Q, and decreases
probability of actions with negative reward Q.

Xavier Bresson 65
66

Policy Gradient Approximation

It is intractable to compute the exact gradient in high-dimensional spaces.

r1(s1,a1,s2) r2(s2,a2,s3) rT(sT,aT,sT+1)

π(a1|s1) P(s2|s1,a1) π(a2|s2) P(s3|s2,a2) π(aT|sT) P(sT+1|sT,aT)

s1 a1 s2 a2 s3 … aT sT+1

Xavier Bresson 66
67

Loss Approximation
Loss for one episode :
LL(w) = !a,s ( - log πw(a|s) . Q(a|s) )
≈ 1/T Σt - log πw(at|st) . Q(at|st)

Approximation of Number of time

expected value steps in one episode
for one episode

with Q(at|st) = rt + γ.rt+1 + γ2.rt+2 + … + γT.rT Discounted reward

Loss for a mini-batch of M episodes :

LL(w) ≈ 1/M Σm 1/Tm Σt - log πw(atm|stm) . Q(atm|stm)

Xavier Bresson 67
68

REINFORCE Algorithm

Deep policy network algorithm :

Initialize the policy network at random wk=0 for πw(a|s).

Repeat :

Collect a mini-batch of M episodes em by rolling out the current policy network πw.

Update the policy network by stochastic gradient descent :

wk+1 = wk – lr . ∇w ( 1/M Σm 1/Tm Σt - log πw(atm|stm) . Q(atm|stm) )

≈ ∇w LL(w)

This algorithm is known as REINFORCE or Monte-Carlo policy gradient.

Xavier Bresson 68
69

Demo 02

Finding the optimal policy network with local discarded rewards.

Optimal policy network

Xavier Bresson 69
70

Policy network Demo 02

Compute action
probabilities
given a state s

Selection an
action a given a
state s
Select action by
Bernouilli random
sampling to explore Compute
the space of discarded
(state,action) rewards
Loss
Z-score
discarded
rewards

Policy loss for

one episode
Xavier Bresson 70
71

Demo 02

Rollout class

Rollout one
episode

Collect all local

rewards and log
probability of
actions

Xavier Bresson 71
72

Demo 02

Rollout one
episode.
Compute loss,
gradient, update.

Xavier Bresson 72
73

Demo 02
Train one epoch
Train multiple epochs
by rolling out a
few episodes

Stopping condition
(high reward)

Xavier Bresson 73
74

Policy Network for Global Rewards

Optimize the expected reward of all episodes/trajectories :

maxπ !e ~ π,P ( Q(e) ) st+1

Σe Pe(e|π,P) . Q(e) s3
at
st+1
at
a2 st+1
s3
s2 at
a2 st+1
s3
a1 at
s2 a2 st+1
s3
Probability of s1 a1 s2
…
at
st+1
a2 s3
episode e given π,P at
st+1
s2 a2 s3
a1 at
π(a1|s1) st+1

a2 s3
s2 at
st+1
P(s2|s1,a1) s3
a2 at
π(a2|s2) st+1
s3 at
P(s3|s2,a2) st+1
P(st+1|st,at)

Let us represent the policy function π with a neural network :

π(a|s) ⇒ πw(a|s)

Xavier Bresson 74
75

Xavier Bresson 75
76

Gradient is zero as term is independent of w.

The policy gradient is independent of the
dynamics of the environment.
No need to know the dynamics of the world
to act optimally !

∇w ( "e ~ π,P ( Q(e) ) ) ≈ ∇w ( 1/M Σm Σt log πw(atm |stm ) . Q(em) )

+
Policy loss for global reward :
LG(w) = 1/M Σm Σt - log πw(atm |stm ) . Q(em)
minw LG(w)
Xavier Bresson 76
77

Demo 03

Finding the optimal policy network with global rewards.

Optimal policy network

Xavier Bresson 77
78

Policy network
Demo 03

Compute action
probabilities
given a state s

Selection an
action a given a
state s

Loss

Compare
current policy
with baseline

Policy loss for

one episode
Xavier Bresson 78
79

Demo 03
Rollout class

Rollout one
episode

Collect all global

rewards, i.e.
length of each
episode

Xavier Bresson 79
80

Demo 03

Rollout one
episode for current
and baseline
policies.
Compute loss,
gradient, update.

Xavier Bresson 80
81

Train multiple epochs

Demo 03 Train one epoch
by rolling out a
few episodes

Updating the baseline

policy if greedy
evaluation of current
policy is better.

Stopping condition
(high reward)

Xavier Bresson 81
82

Policy Algorithms

Policy network with local rewards :

Good to receive lots of local rewards to learn faster and better.
Hard to define a good baseline (use z-score of local rewards of mini-batch of episodes).
Slow to learn long-term strategy

Policy network with global rewards :

We receive only a global/sparse reward at the end of the episode.
Easy to define a good baseline (by greedy roll-out)
Faster to learn long-term strategy.

Xavier Bresson 82
83

Outline

RL and Deep Learning

Xavier Bresson 83
84

Actor-Critic Algorithm

Monte-Carlo policy gradient (i.e. policy w/ local discarded rewards) a.k.a. REINFORCE has
a high variance (because of Q(at|st)) :
∇w LL(w) = Σt - ∇w log πw(at|st) . Q(at|st)
where Q(at|st) = rt + γ.rt+1 + γ2.rt+2 + … + γT.rT = Σk=tT γk-t.rk
The variance of the policy (actor) can be reduced by constraining Q(at|st) (critic) to satisfy
the Bellman equation :
Q(at|st) = "π,P ( rt + γ. Q(at+1|st+1) )
≈ rt + γ. Q(at+1|st+1) (during one episode)
Q(a|s) will be estimated with a neural network.
One-Step Actor-Critic (QAC) loss :
LQAC(w,v) = Σt - log πw(at|st) . Qv(at|st)
+ Σt || Qv(at|st) – ( rt + γ. Qv(at+1|st+1) ) ||22

Xavier Bresson 84
85

QAC Algorithm

One-Step Actor-Critic (QAC) algorithm :

Initialize w and v randomly.
Repeat :
Roll-out : st, at ~ πw, rt, st+1, at+1 ~ πw
Error : et = Qv(at|st) – ( rt + γ. Qv(at+1|st+1) )
Actor update : wk+1 = wk - lr . Σt ∇w - log πw(at|st) . Qv(at|st)
Critic update : vk+1 = vk - lr . et . ∇v Qv(at|st)
We simultaneously estimate the policy πw (actor) and the action-value function Qv (critic)
that evaluates the policy :
Qv(at|st) is the evaluation of long-term returned by the critic at time t.
et is the estimated error of long-term returned at time t.

Xavier Bresson 85
86

Demo 04

Xavier Bresson 86
87

AAC Algorithm

One-Step Actor-Critic (QAC) loss :

LQAC(w,v) = Σt - log πw(at|st) . Qv(at|st)
+ Σt || Qv(at|st) – ( rt + γ. Qv(at+1|st+1) ) ||22
and its policy gradient :
∇w LP(w) = Σt - ∇w log πw(at|st) . Qv(at|st)
is biased and may not find the optimal policy.
A baseline function b(s) is introduced to remove the bias :
∇w LP(w) = Σt - ∇w log πw(at|st) . ( Qv(at|st) – b(st) )
A good baseline is
b(st) = Q(st), the state-value function in state st, i.e. the future reward from state st.
Q(s) will be estimated with a neural network.

Xavier Bresson 87
88

AAC Algorithm

Advantage Actor-Critic (AAC or A2C) loss :

LAAC(w,v,u) = Σt - log πw(at|st) . ( Qv(at|st) – Qu(st) )
+ Σt || Qv(at|st) – ( rt + γ. Qv(at+1|st+1) ) ||22
+ Σt || Qu(st) - Q(at|st) ||22 with Q(at|st) = Σk=tT γk-t.rk
and its policy gradient :
∇w LAAC(w) = Σt - ∇w log πw(at|st) . A(at|st)
where A(at|st) = Qv(at|st) – Qu(st) is the Advantage function.
This gradient policy is unbiased and has reduced variance.

Xavier Bresson 88
89

AAC Algorithm

Advantage Actor-Critic (AAC or A2C) algorithm :

Initialize w, v, u randomly.
Repeat :
Roll-out an episode and write in memory : st, at ~ πw, rt, st+1, at+1 ~ πw
Discounted reward : Q(at|st) = Σk=tT γk-t.rk
Errors : et1 = Qv(at|st) – ( rt + γ. Qv(at+1|st+1) ) and et2 = Qu(st) - Q(at|st)
Advantage function : A(at|st) = Qv(at|st) – Qu(st)
Update : wk+1 = wk - lr . Σt ∇w - log πw(at|st) . A(at|st)
Update : vk+1 = vk - lr . et1 . ∇v Qv(at|st)
Update : uk+1 = uk - lr . et2 . ∇v Qu(st)

Xavier Bresson 89
90

Actor-critic network Demo 05

Policy (actor)
Evaluate action-
value Q function
(critic)

Evaluate state-
value Q function
(critic)

Action selected
by policy

Xavier Bresson 90
91

Demo 05
Loss
Reward must be
negative

Policy loss for

one episode

State-value Q
function loss for
one episode

State Q function
loss for one
episode

Xavier Bresson 91
92

Demo 05
Rollout class

Rollout one
episode

Write in memory
(s,a,s’,r)

Xavier Bresson 92
93

Demo 05

Rollout one
episode for current
and baseline
policy.
Compute loss,
gradient, update.

Xavier Bresson 93
94

Demo 05
Train multiple epochs Train one epoch
by rolling out a
few episodes

Stopping condition
(high reward)

Xavier Bresson 94
95

Continuous Actor-Critic Algorithm

Continuous control :
Deterministic policy and continuous action space :
a = π(s)
a in R (continuous variable)
Find the deterministic policy π that maximizes the expected total reward over all possible
actions and states:
maxπ !a=π(s),s ( Q(a|s) )
Σa=π(s),s Q(a|s)
Σs Q(π(s)|s)

Xavier Bresson 95
96

Continuous Actor-Critic Algorithm

Continuous control :
Policy gradient :
∇w ( "a=π(s),s ( Q(a|s) ) = ∇w ( "s ( Q(πw(s)|s) )
= "s ( ∇w ( Q(πw(s)|s) )
= "s ( dQ(πw)/dπw . ∇w πw ) // chain rule on differentiable Q
(as πw=a in R)
= "s ( dQ(a|s)/da . ∇w a ) // as πw=a

Gradient of the Q-value function. How to adjust the policy so

How to adjust the Q-value to take action that we can take these actions
a in state s more often (or less often). more often (or less often).

The product is the gradient to follow to make

our policy better to get higher Q value.

Xavier Bresson 96
97

Further Improvements

Observations ot are raw pixels from current frame.

Observations ot of the maze are partial.
State st =f(o1,…,ot) is modeled by a recurrent neural network (LSTM).
One network to output action/state functions Q(a|s), Q(s) and policy function π(a|s).
Task is to collect apples (+1 reward) and escape (+10 reward).

Xavier Bresson 97
98

Applications

Learning locomotion Learning human-like robot hand gesture

Humanoid and spider Application to Rubik’s Cube

Xavier Bresson 98
99

Outline

RL and Deep Learning

Xavier Bresson 99
100

Supervised Learning

SL is learning with a teacher who tells the best action to take given any state of the environment.
Training set for SL :
(st,at)t=1:T where at is the label/target action that optimizes the return.
By analogy to computer vision :
s = image
a = class
Likelihood probability of class a given image s :
π(a|s)
It corresponds to the policy function π in RL.

Xavier Bresson 100

101

Supervised Learning

= Σt log πw(at|st) . π*(at|st)

Gradient of loss function : a1 a2 a* aK

∇w Lπ(W) = ∇w ( Σt log πw(at|st) . π*(at|st) ) Predicted probability

= Σt ∇w log πw(at|st) . π(at|st) π(a|s)

1
In SL, we know the best
action to take to optimize
the reward function (i.e. the
a1 a2 a* aK
cross entropy loss).
Target probability

Xavier Bresson 101

102

Reinforcement Learning

RL has no teacher (no label) to tell the best action to take.

RL is a trial-and-error learning paradigm.
RL draws sequence of actions that provide reward.
Depending on the reward, the actions are encouraged and discouraged.
Training set :
A batch of episodes drawn from probabilities π and P :
eT = s1, a1 ~ π(a|s1), s2 ~ P(s|s1,a1), r1, a2 ~ π(a|s2), s3 ~ P(s|s2,a2), r2,
… sT ~ P(s|sT-1,aT-1), aT ~ π(a|sT), sT+1 ~ P(s|sT,aT), rT
For each state and action, a reward rt is arbitrary defined :
Game Go : rt=0,1,-1 (1 for won game, -1 for lost game, 0 otherwise).
Robot : 1 if it does not fall, -1 if it falls.
Reward can be highly sparse, unlike SL.

Xavier Bresson 102

103

RL and SL Similarities

Gradient of loss function :

Reinforcement learning :
∇w R(πw) = "a,s ( ∇w log πw(a|s) . Q(a|s) )
= Σt ∇w log πw(at|st) . Q(at|st)

In RL, we estimate the future

reward of taking action at in
state st.

Supervised learning :
∇w Lπ(w) = ∇w ( Σt log πw(at|st) . π*(at|st) )
= Σt ∇w log πw(at|st) . π*(at|st)

In SL, we know the best

action at to take in state st.

Xavier Bresson 103

104

Outline

RL and Deep Learning

Xavier Bresson 104

105

Learning and Planning

RL :
Agent interacts with the environment to understand it and improve its policy to get the
best possible reward from this environment.
Agent learns implicitly the environment.

Planning :
We assume the environment is known (or it was learned).
The agent can inquiry multiple times the simulated environment (no real action taken in
the real environment), and decides what to do from these inquiries.
The agent is “thinking” what to do before acting.

RL and planning are usually done together.

Xavier Bresson 105

106

Learning and Planning

If the environment is high-dimensional :

Curse of dimensionality :
The space is too large to simultaneously explore and exploit.
How to balance the trade-off between exploration and exploitation ?
Exploration :
Give up some rewards to get more information about the environment, which can
eventually lead to higher rewards.
Exploitation :
Give up potential more rewards by only using the known environment to get good rewards.

Xavier Bresson 106

107

Learning and Planning

Examples :
Game of Go :
Play the move that is known to be the best.
Try a new strategy.
Online advertisement :
Show the consumer the most successful ads.
Try a new ad.

Xavier Bresson 107

108

Learning and Planning

Model-based RL :
Learn a model of a proxy of the environment.
This model can be used to interact virtually with the environment, just as we would
interact with the real environment.
A model is essential to do a look-ahead search by querying the environment with some
actions and observe the consequences/rewards of this action.
This allows to think ahead and make better actions in high-dimensional spaces.

Xavier Bresson 108

109

Questions?

Xavier Bresson 109

TCP2101 Algorithm Design and Analysis Lab07 - Divide-and-Conquer
No ratings yet
TCP2101 Algorithm Design and Analysis Lab07 - Divide-and-Conquer
5 pages
Upload An Approved Document, You Will Be Able To Download The Document
No ratings yet
Upload An Approved Document, You Will Be Able To Download The Document
25 pages
Investment Banking Preparation Week 1
No ratings yet
Investment Banking Preparation Week 1
12 pages
Markov Decision Process and Reinforcement Learning
No ratings yet
Markov Decision Process and Reinforcement Learning
36 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
Reinforcement Learning: A Short Cut
No ratings yet
Reinforcement Learning: A Short Cut
7 pages
DAA Notes
No ratings yet
DAA Notes
199 pages
DAA NoTESALLUNITSBYNEELIMA
No ratings yet
DAA NoTESALLUNITSBYNEELIMA
220 pages
Unified Modeling Language: Eran Kampf 2005
No ratings yet
Unified Modeling Language: Eran Kampf 2005
34 pages
IBM Software Group
No ratings yet
IBM Software Group
43 pages
Iiyr It - Cs8451 Daa - 5 Units Notes
No ratings yet
Iiyr It - Cs8451 Daa - 5 Units Notes
117 pages
Network Security Essentials: Fourth Edition by William Stallings Lecture Slides by Lawrie Brown
No ratings yet
Network Security Essentials: Fourth Edition by William Stallings Lecture Slides by Lawrie Brown
46 pages
Database Management System by IDM Computer Studies (PVT) Ltd.
100% (1)
Database Management System by IDM Computer Studies (PVT) Ltd.
142 pages
Unit2 Notes Divya
No ratings yet
Unit2 Notes Divya
27 pages
Chapter 3 - Block Ciphers and The Data Encryption Standard
No ratings yet
Chapter 3 - Block Ciphers and The Data Encryption Standard
34 pages
Cryptography and Network Security - Chap 4
No ratings yet
Cryptography and Network Security - Chap 4
45 pages
UML - Class Diagram
No ratings yet
UML - Class Diagram
41 pages
Ada Notes
No ratings yet
Ada Notes
148 pages
Ugc Net Exam Daa PDF
No ratings yet
Ugc Net Exam Daa PDF
94 pages
Daa Lecture Notes
No ratings yet
Daa Lecture Notes
169 pages
3 Greedy Method New
No ratings yet
3 Greedy Method New
92 pages
Daa
No ratings yet
Daa
113 pages
1 Alg Lecture1 (1) (7 Files Merged)
No ratings yet
1 Alg Lecture1 (1) (7 Files Merged)
185 pages
About Skyess: Problems Crushed
No ratings yet
About Skyess: Problems Crushed
11 pages
Department of Information Technolo
No ratings yet
Department of Information Technolo
116 pages
Introduction To Daa
100% (1)
Introduction To Daa
126 pages
CSS - Orientation TE SEM VI FH - 22
No ratings yet
CSS - Orientation TE SEM VI FH - 22
17 pages
Object Oriented Analysis and Design Using UML
100% (1)
Object Oriented Analysis and Design Using UML
111 pages
Data Science New
No ratings yet
Data Science New
9 pages
Design and Analysis of Algorithms Lab
No ratings yet
Design and Analysis of Algorithms Lab
62 pages
Unit Iii PDF
No ratings yet
Unit Iii PDF
97 pages
DAA Unit 1
No ratings yet
DAA Unit 1
84 pages
Assignment No.2 - Design and Analysis of Algorithms
No ratings yet
Assignment No.2 - Design and Analysis of Algorithms
37 pages
Spool Generated For Class of Oracle by Satish K Yellanki
No ratings yet
Spool Generated For Class of Oracle by Satish K Yellanki
98 pages
Introduction To Design and Analysis of Algorithms
No ratings yet
Introduction To Design and Analysis of Algorithms
50 pages
Chapter-3 Computer Security
No ratings yet
Chapter-3 Computer Security
67 pages
Algorithm Timecomplexity
No ratings yet
Algorithm Timecomplexity
45 pages
Overview of OOMD
No ratings yet
Overview of OOMD
57 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
1-Analysis and Design F Algorithms
No ratings yet
1-Analysis and Design F Algorithms
83 pages
Ch3 Crypto
No ratings yet
Ch3 Crypto
90 pages
Daa Notes Full
No ratings yet
Daa Notes Full
55 pages
Unit-I - Introduction
100% (1)
Unit-I - Introduction
75 pages
DAA Dynamic Programming
No ratings yet
DAA Dynamic Programming
35 pages
Study Note of Theory of Computation
No ratings yet
Study Note of Theory of Computation
21 pages
LDPC
No ratings yet
LDPC
40 pages
SQLLoader
No ratings yet
SQLLoader
10 pages
D B M S: ATA ASE Anage Me NT Ystem
No ratings yet
D B M S: ATA ASE Anage Me NT Ystem
114 pages
Algorithms PDF
No ratings yet
Algorithms PDF
116 pages
Chapter 1 - Data Representation 1.1 - Data Types
No ratings yet
Chapter 1 - Data Representation 1.1 - Data Types
12 pages
Module-2 Lecture 7
100% (1)
Module-2 Lecture 7
21 pages
CS8451 DESIGN AND ANALYSIS OF ALGORITHMS QUESTION BANK - Watermark PDF
50% (2)
CS8451 DESIGN AND ANALYSIS OF ALGORITHMS QUESTION BANK - Watermark PDF
47 pages
RSA Algorithm: D e D Ed
No ratings yet
RSA Algorithm: D e D Ed
31 pages
Alan Turing and His Legacy
No ratings yet
Alan Turing and His Legacy
26 pages
CS6659 AI UNIT 3 Notes
50% (4)
CS6659 AI UNIT 3 Notes
30 pages
CryptographyNetSecurity 2008
100% (3)
CryptographyNetSecurity 2008
748 pages
CPE121 - Chapter01 - Introduction To Data Structures and Algorithm
No ratings yet
CPE121 - Chapter01 - Introduction To Data Structures and Algorithm
24 pages
CS 501 Theory of Computation CCP
No ratings yet
CS 501 Theory of Computation CCP
22 pages
Intermediate Code Generation
No ratings yet
Intermediate Code Generation
22 pages
Java Reflection Complete Self-Assessment Guide
From Everand
Java Reflection Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
Acceptance Rejection
No ratings yet
Acceptance Rejection
2 pages
1.1 Multi-Asset Options
No ratings yet
1.1 Multi-Asset Options
44 pages
O Predict Realized Variance
No ratings yet
O Predict Realized Variance
15 pages
Elettra Investimenti 23 October 2019 PDF
No ratings yet
Elettra Investimenti 23 October 2019 PDF
13 pages
GARCH Parameter Estimation by Machine Learning
No ratings yet
GARCH Parameter Estimation by Machine Learning
4 pages
21 Finance Interview Questions and Answers
100% (1)
21 Finance Interview Questions and Answers
17 pages
2018 List of Global Systemically Important Banks (G-Sibs)
No ratings yet
2018 List of Global Systemically Important Banks (G-Sibs)
3 pages
6 Rules For Creating The CV That Will Get You A First Job in An Investment Bank
No ratings yet
6 Rules For Creating The CV That Will Get You A First Job in An Investment Bank
4 pages
Siga Led Rev2 PDF
No ratings yet
Siga Led Rev2 PDF
1 page
A Child'S Imaginary Companion: A Transitional Self: ABSTRACT: The Article Explores The Use of The Imaginary Companion by
No ratings yet
A Child'S Imaginary Companion: A Transitional Self: ABSTRACT: The Article Explores The Use of The Imaginary Companion by
12 pages
SCAC
No ratings yet
SCAC
3 pages
NEET Chemistry Chapter Wise Mock Test - Equilibrium - CBSE Tuts
No ratings yet
NEET Chemistry Chapter Wise Mock Test - Equilibrium - CBSE Tuts
47 pages
Alexander Klug JAppl Phys
No ratings yet
Alexander Klug JAppl Phys
7 pages
Mengembangkan Jiwa Kewirausahaan
No ratings yet
Mengembangkan Jiwa Kewirausahaan
16 pages
2010 - Improving Information Security Awareness and Behaviour Through Dialogue, Participation and Collective Reflection. An Intervention Study
No ratings yet
2010 - Improving Information Security Awareness and Behaviour Through Dialogue, Participation and Collective Reflection. An Intervention Study
14 pages
Đề chính thức
No ratings yet
Đề chính thức
11 pages
The Emergence of Deaf Community in Nicaragua With Sign Language You Can Learn So Much 1st Edition Laura Polich Download PDF
100% (5)
The Emergence of Deaf Community in Nicaragua With Sign Language You Can Learn So Much 1st Edition Laura Polich Download PDF
84 pages
Week 1:: Write Simple Programs Using Printf, Scanf Aim: To Write A Simple C Programs Using Printf, Scanf
No ratings yet
Week 1:: Write Simple Programs Using Printf, Scanf Aim: To Write A Simple C Programs Using Printf, Scanf
12 pages
Week 4. Course Task 1
No ratings yet
Week 4. Course Task 1
3 pages
Physical Science Quarter 1 Module 8
No ratings yet
Physical Science Quarter 1 Module 8
32 pages
Distinct Responses of The Low Latitude Ionosphere To CME and HSSWS The Role of The IMF BZ Oscillation Frequency
No ratings yet
Distinct Responses of The Low Latitude Ionosphere To CME and HSSWS The Role of The IMF BZ Oscillation Frequency
21 pages
Pitfalls of Punishment-13853979573262581144
No ratings yet
Pitfalls of Punishment-13853979573262581144
7 pages
6 Nationality and Identity
No ratings yet
6 Nationality and Identity
18 pages
Authority&Pop Resistance
No ratings yet
Authority&Pop Resistance
18 pages
Gunung Garuda - Angle
No ratings yet
Gunung Garuda - Angle
14 pages
General Api Developer Guide PDF
No ratings yet
General Api Developer Guide PDF
21 pages
Animal Jobs Summary
No ratings yet
Animal Jobs Summary
9 pages
Airon Series
No ratings yet
Airon Series
20 pages
Catering Proposal
No ratings yet
Catering Proposal
13 pages
Mi Vida Futura
100% (1)
Mi Vida Futura
8 pages
What Are The Chemical Used in Washing Recyclable Glass Bottle in Carbonated Beverage Industry
50% (2)
What Are The Chemical Used in Washing Recyclable Glass Bottle in Carbonated Beverage Industry
2 pages
IBM Spectrum Virtualize - BP Zoning 101-Top 10-Part1 Hollywood
No ratings yet
IBM Spectrum Virtualize - BP Zoning 101-Top 10-Part1 Hollywood
38 pages
Oee Y Kpi S: Edson Orbe Sanchez Jorge Emmanuel Orozco Piña Brandon Vega Ventura Luis Ricardo Equipo 7
100% (8)
Oee Y Kpi S: Edson Orbe Sanchez Jorge Emmanuel Orozco Piña Brandon Vega Ventura Luis Ricardo Equipo 7
61 pages
DLL All Subjects 2 q2 w9 d5
No ratings yet
DLL All Subjects 2 q2 w9 d5
3 pages
ICP Chart
No ratings yet
ICP Chart
3 pages
27 34 PDF
No ratings yet
27 34 PDF
8 pages
Lead Acid Batteries
No ratings yet
Lead Acid Batteries
5 pages
Foshan Kung Fu Culture Translation From The Perspective of Ecological Translation Studies - Taking "Hongquan" As An Example
No ratings yet
Foshan Kung Fu Culture Translation From The Perspective of Ecological Translation Studies - Taking "Hongquan" As An Example
4 pages