Deep reinforcement learning
for time series decision making
Ruxandra Stoean
Further bibliography
R. S. Sutton, A. G. Barto, Reinforcement Learning, second edition: An Introduction
(Adaptive Computation and Machine Learning series), 2018
M. Lapan, Deep Reinforcement Learning Hands-On: Apply modern RL methods to
practical problems of chatbots, robotics, discrete optimization, web automation, and
more, 2nd Edition, 2018
L. Graesser, W. L. Keng, Foundations of Deep Reinforcement Learning: Theory and
Practice in Python, 2019
M. Morales, Grokking Deep Reinforcement Learning, 2020
W. B. Powell, Reinforcement Learning and Stochastic Optimization: A Unified
Framework for Sequential Decisions, 2022
Reinforcement learning
A learning paradigm different from
Supervised learning
Associate input to output in labeled data
Unsupervised learning
Find patterns in unlabeled data
Reinforcement learning
An agent in an initial state in an environment
Loop until reach target
Experience: take actions -> move to next state
Get reward from the environment
Maximize cumulative reward
Exploit-explore information
Perform several such episodes (similar to epochs in neural networks)
Concepts 1/3
Action taken by the agent in the environment
Environment response to the agent
Reward (Value): feedback to reinforce behavior
State: the state changes for the agent as a consequence of its action
Loop until terminal state is reached
Reach destination
Obtain a maximal reward
A number of time steps is reached
Game over
Environment
Deterministic
State transition and reward are deterministic functions
The reward for a same action in a given state is always the same
The specific action in the particular state determines the same next state every time
Stochastic
The reward and the transition to a new state after a same action may not be the same as
in a previous encounter
Concepts 2/3
Policy (π): the strategy followed by the agent in its quest
Optimal, when it maximizes the value
Value function
The expected value (reward) of a state s if the agent follows the policy π
The state-value for a policy
Q-value (quality-value) function
The value of the long-time gain if the agent in a state takes the action a and follows the
policy π
Action-value function for a policy
Temporal difference (TD)
Computes the estimated value of a state for the policy π, based on the reward received by
the agent and the value of the next state
Exploration-Exploitation Dilemma
Exploitation
Take the best learned action, with the maximum expected reward at a given state
Exploration
Take a random action, without taking rewards into account
Trade-off between exploitation and exploration
Exploitation only: get stuck into local optimum
Exploration only: large time to discover all the information
The ε-greedy policy
Random action is selected with probability ε
Optimal action with 1-ε probability
On- and off-policy approaches
On- versus off-policy
On - SARSA (State-Action-Reward-State-Action)
Employs the ε-greedy policy
To estimate the Q-value, it takes the next action a’ in next state s’ using the same strategy
target(s’) = R(s, a, s’) + γQk(s’, a’)
Qk+1(s, a) = (1- α)Qk(s, a) + α[target(s’)]
Off – Q-learning
ε-greedy policy
And, to estimate the Q-value, it uses a max greedy target policy for the best action (with
the maximum value) in the next state s’
target(s’) = R(s, a, s’) + γmaxa’Qk(s’, a’) (Bellman equation)
Qk+1(s, a) = (1- α)Qk(s, a) + α[target(s’)]
Alternative formulation: Qk+1(s, a) = Qk(s, a) + α[R(s, a, s’) + γmaxa’Qk(s’, a’) – Qk(s, a)]
Concepts 3/3
Learning rate α
Values in [0,1]
A value of 0 leads to no learning
A value of 0.9 leads to very fast learning
The discount factor γ
Also in [0, 1]
Makes further rewards count less than immediate ones
ε-decay
Initially high ε, then value is decreased to allow less random actions
Q-learning
Model-free RL approach
Trial-and-error algorithm, learning from action-outcome as it goes through the
environment
It does not construct an internal model
https://www.baeldung.com/cs/reinforcement-learning-neural-network
Tabular (exact) Q-learning Algorithm
Initialize Q0(s, a) for all states and actions (by 0)
Repeat
Initialize state s
For k = 1, 2, …
Sample an action a according to policy
Execute a and get next state s’
If s’ is terminal
target(s’) = R(s, a, s’) (reward of transition)
Else
target(s’) = R(s, a, s’) + γmaxa’Qk(s’, a’)
Update Qk+1(s, a) = (1- α)Qk(s, a) + α[target(s’)] to be closer to the target
s = s’
Until number of episodes reached
Example
Q function parameterized by a function approximator
Q values computed by e.g. neural network (deep learning) -> get parameters θ of
the Q function; initially random weights
Iterative regression -> fit Q-values to the computed targets
Optimizing squared loss function
Problem: non-stable targets, catastrophic forgetting
1. Q values for a state and action will not remain stationary as before, as the neural
network generalize between states
2. Large swings in state distributions
Approximate Q-learning Algorithm
Initialize Q0(s, a) for all states and actions (by 0)
Repeat
Initialize state s
For k = 1, 2, …
Sample an action a according to policy
Execute a and get next state s’
If s’ is terminal
target(s’) = R(s, a, s’)
Else
target(s’) = R(s, a, s’) + γmaxa’Qk(s’, a’)
Gradient update on the function approximator θk+1 = θk - α∇θEs’[(Qθ(s, a) – target(s’))2] θ=θk
s = s’
Until number of episodes reached (complete passes of the data)
DQN Algorithm
Transform Q-learning to a supervised learning task
1. Experience replay buffer
Take action - get reward - go to next state and store each transition in the buffer
Online single learning update replaced with applying batch update - sampling mini-
batch of past transitions from the buffer -> a stabler update
Data distribution is more stationary
Steadier learning
2. Save a copy of the weights fixed for some time to compute the target function
(target network), instead of using the current weights γmaxa’Qk(s’, a’, θ-)
Example: Trading actions in stock time series
Problem
Given a historical stock price time series, decide on best trading action
BUY
SELL
HOLD
Could be solved through a recurrent architecture (LSTM, GRU) to estimate the
stock price evolution
Take the estimations and formulate a separate optimization problem to determine
the best trading actions per time step, e.g. evolutionary algorithms
State
representation
definition
State representation
Portofolio performance
Final plots: transaction history
Final plots: returns across RL episodes
Agent definition
Deep model architecture
Reset, remember transition, take action
Experience replay
Initialize Agent, import data, define actions
Hold, Buy, Sell actions
Logs
RL loop
Predict action from state and execute it
Compute reward
Call experience buffer
In practice, for the experience memory, a deque structure is used, which is
larger than the batch on which the model is trained
Updates when replay buffer length is larger than batch_size threshold
New memories are pushed in and older ones are taken out from the deque
Save model at each episode and plot returns
across episodes
Evaluation
stage on
test data
Take model at episode 10 and try to get a
portofolio different from 0
Trading actions and their plot
Return by episode
Trading decisions on test data
Further deep RL architectures to avoid
overestimations
DDPG (Deep Deterministic Policy Gradient)
Combines DQN with DPG (Deterministic Policy Gradient)
An actor-critic method (two neural networks)
The actor is a deterministic policy network to determine the action
The critic estimates the Q-value
DDQN (Double DQN)
Two networks: a DQN and a Target Network
The DQN selects the best action with maximum Q-value for the next state
The target network calculates the estimated Q-value for the action selected