On The Hardness of Reinforcement Learning With Value-Function Approximation
Building Reproducible, Reusable and Robust Deep RL Systems - Joelle Pineau
Reinforcement Learning on Hundreds of Thousands of Cores - Henrique Ponde de Oliveira Pinto - Open AI - scaling the Open AI DOTA agents
DOTA
- co-ordination
- imperfect info
180 years of games per day
100,000x CPU playing the game
100x GPU learning
These needs to be connected with a controller (Redis)
- this holds the configs & parameters
- single source of truth
- can easily backup to disk
Use Lua scripts inside Redis
Tutorial: Introduction to Reinforcement Learning with Function Approximation
1:21:30
What causes instability?
Not learning / sampling
- DP diverges (w/ function approx)
Not exploration
- policy evaluation can diverge
Not non-linear functions
- linear functions can diverge
Risk of divergence occurs when combining
- function approximation
- bootstrapping
- off policy learning
Any two are OK - three not
Can we remove bootstrapping?
- key to computational/data efficiency
- introduces bias
1:28:25