- Starcraft 2 Multiple Agents Results with PPO (https://github.com/oxwhirl/smac)
- Every agent was controlled independently and has restricted information
- All the environments were trained with a default difficulty level 7
- No curriculum, just baseline PPO
- Full state information wasn't used for critic, actor and critic recieved the same agent observations
- Most results are significantly better by win rate and were trained on a single PC much faster than QMIX (https://arxiv.org/pdf/1902.04043.pdf), MAVEN (https://arxiv.org/pdf/1910.07483.pdf) or QTRAN
- No hyperparameter search
- 4 frames + conv1d actor-critic network
- Miniepoch num was set to 1, higher numbers didn't work
- Simple MLP networks didnot work good on hard envs
- python runner.py --train --file rl_games/configs/smac/3m_torch.yaml
- python runner.py --play --file rl_games/configs/smac/3m_torch.yaml --checkpoint 'nn/3m_cnn'
- python runner.py --tf --train --file rl_games/configs/smac/3m_torch.yaml
- python runner.py --tf --play --file rl_games/configs/smac/3m_torch.yaml --checkpoint 'nn/3m_cnn'
- tensorboard --logdir runs
- 2m_vs_1z took near 2 minutes to achive 100% WR
- corridor took near 2 hours for 95+% WR
- MMM2 4 hours for 90+% WR
- 6h_vs_8z got 82% WR after 8 hours of training
- 5m_vs_6m got 72% WR after 8 hours of training
FPS in these plots is calculated on per env basis except MMM2 (it was scaled by number of agents which is 10), to get a win rate per number of environmental steps info, the same as used in plots in QMIX, MAVEN, QTRAN or Deep Coordination Graphs (https://arxiv.org/pdf/1910.00091.pdf) papers FPS numbers under the horizontal axis should be devided by number of agents in player's team.
Link to the continuous results
Currently Implemented:
- DQN
- Double DQN
- Dueling DQN
- Noisy DQN
- N-Step DQN
- Categorical
- Rainbow DQN
- A2C
- PPO
Tensorflow implementations of the DQN atari.
- Double dueling DQN vs DQN with the same parameters
Near 90 minutes to learn with this setup.
- Different DQN Configurations tests
Light grey is noisy 1-step dddqn. Noisy 3-step dddqn was even faster. Best network (configuration 5) needs near 20 minutes to learn, on NVIDIA 1080. Currently the best setup for pong is noisy 3-step double dueling network. In pong_runs.py different experiments could be found. Less then 200k frames to take score > 18. DQN has more optimistic Q value estimations.
This results are not stable. Just best games, for good average results you need to train network more then 10 million steps. Some games need 50m steps.
- 5 million frames two step noisy double dueling dqn:
- Random lucky game in Space Invaders after less then one hour learning:
- More than 2 hours for Pong to achieve 20 score with one actor playing.
- 8 Hours for Supermario lvl1
- PPO with LSTM layers