Implementation of Safe Policy Improvement with Baseline Bootstrapping and Safe Policy Improvement with Soft Baseline Bootstrapping
This project can be used to reproduce the finite MDPs experiments presented in:
- the ICML2019 paper: Safe Policy Improvement with Baseline Bootstrapping, by Romain Laroche, Paul Trichelair, and Rémi Tachet des Combes. (SPIBB)
- the ECML-PKDD2019 paper: Safe Policy Improvement with Soft Baseline Bootstrapping, by Kimia Nadjahi, Romain Laroche, and Rémi Tachet des Combes. (Soft-SPIBB)
- the AAMAS2020 paper: Safe Policy Improvement with an Estimated Baseline Policy, by Thiago D. Simão, Romain Laroche, and Rémi Tachet des Combes.
For the DQN implementation of SPIBB and Soft-SPIBB, please refer to the git repository at this address.
The project is implemented in Python 3.5 and requires numpy, scipy, pandas and openpyxl.
We include the following:
Libraries of the following algorithms:
- basic RL,
- Pi_b-SPIBB,
- Pi_{\leq b}-SPIBB,
- Soft-SPIBB:
- Exact-Soft-SPIBB (1-step or not),
- Approx-Soft-SPIBB (1-step or not),
- doubly-robust,
- importance_sampling,
- weighted_importance_sampling,
- weighted_per_decision_IS,
- per_decision_IS,
- Robust MDP,
- and Reward-adjusted MDP.
- Gridworld environment,
- Random MDPs environment.
SPIBB experiments:
Gridworld experiment of Section 3.1. Run:
python #name_of_experiment# #random_seed#
Gridworld experiment with random behavioural policy of Section 3.2. Run:
python #name_of_experiment# #random_seed#
Random MDPs experiment of Section 3.3. Run:
python #name_of_experiment# #random_seed#
Soft-SPIBB Random MDPs experiment of Section 4.1.Run:
python #name_of_experiment# #random_seed#
Random MDPs with unknown behaviour policy. Experiment of Section 4.1(Figure 1). Run:
python #name_of_experiment# #random_seed#
We DO NOT include the following:
- The hyper-parameter search (Appendix C.2 in SPIBB paper): it should be easy to re-implement.
- The figure generator: it has too many specificities to be made understandable for a user at the moment. Also, it is not hard to re-implement one's own visualization tools.
- The multi-CPU implementation: its structure is too much dependent on the cluster tools.
This project is BSD-licensed.
Please use the following bibtex entry if you use this code for SPIBB:
title={Safe Policy Improvement with Baseline Bootstrapping},
author={Laroche, Romain and Trichelair, Paul and Tachet des Combes, R\'emi},
booktitle={Proceedings of the 36th International Conference on Machine Learning (ICML)},
Please use the following bibtex entry if you use this code for Soft-SPIBB:
title={Safe Policy Improvement with Soft Baseline Bootstrapping},
author={Nadjahi, Kimia and Laroche, Romain and Tachet des Combes, R\'emi},
booktitle={Proceedings of the 2019 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD)},
Please use the following bibtex entry if you use this code for algorithms using an estimate of the behaviour policy:
author = {Sim\~ao, Thiago D. and Laroche, Romain and Tachet des Combes, R\'emi},
title = {Safe Policy Improvement with an Estimated Baseline Policy},
booktitle={Proceedings of the 19th International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS)},
year = {2020},