This is the code for the paper Penalizing side effects using stepwise relative reachability by Krakovna et al (2019). It implements a tabular Q-learning agent with different penalties for side effects. Each side effects penalty consists of a deviation measure (none, unreachability, relative reachability, or attainable utility) and a baseline (starting state, inaction, or stepwise inaction).
Clone the repository:
git clone https://github.com/deepmind/deepmind-research/side_effects_penalties.git
Run the agent with a given penalty on an AI Safety Gridworlds environment:
python -m side_effects_penalties.run_experiment -baseline <X> -dev_measure <Y> -env_name <Z> -suffix <S>
The following parameters can be specified for the side effects penalty:
- Baseline state (
-baseline
): starting state (start
), inaction (inaction
), stepwise inaction with rollouts (stepwise
), stepwise inaction without rollouts (step_noroll
) - Deviation measure (
-dev_measure
): none (none
), unreachability (reach
), relative reachability (rel_reach
), attainable utility (att_util
) - Discount factor for the deviation measure value function (
-value_discount
) - Summary function to apply to the relative reachability or attainable utility
deviation measure (
-dev_fun
): max (0, x) (truncation
) or |x| (absolute
) - Weight for the side effects penalty relative to the reward (
-beta
)
Other arguments:
- AI Safety Gridworlds environment name (
-env_name
) - Number of episodes (
-num_episodes
) - Filename suffix for saving result files (
-suffix
)
Make a summary data frame from the result files generated by run_experiment
:
python -m side_effects_penalties.results_summary -compare_penalties -input_suffix <S>
Arguments:
- -bar_plot: make a data frame for a bar plot (True) or learning curve plot (False)
- -compare_penalties: compare different penalties using the best beta value for each penalty (True), or compare different beta values for a given penalty (False)
- If compare_penalties=False, specify the penalty parameters (
-dev_measure
,-dev_fun
and-value_discount
) - Environment name (
-env_name
) - Filename suffix for loading result files (
-input_suffix
) - Filename suffix for the summary data frame (
-output_suffix
)
Import the summary data frame into plot_results.ipynb
and make a bar plot or
learning curve plot.
- Python 2.7 or 3 (tested with Python 2.7.15 and 3.6.7)
- AI Safety Gridworlds suite of safety environments
- Abseil Python common libraries
- Numpy
- Pandas
- Six
- Matplotlib
- Seaborn
If you use this code in your work, please cite the accompanying paper:
@article{srr2019, title = {Penalizing Side Effects using Stepwise Relative Reachability}, author = {Victoria Krakovna and Laurent Orseau and Ramana Kumar and Miljan Martic and Shane Legg}, journal = {CoRR}, volume = {abs/1806.01186}, year = {2019} }