This repository contains a variety of Multi-Agent Reinforcement Learning (MARL) algorithms. Its purpose is to develop new algorithms and it is not intended to be a stable library.
marl is strongly typed and has high code quality standards. Any contribution to this repository is expected to exhibit a similar quality. marl comes with a web interface to visualise the results of your experiments (more info down below).
To install all the dependencies, run uv sync. If you are using a GPU whose support has ended, use the legacy-gpu extra.
$ uv sync # Standard install
$ uv sync --extra legacy-gpu # Install for older GPUsSetup your experiment accoring to the examples in create_experiments.py and run it directly with the --run option. The results of the experiment are stored in the logs folder.
$ python src/create_experiments.py --runWhen creating your experiment, you can decide which logging method to use (csv, tensorboard, weights & biases, or neptune). All log files are stored in the logs folder.
For instance, to check your tensorboard logs, run
$ tensorboard --logdir logsWith the Brave browser: you have to deactivate the Brave shield.
You can also inspect your results with a dedicated web UI. You first have to build the sources, and then serve the files with the serve.py script.
$ cd src/ui
$ npm install # or deno install or bun install
$ npm run build # Build the sources to src/ui/dist.
$ cd ../.. # Go back to the root of the project
$ python src/serve.pyTo serve the files in development mode, you need two terminals.
$ cd src/ui && npm run dev # In one terminal
$ python src/serve.py # In an other terminalThis repository is aimed at prototyping but tries to follows good software engineering practicies as much as possible.
The models module exposes:
- abstract classes that algorithms can work with (e.g.
Actor,CriticorQNetwork); - implementation of utility objects such as
Experiment,Run,BatchorReplayMemory.
The models module should absolutely not contain implementations of neural networks or algorithms.
- Agent: Abstract class that encapsulate the decision-making logic. It exposes the
choose_action()method and is agnostic to the learning algorithm. - Trainer: Abstract base class for learning algorithms that train agents. Trainers implement
update_step()andupdate_episode()methods, expose trainable neural networks, and implementmake_agent()to produce their corresponding agent. - Experiment and Run: an
Experimentis defined by a specific training algorithm and a specific environment and their related set of parameters. EachExperimentis stored in its dedicated folder. AnExperimentcan be run multiple times with different seeds, hence theRunclass. EveryRunhas its own results stored in its dedicated folder. - Runner: the runner orchestrates the training/testing loop. The runner manages the lifecycle of training runs with proper seeding and checkpointing such that test episodes can be replayed.
This module contains neural network related classes and functions as well as a model bank. The model bank contains a series of models that serve a specific purpose (e.g. a CNN Q-network, a MLP Q-network, etc). Mixing networks such as VDN, QMIX or QPLEX also have their own src/marl/nn/mixers module.
All classes inherit from the NN abstract class that enables each device management, randomization, and saving/loading.
The web UI is implemted with Vue in the frontend and FastAPI in the backend. The backend is located in the src/ui folder.
Each training algorithm has its own dedicated file in the src/marl/training module. This module also contains components that provide intrinsic rewards such as RandomNetworkDistillation.
| Algorithm | Multi-Objective | Status | Notes |
|---|---|---|---|
| Q-Learning (Tabular) | ✗ | Working | Classic tabular approach |
| DQN/IQL | ✓ | ✓ | Independent Q-learning (DQN with mixer=None) |
| VDN | ✓ | ✓ | Value Decomposition Network |
| QMIX | ✓ | ✓ | |
| QPLEX | ? | Almost | Factorization architecture |
| QTRAN | ? | Not tested | Transitivity-aware factorization |
| QATTEN | ? | Not tested | Attention-based mixing |
| IPPO | ? | ✓ | MAPPO with mixer=None |
| MAPPO | ? | ✓ | Multi-Agent PPO with centralized critic |
| DDPG | ✗ | ✗ | Continuous control |
| Option-Critic | ✗ | ? | Hierarchical RL |
| RND | ✓ | ✓ | Random Network Distillation |
| ICM | ✓ | ? | Intrinsic Curiosity Module |
| HAVEN | ✗ | ✗ | Hierarchical MARL with intrinsic motivation |
| REINFORCE | ✗ | ✓ | Policy gradient method |
| AlphaZero/MCTS | ✗ | ? | Tree search-based |