This repository contains reinforcement learning implementations for creating agents to control a robot in the Harvard PacBot competition.
I implemented three RL algorithms from scratch: DQN, PPO, and AlphaZero. The main training/evaluation scripts are in src/algorithms/.
The rest of the src/ directory contains supporting Python code.
The game environment, as well as some other performance-sentitive code such as the vectorized Monte Carlo tree search implementation for AlphaZero, are implemented in Rust as a Python extension module (using pyo3 and maturin). This code is located in pacbot_rs/.
Make sure you have Poetry and Rust installed.
DQN has generally worked better than PPO and AlphaZero for us, so the main script you should run is src/algorithms/train_dqn.py.
Make sure you have the poetry environment activated (by running poetry shell within the repo; make sure you have also installed dependencies with poetry install).
Initially, and after making changes to any of the Rust code in pacbot_rs/, you'll need to build the Rust extension module and install it into the Python environment:
pacbot_rs/build_and_install.shFirst, make sure your working directory is src/.
To start a training run, invoke train_dqn.py:
python3 -m algorithms.train_dqn- To fine-tune (start from an existing checkpoint instead of initializing the model randomly), pass
--finetune path/to/checkpoint. - By default, model checkpoints will be saved to
./checkpoints/. You can change this by passing--checkpoint-dir path/to/dir. - By default, this will use Weights & Biases to log checkpoints and various metrics to aid debugging and tracking training progress. To disable this, pass
--no-wandb. - By default, the
cudadevice is used (if available), falling back tocpu. To manually set the PyTorch device, pass--device <device>.
You can also change hyperparameters with command-line arguments. To see the full list, pass --help or take a look at the hyperparam_defaults dictionary near the top of train_dqn.py. In particular:
- If you're running out of memory, consider reducing the
batch_size. -
num_iterscontrols the total runtime of the script execution, and also indirectly the value of$\epsilon$ (epsilon), which controls the amount of random exploration ($\epsilon$ starts atinitial_epsilonand linearly decreases tofinal_epsilonafternum_iters). Of course, you can always setnum_itersto a large value (like the default) and just kill the script once it reaches a satisfactory level of performance.
To show an animated visualization in the terminal of the agent being controlled by a particular checkpoint, use --eval:
python3 -m algorithms.train_dqn --eval path/to/checkpoint.pt