We have been able to train RL algorithms that rival and exceed human level on tasks such as chess and Go, but this progress has not made it to embodied settings due to the need for massive amounts of data during training and the narrow skillset of these bots. In recent years, pioneering work has been done to discover what is needed to create sample-efficient RL models.
We develop a model that builds on prevous work (notably the 2023 Bigger, Better, Faster Model) and adds the capacity for multi-modal input. We train our model on Atari 2600 games using the raw input of video frames and audio as observations.
