RLHF re-implementation for UIUC's CS 443: Reinforcement Learning
We utilize a preprocessed version of Anthropic's RLHF dataset to train our reward model and subsequently perform fine-tuning on our LLM. This dataset consists of prompts followed by responses that are human-labaled as chosen or rejected. Chosen respones are those which are both helpful and harmless, while rejected responses are LLM output that contains explicit or offensive material that should be suppressed once fine-tuned.
Code for reward model training can be found in reward_model.ipynb. We use GPT2 with a text classification head as our reward model.
Code for PPO training is in rlhf_script.py while evaluation is in eval.ipynb. Models used are listed below.