Skip to content

sidmadala/CS443-RLHF

Repository files navigation

CS443-RLHF

RLHF re-implementation for UIUC's CS 443: Reinforcement Learning

Dataset

We utilize a preprocessed version of Anthropic's RLHF dataset to train our reward model and subsequently perform fine-tuning on our LLM. This dataset consists of prompts followed by responses that are human-labaled as chosen or rejected. Chosen respones are those which are both helpful and harmless, while rejected responses are LLM output that contains explicit or offensive material that should be suppressed once fine-tuned.

Reward Model Training

Code for reward model training can be found in reward_model.ipynb. We use GPT2 with a text classification head as our reward model.

Proximal Policy Optimization

Code for PPO training is in rlhf_script.py while evaluation is in eval.ipynb. Models used are listed below.

  1. GPT2-small
  2. GPT2-PPO
  3. Third Party GPT2 RLHF

About

RLHF implementation for UIUC's CS 443: Reinforcement Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors