-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Description
Student Name
Huy Dang
Model Length
256
Accuracy
54.82
Improvement Description
Reward Function, Advantage + Log ratio clamping
Detailed Write-up
-
Advantage clamping prevents NaN/Inf from extreme advantage values
-
Log ratio clamping prevents probability ratio explosions
--> RL runs became much more stable -
Distance-based + negative rewards discourage bad behaviors (no answer tags, wrong numbers)
-
It plateaued at around 51%, so i did a final push by changing lr for more exploration
Unfortunately, I don't have enough free GPU credits to test for 512 tokens.
GPU Hours
1.5 H1000