-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Student Name
Swan Hwee
Model Length
256
Accuracy
95.90
Improvement Description
learning rate + reward function
Detailed Write-up
- Increased LR from 7e^6 --> 1e^2
- Reward function optimized for format accuracy (tags + using the right numbers), especially having tag .
My original plan was to split into two phases: (1) RL for correct format, (2) RL for reasoning. To my surprise, doing (1) at a small number of steps worked.
I ran some previous experiments which showed that there's a high correlation between having tag and getting the right answer, and noticed that model was getting cut off due to token limit, not that it wasn't "on the right track".
In the reward function, I optimized for tag and completely disregarded the accuracy as the goal was to make the model learn how to format correct. Surprisingly, the model learnt to use fewer tokens (avg. token was 50s) so that it'd have enough token to output .
This best result was achieved at step 70, which coincidentally also the step which the model had best "tag" rate - it outputted both tags and . However, the model quickly collapsed to 0% accuracy at step 100.
Haven't tried for 512 tokens, but i expect the answer to be the same, as the model didn't fully utilize the 256 tokens it was given.
GPU Hours
30 minutes H100
Submission Agreement
- I confirm that these results are from my own work