Skip to content

[Submission] #3

@huyxdang

Description

@huyxdang

Student Name

Swan Hwee

Model Length

256

Accuracy

95.90

Improvement Description

learning rate + reward function

Detailed Write-up

  • Increased LR from 7e^6 --> 1e^2
  • Reward function optimized for format accuracy (tags + using the right numbers), especially having tag .

My original plan was to split into two phases: (1) RL for correct format, (2) RL for reasoning. To my surprise, doing (1) at a small number of steps worked.

I ran some previous experiments which showed that there's a high correlation between having tag and getting the right answer, and noticed that model was getting cut off due to token limit, not that it wasn't "on the right track".

In the reward function, I optimized for tag and completely disregarded the accuracy as the goal was to make the model learn how to format correct. Surprisingly, the model learnt to use fewer tokens (avg. token was 50s) so that it'd have enough token to output .

This best result was achieved at step 70, which coincidentally also the step which the model had best "tag" rate - it outputted both tags and . However, the model quickly collapsed to 0% accuracy at step 100.

Haven't tried for 512 tokens, but i expect the answer to be the same, as the model didn't fully utilize the 256 tokens it was given.

GPU Hours

30 minutes H100

Submission Agreement

  • I confirm that these results are from my own work

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions