[Submission]

### Student Name

Swan Hwee

### Model Length

256

### Accuracy

95.90

### Improvement Description

learning rate + reward function

### Detailed Write-up

- Increased LR from 7e^6 --> 1e^2
- Reward function optimized for format accuracy (tags + using the right numbers), especially having tag <answer> </answer> .

My original plan was to split into two phases: (1) RL for correct format, (2) RL for reasoning. To my surprise, doing (1) at a small number of steps worked. 

I ran some previous experiments which showed that there's a high correlation between having tag <answer> and getting the right answer, and noticed that model was getting cut off due to token limit, not that it wasn't "on the right track". 

In the reward function, I optimized for tag <answer> and completely disregarded the accuracy as the goal was to make the model learn how to format correct. Surprisingly, the model learnt to use fewer tokens (avg. token was 50s) so that it'd have enough token to output <answer>.

This best result was achieved at step 70, which coincidentally also the step which the model had best "tag" rate - it outputted both tags <answer> and <think>. However, the model quickly collapsed to 0% accuracy at step 100. 

Haven't tried for 512 tokens, but i expect the answer to be the same, as the model didn't fully utilize the 256 tokens it was given. 

### GPU Hours

30 minutes H100

### Submission Agreement

- [x] I confirm that these results are from my own work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Submission] #3

Student Name

Model Length

Accuracy

Improvement Description

Detailed Write-up

GPU Hours

Submission Agreement

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Submission] #3

Description

Student Name

Model Length

Accuracy

Improvement Description

Detailed Write-up

GPU Hours

Submission Agreement

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions