Skip to content

[Submission] #4

@huyxdang

Description

@huyxdang

Student Name

Huy Dang

Model Length

256

Accuracy

54.82

Improvement Description

Reward Function, Advantage + Log ratio clamping

Detailed Write-up

  • Advantage clamping prevents NaN/Inf from extreme advantage values

  • Log ratio clamping prevents probability ratio explosions
    --> RL runs became much more stable

  • Distance-based + negative rewards discourage bad behaviors (no answer tags, wrong numbers)

  • It plateaued at around 51%, so i did a final push by changing lr for more exploration

Unfortunately, I don't have enough free GPU credits to test for 512 tokens.

GPU Hours

1.5 H1000

Submission Agreement

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions