Skip to content

Training Hours and Reward trend #37

@mandyyyyii

Description

@mandyyyyii

Hi, thank you very much for your contribution to the community. I have a few questions regarding the training setup and observed behaviors.

In the paper, how many training hours are required to reach 1,000 steps? In my setup, I enable oversample=2, which results in 47 steps per epoch. This means it would take roughly 10 epochs to reach 1,000 steps. Currently, each epoch takes about 15 hours to train on 8× H200 GPUs using the Qwen3-4B backbone. Does this align with the training configuration reported in the paper?

I have trained the model for 4 epochs (approximately 200 steps). I observe that the training reward curves across epochs are highly similar, although the gradient norm remains stable and the entropy shows a slow upward trend, which seems healthy. However, the model’s generation behavior appears very similar across epochs, and the reward within each epoch exhibits high variance. Is this behavior expected or normal in this training regime?

Thank you very much for your time and help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions