Training Hours and Reward trend

Hi, thank you very much for your contribution to the community. I have a few questions regarding the training setup and observed behaviors.

In the paper, how many training hours are required to reach 1,000 steps? In my setup, I enable oversample=2, which results in 47 steps per epoch. This means it would take roughly 10 epochs to reach 1,000 steps. Currently, each epoch takes about 15 hours to train on 8× H200 GPUs using the Qwen3-4B backbone. Does this align with the training configuration reported in the paper?

I have trained the model for 4 epochs (approximately 200 steps). I observe that the training reward curves across epochs are highly similar, although the gradient norm remains stable and the entropy shows a slow upward trend, which seems healthy. However, the model’s generation behavior appears very similar across epochs, and the reward within each epoch exhibits high variance. Is this behavior expected or normal in this training regime?

Thank you very much for your time and help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Hours and Reward trend #37

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Training Hours and Reward trend #37

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions