Support top-k training: instead of training on all rollouts in a group, only train on the top-k rollouts ranked by reward.
This is a common technique in GRPO-style training to focus gradient signal on the highest-quality completions within each group, improving sample efficiency and training stability.
Support top-k training: instead of training on all rollouts in a group, only train on the top-k rollouts ranked by reward.
This is a common technique in GRPO-style training to focus gradient signal on the highest-quality completions within each group, improving sample efficiency and training stability.