feat: support top-k training

Support top-k training: instead of training on all rollouts in a group, only train on the top-k rollouts ranked by reward.

This is a common technique in GRPO-style training to focus gradient signal on the highest-quality completions within each group, improving sample efficiency and training stability.