Hi,
I'm confused about why the KL divergence is calculated in train.py at line 294 and then stored in the replay buffer.
Is it because, in OpenRLHF's GRPO, in addition to the loss, the KL divergence is also used as a KL penalty for the reward, but with a very small KL weight?