-
Notifications
You must be signed in to change notification settings - Fork 160
Open
Description
带#为原代码,#后面的为新加的,主要想通过loop实现multi_turns:
with marked_timer("step", timing_raw):
# generate a batch
with marked_timer("gen", timing_raw, "red"):
# gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
# timing_raw.update(gen_batch_output.meta_info["timing"])
# gen_batch_output.meta_info.pop("timing", None)
if self.config.actor_rollout_ref.rollout.max_turns is None:
gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
else:
gen_batch_output = self.actor_rollout_wg.generate_sequences_loop(gen_batch)新加的长度过滤和reward归一化:
if not self.config.algorithm.filter_groups.enable:
batch = new_batch
else:
filter_id = []
for idx, (uid, input_ids) in enumerate(zip(new_batch.non_tensor_batch['uid'], new_batch.batch['input_ids'])):
if len(input_ids[input_ids != 151643]) < self.config.data.max_prompt_length:
filter_id.append(idx)
new_batch = new_batch[filter_id]
metric_name = self.config.algorithm.filter_groups.metric
if metric_name == "seq_final_reward":
# Turn to numpy for easier filtering
new_batch.non_tensor_batch["seq_final_reward"] = new_batch.batch["token_level_rewards"].sum(dim=-1).numpy()
elif metric_name == "seq_reward":
# ===============================add======================================
seq_reward = new_batch.batch["token_level_scores"].sum(dim=-1).numpy()
seq_reward = (seq_reward - seq_reward.mean()) / (seq_reward.std() + 1e-8)
new_batch.non_tensor_batch["seq_reward"] = seq_reward
# ===============================add======================================报错位置:
1.batch = new_batch if batch is None else DataProto.concat([batch, new_batch])
2.# Align the batch
traj_bsz = self.config.data.train_batch_size * self.config.actor_rollout_ref.rollout.n
batch = batch[:traj_bsz]
3.RL-Factory/verl/workers/fsdp_workers.py", line 773, in generate_sequences_loop:
work.wait() RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
4.
环境:A800*8,qwen3-8B,torch==2.6.0
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels