-
Notifications
You must be signed in to change notification settings - Fork 160
Description
==================================================================================
Traceback (most recent call last):
File "/root/paddlejob/workspace/env_run/RL-Factory/verl/trainer/main_ppo.py", line 30, in main
run_ppo(config)
File "/root/paddlejob/workspace/env_run/RL-Factory/verl/trainer/main_ppo.py", line 49, in run_ppo
ray.get(runner.run.remote(config))
File "/root/miniconda3/envs/rlagent/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/root/miniconda3/envs/rlagent/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/root/miniconda3/envs/rlagent/lib/python3.10/site-packages/ray/_private/worker.py", line 2822, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/root/miniconda3/envs/rlagent/lib/python3.10/site-packages/ray/_private/worker.py", line 930, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ActorDiedError): ray::TaskRunner.run() (pid=16209, ip=10.96.200.252, actor_id=af67a3a6fa70baaa8f2d6ff301000000, repr=<main_ppo.TaskRunner object at 0x7ecf611e7670>)
File "/root/paddlejob/workspace/env_run/RL-Factory/verl/trainer/main_ppo.py", line 248, in run
trainer.fit()
File "/root/paddlejob/workspace/env_run/RL-Factory/verl/trainer/ppo/ray_trainer.py", line 1081, in fit
gen_batch_output = self.actor_rollout_wg.generate_sequences_loop(gen_batch)
File "/root/paddlejob/workspace/env_run/RL-Factory/verl/single_controller/ray/base.py", line 51, in call
output = ray.get(output)
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: create_colocated_worker_cls..WorkerDict
actor_id: 6aaacbae2cbaf2e3e311550101000000
pid: 20659
name: jDIkihWorkerDict_0:7
namespace: 6951bd58-d6ee-4ccf-9ff4-e7c4e2650134
ip: 10.96.200.252
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
==================================================================================
I have installed nvidia-cublas-cu12==12.4.5.8, but this error still occur. I use the mode judge.
this error is similar to verl-project/verl#2833 (comment)
my version:
torch == 2.6.0
vllm==0.8.5
nvidia-cublas-cu12==12.4.5.8
CUDA 12.0
my script:
python3 -m verl.trainer.main_ppo
algorithm.adv_estimator=grpo
data.train_files=/root/paddlejob/workspace/env_run/data/nq_search/train.parquet
data.val_files=/root/paddlejob/workspace/env_run/data/nq_search/test.parquet
trainer.rollout_data_dir=/root/paddlejob/workspace/env_run/rag_data/logs1
data.train_batch_size=128
data.max_prompt_length=4096
data.max_response_length=512
actor_rollout_ref.model.path=$MODEL_PATH
actor_rollout_ref.model.use_remove_padding=True
actor_rollout_ref.model.enable_gradient_checkpointing=True
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.ppo_mini_batch_size=32
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8
actor_rollout_ref.actor.use_kl_loss=True
actor_rollout_ref.actor.kl_loss_coef=0.001
actor_rollout_ref.actor.kl_loss_type=low_var_kl
actor_rollout_ref.actor.fsdp_config.param_offload=True
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True
actor_rollout_ref.actor.state_masking=True
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8
actor_rollout_ref.rollout.tensor_model_parallel_size=1
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.gpu_memory_utilization=0.75
actor_rollout_ref.rollout.n=4
actor_rollout_ref.rollout.max_turns=4
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8
actor_rollout_ref.ref.fsdp_config.param_offload=False
actor_rollout_ref.rollout.enforce_eager=False
actor_rollout_ref.rollout.free_cache_engine=False
actor_rollout_ref.env.name=reward_rollout
actor_rollout_ref.env.mcp_mode=stdio
actor_rollout_ref.env.tool_manager=qwen3
actor_rollout_ref.env.enable_thinking=True
actor_rollout_ref.env.config_path=envs/configs/mcp_tools.pydata
actor_rollout_ref.env.use_process_reward=False
reward_rollout.if_use_reward_rollout=True
reward_rollout.rollout.tensor_model_parallel_size=4
reward_rollout.rollout.gpu_memory_utilization=0.5
reward_rollout.rollout.model_name=$REWARD_MODEL_PATH
reward_rollout.rollout.free_cache_engine=False
reward_rollout.rollout.response_length=2048
reward_model.reward_manager=parallel
algorithm.kl_ctrl.kl_coef=0.001
trainer.critic_warmup=0
trainer.logger=['tensorboard']
trainer.project_name='GRPO_search'
trainer.experiment_name='search_with_thinking'
trainer.n_gpus_per_node=8
trainer.nnodes=1
trainer.val_before_train=False
trainer.default_local_dir=$RESULT_DIR
trainer.default_hdfs_dir=null
trainer.save_freq=5
trainer.test_freq=100
trainer.total_epochs=1 $@ 2>&1 | tee grpo.log