ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.

==================================================================================
Traceback (most recent call last):
  File "/root/paddlejob/workspace/env_run/RL-Factory/verl/trainer/main_ppo.py", line 30, in main
    run_ppo(config)
  File "/root/paddlejob/workspace/env_run/RL-Factory/verl/trainer/main_ppo.py", line 49, in run_ppo
    ray.get(runner.run.remote(config))
  File "/root/miniconda3/envs/rlagent/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/root/miniconda3/envs/rlagent/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/rlagent/lib/python3.10/site-packages/ray/_private/worker.py", line 2822, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/root/miniconda3/envs/rlagent/lib/python3.10/site-packages/ray/_private/worker.py", line 930, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ActorDiedError): ray::TaskRunner.run() (pid=16209, ip=10.96.200.252, actor_id=af67a3a6fa70baaa8f2d6ff301000000, repr=<main_ppo.TaskRunner object at 0x7ecf611e7670>)
  File "/root/paddlejob/workspace/env_run/RL-Factory/verl/trainer/main_ppo.py", line 248, in run
    trainer.fit()
  File "/root/paddlejob/workspace/env_run/RL-Factory/verl/trainer/ppo/ray_trainer.py", line 1081, in fit
    gen_batch_output = self.actor_rollout_wg.generate_sequences_loop(gen_batch)
  File "/root/paddlejob/workspace/env_run/RL-Factory/verl/single_controller/ray/base.py", line 51, in __call__
    output = ray.get(output)
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
        class_name: create_colocated_worker_cls.<locals>.WorkerDict
        actor_id: 6aaacbae2cbaf2e3e311550101000000
        pid: 20659
        name: jDIkihWorkerDict_0:7
        namespace: 6951bd58-d6ee-4ccf-9ff4-e7c4e2650134
        ip: 10.96.200.252
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

==================================================================================

I have installed nvidia-cublas-cu12==12.4.5.8, but this error still occur. I use the mode judge. 

this error is similar to https://github.com/volcengine/verl/issues/2833#issue-3279377514

my version:
torch == 2.6.0
vllm==0.8.5
nvidia-cublas-cu12==12.4.5.8
CUDA 12.0

my script:
python3 -m verl.trainer.main_ppo
algorithm.adv_estimator=grpo
data.train_files=/root/paddlejob/workspace/env_run/data/nq_search/train.parquet
data.val_files=/root/paddlejob/workspace/env_run/data/nq_search/test.parquet
trainer.rollout_data_dir=/root/paddlejob/workspace/env_run/rag_data/logs1
data.train_batch_size=128
data.max_prompt_length=4096
data.max_response_length=512
actor_rollout_ref.model.path=$MODEL_PATH
actor_rollout_ref.model.use_remove_padding=True
actor_rollout_ref.model.enable_gradient_checkpointing=True
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.ppo_mini_batch_size=32
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8
actor_rollout_ref.actor.use_kl_loss=True
actor_rollout_ref.actor.kl_loss_coef=0.001
actor_rollout_ref.actor.kl_loss_type=low_var_kl
actor_rollout_ref.actor.fsdp_config.param_offload=True
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True
actor_rollout_ref.actor.state_masking=True
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8
actor_rollout_ref.rollout.tensor_model_parallel_size=1
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.gpu_memory_utilization=0.75
actor_rollout_ref.rollout.n=4
actor_rollout_ref.rollout.max_turns=4
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8
actor_rollout_ref.ref.fsdp_config.param_offload=False
actor_rollout_ref.rollout.enforce_eager=False
actor_rollout_ref.rollout.free_cache_engine=False
actor_rollout_ref.env.name=reward_rollout
actor_rollout_ref.env.mcp_mode=stdio
actor_rollout_ref.env.tool_manager=qwen3
actor_rollout_ref.env.enable_thinking=True
actor_rollout_ref.env.config_path=envs/configs/mcp_tools.pydata
actor_rollout_ref.env.use_process_reward=False
reward_rollout.if_use_reward_rollout=True
reward_rollout.rollout.tensor_model_parallel_size=4
reward_rollout.rollout.gpu_memory_utilization=0.5
reward_rollout.rollout.model_name=$REWARD_MODEL_PATH
reward_rollout.rollout.free_cache_engine=False
reward_rollout.rollout.response_length=2048
reward_model.reward_manager=parallel
algorithm.kl_ctrl.kl_coef=0.001
trainer.critic_warmup=0
trainer.logger=['tensorboard']
trainer.project_name='GRPO_search'
trainer.experiment_name='search_with_thinking'
trainer.n_gpus_per_node=8
trainer.nnodes=1
trainer.val_before_train=False
trainer.default_local_dir=$RESULT_DIR
trainer.default_hdfs_dir=null
trainer.save_freq=5
trainer.test_freq=100
trainer.total_epochs=1 $@ 2>&1 | tee grpo.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. #63

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. #63

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions