Skip to content

NCCL error when launching training with vLLM #339

@aryagxr

Description

@aryagxr

I'm seeing NCCL errors while trying to run verifiers on 2 x A100 - 80gb. My setup is vLLM: 0.10.2, CUDA: 12.8

The same error was noted in issue #78 , but the suggested fixes don't work for me.

What I tried so far:

  1. setting NCCL_P2P_DISABLE=1 & 0

Funny enough, adding back in NCCL_P2P_DISABLE=1 fixed it for me. I'll update the readme docs to recommend toggling this on NCCL bugs. Seems to be a common workaround for driver issues, which are highly dependent on your machine's setup and are somewhat beyond scope of repo-level features.

Yeah, relaunching the server is generally recommended if the crash occurs during the communication channel initialization (which will be the case for these NCCL errors).

More info: https://docs.vllm.ai/en/v0.6.6/getting_started/debugging.html#incorrect-hardware-driver

Originally posted by @willccbb in #78

  1. Downgrading vLLM to 0.9.1 and 0.8.5

I had the same problem with VLLM 0.9.1. Downgraded to 0.8.5 in a fresh uv venv and things worked again.

The vLLM server runs fine, but the error seems to happen after I launch the training script
Here are my outputs:

Command to launch train script:

CUDA_VISIBLE_DEVICES=1 python train.py

Error in the train script launching terminal window:

Using Liger kernel
torch_dtype is deprecated! Use dtype instead!
Applied Liger kernels to Qwen2
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 94.62it/s]
[2025-09-17 05:28:42,083] [INFO] [real_accelerator.py:260:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-09-17 05:28:43,637] [INFO] [[logging.py:107](http://logging.py:107/):log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False
2025-09-17 05:28:43 - verifiers.trainers.grpo_trainer - INFO - Filtering dataset for prompts with length <= 512
Filter (num_proc=8): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49950/49950 [00:03<00:00, 14780.32 examples/s]
2025-09-17 05:28:50 - verifiers.trainers.grpo_trainer - INFO - Dataset size: 49950, global batch size: 0.5
2025-09-17 05:28:50 - verifiers.trainers.grpo_trainer - INFO - Unique prompts per device batch: 0.125, unique prompts per gradient step: 0.5
2025-09-17 05:28:50 - verifiers.trainers.grpo_trainer - INFO - Batches per epoch: 99900.0
2025-09-17 05:28:50 - verifiers.trainers.grpo_trainer - INFO - Steps per epoch: 99900.0 (num_iterations=1)
2025-09-17 05:28:50 - verifiers.trainers.grpo_trainer - INFO - Number of epochs:
INFO 09-17 05:28:50 [init.py:216] Automatically detected platform cuda.
2025-09-17 05:28:50 - verifiers.inference.vllm_client - INFO - Server is up!
2025-09-17 05:28:50 - verifiers.inference.vllm_client - INFO - vLLM world size: 2
2025-09-17 05:28:50 - verifiers.inference.vllm_client - INFO - Client rank: 2, total world size: 3
2025-09-17 05:28:50 - verifiers.inference.vllm_client - INFO - Initializing PyNcclCommunicator on device 0, rank 2, world_size 3
INFO 09-17 05:28:50 [init.py:1433] Found nccl from library libnccl.so.2
INFO 09-17 05:28:50 [[pynccl.py:70](http://pynccl.py:70/)] vLLM is using nccl==2.27.3
Traceback (most recent call last):
File "/home/ubuntu/pienvrl/train.py", line 37, in <module>
main()
File "/home/ubuntu/pienvrl/train.py", line 26, in main
trainer = vf.GRPOTrainer(
^^^^^^^^^^^^^^^
File "/home/ubuntu/pienvrl/pienv/lib/python3.12/site-packages/verifiers/trainers/grpo_trainer.py", line 541, in init
self.vllm_client.init_communicator()
File "/home/ubuntu/pienvrl/pienv/lib/python3.12/site-packages/verifiers/inference/vllm_client.py", line 186, in init_communicator
self.pynccl_comm = PyNcclCommunicator(pg, device=device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/pienvrl/pienv/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl.py", line 100, in init
self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/pienvrl/pienv/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 301, in ncclCommInitRank
self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
File "/home/ubuntu/pienvrl/pienv/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 272, in NCCL_CHECK
raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)

Command to launch the vllm server:

MPLBACKEND="agg" CUDA_VISIBLE_DEVICES=0,1 NCCL_DEBUG=INFO NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 \
vf-vllm --model 'Qwen/Qwen2.5-7B-Instruct' \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.5 \
--host 0.0.0.0 \
--port 8000 \
--disable-log-requests

Error seen in the vllm server window:

INFO 09-17 05:00:06 [init.py:216] Automatically detected platform cuda.
WARNING 09-17 05:00:07 [init.py:1758] argument '--disable-log-requests' is deprecated and replaced with '--enable-log-requests'. This will be removed in v0.12.0.
INFO 09-17 05:00:07 [api_server.py:122] vLLM API server version 0.10.2
INFO 09-17 05:00:07 [api_server.py:123] args: Namespace(host=None, port=8000, ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, log_level='debug', model='Qwen/Qwen2.5-7B-Instruct', runner='auto', convert='auto', task=None, tokenizer=None, tokenizer_mode='auto', trust_remote_code=False, dtype='auto', seed=None, hf_config_path=None, allowed_local_media_path='', revision=None, code_revision=None, rope_scaling={}, rope_theta=None, tokenizer_revision=None, max_model_len=None, quantization=None, enforce_eager=True, max_seq_len_to_capture=8192, max_logprobs=20, logprobs_mode=<LogprobsMode.RAW_LOGPROBS: 'raw_logprobs'>, disable_sliding_window=False, disable_cascade_attn=False, skip_tokenizer_init=False, enable_prompt_embeds=False, served_model_name=None, disable_async_output_proc=False, config_format='auto', hf_token=None, hf_overrides={}, override_pooler_config=None, logits_processor_pattern=None, generation_config='auto', override_generation_config={}, enable_sleep_mode=False, model_impl='auto', override_attention_dtype=None, logits_processors=None, io_processor_plugin=None, load_format='auto', download_dir=None, safetensors_load_strategy='lazy', model_loader_extra_config={}, ignore_patterns=None, use_tqdm_on_load=True, pt_load_map_location='cpu', guided_decoding_backend='auto', guided_decoding_disable_fallback=False, guided_decoding_disable_any_whitespace=False, guided_decoding_disable_additional_properties=False, reasoning_parser='', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, decode_context_parallel_size=1, data_parallel_size=1, data_parallel_rank=None, data_parallel_start_rank=None, data_parallel_size_local=None, data_parallel_address=None, data_parallel_rpc_port=None, data_parallel_backend='mp', data_parallel_hybrid_lb=False, enable_expert_parallel=False, enable_eplb=False, eplb_config=EPLBConfig(window_size=1000, step_interval=3000, num_redundant_experts=0, log_balancedness=False), num_redundant_experts=None, eplb_window_size=None, eplb_step_interval=None, eplb_log_balancedness=None, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, worker_cls='auto', worker_extension_cls='', enable_multimodal_encoder_data_parallel=False, block_size=None, gpu_memory_utilization=0.9, kv_cache_memory_bytes=None, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='sha256', cpu_offload_gb=0, calculate_kv_scales=False, kv_sharing_fast_prefill=False, mamba_cache_dtype='auto', mamba_ssm_cache_dtype='auto', limit_mm_per_prompt={}, media_io_kwargs={}, mm_processor_kwargs=None, mm_processor_cache_gb=4, disable_mm_preprocessor_cache=False, mm_encoder_tp_mode='weights', interleave_mm_strings=False, skip_mm_profiling=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, fully_sharded_loras=False, default_mm_loras=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, cuda_graph_sizes=[], long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', disable_hybrid_kv_cache_manager=False, async_scheduling=False, speculative_config=None, kv_transfer_config=None, kv_events_config=None, compilation_config={"level":null,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":null,"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":null,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":null,"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":null,"local_cache_dir":null}, additional_config={}, disable_log_stats=False, enable_log_requests=False, disable_log_requests=True)
INFO 09-17 05:00:13 [init.py:742] Resolved architecture: Qwen2ForCausalLM
torch_dtype is deprecated! Use dtype instead!
INFO 09-17 05:00:13 [init.py:1815] Using max model len 32768
INFO 09-17 05:00:13 [[scheduler.py:222](http://scheduler.py:222/)] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 09-17 05:00:13 [init.py:3400] Cudagraph is disabled under eager mode
(EngineCore_DP0 pid=16357) INFO 09-17 05:00:14 [[core.py:654](http://core.py:654/)] Waiting for init message from front-end.
(EngineCore_DP0 pid=16357) INFO 09-17 05:00:14 [[core.py:76](http://core.py:76/)] Initializing a V1 LLM engine (v0.10.2) with config: model='Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen2.5-7B-Instruct, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
[W917 05:00:16.488911002 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=16357) INFO 09-17 05:00:17 [parallel_state.py:1165] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=16357) WARNING 09-17 05:00:17 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_DP0 pid=16357) INFO 09-17 05:00:17 [gpu_model_runner.py:2338] Starting to load model Qwen/Qwen2.5-7B-Instruct...
(EngineCore_DP0 pid=16357) INFO 09-17 05:00:17 [gpu_model_runner.py:2370] Loading model from scratch...
(EngineCore_DP0 pid=16357) INFO 09-17 05:00:17 [[cuda.py:362](http://cuda.py:362/)] Using Flash Attention backend on V1 engine.
(EngineCore_DP0 pid=16357) INFO 09-17 05:00:17 [weight_utils.py:348] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.46it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.36it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.31it/s]
(EngineCore_DP0 pid=16357)
(EngineCore_DP0 pid=16357) INFO 09-17 05:00:20 [default_loader.py:268] Loading weights took 3.19 seconds
(EngineCore_DP0 pid=16357) INFO 09-17 05:00:21 [gpu_model_runner.py:2392] Model loading took 14.2488 GiB and 3.514301 seconds
(EngineCore_DP0 pid=16357) INFO 09-17 05:00:22 [gpu_worker.py:298] Available KV cache memory: 56.26 GiB
(EngineCore_DP0 pid=16357) INFO 09-17 05:00:22 [kv_cache_utils.py:864] GPU KV cache size: 1,053,392 tokens
(EngineCore_DP0 pid=16357) INFO 09-17 05:00:22 [kv_cache_utils.py:868] Maximum concurrency for 32,768 tokens per request: 32.15x
(EngineCore_DP0 pid=16357) INFO 09-17 05:00:22 [gpu_worker.py:391] Free memory on device (78.66/79.15 GiB) on startup. Desired GPU memory utilization is (0.9, 71.24 GiB). Actual usage is 14.25 GiB for weight, 0.71 GiB for peak activation, 0.02 GiB for non-torch memory, and 0.0 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with --kv-cache-memory=60248635392 to fit into requested memory, or --kv-cache-memory=68224169984 to fully utilize gpu memory. Current kv cache memory in use is 60405921792 bytes.
(EngineCore_DP0 pid=16357) INFO 09-17 05:00:22 [[core.py:218](http://core.py:218/)] init engine (profile, create kv cache, warmup model) took 1.38 seconds
(EngineCore_DP0 pid=16357) INFO 09-17 05:00:23 [init.py:3400] Cudagraph is disabled under eager mode
INFO 09-17 05:00:23 [[loggers.py:142](http://loggers.py:142/)] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 65837
INFO 09-17 05:00:23 [async_llm.py:180] Torch profiler disabled. AsyncLLM CPU traces will not be collected.
INFO 09-17 05:00:23 [[launcher.py:36](http://launcher.py:36/)] Available routes are:
INFO 09-17 05:00:23 [[launcher.py:44](http://launcher.py:44/)] Route: /openapi.json, Methods: HEAD, GET
INFO 09-17 05:00:23 [[launcher.py:44](http://launcher.py:44/)] Route: /docs, Methods: HEAD, GET
INFO 09-17 05:00:23 [[launcher.py:44](http://launcher.py:44/)] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 09-17 05:00:23 [[launcher.py:44](http://launcher.py:44/)] Route: /redoc, Methods: HEAD, GET
INFO 09-17 05:00:23 [[launcher.py:44](http://launcher.py:44/)] Route: /health, Methods: GET
INFO 09-17 05:00:23 [[launcher.py:44](http://launcher.py:44/)] Route: /generate, Methods: POST
INFO:     Started server process [16088]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000/ (Press CTRL+C to quit)
INFO:     127.0.0.1:50114 - "GET /health HTTP/1.1" 200 OK
INFO:     127.0.0.1:50118 - "GET /get_world_size HTTP/1.1" 404 Not Found

Here is the full code: https://github.com/aryagxr/anagram

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions