Skip to content

MOE Model inference error with B7 version #256

@jessie-zhao

Description

@jessie-zhao

Test commands:
vllm bench serve --model /llm/models/Qwen3-Coder-30B-A3B-Instruct --served-model-name Qwen3-Coder-30B-A3B-Instruct --dataset-name sharegpt --num-prompts 200 --max-concurrency 200 --request-rate $bs --backend vllm --dataset-path /llm/ShareGPT_V3_unfiltered_cleaned_split.json --trust_remote_code --ignore-eos --host 127.0.0.1 --port 8001 --ignore-eos --save-detailed --append-result

Error:
^[[1;36m(APIServer pid=30490)^[[0;0m INFO 01-20 03:01:52 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 48 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.9%, Prefix cache hit rate: 87.7%
^[[1;36m(EngineCore_DP0 pid=30630)^[[0;0m INFO 01-20 03:02:37 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
^[[1;36m(EngineCore_DP0 pid=30630)^[[0;0m INFO 01-20 03:03:37 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
^[[1;36m(EngineCore_DP0 pid=30630)^[[0;0m INFO 01-20 03:04:37 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
^[[1;36m(EngineCore_DP0 pid=30630)^[[0;0m INFO 01-20 03:05:37 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
^[[1;36m(EngineCore_DP0 pid=30630)^[[0;0m ERROR 01-20 03:06:37 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.11.2.dev0+g439368496.d20260115) with config: model='/llm/models/Qwen3-Coder-30B-A3B-Instruct', speculative_config=None, tokenizer='/llm/models/Qwen3-Coder-30B-A3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=10000, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=fp8, enforce_eager=True, kv_cache_dtype=auto, device_config=xpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen3-Coder-30B-A3B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': None, 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': None, 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': None, 'local_cache_dir': None},
^[[1;36m(EngineCore_DP0 pid=30630)^[[0;0m ERROR 01-20 03:06:37 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['cmpl-bench-be60f4bb-1-0', 'cmpl-bench-be60f4bb-2-0', 'cmpl-bench-be60f4bb-3-0', 'cmpl-bench-be60f4bb-5-0', 'cmpl-bench-be60f4bb-9-0', 'cmpl-bench-be60f4bb-11-0', 'cmpl-bench-be60f4bb-13-0', 'cmpl-bench-be60f4bb-15-0', 'cmpl-bench-be60f4bb-16-0', 'cmpl-bench-be60f4bb-17-0', 'cmpl-bench-be60f4bb-19-0', 'cmpl-bench-be60f4bb-20-0', 'cmpl-bench-be60f4bb-21-0', 'cmpl-bench-be60f4bb-22-0', 'cmpl-bench-be60f4bb-25-0', 'cmpl-bench-be60f4bb-26-0', 'cmpl-bench-be60f4bb-27-0', 'cmpl-bench-be60f4bb-29-0', 'cmpl-bench-be60f4bb-31-0', 'cmpl-bench-be60f4bb-32-0', 'cmpl-bench-be60f4bb-33-0', 'cmpl-bench-be60f4bb-34-0', 'cmpl-bench-be60f4bb-36-0', 'cmpl-bench-be60f4bb-37-0', 'cmpl-bench-be60f4bb-40-0', 'cmpl-bench-be60f4bb-41-0', 'cmpl-bench-be60f4bb-42-0', 'cmpl-bench-be60f4bb-43-0', 'cmpl-bench-be60f4bb-44-0', 'cmpl-bench-be60f4bb-46-0', 'cmpl-bench-be60f4bb-48-0', 'cmpl-bench-be60f4bb-50-0', 'cmpl-bench-be60f4bb-51-0', 'cmpl-bench-be60f4bb-52-0', 'cmpl-bench-be60f4bb-53-0', 'cmpl-bench-be60f4bb-55-0', 'cmpl-bench-be60f4bb-57-0', 'cmpl-bench-be60f4bb-59-0', 'cmpl-bench-be60f4bb-60-0', 'cmpl-bench-be60f4bb-61-0', 'cmpl-bench-be60f4bb-62-0', 'cmpl-bench-be60f4bb-63-0', 'cmpl-bench-be60f4bb-64-0', 'cmpl-bench-be60f4bb-65-0', 'cmpl-bench-be60f4bb-66-0', 'cmpl-bench-be60f4bb-67-0', 'cmpl-bench-be60f4bb-68-0', 'cmpl-bench-be60f4bb-69-0'], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null], num_computed_tokens=[145, 145, 126, 142, 116, 509, 101, 318, 96, 99, 729, 130, 144, 829, 406, 70, 66, 133, 367, 89, 666, 55, 60, 59, 45, 40, 311, 42, 175, 818, 31, 31, 66, 271, 831, 106, 32, 25, 85, 40, 756, 186, 15, 633, 482, 44, 335, 23], num_output_tokens=[122, 120, 118, 115, 102, 98, 91, 90, 90, 87, 80, 73, 70, 68, 63, 63, 58, 55, 52, 51, 49, 49, 46, 44, 36, 35, 32, 32, 30, 28, 27, 24, 23, 15, 15, 14, 12, 11, 10, 10, 8, 8, 8, 7, 4, 4, 1, 1]), num_scheduled_tokens={cmpl-bench-be60f4bb-36-0: 1, cmpl-bench-be60f4bb-16-0: 1, cmpl-bench-be60f4bb-50-0: 1, cmpl-bench-be60f4bb-67-0: 1, cmpl-bench-be60f4bb-21-0: 1, cmpl-bench-be60f4bb-1-0: 1, cmpl-bench-be60f4bb-2-0: 1, cmpl-bench-be60f4bb-41-0: 1, cmpl-bench-be60f4bb-51-0: 1, cmpl-bench-be60f4bb-59-0: 1, cmpl-bench-be60f4bb-52-0: 1, cmpl-bench-be60f4bb-40-0: 1, cmpl-bench-be60f4bb-66-0: 1, cmpl-bench-be60f4bb-11-0: 1, cmpl-bench-be60f4bb-55-0: 1, cmpl-bench-be60f4bb-34-0: 1, cmpl-bench-be60f4bb-64-0: 1, cmpl-bench-be60f4bb-69-0: 1, cmpl-bench-be60f4bb-65-0: 1, cmpl-bench-be60f4bb-33-0: 1, cmpl-bench-be60f4bb-43-0: 1, cmpl-bench-be60f4bb-53-0: 1, cmpl-bench-be60f4bb-46-0: 1, cmpl-bench-be60f4bb-19-0: 1, cmpl-bench-be60f4bb-20-0: 1, cmpl-bench-be60f4bb-27-0: 1, cmpl-bench-be60f4bb-31-0: 1, cmpl-bench-be60f4bb-9-0: 1, cmpl-bench-be60f4bb-29-0: 1, cmpl-bench-be60f4bb-63-0: 1, cmpl-bench-be60f4bb-60-0: 1, cmpl-bench-be60f4bb-13-0: 1, cmpl-bench-be60f4bb-15-0: 1, cmpl-bench-be60f4bb-32-0: 1, cmpl-bench-be60f4bb-22-0: 1, cmpl-bench-be60f4bb-25-0: 1, cmpl-bench-be60f4bb-62-0: 1, cmpl-bench-be60f4bb-26-0: 1, cmpl-bench-be60f4bb-37-0: 1, cmpl-bench-be60f4bb-17-0: 1, cmpl-bench-be60f4bb-61-0: 1, cmpl-bench-be60f4bb-68-0: 1, cmpl-bench-be60f4bb-44-0: 1, cmpl-bench-be60f4bb-42-0: 1, cmpl-bench-be60f4bb-48-0: 1, cmpl-bench-be60f4bb-57-0: 1, cmpl-bench-be60f4bb-3-0: 1, cmpl-bench-be60f4bb-5-0: 1}, total_num_scheduled_tokens=48, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null, ec_connector_metadata=null)
^[[1;36m(EngineCore_DP0 pid=30630)^[[0;0m ERROR 01-20 03:06:37 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=48, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.02938503483792787, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={})
^[[1;36m(EngineCore_DP0 pid=30630)^[[0;0m ERROR 01-20 03:06:37 [core.py:844] EngineCore encountered a fatal error.^M
^[[1;36m(EngineCore_DP0 pid=30630)^[[0;0m ERROR 01-20 03:06:37 [core.py:844] Traceback (most recent call last):^M

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions