-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Your current environment
The output of vllm-ascend python collect_env.py
Collecting environment information...
PyTorch version: 2.7.1+cpu
Is debug build: False
OS: openEuler 24.03 (LTS-SP2) (aarch64)
GCC version: (GCC) 10.3.1
Clang version: Could not collect
CMake version: version 4.1.2
Libc version: glibc-2.38
Python version: 3.11.13 (main, Nov 2 2025, 08:49:25) [GCC 12.3.1 (openEuler 12.3.1-98.oe2403sp2)] (64-bit runtime)
Python platform: Linux-4.19.90-vhulk2211.3.0.h1912.eulerosv2r10.aarch64-aarch64-with-glibc2.38
CPU:
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Vendor ID: HiSilicon
BIOS Vendor ID: HiSilicon
Model name: Kunpeng-920
BIOS Model name: HUAWEI Kunpeng 920 5250 To be filled by O.E.M. CPU @ 2.6GHz
BIOS CPU family: 280
Model: 0
Thread(s) per core: 1
Core(s) per socket: 48
Socket(s): 4
Stepping: 0x1
BogoMIPS: 200.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache: 12 MiB (192 instances)
L1i cache: 12 MiB (192 instances)
L2 cache: 96 MiB (192 instances)
L3 cache: 192 MiB (8 instances)
NUMA node(s): 8
NUMA node0 CPU(s): 0-23
NUMA node1 CPU(s): 24-47
NUMA node2 CPU(s): 48-71
NUMA node3 CPU(s): 72-95
NUMA node4 CPU(s): 96-119
NUMA node5 CPU(s): 120-143
NUMA node6 CPU(s): 144-167
NUMA node7 CPU(s): 168-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==27.1.0
[pip3] torch==2.7.1
[pip3] torch_npu==2.7.1
[pip3] torchaudio==2.8.0
[pip3] torchvision==0.22.1
[pip3] transformers==4.57.1
[conda] Could not collect
vLLM Version: 0.11.0
vLLM Ascend Version: 0.11.0
ENV Variables:
ATB_OPSRUNNER_KERNEL_CACHE_LOCAL_COUNT=1
ATB_STREAM_SYNC_EVERY_RUNNER_ENABLE=0
ATB_OPSRUNNER_SETUP_CACHE_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1
ATB_DEVICE_TILING_BUFFER_BLOCK_NUM=32
ATB_STREAM_SYNC_EVERY_KERNEL_ENABLE=0
ATB_OPSRUNNER_KERNEL_CACHE_GLOABL_COUNT=5
ATB_HOME_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1
ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ATB_COMPARE_TILING_EVERY_KERNEL=0
ASCEND_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:
ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_STREAM_SYNC_EVERY_OPERATION_ENABLE=0
ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_MATMUL_SHUFFLE_K_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=1
ATB_HOST_TILING_BUFFER_BLOCK_NUM=128
ATB_SHARE_MEMORY_NAME_SUFFIX=
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
NPU:
+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2 Version: 24.1.rc2 |
+---------------------------+---------------+----------------------------------------------------+
CANN:
package_name=Ascend-cann-toolkit
version=8.3.RC1
innerversion=V100R001C23SPC001B235
compatible_version=[V100R001C15],[V100R001C18],[V100R001C19],[V100R001C20],[V100R001C21],[V100R001C23]
arch=aarch64
os=linux
path=/usr/local/Ascend/ascend-toolkit/8.3.RC1/aarch64-linux
The output of vllm python collect_env.py
Collecting environment information...
==============================
System Info
==============================
OS : openEuler 24.03 (LTS-SP2) (aarch64)
GCC version : (GCC) 10.3.1
Clang version : Could not collect
CMake version : version 4.1.2
Libc version : glibc-2.38
==============================
PyTorch Info
==============================
PyTorch version : 2.7.1+cpu
Is debug build : False
CUDA used to build PyTorch : None
ROCM used to build PyTorch : N/A
==============================
Python Environment
==============================
Python version : 3.11.13 (main, Nov 2 2025, 08:49:25) [GCC 12.3.1 (openEuler 12.3.1-98.oe2403sp2)] (64-bit runtime)
Python platform : Linux-4.19.90-vhulk2211.3.0.h1912.eulerosv2r10.aarch64-aarch64-with-glibc2.38
==============================
CUDA / GPU Info
==============================
Is CUDA available : False
CUDA runtime version : No CUDA
CUDA_MODULE_LOADING set to : N/A
GPU models and configuration : No CUDA
Nvidia driver version : No CUDA
cuDNN version : No CUDA
HIP runtime version : N/A
MIOpen runtime version : N/A
Is XNNPACK available : True
==============================
CPU Info
==============================
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Vendor ID: HiSilicon
BIOS Vendor ID: HiSilicon
Model name: Kunpeng-920
BIOS Model name: HUAWEI Kunpeng 920 5250 To be filled by O.E.M. CPU @ 2.6GHz
BIOS CPU family: 280
Model: 0
Thread(s) per core: 1
Core(s) per socket: 48
Socket(s): 4
Stepping: 0x1
BogoMIPS: 200.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache: 12 MiB (192 instances)
L1i cache: 12 MiB (192 instances)
L2 cache: 96 MiB (192 instances)
L3 cache: 192 MiB (8 instances)
NUMA node(s): 8
NUMA node0 CPU(s): 0-23
NUMA node1 CPU(s): 24-47
NUMA node2 CPU(s): 48-71
NUMA node3 CPU(s): 72-95
NUMA node4 CPU(s): 96-119
NUMA node5 CPU(s): 120-143
NUMA node6 CPU(s): 144-167
NUMA node7 CPU(s): 168-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
==============================
Versions of relevant libraries
==============================
[pip3] numpy==1.26.4
[pip3] pyzmq==27.1.0
[pip3] torch==2.7.1
[pip3] torch_npu==2.7.1
[pip3] torchaudio==2.8.0
[pip3] torchvision==0.22.1
[pip3] transformers==4.57.1
[conda] Could not collect
==============================
vLLM Info
==============================
ROCM Version : Could not collect
vLLM Version : 0.11.0
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
Could not collect
==============================
Environment Variables
==============================
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
🐛 Describe the bug
Run the following command to reproduce the error:
1E2PD ASCEND_RT_VISIBLE_DEVICES="$GPU_E" vllm serve "$MODEL" \ --gpu-memory-utilization 0.0 \ --port "$ENCODE_PORT" \ --enforce-eager \ --enable-request-id-headers \ --no-enable-prefix-caching \ --max-num-batched-tokens 10000 \ --max-num-seqs 128 \ --max-model-len 10000 \ --ec-transfer-config '{ "ec_connector":"ECMooncakeStorageConnector", "ec_role":"ec_producer", "ec_connector_extra_config": { "ec_mooncake_config_file_path":"'${SCRIPT_DIR}'/producer.json", "ec_max_num_scheduled_tokens": "1000000000000000000" } }' \ >"${ENC_LOG}" 2>&1 & ASCEND_RT_VISIBLE_DEVICES="$GPU_PD" VLLM_NIXL_SIDE_CHANNEL_PORT=6000 vllm serve "$MODEL" \ --gpu-memory-utilization 0.98 \ --port "$PREFILL_DECODE_PORT" \ --enforce-eager \ --enable-request-id-headers \ --max-num-seqs 128 \ --max-num-batched-tokens 10000 \ --max-model-len 10000 \ --no-enable-prefix-caching \ --ec-transfer-config '{ "ec_connector":"ECMooncakeStorageConnector", "ec_role":"ec_consumer", "ec_connector_extra_config": { "ec_mooncake_config_file_path":"'${SCRIPT_DIR}'/consumer.json" } }' \ >"${PD_LOG}" 2>&1 &benchmark
acc_cases = [{
"case_type":
"accuracy",
"dataset_path":
os.path.join(DATASET_PATH, "textvqa_subset"),
"request_conf":
"vllm_api_general_chat",
"dataset_conf":
"textvqa/textvqa_gen_base64",
"max_out_len":
2048,
"batch_size":
32,
"temperature":
0,
"top_k":
-1,
"top_p":
1,
"repetition_penalty":
1,
"request_rate":
0,
"seed":
77,
"baseline":81,
"threshold":1
}]
Error output:
(APIServer pid=33795) INFO: 127.0.0.1:44520 - "POST /v1/chat/completions HTTP/1.1" 200 OK (APIServer pid=33795) INFO: 127.0.0.1:46526 - "POST /v1/chat/completions HTTP/1.1" 200 OK (EngineCore_DP0 pid=34250) WARNING 11-14 16:46:40 [mooncake_storage_connector.py:69] ('In connector.start_load_caches, ', 'but the connector metadata has no mm_datas') (APIServer pid=33795) INFO: 127.0.0.1:45346 - "POST /v1/chat/completions HTTP/1.1" 200 OK (APIServer pid=33795) INFO: 127.0.0.1:46530 - "POST /v1/chat/completions HTTP/1.1" 200 OK (APIServer pid=33795) INFO: 127.0.0.1:44510 - "POST /v1/chat/completions HTTP/1.1" 200 OK [rank0]:[E1114 16:46:40.652064787 compiler_depend.ts:444] NPU out of memory. NPUWorkspaceAllocator tried to allocate 166.91 MiB(NPU 0; 29.50 GiB total capacity; 248.07 MiB free). If you want to reduce memory usage, take a try to set the environment variable TASK_QUEUE_ENABLE=1.[ERROR] 2025-11-14-16:46:40 (PID:34250, Device:0, RankID:-1) ERR00006 PTA memory error
Exception raised from malloc at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:426 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xd4 (0xffff98e03ea4 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xe4 (0xffff98da3e44 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: + 0x980670 (0xfffded8e0670 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #3: + 0x980e04 (0xfffded8e0e04 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #4: + 0x97af2c (0xfffded8daf2c in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #5: + 0x2735aa0 (0xfffdef695aa0 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #6: at_npu::native::allocate_workspace(unsigned long, void*) + 0x28 (0xfffded8d84d8 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #7: + 0x9ecf8 (0xfffdd9b5ecf8 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libop_plugin_atb.so)
frame #8: + 0x26e6c10 (0xfffdef646c10 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #9: + 0x961a94 (0xfffded8c1a94 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #10: + 0x9644c0 (0xfffded8c44c0 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #11: + 0x96072c (0xfffded8c072c in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #12: + 0xcf25c (0xffff98c3f25c in /usr/lib64/libstdc++.so.6)
frame #13: + 0x7fbb4 (0xffffa4f9fbb4 in /usr/lib64/libc.so.6)
frame #14: + 0xe79dc (0xffffa50079dc in /usr/lib64/libc.so.6)
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.11.0) with config: model='/data/models/Qwen2.5-VL-7B-Instruct', speculative_config=None, tokenizer='/data/models/Qwen2.5-VL-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=10000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=npu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/data/models/Qwen2.5-VL-7B-Instruct, enable_prefix_caching=False, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["all"],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null},
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-db4f9bfe-0728-48a2-9ed1-fb868fb56643', 'chatcmpl-5b562f28-360b-44a1-8e82-0a2edc6d1d32', 'chatcmpl-1b16ef2b-f385-4505-bb0a-eb17b7b8007f', 'chatcmpl-b00a1f46-828b-4756-bcff-de522f60f68f', 'chatcmpl-61f63aec-da89-48b1-8f29-8693704b0866', 'chatcmpl-b2151b82-2ade-473c-b6eb-bef0c422163d', 'chatcmpl-6c4acd84-2946-4029-96c3-2d934b05e544', 'chatcmpl-45ce2d2e-e94b-4955-80dc-b92203f46160', 'chatcmpl-b32d86f4-2baa-4015-bfc0-a91687bcde78', 'chatcmpl-134428e4-7c44-4ed7-bbb0-3b9d3231a463', 'chatcmpl-a6c08e31-653f-432a-9641-4868f15d7155', 'chatcmpl-b8c9445b-4a53-443b-b4ed-e259086d629d', 'chatcmpl-5ba69b99-71cc-421f-8dfb-e6ffebd8229d', 'chatcmpl-84f98c5c-c529-410f-bd3d-22ae7ae18f4b', 'chatcmpl-8f2896a4-4ccc-4cfe-aa61-66c6fe779457', 'chatcmpl-e03d398e-70e8-4089-85c2-a564977d2e8a', 'chatcmpl-6aabbf98-18e3-42dc-af14-7e9db78e97b1'], resumed_from_preemption=[false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false], new_token_ids=[], new_block_ids=[null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null], num_computed_tokens=[931, 966, 1041, 1041, 1041, 1043, 930, 928, 1037, 1037, 1039, 925, 1042, 926, 1042, 1371, 1002]), num_scheduled_tokens={chatcmpl-5ba69b99-71cc-421f-8dfb-e6ffebd8229d: 1, chatcmpl-db4f9bfe-0728-48a2-9ed1-fb868fb56643: 1, chatcmpl-b8c9445b-4a53-443b-b4ed-e259086d629d: 1, chatcmpl-b00a1f46-828b-4756-bcff-de522f60f68f: 1, chatcmpl-b32d86f4-2baa-4015-bfc0-a91687bcde78: 1, chatcmpl-8f2896a4-4ccc-4cfe-aa61-66c6fe779457: 1, chatcmpl-e03d398e-70e8-4089-85c2-a564977d2e8a: 1, chatcmpl-1b16ef2b-f385-4505-bb0a-eb17b7b8007f: 1, chatcmpl-a6c08e31-653f-432a-9641-4868f15d7155: 1, chatcmpl-134428e4-7c44-4ed7-bbb0-3b9d3231a463: 1, chatcmpl-b2151b82-2ade-473c-b6eb-bef0c422163d: 1, chatcmpl-6aabbf98-18e3-42dc-af14-7e9db78e97b1: 1, chatcmpl-45ce2d2e-e94b-4955-80dc-b92203f46160: 1, chatcmpl-5b562f28-360b-44a1-8e82-0a2edc6d1d32: 1, chatcmpl-61f63aec-da89-48b1-8f29-8693704b0866: 1, chatcmpl-6c4acd84-2946-4029-96c3-2d934b05e544: 1, chatcmpl-84f98c5c-c529-410f-bd3d-22ae7ae18f4b: 1}, total_num_scheduled_tokens=17, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0], finished_req_ids=['chatcmpl-d890cf90-df89-4e75-ab69-9924960f62fa', 'chatcmpl-b8d32886-1b60-4407-88db-591862a0716c', 'chatcmpl-903bb957-bbe3-4e29-b07b-f0c5cce4ca0f', 'chatcmpl-9e934981-edef-4fcb-a2fe-02350992b8cb', 'chatcmpl-df40c129-60f9-4979-81b7-d28c66b21e43', 'chatcmpl-ce3771e9-0811-4cd1-9027-d8958b4da14e', 'chatcmpl-a201131f-d362-475a-8a80-a4bd6b031655', 'chatcmpl-de73026a-f887-4f70-8290-07c0dd10168a'], free_encoder_mm_hashes=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null, ec_connector_metadata=ECMooncakeStorageConnectorMetadata(mm_datas=[]))
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [dump_input.py:79] Dumping scheduler stats: SchedulerStats(num_running_reqs=17, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.11708860759493667, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0), spec_decoding_stats=None, kv_connector_stats=None, num_corrupted_reqs=0)
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] Traceback (most recent call last):
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 701, in run_engine_core
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] engine_core.run_busy_loop()
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 728, in run_busy_loop
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] self._process_engine_step()
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 754, in _process_engine_step
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 284, in step
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] model_output = self.execute_model_with_error_logging(
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 270, in execute_model_with_error_logging
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] raise err
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 261, in execute_model_with_error_logging
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] return model_fn(scheduler_output)
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 103, in execute_model
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] output = self.collective_rpc("execute_model",
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/vllm-workspace/vllm/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] return [run_method(self.driver_worker, method, args, kwargs)]
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/vllm-workspace/vllm/vllm/utils/init.py", line 3122, in run_method
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] return func(*args, **kwargs)
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 257, in execute_model
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] output = self.model_runner.execute_model(scheduler_output,
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] return func(*args, **kwargs)
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2133, in execute_model
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] valid_sampled_token_ids = sampled_token_ids.tolist()
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is PagedAttentionOperation.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] [ERROR] 2025-11-14-16:46:40 (PID:34250, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] [PID: 34250] 2025-11-14-16:46:40.305.301 Memory_Allocation_Failure(EL0004): Failed to allocate memory.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] Possible Cause: Available memory is insufficient.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] Solution: Close applications not in use.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] TraceBack (most recent call last):
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] alloc device memory failed, runtime result = 207001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:162]
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]
(EngineCore_DP0 pid=34250) Process EngineCore_DP0:
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480] AsyncLLM output_handler failed.
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480] Traceback (most recent call last):
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 439, in output_handler
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480] outputs = await engine_core.get_output_async()
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480] File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 846, in get_output_async
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480] raise self._format_exception(outputs) from None
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(EngineCore_DP0 pid=34250) Traceback (most recent call last):
(EngineCore_DP0 pid=34250) File "/usr/local/python3.11.13/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=34250) self.run()
(EngineCore_DP0 pid=34250) File "/usr/local/python3.11.13/lib/python3.11/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=34250) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=34250) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 712, in run_engine_core
(EngineCore_DP0 pid=34250) raise e
(EngineCore_DP0 pid=34250) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 701, in run_engine_core
(EngineCore_DP0 pid=34250) engine_core.run_busy_loop()
(EngineCore_DP0 pid=34250) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 728, in run_busy_loop
(EngineCore_DP0 pid=34250) self._process_engine_step()
(EngineCore_DP0 pid=34250) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 754, in _process_engine_step
(EngineCore_DP0 pid=34250) outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=34250) ^^^^^^^^^^^^^^
(APIServer pid=33795) INFO: 127.0.0.1:46534 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.