Skip to content

[Bug]: vllm gpu-memory-utilization设置0.98,启动Qwen2.5-VL-7B-Instruct 1E2PD后,使用textvqa_subset数据集,并发32发送请求,一段时间后PD实例OOM退出 #143

@yenuo26

Description

@yenuo26

Your current environment

The output of vllm-ascend python collect_env.py
Collecting environment information...
PyTorch version: 2.7.1+cpu
Is debug build: False

OS: openEuler 24.03 (LTS-SP2) (aarch64)
GCC version: (GCC) 10.3.1
Clang version: Could not collect
CMake version: version 4.1.2
Libc version: glibc-2.38

Python version: 3.11.13 (main, Nov  2 2025, 08:49:25) [GCC 12.3.1 (openEuler 12.3.1-98.oe2403sp2)] (64-bit runtime)
Python platform: Linux-4.19.90-vhulk2211.3.0.h1912.eulerosv2r10.aarch64-aarch64-with-glibc2.38

CPU:
Architecture:                       aarch64
CPU op-mode(s):                     64-bit
Byte Order:                         Little Endian
CPU(s):                             192
On-line CPU(s) list:                0-191
Vendor ID:                          HiSilicon
BIOS Vendor ID:                     HiSilicon
Model name:                         Kunpeng-920
BIOS Model name:                    HUAWEI Kunpeng 920 5250 To be filled by O.E.M. CPU @ 2.6GHz
BIOS CPU family:                    280
Model:                              0
Thread(s) per core:                 1
Core(s) per socket:                 48
Socket(s):                          4
Stepping:                           0x1
BogoMIPS:                           200.00
Flags:                              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache:                          12 MiB (192 instances)
L1i cache:                          12 MiB (192 instances)
L2 cache:                           96 MiB (192 instances)
L3 cache:                           192 MiB (8 instances)
NUMA node(s):                       8
NUMA node0 CPU(s):                  0-23
NUMA node1 CPU(s):                  24-47
NUMA node2 CPU(s):                  48-71
NUMA node3 CPU(s):                  72-95
NUMA node4 CPU(s):                  96-119
NUMA node5 CPU(s):                  120-143
NUMA node6 CPU(s):                  144-167
NUMA node7 CPU(s):                  168-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; __user pointer sanitization
Vulnerability Spectre v2:           Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==27.1.0
[pip3] torch==2.7.1
[pip3] torch_npu==2.7.1
[pip3] torchaudio==2.8.0
[pip3] torchvision==0.22.1
[pip3] transformers==4.57.1
[conda] Could not collect
vLLM Version: 0.11.0
vLLM Ascend Version: 0.11.0

ENV Variables:
ATB_OPSRUNNER_KERNEL_CACHE_LOCAL_COUNT=1
ATB_STREAM_SYNC_EVERY_RUNNER_ENABLE=0
ATB_OPSRUNNER_SETUP_CACHE_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1
ATB_DEVICE_TILING_BUFFER_BLOCK_NUM=32
ATB_STREAM_SYNC_EVERY_KERNEL_ENABLE=0
ATB_OPSRUNNER_KERNEL_CACHE_GLOABL_COUNT=5
ATB_HOME_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1
ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ATB_COMPARE_TILING_EVERY_KERNEL=0
ASCEND_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:
ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_STREAM_SYNC_EVERY_OPERATION_ENABLE=0
ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_MATMUL_SHUFFLE_K_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=1
ATB_HOST_TILING_BUFFER_BLOCK_NUM=128
ATB_SHARE_MEMORY_NAME_SUFFIX=
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1


NPU:
+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2                 Version: 24.1.rc2                                             |
+---------------------------+---------------+----------------------------------------------------+


CANN:
package_name=Ascend-cann-toolkit
version=8.3.RC1
innerversion=V100R001C23SPC001B235
compatible_version=[V100R001C15],[V100R001C18],[V100R001C19],[V100R001C20],[V100R001C21],[V100R001C23]
arch=aarch64
os=linux
path=/usr/local/Ascend/ascend-toolkit/8.3.RC1/aarch64-linux

The output of vllm python collect_env.py
Collecting environment information...
==============================
        System Info
==============================
OS                           : openEuler 24.03 (LTS-SP2) (aarch64)
GCC version                  : (GCC) 10.3.1
Clang version                : Could not collect
CMake version                : version 4.1.2
Libc version                 : glibc-2.38

==============================
       PyTorch Info
==============================
PyTorch version              : 2.7.1+cpu
Is debug build               : False
CUDA used to build PyTorch   : None
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.11.13 (main, Nov  2 2025, 08:49:25) [GCC 12.3.1 (openEuler 12.3.1-98.oe2403sp2)] (64-bit runtime)
Python platform              : Linux-4.19.90-vhulk2211.3.0.h1912.eulerosv2r10.aarch64-aarch64-with-glibc2.38

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False
CUDA runtime version         : No CUDA
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : No CUDA
Nvidia driver version        : No CUDA
cuDNN version                : No CUDA
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       aarch64
CPU op-mode(s):                     64-bit
Byte Order:                         Little Endian
CPU(s):                             192
On-line CPU(s) list:                0-191
Vendor ID:                          HiSilicon
BIOS Vendor ID:                     HiSilicon
Model name:                         Kunpeng-920
BIOS Model name:                    HUAWEI Kunpeng 920 5250 To be filled by O.E.M. CPU @ 2.6GHz
BIOS CPU family:                    280
Model:                              0
Thread(s) per core:                 1
Core(s) per socket:                 48
Socket(s):                          4
Stepping:                           0x1
BogoMIPS:                           200.00
Flags:                              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache:                          12 MiB (192 instances)
L1i cache:                          12 MiB (192 instances)
L2 cache:                           96 MiB (192 instances)
L3 cache:                           192 MiB (8 instances)
NUMA node(s):                       8
NUMA node0 CPU(s):                  0-23
NUMA node1 CPU(s):                  24-47
NUMA node2 CPU(s):                  48-71
NUMA node3 CPU(s):                  72-95
NUMA node4 CPU(s):                  96-119
NUMA node5 CPU(s):                  120-143
NUMA node6 CPU(s):                  144-167
NUMA node7 CPU(s):                  168-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; __user pointer sanitization
Vulnerability Spectre v2:           Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==1.26.4
[pip3] pyzmq==27.1.0
[pip3] torch==2.7.1
[pip3] torch_npu==2.7.1
[pip3] torchaudio==2.8.0
[pip3] torchvision==0.22.1
[pip3] transformers==4.57.1
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.11.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  Could not collect

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1


🐛 Describe the bug

Run the following command to reproduce the error: 1E2PD ASCEND_RT_VISIBLE_DEVICES="$GPU_E" vllm serve "$MODEL" \ --gpu-memory-utilization 0.0 \ --port "$ENCODE_PORT" \ --enforce-eager \ --enable-request-id-headers \ --no-enable-prefix-caching \ --max-num-batched-tokens 10000 \ --max-num-seqs 128 \ --max-model-len 10000 \ --ec-transfer-config '{ "ec_connector":"ECMooncakeStorageConnector", "ec_role":"ec_producer", "ec_connector_extra_config": { "ec_mooncake_config_file_path":"'${SCRIPT_DIR}'/producer.json", "ec_max_num_scheduled_tokens": "1000000000000000000" } }' \ >"${ENC_LOG}" 2>&1 & ASCEND_RT_VISIBLE_DEVICES="$GPU_PD" VLLM_NIXL_SIDE_CHANNEL_PORT=6000 vllm serve "$MODEL" \ --gpu-memory-utilization 0.98 \ --port "$PREFILL_DECODE_PORT" \ --enforce-eager \ --enable-request-id-headers \ --max-num-seqs 128 \ --max-num-batched-tokens 10000 \ --max-model-len 10000 \ --no-enable-prefix-caching \ --ec-transfer-config '{ "ec_connector":"ECMooncakeStorageConnector", "ec_role":"ec_consumer", "ec_connector_extra_config": { "ec_mooncake_config_file_path":"'${SCRIPT_DIR}'/consumer.json" } }' \ >"${PD_LOG}" 2>&1 &

benchmark
acc_cases = [{
"case_type":
"accuracy",
"dataset_path":
os.path.join(DATASET_PATH, "textvqa_subset"),
"request_conf":
"vllm_api_general_chat",
"dataset_conf":
"textvqa/textvqa_gen_base64",
"max_out_len":
2048,
"batch_size":
32,
"temperature":
0,
"top_k":
-1,
"top_p":
1,
"repetition_penalty":
1,
"request_rate":
0,
"seed":
77,
"baseline":81,
"threshold":1
}]

Error output: (APIServer pid=33795) INFO: 127.0.0.1:44520 - "POST /v1/chat/completions HTTP/1.1" 200 OK (APIServer pid=33795) INFO: 127.0.0.1:46526 - "POST /v1/chat/completions HTTP/1.1" 200 OK (EngineCore_DP0 pid=34250) WARNING 11-14 16:46:40 [mooncake_storage_connector.py:69] ('In connector.start_load_caches, ', 'but the connector metadata has no mm_datas') (APIServer pid=33795) INFO: 127.0.0.1:45346 - "POST /v1/chat/completions HTTP/1.1" 200 OK (APIServer pid=33795) INFO: 127.0.0.1:46530 - "POST /v1/chat/completions HTTP/1.1" 200 OK (APIServer pid=33795) INFO: 127.0.0.1:44510 - "POST /v1/chat/completions HTTP/1.1" 200 OK [rank0]:[E1114 16:46:40.652064787 compiler_depend.ts:444] NPU out of memory. NPUWorkspaceAllocator tried to allocate 166.91 MiB(NPU 0; 29.50 GiB total capacity; 248.07 MiB free). If you want to reduce memory usage, take a try to set the environment variable TASK_QUEUE_ENABLE=1.

[ERROR] 2025-11-14-16:46:40 (PID:34250, Device:0, RankID:-1) ERR00006 PTA memory error
Exception raised from malloc at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:426 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xd4 (0xffff98e03ea4 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xe4 (0xffff98da3e44 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: + 0x980670 (0xfffded8e0670 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #3: + 0x980e04 (0xfffded8e0e04 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #4: + 0x97af2c (0xfffded8daf2c in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #5: + 0x2735aa0 (0xfffdef695aa0 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #6: at_npu::native::allocate_workspace(unsigned long, void*) + 0x28 (0xfffded8d84d8 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #7: + 0x9ecf8 (0xfffdd9b5ecf8 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libop_plugin_atb.so)
frame #8: + 0x26e6c10 (0xfffdef646c10 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #9: + 0x961a94 (0xfffded8c1a94 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #10: + 0x9644c0 (0xfffded8c44c0 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #11: + 0x96072c (0xfffded8c072c in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #12: + 0xcf25c (0xffff98c3f25c in /usr/lib64/libstdc++.so.6)
frame #13: + 0x7fbb4 (0xffffa4f9fbb4 in /usr/lib64/libc.so.6)
frame #14: + 0xe79dc (0xffffa50079dc in /usr/lib64/libc.so.6)

(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.11.0) with config: model='/data/models/Qwen2.5-VL-7B-Instruct', speculative_config=None, tokenizer='/data/models/Qwen2.5-VL-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=10000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=npu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/data/models/Qwen2.5-VL-7B-Instruct, enable_prefix_caching=False, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["all"],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null},
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-db4f9bfe-0728-48a2-9ed1-fb868fb56643', 'chatcmpl-5b562f28-360b-44a1-8e82-0a2edc6d1d32', 'chatcmpl-1b16ef2b-f385-4505-bb0a-eb17b7b8007f', 'chatcmpl-b00a1f46-828b-4756-bcff-de522f60f68f', 'chatcmpl-61f63aec-da89-48b1-8f29-8693704b0866', 'chatcmpl-b2151b82-2ade-473c-b6eb-bef0c422163d', 'chatcmpl-6c4acd84-2946-4029-96c3-2d934b05e544', 'chatcmpl-45ce2d2e-e94b-4955-80dc-b92203f46160', 'chatcmpl-b32d86f4-2baa-4015-bfc0-a91687bcde78', 'chatcmpl-134428e4-7c44-4ed7-bbb0-3b9d3231a463', 'chatcmpl-a6c08e31-653f-432a-9641-4868f15d7155', 'chatcmpl-b8c9445b-4a53-443b-b4ed-e259086d629d', 'chatcmpl-5ba69b99-71cc-421f-8dfb-e6ffebd8229d', 'chatcmpl-84f98c5c-c529-410f-bd3d-22ae7ae18f4b', 'chatcmpl-8f2896a4-4ccc-4cfe-aa61-66c6fe779457', 'chatcmpl-e03d398e-70e8-4089-85c2-a564977d2e8a', 'chatcmpl-6aabbf98-18e3-42dc-af14-7e9db78e97b1'], resumed_from_preemption=[false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false], new_token_ids=[], new_block_ids=[null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null], num_computed_tokens=[931, 966, 1041, 1041, 1041, 1043, 930, 928, 1037, 1037, 1039, 925, 1042, 926, 1042, 1371, 1002]), num_scheduled_tokens={chatcmpl-5ba69b99-71cc-421f-8dfb-e6ffebd8229d: 1, chatcmpl-db4f9bfe-0728-48a2-9ed1-fb868fb56643: 1, chatcmpl-b8c9445b-4a53-443b-b4ed-e259086d629d: 1, chatcmpl-b00a1f46-828b-4756-bcff-de522f60f68f: 1, chatcmpl-b32d86f4-2baa-4015-bfc0-a91687bcde78: 1, chatcmpl-8f2896a4-4ccc-4cfe-aa61-66c6fe779457: 1, chatcmpl-e03d398e-70e8-4089-85c2-a564977d2e8a: 1, chatcmpl-1b16ef2b-f385-4505-bb0a-eb17b7b8007f: 1, chatcmpl-a6c08e31-653f-432a-9641-4868f15d7155: 1, chatcmpl-134428e4-7c44-4ed7-bbb0-3b9d3231a463: 1, chatcmpl-b2151b82-2ade-473c-b6eb-bef0c422163d: 1, chatcmpl-6aabbf98-18e3-42dc-af14-7e9db78e97b1: 1, chatcmpl-45ce2d2e-e94b-4955-80dc-b92203f46160: 1, chatcmpl-5b562f28-360b-44a1-8e82-0a2edc6d1d32: 1, chatcmpl-61f63aec-da89-48b1-8f29-8693704b0866: 1, chatcmpl-6c4acd84-2946-4029-96c3-2d934b05e544: 1, chatcmpl-84f98c5c-c529-410f-bd3d-22ae7ae18f4b: 1}, total_num_scheduled_tokens=17, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0], finished_req_ids=['chatcmpl-d890cf90-df89-4e75-ab69-9924960f62fa', 'chatcmpl-b8d32886-1b60-4407-88db-591862a0716c', 'chatcmpl-903bb957-bbe3-4e29-b07b-f0c5cce4ca0f', 'chatcmpl-9e934981-edef-4fcb-a2fe-02350992b8cb', 'chatcmpl-df40c129-60f9-4979-81b7-d28c66b21e43', 'chatcmpl-ce3771e9-0811-4cd1-9027-d8958b4da14e', 'chatcmpl-a201131f-d362-475a-8a80-a4bd6b031655', 'chatcmpl-de73026a-f887-4f70-8290-07c0dd10168a'], free_encoder_mm_hashes=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null, ec_connector_metadata=ECMooncakeStorageConnectorMetadata(mm_datas=[]))
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [dump_input.py:79] Dumping scheduler stats: SchedulerStats(num_running_reqs=17, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.11708860759493667, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0), spec_decoding_stats=None, kv_connector_stats=None, num_corrupted_reqs=0)
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] Traceback (most recent call last):
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 701, in run_engine_core
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] engine_core.run_busy_loop()
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 728, in run_busy_loop
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] self._process_engine_step()
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 754, in _process_engine_step
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 284, in step
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] model_output = self.execute_model_with_error_logging(
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 270, in execute_model_with_error_logging
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] raise err
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 261, in execute_model_with_error_logging
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] return model_fn(scheduler_output)
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 103, in execute_model
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] output = self.collective_rpc("execute_model",
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/vllm-workspace/vllm/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] return [run_method(self.driver_worker, method, args, kwargs)]
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/vllm-workspace/vllm/vllm/utils/init.py", line 3122, in run_method
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] return func(*args, **kwargs)
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 257, in execute_model
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] output = self.model_runner.execute_model(scheduler_output,
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] return func(*args, **kwargs)
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2133, in execute_model
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] valid_sampled_token_ids = sampled_token_ids.tolist()
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is PagedAttentionOperation.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] [ERROR] 2025-11-14-16:46:40 (PID:34250, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] [PID: 34250] 2025-11-14-16:46:40.305.301 Memory_Allocation_Failure(EL0004): Failed to allocate memory.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] Possible Cause: Available memory is insufficient.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] Solution: Close applications not in use.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] TraceBack (most recent call last):
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] alloc device memory failed, runtime result = 207001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:162]
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]
(EngineCore_DP0 pid=34250) Process EngineCore_DP0:
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480] AsyncLLM output_handler failed.
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480] Traceback (most recent call last):
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 439, in output_handler
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480] outputs = await engine_core.get_output_async()
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480] File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 846, in get_output_async
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480] raise self._format_exception(outputs) from None
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(EngineCore_DP0 pid=34250) Traceback (most recent call last):
(EngineCore_DP0 pid=34250) File "/usr/local/python3.11.13/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=34250) self.run()
(EngineCore_DP0 pid=34250) File "/usr/local/python3.11.13/lib/python3.11/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=34250) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=34250) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 712, in run_engine_core
(EngineCore_DP0 pid=34250) raise e
(EngineCore_DP0 pid=34250) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 701, in run_engine_core
(EngineCore_DP0 pid=34250) engine_core.run_busy_loop()
(EngineCore_DP0 pid=34250) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 728, in run_busy_loop
(EngineCore_DP0 pid=34250) self._process_engine_step()
(EngineCore_DP0 pid=34250) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 754, in _process_engine_step
(EngineCore_DP0 pid=34250) outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=34250) ^^^^^^^^^^^^^^
(APIServer pid=33795) INFO: 127.0.0.1:46534 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions