[Bug]: vllm gpu-memory-utilization设置0.98，启动Qwen2.5-VL-7B-Instruct 1E2PD后，使用textvqa_subset数据集，并发32发送请求，一段时间后PD实例OOM退出

### Your current environment

<details>
<summary>The output of <code>vllm-ascend python collect_env.py</code></summary>

```text
Collecting environment information...
PyTorch version: 2.7.1+cpu
Is debug build: False

OS: openEuler 24.03 (LTS-SP2) (aarch64)
GCC version: (GCC) 10.3.1
Clang version: Could not collect
CMake version: version 4.1.2
Libc version: glibc-2.38

Python version: 3.11.13 (main, Nov  2 2025, 08:49:25) [GCC 12.3.1 (openEuler 12.3.1-98.oe2403sp2)] (64-bit runtime)
Python platform: Linux-4.19.90-vhulk2211.3.0.h1912.eulerosv2r10.aarch64-aarch64-with-glibc2.38

CPU:
Architecture:                       aarch64
CPU op-mode(s):                     64-bit
Byte Order:                         Little Endian
CPU(s):                             192
On-line CPU(s) list:                0-191
Vendor ID:                          HiSilicon
BIOS Vendor ID:                     HiSilicon
Model name:                         Kunpeng-920
BIOS Model name:                    HUAWEI Kunpeng 920 5250 To be filled by O.E.M. CPU @ 2.6GHz
BIOS CPU family:                    280
Model:                              0
Thread(s) per core:                 1
Core(s) per socket:                 48
Socket(s):                          4
Stepping:                           0x1
BogoMIPS:                           200.00
Flags:                              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache:                          12 MiB (192 instances)
L1i cache:                          12 MiB (192 instances)
L2 cache:                           96 MiB (192 instances)
L3 cache:                           192 MiB (8 instances)
NUMA node(s):                       8
NUMA node0 CPU(s):                  0-23
NUMA node1 CPU(s):                  24-47
NUMA node2 CPU(s):                  48-71
NUMA node3 CPU(s):                  72-95
NUMA node4 CPU(s):                  96-119
NUMA node5 CPU(s):                  120-143
NUMA node6 CPU(s):                  144-167
NUMA node7 CPU(s):                  168-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; __user pointer sanitization
Vulnerability Spectre v2:           Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==27.1.0
[pip3] torch==2.7.1
[pip3] torch_npu==2.7.1
[pip3] torchaudio==2.8.0
[pip3] torchvision==0.22.1
[pip3] transformers==4.57.1
[conda] Could not collect
vLLM Version: 0.11.0
vLLM Ascend Version: 0.11.0

ENV Variables:
ATB_OPSRUNNER_KERNEL_CACHE_LOCAL_COUNT=1
ATB_STREAM_SYNC_EVERY_RUNNER_ENABLE=0
ATB_OPSRUNNER_SETUP_CACHE_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1
ATB_DEVICE_TILING_BUFFER_BLOCK_NUM=32
ATB_STREAM_SYNC_EVERY_KERNEL_ENABLE=0
ATB_OPSRUNNER_KERNEL_CACHE_GLOABL_COUNT=5
ATB_HOME_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1
ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ATB_COMPARE_TILING_EVERY_KERNEL=0
ASCEND_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:
ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_STREAM_SYNC_EVERY_OPERATION_ENABLE=0
ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_MATMUL_SHUFFLE_K_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=1
ATB_HOST_TILING_BUFFER_BLOCK_NUM=128
ATB_SHARE_MEMORY_NAME_SUFFIX=
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1


NPU:
+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2                 Version: 24.1.rc2                                             |
+---------------------------+---------------+----------------------------------------------------+


CANN:
package_name=Ascend-cann-toolkit
version=8.3.RC1
innerversion=V100R001C23SPC001B235
compatible_version=[V100R001C15],[V100R001C18],[V100R001C19],[V100R001C20],[V100R001C21],[V100R001C23]
arch=aarch64
os=linux
path=/usr/local/Ascend/ascend-toolkit/8.3.RC1/aarch64-linux

```
</details>



<details>
<summary>The output of <code>vllm python collect_env.py</code></summary>

```text
Collecting environment information...
==============================
        System Info
==============================
OS                           : openEuler 24.03 (LTS-SP2) (aarch64)
GCC version                  : (GCC) 10.3.1
Clang version                : Could not collect
CMake version                : version 4.1.2
Libc version                 : glibc-2.38

==============================
       PyTorch Info
==============================
PyTorch version              : 2.7.1+cpu
Is debug build               : False
CUDA used to build PyTorch   : None
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.11.13 (main, Nov  2 2025, 08:49:25) [GCC 12.3.1 (openEuler 12.3.1-98.oe2403sp2)] (64-bit runtime)
Python platform              : Linux-4.19.90-vhulk2211.3.0.h1912.eulerosv2r10.aarch64-aarch64-with-glibc2.38

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False
CUDA runtime version         : No CUDA
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : No CUDA
Nvidia driver version        : No CUDA
cuDNN version                : No CUDA
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       aarch64
CPU op-mode(s):                     64-bit
Byte Order:                         Little Endian
CPU(s):                             192
On-line CPU(s) list:                0-191
Vendor ID:                          HiSilicon
BIOS Vendor ID:                     HiSilicon
Model name:                         Kunpeng-920
BIOS Model name:                    HUAWEI Kunpeng 920 5250 To be filled by O.E.M. CPU @ 2.6GHz
BIOS CPU family:                    280
Model:                              0
Thread(s) per core:                 1
Core(s) per socket:                 48
Socket(s):                          4
Stepping:                           0x1
BogoMIPS:                           200.00
Flags:                              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache:                          12 MiB (192 instances)
L1i cache:                          12 MiB (192 instances)
L2 cache:                           96 MiB (192 instances)
L3 cache:                           192 MiB (8 instances)
NUMA node(s):                       8
NUMA node0 CPU(s):                  0-23
NUMA node1 CPU(s):                  24-47
NUMA node2 CPU(s):                  48-71
NUMA node3 CPU(s):                  72-95
NUMA node4 CPU(s):                  96-119
NUMA node5 CPU(s):                  120-143
NUMA node6 CPU(s):                  144-167
NUMA node7 CPU(s):                  168-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; __user pointer sanitization
Vulnerability Spectre v2:           Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==1.26.4
[pip3] pyzmq==27.1.0
[pip3] torch==2.7.1
[pip3] torch_npu==2.7.1
[pip3] torchaudio==2.8.0
[pip3] torchvision==0.22.1
[pip3] transformers==4.57.1
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.11.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  Could not collect

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1


```


</details>



### 🐛 Describe the bug

<details>
<summary>Run the following command to reproduce the error:</summary>
1E2PD
ASCEND_RT_VISIBLE_DEVICES="$GPU_E" vllm serve "$MODEL" \
    --gpu-memory-utilization 0.0 \
    --port "$ENCODE_PORT" \
    --enforce-eager \
    --enable-request-id-headers \
    --no-enable-prefix-caching \
    --max-num-batched-tokens 10000 \
    --max-num-seqs 128 \
    --max-model-len 10000 \
    --ec-transfer-config '{
        "ec_connector":"ECMooncakeStorageConnector",
        "ec_role":"ec_producer",
        "ec_connector_extra_config": {
            "ec_mooncake_config_file_path":"'${SCRIPT_DIR}'/producer.json",
            "ec_max_num_scheduled_tokens": "1000000000000000000"
        }
    }' \
    >"${ENC_LOG}" 2>&1 &
ASCEND_RT_VISIBLE_DEVICES="$GPU_PD" VLLM_NIXL_SIDE_CHANNEL_PORT=6000 vllm serve "$MODEL" \
    --gpu-memory-utilization 0.98 \
    --port "$PREFILL_DECODE_PORT" \
    --enforce-eager \
    --enable-request-id-headers \
    --max-num-seqs 128 \
    --max-num-batched-tokens 10000 \
    --max-model-len 10000 \
    --no-enable-prefix-caching \
    --ec-transfer-config '{
        "ec_connector":"ECMooncakeStorageConnector",
        "ec_role":"ec_consumer",
        "ec_connector_extra_config": {
            "ec_mooncake_config_file_path":"'${SCRIPT_DIR}'/consumer.json"
        }
    }' \
    >"${PD_LOG}" 2>&1 &

benchmark
acc_cases = [{
        "case_type":
        "accuracy",
        "dataset_path":
        os.path.join(DATASET_PATH, "textvqa_subset"),
        "request_conf":
        "vllm_api_general_chat",
        "dataset_conf":
        "textvqa/textvqa_gen_base64",
        "max_out_len":
        2048,
        "batch_size":
        32,
        "temperature":
        0,
        "top_k":
        -1,
        "top_p":
        1,
        "repetition_penalty":
        1,
        "request_rate":
        0,
        "seed":
        77,
        "baseline":81,
        "threshold":1
    }]


</details>

<details>
<summary>Error output:</summary>
(APIServer pid=33795) INFO:     127.0.0.1:44520 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=33795) INFO:     127.0.0.1:46526 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(EngineCore_DP0 pid=34250) WARNING 11-14 16:46:40 [mooncake_storage_connector.py:69] ('In connector.start_load_caches, ', 'but the connector metadata has no mm_datas')
(APIServer pid=33795) INFO:     127.0.0.1:45346 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=33795) INFO:     127.0.0.1:46530 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=33795) INFO:     127.0.0.1:44510 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[rank0]:[E1114 16:46:40.652064787 compiler_depend.ts:444] NPU out of memory. NPUWorkspaceAllocator tried to allocate 166.91 MiB(NPU 0; 29.50 GiB total capacity; 248.07 MiB free). If you want to reduce memory usage, take a try to set the environment variable TASK_QUEUE_ENABLE=1.

[ERROR] 2025-11-14-16:46:40 (PID:34250, Device:0, RankID:-1) ERR00006 PTA memory error
Exception raised from malloc at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:426 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xffff98e03ea4 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe4 (0xffff98da3e44 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x980670 (0xfffded8e0670 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #3: <unknown function> + 0x980e04 (0xfffded8e0e04 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #4: <unknown function> + 0x97af2c (0xfffded8daf2c in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #5: <unknown function> + 0x2735aa0 (0xfffdef695aa0 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #6: at_npu::native::allocate_workspace(unsigned long, void*) + 0x28 (0xfffded8d84d8 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #7: <unknown function> + 0x9ecf8 (0xfffdd9b5ecf8 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libop_plugin_atb.so)
frame #8: <unknown function> + 0x26e6c10 (0xfffdef646c10 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #9: <unknown function> + 0x961a94 (0xfffded8c1a94 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #10: <unknown function> + 0x9644c0 (0xfffded8c44c0 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #11: <unknown function> + 0x96072c (0xfffded8c072c in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #12: <unknown function> + 0xcf25c (0xffff98c3f25c in /usr/lib64/libstdc++.so.6)
frame #13: <unknown function> + 0x7fbb4 (0xffffa4f9fbb4 in /usr/lib64/libc.so.6)
frame #14: <unknown function> + 0xe79dc (0xffffa50079dc in /usr/lib64/libc.so.6)

(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.11.0) with config: model='/data/models/Qwen2.5-VL-7B-Instruct', speculative_config=None, tokenizer='/data/models/Qwen2.5-VL-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=10000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=npu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/data/models/Qwen2.5-VL-7B-Instruct, enable_prefix_caching=False, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["all"],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null},
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-db4f9bfe-0728-48a2-9ed1-fb868fb56643', 'chatcmpl-5b562f28-360b-44a1-8e82-0a2edc6d1d32', 'chatcmpl-1b16ef2b-f385-4505-bb0a-eb17b7b8007f', 'chatcmpl-b00a1f46-828b-4756-bcff-de522f60f68f', 'chatcmpl-61f63aec-da89-48b1-8f29-8693704b0866', 'chatcmpl-b2151b82-2ade-473c-b6eb-bef0c422163d', 'chatcmpl-6c4acd84-2946-4029-96c3-2d934b05e544', 'chatcmpl-45ce2d2e-e94b-4955-80dc-b92203f46160', 'chatcmpl-b32d86f4-2baa-4015-bfc0-a91687bcde78', 'chatcmpl-134428e4-7c44-4ed7-bbb0-3b9d3231a463', 'chatcmpl-a6c08e31-653f-432a-9641-4868f15d7155', 'chatcmpl-b8c9445b-4a53-443b-b4ed-e259086d629d', 'chatcmpl-5ba69b99-71cc-421f-8dfb-e6ffebd8229d', 'chatcmpl-84f98c5c-c529-410f-bd3d-22ae7ae18f4b', 'chatcmpl-8f2896a4-4ccc-4cfe-aa61-66c6fe779457', 'chatcmpl-e03d398e-70e8-4089-85c2-a564977d2e8a', 'chatcmpl-6aabbf98-18e3-42dc-af14-7e9db78e97b1'], resumed_from_preemption=[false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false], new_token_ids=[], new_block_ids=[null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null], num_computed_tokens=[931, 966, 1041, 1041, 1041, 1043, 930, 928, 1037, 1037, 1039, 925, 1042, 926, 1042, 1371, 1002]), num_scheduled_tokens={chatcmpl-5ba69b99-71cc-421f-8dfb-e6ffebd8229d: 1, chatcmpl-db4f9bfe-0728-48a2-9ed1-fb868fb56643: 1, chatcmpl-b8c9445b-4a53-443b-b4ed-e259086d629d: 1, chatcmpl-b00a1f46-828b-4756-bcff-de522f60f68f: 1, chatcmpl-b32d86f4-2baa-4015-bfc0-a91687bcde78: 1, chatcmpl-8f2896a4-4ccc-4cfe-aa61-66c6fe779457: 1, chatcmpl-e03d398e-70e8-4089-85c2-a564977d2e8a: 1, chatcmpl-1b16ef2b-f385-4505-bb0a-eb17b7b8007f: 1, chatcmpl-a6c08e31-653f-432a-9641-4868f15d7155: 1, chatcmpl-134428e4-7c44-4ed7-bbb0-3b9d3231a463: 1, chatcmpl-b2151b82-2ade-473c-b6eb-bef0c422163d: 1, chatcmpl-6aabbf98-18e3-42dc-af14-7e9db78e97b1: 1, chatcmpl-45ce2d2e-e94b-4955-80dc-b92203f46160: 1, chatcmpl-5b562f28-360b-44a1-8e82-0a2edc6d1d32: 1, chatcmpl-61f63aec-da89-48b1-8f29-8693704b0866: 1, chatcmpl-6c4acd84-2946-4029-96c3-2d934b05e544: 1, chatcmpl-84f98c5c-c529-410f-bd3d-22ae7ae18f4b: 1}, total_num_scheduled_tokens=17, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0], finished_req_ids=['chatcmpl-d890cf90-df89-4e75-ab69-9924960f62fa', 'chatcmpl-b8d32886-1b60-4407-88db-591862a0716c', 'chatcmpl-903bb957-bbe3-4e29-b07b-f0c5cce4ca0f', 'chatcmpl-9e934981-edef-4fcb-a2fe-02350992b8cb', 'chatcmpl-df40c129-60f9-4979-81b7-d28c66b21e43', 'chatcmpl-ce3771e9-0811-4cd1-9027-d8958b4da14e', 'chatcmpl-a201131f-d362-475a-8a80-a4bd6b031655', 'chatcmpl-de73026a-f887-4f70-8290-07c0dd10168a'], free_encoder_mm_hashes=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null, ec_connector_metadata=ECMooncakeStorageConnectorMetadata(mm_datas=[]))
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [dump_input.py:79] Dumping scheduler stats: SchedulerStats(num_running_reqs=17, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.11708860759493667, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0), spec_decoding_stats=None, kv_connector_stats=None, num_corrupted_reqs=0)
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] Traceback (most recent call last):
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 701, in run_engine_core
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 728, in run_busy_loop
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]     self._process_engine_step()
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 754, in _process_engine_step
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 284, in step
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]     model_output = self.execute_model_with_error_logging(
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 270, in execute_model_with_error_logging
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]     raise err
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 261, in execute_model_with_error_logging
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]     return model_fn(scheduler_output)
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 103, in execute_model
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]     output = self.collective_rpc("execute_model",
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]   File "/vllm-workspace/vllm/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]     return [run_method(self.driver_worker, method, args, kwargs)]
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]   File "/vllm-workspace/vllm/vllm/utils/__init__.py", line 3122, in run_method
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]     return func(*args, **kwargs)
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 257, in execute_model
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]     output = self.model_runner.execute_model(scheduler_output,
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]     return func(*args, **kwargs)
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2133, in execute_model
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]     valid_sampled_token_ids = sampled_token_ids.tolist()
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is PagedAttentionOperation.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] [ERROR] 2025-11-14-16:46:40 (PID:34250, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710] [PID: 34250] 2025-11-14-16:46:40.305.301 Memory_Allocation_Failure(EL0004): Failed to allocate memory.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]         Possible Cause: Available memory is insufficient.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]         Solution: Close applications not in use.
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]         TraceBack (most recent call last):
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]         alloc device memory failed, runtime result = 207001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:162]
(EngineCore_DP0 pid=34250) ERROR 11-14 16:46:40 [core.py:710]
(EngineCore_DP0 pid=34250) Process EngineCore_DP0:
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480] AsyncLLM output_handler failed.
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480] Traceback (most recent call last):
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480]   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 439, in output_handler
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480]     outputs = await engine_core.get_output_async()
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480]   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 846, in get_output_async
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480]     raise self._format_exception(outputs) from None
(APIServer pid=33795) ERROR 11-14 16:46:40 [async_llm.py:480] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(EngineCore_DP0 pid=34250) Traceback (most recent call last):
(EngineCore_DP0 pid=34250)   File "/usr/local/python3.11.13/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=34250)     self.run()
(EngineCore_DP0 pid=34250)   File "/usr/local/python3.11.13/lib/python3.11/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=34250)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=34250)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 712, in run_engine_core
(EngineCore_DP0 pid=34250)     raise e
(EngineCore_DP0 pid=34250)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 701, in run_engine_core
(EngineCore_DP0 pid=34250)     engine_core.run_busy_loop()
(EngineCore_DP0 pid=34250)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 728, in run_busy_loop
(EngineCore_DP0 pid=34250)     self._process_engine_step()
(EngineCore_DP0 pid=34250)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 754, in _process_engine_step
(EngineCore_DP0 pid=34250)     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=34250)                               ^^^^^^^^^^^^^^
(APIServer pid=33795) INFO:     127.0.0.1:46534 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error

</details>

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: vllm gpu-memory-utilization设置0.98，启动Qwen2.5-VL-7B-Instruct 1E2PD后，使用textvqa_subset数据集，并发32发送请求，一段时间后PD实例OOM退出 #143

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: vllm gpu-memory-utilization设置0.98，启动Qwen2.5-VL-7B-Instruct 1E2PD后，使用textvqa_subset数据集，并发32发送请求，一段时间后PD实例OOM退出 #143

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions