Skip to content

[Bug]: Qwen2.5VL-7b-Instruct NVFP4 trtllm-bench Supportive Issue #8077

@jyj0w0

Description

@jyj0w0

System Info

x86_64 / arm64
H100 / GB200
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc5
NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4

The issue happens both H100 / GB200.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The error: size mismatch for weight / Only Tensors of floating point and complex dtype can require gradients.
Model: https://huggingface.co/nvidia/Qwen2.5-VL-7B-Instruct-FP4
DataSet: Self generated dataset matching trtllm-bench multi-model input format: https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html#running-multi-modal-models-in-the-pytorch-workflow

Reproduce step:

  1. docker run --gpus all --ipc=host -it -v /mnt/disk:/scratch nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc5 bash
  2. trtllm-bench --model nvidia/Qwen2.5-VL-7B-Instruct-FP4 throughput --dataset /scratch/my_generated_datasets/Qwen_Qwen2.5-VL-7B-Instruct/inlen1024_outlen110_prefixlen0_MM_imgnum1_imgsize736x480.jsonl --backend pytorch --modality image --num_requests 1000 --streaming --max_input_len 8192 --max_batch_size 2 --kv_cache_free_gpu_mem_fraction 0.95
  3. And we get
[09/29/2025-18:25:43] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization.
Unknown quantization type, got modelopt - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm', 'quanto', 'quark', 'eetq', 'higgs', 'hqq', 'compressed-tensors', 'fbgemm_fp8', 'torchao', 'bitnet', 'vptq', 'spqr', 'fp8', 'auto-round']. Hence, we will skip the quantization. To remove the warning, you can delete the quantization_config attribute in config.json
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[09/29/2025-18:25:43] [TRT-LLM] [I] Fallback to regular model init: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 1028, in _load_model
    model = AutoModelForCausalLM.from_config(config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorrt_llm._torch.models.modeling_utils.MetaInitException: Meta tensor used in unsupported function: aten.detach.default


Unknown quantization type, got modelopt - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm', 'quanto', 'quark', 'eetq', 'higgs', 'hqq', 'compressed-tensors', 'fbgemm_fp8', 'torchao', 'bitnet', 'vptq', 'spqr', 'fp8', 'auto-round']. Hence, we will skip the quantization. To remove the warning, you can delete the quantization_config attribute in config.json
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
[09/29/2025-18:25:43] [TRT-LLM] [E] Failed to initialize executor on rank 0: Error(s) in loading state_dict for Linear:
	size mismatch for weight: copying a param with shape torch.Size([3584, 9472]) from checkpoint, the shape in current model is torch.Size([3584, 18944]).
  1. In model file config.json manually edit the size to [3584, 9472] and get While copying the parameter named "weight", whose dimensions in the model are torch.Size([3584, 9472]) and whose dimensions in the checkpoint are torch.Size([3584, 9472]), an exception occurred : ('Only Tensors of floating point and complex dtype can require gradients',).

Expected behavior

Should benchmark the Qwen2.5VL model with Pytorch backend.

actual behavior

[09/29/2025-18:25:43] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization.
Unknown quantization type, got modelopt - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm', 'quanto', 'quark', 'eetq', 'higgs', 'hqq', 'compressed-tensors', 'fbgemm_fp8', 'torchao', 'bitnet', 'vptq', 'spqr', 'fp8', 'auto-round']. Hence, we will skip the quantization. To remove the warning, you can delete the quantization_config attribute in config.json
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[09/29/2025-18:25:43] [TRT-LLM] [I] Fallback to regular model init: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 1028, in _load_model
    model = AutoModelForCausalLM.from_config(config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorrt_llm._torch.models.modeling_utils.MetaInitException: Meta tensor used in unsupported function: aten.detach.default


Unknown quantization type, got modelopt - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm', 'quanto', 'quark', 'eetq', 'higgs', 'hqq', 'compressed-tensors', 'fbgemm_fp8', 'torchao', 'bitnet', 'vptq', 'spqr', 'fp8', 'auto-round']. Hence, we will skip the quantization. To remove the warning, you can delete the quantization_config attribute in config.json
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
[09/29/2025-18:25:43] [TRT-LLM] [E] Failed to initialize executor on rank 0: Error(s) in loading state_dict for Linear:
	size mismatch for weight: copying a param with shape torch.Size([3584, 9472]) from checkpoint, the shape in current model is torch.Size([3584, 18944]).

additional notes

Per https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html#models-pytorch-backend, NVFP4 should be supportive.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Labels

Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.Model customization<NV>Adding support for new model architectures or variantsMultimodalLabel for issues & PRs regarding Multimodal related objectsPytorch<NV>Pytorch backend related issuesbugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions