[Bug]: Qwen2.5VL-7b-Instruct NVFP4 trtllm-bench Supportive Issue

### System Info

x86_64 / arm64 
H100 / GB200 
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc5
NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4

The issue happens both H100 / GB200.

### Who can help?

_No response_

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

The error: size mismatch for weight / Only Tensors of floating point and complex dtype can require gradients. 
Model: https://huggingface.co/nvidia/Qwen2.5-VL-7B-Instruct-FP4 
DataSet: Self generated dataset matching trtllm-bench multi-model input format: https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html#running-multi-modal-models-in-the-pytorch-workflow 

Reproduce step: 
1. `docker run --gpus all --ipc=host  -it -v /mnt/disk:/scratch nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc5 bash`
2.  ` trtllm-bench --model nvidia/Qwen2.5-VL-7B-Instruct-FP4 throughput --dataset /scratch/my_generated_datasets/Qwen_Qwen2.5-VL-7B-Instruct/inlen1024_outlen110_prefixlen0_MM_imgnum1_imgsize736x480.jsonl --backend pytorch  --modality image  --num_requests 1000  --streaming --max_input_len 8192 --max_batch_size 2 --kv_cache_free_gpu_mem_fraction 0.95`
3. And we get 
```
[09/29/2025-18:25:43] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization.
Unknown quantization type, got modelopt - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm', 'quanto', 'quark', 'eetq', 'higgs', 'hqq', 'compressed-tensors', 'fbgemm_fp8', 'torchao', 'bitnet', 'vptq', 'spqr', 'fp8', 'auto-round']. Hence, we will skip the quantization. To remove the warning, you can delete the quantization_config attribute in config.json
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[09/29/2025-18:25:43] [TRT-LLM] [I] Fallback to regular model init: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 1028, in _load_model
    model = AutoModelForCausalLM.from_config(config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorrt_llm._torch.models.modeling_utils.MetaInitException: Meta tensor used in unsupported function: aten.detach.default


Unknown quantization type, got modelopt - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm', 'quanto', 'quark', 'eetq', 'higgs', 'hqq', 'compressed-tensors', 'fbgemm_fp8', 'torchao', 'bitnet', 'vptq', 'spqr', 'fp8', 'auto-round']. Hence, we will skip the quantization. To remove the warning, you can delete the quantization_config attribute in config.json
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
[09/29/2025-18:25:43] [TRT-LLM] [E] Failed to initialize executor on rank 0: Error(s) in loading state_dict for Linear:
	size mismatch for weight: copying a param with shape torch.Size([3584, 9472]) from checkpoint, the shape in current model is torch.Size([3584, 18944]).
```
4. In model file `config.json` manually edit the size to [3584, 9472] and get  `While copying the parameter named "weight", whose dimensions in the model are torch.Size([3584, 9472]) and whose dimensions in the checkpoint are torch.Size([3584, 9472]), an exception occurred : ('Only Tensors of floating point and complex dtype can require gradients',).`

### Expected behavior

Should benchmark the Qwen2.5VL model with Pytorch backend. 

### actual behavior

```
[09/29/2025-18:25:43] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization.
Unknown quantization type, got modelopt - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm', 'quanto', 'quark', 'eetq', 'higgs', 'hqq', 'compressed-tensors', 'fbgemm_fp8', 'torchao', 'bitnet', 'vptq', 'spqr', 'fp8', 'auto-round']. Hence, we will skip the quantization. To remove the warning, you can delete the quantization_config attribute in config.json
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[09/29/2025-18:25:43] [TRT-LLM] [I] Fallback to regular model init: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 1028, in _load_model
    model = AutoModelForCausalLM.from_config(config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorrt_llm._torch.models.modeling_utils.MetaInitException: Meta tensor used in unsupported function: aten.detach.default


Unknown quantization type, got modelopt - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm', 'quanto', 'quark', 'eetq', 'higgs', 'hqq', 'compressed-tensors', 'fbgemm_fp8', 'torchao', 'bitnet', 'vptq', 'spqr', 'fp8', 'auto-round']. Hence, we will skip the quantization. To remove the warning, you can delete the quantization_config attribute in config.json
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
[09/29/2025-18:25:43] [TRT-LLM] [E] Failed to initialize executor on rank 0: Error(s) in loading state_dict for Linear:
	size mismatch for weight: copying a param with shape torch.Size([3584, 9472]) from checkpoint, the shape in current model is torch.Size([3584, 18944]).
```

### additional notes

Per https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html#models-pytorch-backend, NVFP4 should be supportive. 

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Qwen2.5VL-7b-Instruct NVFP4 trtllm-bench Supportive Issue #8077

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Qwen2.5VL-7b-Instruct NVFP4 trtllm-bench Supportive Issue #8077

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions