-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
System Info
x86_64 / arm64
H100 / GB200
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc5
NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4
The issue happens both H100 / GB200.
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
The error: size mismatch for weight / Only Tensors of floating point and complex dtype can require gradients.
Model: https://huggingface.co/nvidia/Qwen2.5-VL-7B-Instruct-FP4
DataSet: Self generated dataset matching trtllm-bench multi-model input format: https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html#running-multi-modal-models-in-the-pytorch-workflow
Reproduce step:
docker run --gpus all --ipc=host -it -v /mnt/disk:/scratch nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc5 bashtrtllm-bench --model nvidia/Qwen2.5-VL-7B-Instruct-FP4 throughput --dataset /scratch/my_generated_datasets/Qwen_Qwen2.5-VL-7B-Instruct/inlen1024_outlen110_prefixlen0_MM_imgnum1_imgsize736x480.jsonl --backend pytorch --modality image --num_requests 1000 --streaming --max_input_len 8192 --max_batch_size 2 --kv_cache_free_gpu_mem_fraction 0.95- And we get
[09/29/2025-18:25:43] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization.
Unknown quantization type, got modelopt - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm', 'quanto', 'quark', 'eetq', 'higgs', 'hqq', 'compressed-tensors', 'fbgemm_fp8', 'torchao', 'bitnet', 'vptq', 'spqr', 'fp8', 'auto-round']. Hence, we will skip the quantization. To remove the warning, you can delete the quantization_config attribute in config.json
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[09/29/2025-18:25:43] [TRT-LLM] [I] Fallback to regular model init: Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 1028, in _load_model
model = AutoModelForCausalLM.from_config(config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorrt_llm._torch.models.modeling_utils.MetaInitException: Meta tensor used in unsupported function: aten.detach.default
Unknown quantization type, got modelopt - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm', 'quanto', 'quark', 'eetq', 'higgs', 'hqq', 'compressed-tensors', 'fbgemm_fp8', 'torchao', 'bitnet', 'vptq', 'spqr', 'fp8', 'auto-round']. Hence, we will skip the quantization. To remove the warning, you can delete the quantization_config attribute in config.json
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
[09/29/2025-18:25:43] [TRT-LLM] [E] Failed to initialize executor on rank 0: Error(s) in loading state_dict for Linear:
size mismatch for weight: copying a param with shape torch.Size([3584, 9472]) from checkpoint, the shape in current model is torch.Size([3584, 18944]).
- In model file
config.jsonmanually edit the size to [3584, 9472] and getWhile copying the parameter named "weight", whose dimensions in the model are torch.Size([3584, 9472]) and whose dimensions in the checkpoint are torch.Size([3584, 9472]), an exception occurred : ('Only Tensors of floating point and complex dtype can require gradients',).
Expected behavior
Should benchmark the Qwen2.5VL model with Pytorch backend.
actual behavior
[09/29/2025-18:25:43] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization.
Unknown quantization type, got modelopt - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm', 'quanto', 'quark', 'eetq', 'higgs', 'hqq', 'compressed-tensors', 'fbgemm_fp8', 'torchao', 'bitnet', 'vptq', 'spqr', 'fp8', 'auto-round']. Hence, we will skip the quantization. To remove the warning, you can delete the quantization_config attribute in config.json
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[09/29/2025-18:25:43] [TRT-LLM] [I] Fallback to regular model init: Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 1028, in _load_model
model = AutoModelForCausalLM.from_config(config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorrt_llm._torch.models.modeling_utils.MetaInitException: Meta tensor used in unsupported function: aten.detach.default
Unknown quantization type, got modelopt - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm', 'quanto', 'quark', 'eetq', 'higgs', 'hqq', 'compressed-tensors', 'fbgemm_fp8', 'torchao', 'bitnet', 'vptq', 'spqr', 'fp8', 'auto-round']. Hence, we will skip the quantization. To remove the warning, you can delete the quantization_config attribute in config.json
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
[09/29/2025-18:25:43] [TRT-LLM] [E] Failed to initialize executor on rank 0: Error(s) in loading state_dict for Linear:
size mismatch for weight: copying a param with shape torch.Size([3584, 9472]) from checkpoint, the shape in current model is torch.Size([3584, 18944]).
additional notes
Per https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html#models-pytorch-backend, NVFP4 should be supportive.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.