Skip to content

[Bug]: Missing keys when loading Qwen2.5-VL-FP8 weights in TensorRT-LLM #8569

@Lsy-1997

Description

@Lsy-1997

System Info

When running the official example(https://huggingface.co/nvidia/Qwen2.5-VL-7B-Instruct-FP8) for qwen2.5-vl-fp8 with TensorRT-LLM, I encountered missing keys when loading model weights. The model failed to initialize correctly due to a mismatch between checkpoint and model definition.

Environment:
GPU: H20
Driver Version: 535.216.01
CUDA Version: 13.0
OS: Ubuntu 24.04

Python Environment:
Python 3.12.3
torch 2.8.0a0+34c6371d24.nv25.8
tensorrt_llm 1.2.0rc0
transformers 4.56.0

Error logs:

RuntimeError: Error(s) in loading state_dict for Qwen2VisionModelBase:
        Missing key(s) in state_dict: "visual.blocks.0.attn.qkv_proj.weight_scale", "visual.blocks.0.attn.qkv_proj.input_scale", "visual.blocks.0.attn.qkv_proj.inv_input_scale", "visual.blocks.0.attn.qkv_proj.kv_scales", "visual.blocks.0.attn.qkv_proj.inv_kv_scales", 
........
"visual.blocks.31.attn.o_proj.inv_input_scale", "visual.blocks.31.attn.o_proj.kv_scales", "visual.blocks.31.attn.o_proj.inv_kv_scales".

Reproduction

from tensorrt_llm import LLM, SamplingParams

def main():
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    llm = LLM(model="nvidia/Qwen2.5-VL-7B-Instruct-FP8", tensor_parallel_size=1)
    outputs = llm.generate(prompts, sampling_params)

    for output in outputs:
        print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")

if __name__ == "__main__":
    main()

Expected behavior

The model should load successfully and perform text generation like the other supported original models (Qwen/Qwen2.5-VL-7B-Instruct).

actual behavior

When I ran the official FP8 example for Qwen2.5-VL, the model failed to load successfully.
The script raised a missing keys and unexpected keys error during weight loading.

additional notes

It seems the FP8 checkpoint on HuggingFace may not include the quantization scale tensors (e.g., weight_scale, input_scale, kv_scales) expected by the TensorRT-LLM Qwen2VL model definition.

Could you please confirm:
• Whether the FP8 vision weights for Qwen2.5-VL are fully compatible with tensorrt_llm>=1.2.0rc0?
• Or if there’s an updated checkpoint or branch that supports this model?

Metadata

Metadata

Labels

Model customization<NV>Adding support for new model architectures or variantsMultimodalLabel for issues & PRs regarding Multimodal related objectsbugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions