-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
System Info
When running the official example(https://huggingface.co/nvidia/Qwen2.5-VL-7B-Instruct-FP8) for qwen2.5-vl-fp8 with TensorRT-LLM, I encountered missing keys when loading model weights. The model failed to initialize correctly due to a mismatch between checkpoint and model definition.
Environment:
GPU: H20
Driver Version: 535.216.01
CUDA Version: 13.0
OS: Ubuntu 24.04
Python Environment:
Python 3.12.3
torch 2.8.0a0+34c6371d24.nv25.8
tensorrt_llm 1.2.0rc0
transformers 4.56.0
Error logs:
RuntimeError: Error(s) in loading state_dict for Qwen2VisionModelBase:
Missing key(s) in state_dict: "visual.blocks.0.attn.qkv_proj.weight_scale", "visual.blocks.0.attn.qkv_proj.input_scale", "visual.blocks.0.attn.qkv_proj.inv_input_scale", "visual.blocks.0.attn.qkv_proj.kv_scales", "visual.blocks.0.attn.qkv_proj.inv_kv_scales",
........
"visual.blocks.31.attn.o_proj.inv_input_scale", "visual.blocks.31.attn.o_proj.kv_scales", "visual.blocks.31.attn.o_proj.inv_kv_scales".
Reproduction
from tensorrt_llm import LLM, SamplingParams
def main():
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="nvidia/Qwen2.5-VL-7B-Instruct-FP8", tensor_parallel_size=1)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")
if __name__ == "__main__":
main()
Expected behavior
The model should load successfully and perform text generation like the other supported original models (Qwen/Qwen2.5-VL-7B-Instruct).
actual behavior
When I ran the official FP8 example for Qwen2.5-VL, the model failed to load successfully.
The script raised a missing keys and unexpected keys error during weight loading.
additional notes
It seems the FP8 checkpoint on HuggingFace may not include the quantization scale tensors (e.g., weight_scale, input_scale, kv_scales) expected by the TensorRT-LLM Qwen2VL model definition.
Could you please confirm:
• Whether the FP8 vision weights for Qwen2.5-VL are fully compatible with tensorrt_llm>=1.2.0rc0?
• Or if there’s an updated checkpoint or branch that supports this model?