Bridging the gap between traditional medical IQA and human-like reasoning with Multi-modal Large Language Models
- [2025.10] VLMEvalKit integration now supported!
- [2025.09] MedQ-Bench paper submitted to arXiv
- [2025.09] Dataset publicly released on Hugging Face
MedQ-Bench is the first comprehensive benchmark for evaluating Medical Image Quality Assessment (IQA) capabilities of Multi-modal Large Language Models (MLLMs). Unlike traditional score-based IQA methods, MedQ-Bench introduces a perception-reasoning paradigm that mirrors clinicians' cognitive workflow for quality assessment.
- π₯ Medical-Specialized Focus: Designed specifically for medical imaging quality assessment across 5 modalities
- π§ Perception-Reasoning Paradigm: Evaluates both Visual quality perceptual abilities and reasoning skills
- π Comprehensive Evaluation: 3,308 samples covering 40+ quality attributes with multi-dimensional assessment
- π¨ββοΈ Human-AI Alignment: Validated against expert radiologist assessments with strong correlation (ΞΊw > 0.77)
Examples of MedQ-Bench evaluation tasks across different modalities, covering perception (MCQA) tasks, no-reference reasoning, and comparative reasoning scenarios.
-
MedQ-Perception: Probes low-level perceptual capability via human-curated questions
- Yes-or-No, What, How question types
- General medical vs. modality-specific questions
- No degradation vs. mild/severe degradation levels
-
MedQ-Reasoning: Encompasses reasoning tasks aligning with human-like quality assessment
- No-reference reasoning (single image analysis with detailed quality description)
- Comparative reasoning (paired image evaluation and comparison)
- Coarse-grained vs. fine-grained difficulty levels
| Modality | Samples | Key Degradation Types |
|---|---|---|
| CT | 878 | Metal artifacts, noise, streak artifacts, ... |
| MRI | 848 | Motion artifacts, undersampling, susceptibility, ... |
| Histopathology | 758 | Staining artifacts, focus issues, compression, ... |
| Endoscopy | 555 | Illumination, specular reflection, motion blur, ... |
| Fundus | 269 | Color distortion, illumination, contrast issues, ... |
Reasoning Tasks Assessed via 4 Dimensions:
- Completeness (0-2): Coverage of key visual information
- Preciseness (0-2): Consistency with reference assessment
- Consistency (0-2): Logical coherence between reasoning and conclusion
- Quality Accuracy (0-2): Correctness of final quality judgment
| Model Category | Best Performer | Overall Score |
|---|---|---|
| Commercial | GPT-5 | 68.97% |
| Open-Source | Qwen2.5-VL-72B | 63.14% |
| Medical-Specialized | MedGemma-27B | 57.16% |
-
Substantial Human-AI Performance Gap: Despite achieving above-random performance, the best-performing AI model (GPT-5: 68.97%) significantly underperforms human experts (82.50%), leaving a 13.53% gap that highlights insufficient accuracy for reliable clinical deployment without further optimization
-
Mild Degradation Detection Challenges: Models exhibit weakest performance on mild degradations (average 56% accuracy) compared to no degradation (72%) and severe degradation (67%), indicating difficulty in detecting subtle quality issues precisely where reliable quality control is most clinically critical
-
Medical-Specialized Models Underperform: Contrary to expectations, medical-specialized models (best: MedGemma-27B at 57.16%) consistently lag behind general-purpose models, suggesting current domain adaptation strategies may prioritize high-level diagnostic reasoning over fundamental low-level visual perception capabilities required for quality assessment
-
Limited Reasoning Capabilities: Even advanced models achieve only moderate scores in completeness (1.293/2.0) and preciseness (1.556/2.0) for reasoning tasks, demonstrating preliminary but unstable perceptual and reasoning abilities insufficient for complete and accurate quality descriptions
The MedQ-Bench dataset has been made available through π€ Hugging Face (jiyaoliufd/MedQ-Bench)
MedQ-Bench is now integrated with VLMEvalKit for seamless evaluation of vision-language models.
Please refer to the VLMEvalKit Quick Start Guide for installation instructions.
Evaluate your model on MedQ-Bench tasks using the following commands:
1. Perception Task (Multiple Choice Questions):
python run.py --model grok-4 --data MedqbenchMCQ_test --api-nproc 32 --retry 3 --reuse2. No-Reference Reasoning Task (Caption):
python run.py --model grok-4 --data MedqbenchCaption_test --api-nproc 32 --retry 3 --reuse3. Comparative Reasoning Task (Paired Description):
python run.py --model grok-4 --data MedqbenchPairedDescription_dev --judge gpt-4o --api-nproc 32 --retry 3 --reuseParameters:
--model: The model to evaluate (e.g.,grok-4,gpt-4o,qwen2.5-vl-72b)--data: The MedQ-Bench dataset split (MedqbenchMCQ_test,MedqbenchCaption_test,MedqbenchPairedDescription_dev)--judge: Judge model for reasoning tasks (e.g.,gpt-4o) - required for comparative reasoning--api-nproc: Number of parallel API calls (default: 32)--retry: Number of retry attempts for failed API calls (default: 3)--reuse: Reuse existing results to avoid redundant API calls
| Rank | Model | Yes-or-No β | What β | How β | Overall β |
|---|---|---|---|---|---|
| π₯ | GPT-5 | 82.26% | 60.47% | 58.28% | 68.97% |
| π₯ | GPT-4o | 78.48% | 49.64% | 57.32% | 64.79% |
| π₯ | Qwen2.5-VL-72B | 78.67% | 42.25% | 56.44% | 63.14% |
| π₯ | Grok-4 | 73.30% | 48.84% | 59.10% | 63.14% |
| 5 | Gemini-2.5-Pro | 75.13% | 55.02% | 50.54% | 61.88% |
| Rank | Model | Comp. β | Prec. β | Cons. β | Qual. β | Overall β |
|---|---|---|---|---|---|---|
| π₯ | GPT-5 | 1.195 | 1.118 | 1.837 | 1.529 | 5.679 |
| π₯ | GPT-4o | 1.009 | 1.027 | 1.878 | 1.407 | 5.321 |
| π₯ | Qwen2.5-VL-32B | 1.077 | 0.928 | 1.977 | 1.290 | 5.272 |
| 4 | Gemini-2.5-Pro | 0.878 | 0.891 | 1.688 | 1.561 | 5.018 |
| 5 | Grok-4 | 0.982 | 0.846 | 1.801 | 1.389 | 5.017 |
| Rank | Model | Comp. β | Prec. β | Cons. β | Qual. β | Overall β |
|---|---|---|---|---|---|---|
| π₯ | GPT-5 | 1.293 | 1.556 | 1.925 | 1.564 | 6.338 |
| π₯ | GPT-4o | 1.105 | 1.414 | 1.632 | 1.562 | 5.713 |
| π₯ | Grok-4 | 1.150 | 1.233 | 1.820 | 1.459 | 5.662 |
| 4 | Gemini-2.5-Pro | 1.053 | 1.233 | 1.774 | 1.534 | 5.594 |
| 5 | InternVL3-8B | 0.985 | 1.278 | 1.797 | 1.474 | 5.534 |
Scores are on a 0-2 scale for each dimension
- Best Overall: CT and MRI imaging (higher contrast, clearer artifacts)
- Most Challenging: Histopathology (subtle staining variations, texture complexity)
- Modality Gap: 15-20% performance difference between easiest and hardest modalities
- Perception Errors: Difficulty distinguishing mild vs severe degradations
- Reasoning Gaps: Incomplete description of quality factors
- Consistency Issues: Mismatch between observed artifacts and quality conclusion
- Medical Knowledge: Limited understanding of clinical significance
We welcome contributions! Please see our Contributing Guidelines for details.
If you use MedQ-Bench in your research, please cite our paper:
@misc{liu2025medqbenchevaluatingexploringmedical,
title={MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs},
author={Jiyao Liu and Jinjie Wei and Wanying Qu and Chenglong Ma and Junzhi Ning and Yunheng Li and Ying Chen and Xinzhe Luo and Pengcheng Chen and Xin Gao and Ming Hu and Huihui Xu and Xin Wang and Shujian Gao and Dingkang Yang and Zhongying Deng and Jin Ye and Lihao Liu and Junjun He and Ningsheng Xu},
year={2025},
eprint={2510.01691},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.01691},
}- Jiyao Liu: [email protected]
- Lihao Liu: [email protected]
- Junjun He: [email protected]
- Q-Bench Team: For the foundational framework for vision quality assessment
- VLMEvalKit: For the comprehensive evaluation infrastructure
- All Radiologists: Who contributed to human evaluation and validation
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
π Star us on GitHub if MedQ-Bench helps your research! π


