Skip to content

liujiyaoFDU/MedQ-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

logo

MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs

Bridging the gap between traditional medical IQA and human-like reasoning with Multi-modal Large Language Models

arXiv License Python 3.8+

Jiyao Liu1*, Jinjie Wei1*, Wanying Qu1, Chenglong Ma1,2, Junzhi Ning2, Yunheng Li1, Ying Chen2, Xinzhe Luo3, Pengcheng Chen2, Xin Gao1, Ming Hu2, Huihui Xu2, Xin Wang2, Shujian Gao1, Dingkang Yang1, Zhongying Deng4, Jin Ye2, Lihao Liu2†, Junjun He2†, Ningsheng Xu1
1Fudan University, 2Shanghai Artificial Intelligence Laboratory, 3Imperial College London, 4University of Cambridge
*Equal contribution. †Corresponding author.
MedQBench Framework

πŸ”₯ News

  • [2025.10] VLMEvalKit integration now supported!
  • [2025.09] MedQ-Bench paper submitted to arXiv
  • [2025.09] Dataset publicly released on Hugging Face

🎯 Overview

MedQ-Bench is the first comprehensive benchmark for evaluating Medical Image Quality Assessment (IQA) capabilities of Multi-modal Large Language Models (MLLMs). Unlike traditional score-based IQA methods, MedQ-Bench introduces a perception-reasoning paradigm that mirrors clinicians' cognitive workflow for quality assessment.

Why Reasoning-based IQA - Traditional vs. Reasoning Approach

Comparison of traditional score-based IQA vs. our reasoning-based approach. Unlike purely numerical scores, reasoning IQA identifies distortion types and their relative impact, yielding results more consistent with human judgment.

πŸ” Key Innovations

  • πŸ₯ Medical-Specialized Focus: Designed specifically for medical imaging quality assessment across 5 modalities
  • 🧠 Perception-Reasoning Paradigm: Evaluates both Visual quality perceptual abilities and reasoning skills
  • πŸ“Š Comprehensive Evaluation: 3,308 samples covering 40+ quality attributes with multi-dimensional assessment
  • πŸ‘¨β€βš•οΈ Human-AI Alignment: Validated against expert radiologist assessments with strong correlation (ΞΊw > 0.77)

πŸ—οΈ Benchmark Architecture

πŸ“‹ Two Complementary Tasks

MedQBench Task Examples

Examples of MedQ-Bench evaluation tasks across different modalities, covering perception (MCQA) tasks, no-reference reasoning, and comparative reasoning scenarios.

  1. MedQ-Perception: Probes low-level perceptual capability via human-curated questions

    • Yes-or-No, What, How question types
    • General medical vs. modality-specific questions
    • No degradation vs. mild/severe degradation levels
  2. MedQ-Reasoning: Encompasses reasoning tasks aligning with human-like quality assessment

    • No-reference reasoning (single image analysis with detailed quality description)
    • Comparative reasoning (paired image evaluation and comparison)
    • Coarse-grained vs. fine-grained difficulty levels

πŸ₯ Coverage Across Medical Imaging

Modality Samples Key Degradation Types
CT 878 Metal artifacts, noise, streak artifacts, ...
MRI 848 Motion artifacts, undersampling, susceptibility, ...
Histopathology 758 Staining artifacts, focus issues, compression, ...
Endoscopy 555 Illumination, specular reflection, motion blur, ...
Fundus 269 Color distortion, illumination, contrast issues, ...

🎯 Multi-Dimensional Evaluation

Reasoning Tasks Assessed via 4 Dimensions:

  • Completeness (0-2): Coverage of key visual information
  • Preciseness (0-2): Consistency with reference assessment
  • Consistency (0-2): Logical coherence between reasoning and conclusion
  • Quality Accuracy (0-2): Correctness of final quality judgment

πŸ“Š Key Findings

πŸ† Model Performance Hierarchy

Model Category Best Performer Overall Score
Commercial GPT-5 68.97%
Open-Source Qwen2.5-VL-72B 63.14%
Medical-Specialized MedGemma-27B 57.16%

πŸ” Critical Insights

  1. Substantial Human-AI Performance Gap: Despite achieving above-random performance, the best-performing AI model (GPT-5: 68.97%) significantly underperforms human experts (82.50%), leaving a 13.53% gap that highlights insufficient accuracy for reliable clinical deployment without further optimization

  2. Mild Degradation Detection Challenges: Models exhibit weakest performance on mild degradations (average 56% accuracy) compared to no degradation (72%) and severe degradation (67%), indicating difficulty in detecting subtle quality issues precisely where reliable quality control is most clinically critical

  3. Medical-Specialized Models Underperform: Contrary to expectations, medical-specialized models (best: MedGemma-27B at 57.16%) consistently lag behind general-purpose models, suggesting current domain adaptation strategies may prioritize high-level diagnostic reasoning over fundamental low-level visual perception capabilities required for quality assessment

  4. Limited Reasoning Capabilities: Even advanced models achieve only moderate scores in completeness (1.293/2.0) and preciseness (1.556/2.0) for reasoning tasks, demonstrating preliminary but unstable perceptual and reasoning abilities insufficient for complete and accurate quality descriptions

πŸš€ Getting Started

πŸ’Ύ Dataset Access

The MedQ-Bench dataset has been made available through πŸ€— Hugging Face (jiyaoliufd/MedQ-Bench)

πŸ”¬ Evaluation

MedQ-Bench is now integrated with VLMEvalKit for seamless evaluation of vision-language models.

Installation

Please refer to the VLMEvalKit Quick Start Guide for installation instructions.

Usage

Evaluate your model on MedQ-Bench tasks using the following commands:

1. Perception Task (Multiple Choice Questions):

python run.py --model grok-4 --data MedqbenchMCQ_test --api-nproc 32 --retry 3 --reuse

2. No-Reference Reasoning Task (Caption):

python run.py --model grok-4 --data MedqbenchCaption_test --api-nproc 32 --retry 3 --reuse

3. Comparative Reasoning Task (Paired Description):

python run.py --model grok-4 --data MedqbenchPairedDescription_dev --judge gpt-4o --api-nproc 32 --retry 3 --reuse

Parameters:

  • --model: The model to evaluate (e.g., grok-4, gpt-4o, qwen2.5-vl-72b)
  • --data: The MedQ-Bench dataset split (MedqbenchMCQ_test, MedqbenchCaption_test, MedqbenchPairedDescription_dev)
  • --judge: Judge model for reasoning tasks (e.g., gpt-4o) - required for comparative reasoning
  • --api-nproc: Number of parallel API calls (default: 32)
  • --retry: Number of retry attempts for failed API calls (default: 3)
  • --reuse: Reuse existing results to avoid redundant API calls

πŸ“ˆ Leaderboard

Perception Tasks (Test Set)

Rank Model Yes-or-No ↑ What ↑ How ↑ Overall ↑
πŸ₯‡ GPT-5 82.26% 60.47% 58.28% 68.97%
πŸ₯ˆ GPT-4o 78.48% 49.64% 57.32% 64.79%
πŸ₯‰ Qwen2.5-VL-72B 78.67% 42.25% 56.44% 63.14%
πŸ₯‰ Grok-4 73.30% 48.84% 59.10% 63.14%
5 Gemini-2.5-Pro 75.13% 55.02% 50.54% 61.88%

No-Reference Reasoning Tasks (Test Set)

Rank Model Comp. ↑ Prec. ↑ Cons. ↑ Qual. ↑ Overall ↑
πŸ₯‡ GPT-5 1.195 1.118 1.837 1.529 5.679
πŸ₯ˆ GPT-4o 1.009 1.027 1.878 1.407 5.321
πŸ₯‰ Qwen2.5-VL-32B 1.077 0.928 1.977 1.290 5.272
4 Gemini-2.5-Pro 0.878 0.891 1.688 1.561 5.018
5 Grok-4 0.982 0.846 1.801 1.389 5.017

Comparative Reasoning Tasks (Test Set)

Rank Model Comp. ↑ Prec. ↑ Cons. ↑ Qual. ↑ Overall ↑
πŸ₯‡ GPT-5 1.293 1.556 1.925 1.564 6.338
πŸ₯ˆ GPT-4o 1.105 1.414 1.632 1.562 5.713
πŸ₯‰ Grok-4 1.150 1.233 1.820 1.459 5.662
4 Gemini-2.5-Pro 1.053 1.233 1.774 1.534 5.594
5 InternVL3-8B 0.985 1.278 1.797 1.474 5.534

Scores are on a 0-2 scale for each dimension

πŸ“Š Analysis & Insights

Performance by Modality

  • Best Overall: CT and MRI imaging (higher contrast, clearer artifacts)
  • Most Challenging: Histopathology (subtle staining variations, texture complexity)
  • Modality Gap: 15-20% performance difference between easiest and hardest modalities

Error Analysis

  1. Perception Errors: Difficulty distinguishing mild vs severe degradations
  2. Reasoning Gaps: Incomplete description of quality factors
  3. Consistency Issues: Mismatch between observed artifacts and quality conclusion
  4. Medical Knowledge: Limited understanding of clinical significance

🀝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

πŸ“„ Citation

If you use MedQ-Bench in your research, please cite our paper:

@misc{liu2025medqbenchevaluatingexploringmedical,
      title={MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs},
      author={Jiyao Liu and Jinjie Wei and Wanying Qu and Chenglong Ma and Junzhi Ning and Yunheng Li and Ying Chen and Xinzhe Luo and Pengcheng Chen and Xin Gao and Ming Hu and Huihui Xu and Xin Wang and Shujian Gao and Dingkang Yang and Zhongying Deng and Jin Ye and Lihao Liu and Junjun He and Ningsheng Xu},
      year={2025},
      eprint={2510.01691},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.01691},
}

πŸ“ž Contact

πŸ™ Acknowledgments

  • Q-Bench Team: For the foundational framework for vision quality assessment
  • VLMEvalKit: For the comprehensive evaluation infrastructure
  • All Radiologists: Who contributed to human evaluation and validation

πŸ“œ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.


🌟 Star us on GitHub if MedQ-Bench helps your research! 🌟

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •