Skip to content

Multimodal Large Language Models in Biomedicine and Healthcare. Summaries and Guidelines of multimodal LLMs deployed in biomedicine.

Notifications You must be signed in to change notification settings

ncbi-nlp/MLLM4BioMed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 

Repository files navigation

This is an actively updated list of practical guide resources for Biomedical Multimodal Large Language Models (BioMed-MLLMs). It's based on our survey paper:

Multimodal Large Language Models in Biomedicine

πŸ’¬ Update News

[2025-11-11] We updated the 1st version of MLLM4BioMed.

[2025-09-18] We have released the repository of this survey, aiming to collect and organize these updates of MLLMs in biomedicine.

πŸ“‘ Table of Contents

βš™οΈ Methods of Biomedical MLLMs

πŸ“· Image-LLMs
🩻 Radiology LLMs
Model Paper name Conf / Journal Date
GLoRIAGLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-efficient Medical Image RecognitionICCV 20212021-10-10
MedViLLMulti-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-TrainingArxiv; JBHI2022-09-19
RepsNetRepsNet: Combining Vision with Language for Automated Medical ReportsArxiv; MICCAI 20222022-10-01
ConVIRTContrastive Learning of Medical Visual Representations from Paired Images and TextArxiv; MLCH 20222022-10-02
MedCLIPMedCLIP: Contrastive Learning from Unpaired Medical Images and TextArxiv; EMNLP 20222022-12-07
ViewXGenVision-Language Generative Model for View-Specific Chest X-ray GenerationArxiv; CHIL 20242023-02-23
RAMMRetrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-trainingArxiv; MM 20232023-03-01
X-REMMultimodal Image-Text Matching Improves Retrieval-based Chest X-Ray Report GenerationArxiv; MIDL 20232023-03-29
PubMedCLIPHow Much Does CLIP Benefit Visual Question Answering in the Medical Domain?Arxiv; EACL 20232023-05-01
MUMCMasked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question AnsweringArxiv; MICCAI 20232023-10-08
MAIRA-1A specialised large multimodal model for radiology report generationArxiv; BioNLP-ACL 20242023-11-01
Med-MLLMA medical multimodal large language model for future pandemicsNPJ Digital Medicine2023-12-02
CheXagentA Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray InterpretationArxiv2024-01-22
XrayGPTChest Radiographs Summarization using Medical Vision-Language ModelsArxiv; BioNLP-ACL 20242024-08-16
OaDAn Organ-aware Diagnosis Framework for Radiology Report GenerationTMI2024-07-01
MCPLMulti-Modal Collaborative Prompt Learning for Medical Vision-Language ModelTMI2024-06-24
ClinicalBLIPVision-Language Model for Generating Textual Descriptions From Clinical ImagesJMIR Form Res2024-08-02
CT2RepAutomated Radiology Report Generation for 3D Medical ImagingArxiv; MICCAI 20242024-10-01
CTChatDeveloping Generalist Foundation Models from a Multimodal Dataset for 3D CTArxiv2024-10-16
MerlinA Vision Language Foundation Model for 3D Computed TomographyArxiv; Res Sq2025-06-28
LLMSegLLM-driven multimodal target volume contouring in radiation oncologyArxiv; Nat. Commun.2024-10-24
Flamingo-CXRCollaboration between clinicians and vision–language models in radiology report generationArxiv; Nat. Med.2024-11-07
MAIRA-2Grounded Radiology Report GenerationArxiv2024-06-06
MAIRA-SegSegmentation-Aware Multimodal Large Language ModelsArxiv; MLHS 20242024-12-15
CXR-LLaVAA multimodal large language model for interpreting chest X-raysArxiv; Eur. Radiol.2025-01-15
RoentGenA vision–language foundation model for the generation of realistic chest X-ray imagesNat. Biomed Eng.2025-04-09
RaDialogA Large Vision-Language Model for Radiology Report Generation and Conversational AssistanceArxiv; MIDL 20252025-07-09
RadFMTowards Generalist Foundation Model for Radiology by Leveraging Web-scale 2D&3D DataArxiv; Nat. Commun.2025-08-23
🧫 Pathology LLMs
Model Paper name Conf / Journal Date
TraP-VQAVision-language transformer for interpretable pathology visual question answeringJBHI 20222022-03-31
K-PathVQAK-PathVQA: Knowledge-Aware Multimodal Representation for Pathology Visual Question AnsweringJBHI 20232023-07-11
PLIPA visual-language foundation model for pathology image analysis using medical TwitterNat. Med.2023-09-29
PathAsstPathAsst: A Generative Foundation AI Assistant Towards Artificial General Intelligence of PathologyArxiv; AAAI 20242024-02-24
CONCHA visual-language foundation model for computational pathologyNat. Med.2024-03-19
Prov-GigaPathA whole-slide foundation model for digital pathology from real-world dataNat.2024-05-22
PathChatA multimodal generative AI copilot for human pathologyNat.2024-06-12
Quilt-LLaVAQuilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology VideosArxiv; CVPR 20242024-06-19
ViLa-MILDual-scale Vision-Language Multiple Instance Learning for Whole Slide Image ClassificationArxiv; CVPR 20242024-06-19
WsiCaptionWsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole-Slide ImagesArxiv; MICCAI 20242024-10-16
WSI-VQAWSI-VQA: Interpreting Whole Slide Images by Generative Visual Question AnsweringArxiv; ECCV 20242024-10-25
CHIEFA pathology foundation model for cancer diagnosis and prognosis predictionNat.2024-09-04
TITANMultimodal Whole Slide Foundation Model for PathologyArxiv2024-11-29
MUSKA vision–language foundation model for precision oncologyNat.2025-01-08
PathologyVLMPathologyVLM: a large vision-language model for pathology image understandingArtif. Intell. Rev.2025-03-28
CPath-OmniCPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational PathologyCVPR 20252025-06-11
SlideChatSlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image UnderstandingArxiv; CVPR 20252025-06-11
PRISMPRISM: A Multi-Modal Generative Foundation Model for Slide-Level HistopathologyArxiv2025-05-16
PRISM2PRISM2: Unlocking Multi-Modal General Pathology AI with Clinical DialogueArxiv2025-06-16
WSI-LLaVAWSI-LLaVA: A Multimodal Large Language Model for Whole Slide ImageArxiv; ICCV 20252025-10-19
πŸ‘οΈ Ophthalmology LLMs
Model Paper name Conf / Journal Date
EyeDoctorA Role-specific Guided Large Language Model for Ophthalmic Consultation Based on Stylistic DifferentiationArxiv2024-06-24
OphGLMOphGLM: Training an Ophthalmology Large Language-and-Vision Assistant based on Instructions and DialogueArxiv; AI in Medicine 20242024-11-01
VisionFMDevelopment and Validation of a Multimodal Multitask Vision Foundation Model for Generalist Ophthalmic Artificial IntelligenceArxiv; NEJM AI 20242024-11-27
IOMIDSMultimodal machine learning enables AI chatbot to diagnose ophthalmic diseases and provide high-quality medical responsesnpj Digit. Med.2025-01-27
FLAIRA Foundation Language-Image Model of the Retina (FLAIR): encoding expert knowledge in text supervisionMed. Img. Anal.2025-01-01
LMODA Large Multimodal Ophthalmology Dataset and Benchmark for Vision-Language ModelsArxiv; NAACL 20252025-04-29
EyeCLIPA Multimodal Generalist Foundation Model for Ophthalmic ImagingArxiv; npj Digit. Med.2025-06-21
RetiZeroEnhancing diagnostic accuracy in rare and common fundus diseases with a knowledge-rich vision-language modelNat. Commun.2025-07-01
VisionUniteVisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical KnowledgeArxiv; TPAMI 20252025-08-13
EyeFMAn eyecare foundation model for clinical assistance: a randomized controlled trialNat. Med.2025-08-28
🩺 Endoscopy & Surgical LLMs
Model Paper name Conf / Journal Date
Surgical-VQASurgical-VQA: visual question answering in surgical scenes using transformerMICCAI 20222022-09-17
MIU-VLMedical Image Understanding with Pretrained Vision Language Models: A Comprehensive StudyArxiv; ICLR 20232023-02-01
CAT-ViL DeiTCAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic SurgeryArxiv; MICCAI 20232023-10-01
SurgicalGPTSurgicalGPT: End-to-end language-vision GPT for visual question answering in surgeryArxiv; MICCAI 20232023-10-01
LLaVA-SurgLLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video LearningArxiv2024-08-15
SurgRAWSurgRAW: Multi-Agent Workflow with Chain-of-Thought Reasoning for Surgical IntelligenceArxiv2025-03-13
SurgVidLMSurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language ModelArxiv2025-06-22
🧴 Dermatology LLMs
Model Paper name Conf / Journal Date
MONETTransparent medical image AI via an image–text foundation model grounded in medical literatureNat. Med.2024-04-16
SkinGPT-4Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4Nat. Commun.2024-06-05
PanDermA multimodal vision foundation model for clinical dermatologyNat. Med.2025-06-06
🧠 Multidomain LLMs
Model Paper name Conf / Journal Date
TV-SAMTV-SAM: Increasing Zero-Shot Segmentation Performance on Multimodal Medical Images Using GPT-4 Generated Descriptive Prompts Without Human AnnotationArxiv; BDMA 20242024-12-04
MRI-PTPCaAn MRI–pathology foundation model for noninvasive diagnosis and grading of prostate cancerNat. Can.2025-09-02
🧬 Omics-LLMs
Model Paper name Conf / Journal Date
Precious3GPTMultimodal Multi-Species Multi-Omics Multi-Tissue Transformer for Aging Research and Drug DiscoverybioRxiv2024-07-25
GenePTSimple and effective embedding model for single-cell biology built from ChatGPTNat. Biomed. Eng.2024-12-06
scELMoEmbeddings from Language Models are Good Learners for Single-cell Data AnalysisbioRxiv2023-12-08
LangCellLanguage-Cell Pre-training for Cell Identity UnderstandingarXiv2024-06-11
CellWhispererMultimodal learning of transcriptomes and text enables interactive single-cell RNA-seq data exploration with natural-language chatsbioRxiv2024-10-18
scMulanA Multitask Generative Pre-trained Language Model for Single-Cell AnalysisbioRxiv; RECOMB 20242024-05-17
scInterpreterTraining Large Language Models to Interpret scRNA-seq Data for Cell Type AnnotationarXiv2024-02-18
Cell2SentenceTeaching Large Language Models the Language of BiologybioRxiv; ICML 20242024-05-01
GPT-4 (for scRNA-seq)Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysisbioRxiv; Nat. Methods Brief Comm.2024-03-25
CELLamaFoundation Model for Single Cell and Spatial Transcriptomics by Cell Embedding Leveraging Language Model AbilitiesbioRxiv2024-05-10
🌐 Generalist Models
Model Paper name Conf / Journal Date
LLaVA-MedTraining a Large Language-and-Vision Assistant for Biomedicine in One DayarXiv; NeurIPS 20232023-09-25
GPT-4v-medGPT-4vOpenAI Report2023-09-25
Med-FlamingoA Multimodal Medical Few-shot LearnerarXiv; ML4H 20232023-12-10
Med-PaLM MTowards Generalist Biomedical AIarXiv; NEJM AI2024-02-22
ChatCAD+Toward a Universal and Reliable Interactive CAD Using LLMsarXiv; IEEE 20242024-05-08
InternVLScaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic TasksarXiv; CVPR 20242024-06-17
MedVersaA Generalist Foundation Model for Medical Image InterpretationarXiv2024-05-13
BiomedGPTA Generalist Vision–Language Foundation Model for Diverse Biomedical TasksarXiv; Nat. Med.2024-08-07
Dragonfly-MedMulti-Resolution Zoom-In Encoding Enhances Vision-Language ModelsarXiv; OpenReview 20242024-10-14
BiomedCLIPA Multimodal Biomedical Foundation Model Pretrained from Fifteen Million Scientific Image-Text PairsarXiv; NEJM AI2024-12-20
Vision-BioLLMLarge Vision Language Model for Visual Dialogue in Biomedical ImageryBSPC 20252025-01-03
MedGeminiCapabilities of Gemini Models in MedicinearXiv2024-04-29
LingshuA Generalist Foundation Model for Unified Multimodal Medical Understanding and ReasoningarXiv2025-06-08

πŸ§ͺ Biomedical Datasets for MLLMs

🩻 Radiology Datasets
Dataset Modality Scale (imgs) Text (QA) Conf / Journal Date Capability
CT-RATEImg, Text25.6k25.6karXiv2024-10-16classification, report
RadGPTImg, Text2.7M1.8MarXiv; ICCV 20252025-10-19segmentation, report
PadChestImg, Text160k160karXiv; Med. Image Anal.2020-08-20classification, captioning
MIMIC-CXREHR, Img, Text377k227karXiv; Nat. Sci. Data2019-12-12classification, captioning
CANDID-PTXImg, Mask, Text19k19kRadiology AI2021-10-13classification, segmentation, report
CheXpert PlusImg, Text223k223karXiv2024-05-29classification, captioning, report
PadChest-GRImg, Bbox, Text4.5k10.4karXiv2024-11-07classification, report
ROCOImg, Text87k87kLABELS 20182018-10-17classification, captioning
ROCOv2Img, Text79k79karXiv; Nat. Sci. Data2024-06-26classification, captioning
ImageCLEF-MedImg, Text2.8k6.4kCLEF 20242024-09-19classification, VQA
GEMeXImg, Bbox, Text151k1.6MarXiv2024-11-25VQA
RadMDImg, Text5Mβ€”arXiv; Nat. Commun.2025-08-23VQA, report generation
🧫 Histopathology Datasets
Dataset Modality Scale (imgs) Text (QA) Conf / Journal Date Capability
TCGAImg, Gene, Text44k*β€”β€”β€”classification, survival analysis
OpenPathImg, Text208k208kNat. Med.2023-08-17classification, report
Quilt-1MImg, Audio, Text802k802karXiv; NeurIPS 20232023-09-25classification, report
PathVQAImg, Text4.9k32.7karXiv2020-05-07VQA
WSI-VQAImg, Text9778.6karXiv; ECCV 20242024-10-25VQA
QUILT-InstructImg, Text107k107kCVPR 20242024-06-09VQA
WSI-BenchImg, Text180k180kICCV 20252025-11-19VQA
πŸ‘οΈ Ophthalmology Datasets
Dataset Modality Scale (imgs) Text (QA) Conf / Journal Date Capability
FFA-IRImg, Text1M10kNeurIPS 20212021-10-11report
FairVLMedImg, Text10k10karXiv; CVPR 20242024-09-16report
LMODImg, Text21.9k21.9karXiv; NAACL 2025TBDclassification, detection, segmentation
MM-RetinalImg, Text4.3k4.3karXiv; MICCAI 20242024-05-20report
OphthalVQAImg, Text60600medRxiv; BJO2024-09-20VQA
πŸ”¬ Endoscopy Datasets
Dataset Modality Scale (imgs) Text (QA) Conf / Journal Date Capability
EndoVis-18-VQLAVideo, Text2k12karXiv; ICRA 20232023-06-20VQA
🧬 Omics Datasets
Dataset Modality Scale (imgs) Text (QA) Conf / Journal Date Capability
Immune Tissue DatasetscRNA-seq40kβ€”Science2022-05-13Raw data for Cell2Sentence
CellxGenescRNA-seq107.5Mβ€”bioRxiv2021-04-06Used by LangCell & CellWhisperer
HubMAP (Azimuth)scRNA-seqβ€”β€”Nature2019-10-09Raw data for GPTcelltype
GTEx gene matrixscRNA-seq209kβ€”Science2022-05-13Used by GPTcelltype
Human Cell LandscapescRNA-seq700kβ€”Nature2020-03-25Used by GPTcelltype
Mouse Cell AtlasscRNA-seq400kβ€”Cell2018-02-22Used by GPTcelltype
B-cell LymphomascRNA-seqβ€”β€”Cell Discovery2023-06-12Used by GPTcelltype
Colon CancerscRNA-seq63kβ€”Nat. Genet.2020-05-25Used by GPTcelltype
Lung CancerscRNA-seq208kβ€”Nat. Commun.2020-05-08Used by GPTcelltype
Tabula SapiensscRNA-seq500kβ€”Science2022-05-13Used by GPTcelltype
NCBI Summary of GenesscRNA-seq93kβ€”Nat. Biomed. Eng.2024-12-06Used by GenePT
GEO RepositoryscRNA-seqβ€”β€”NAR2012-11-26Used by CellWhisperer
🧠 Multimodal Datasets
Dataset Modality Scale (imgs) Text (QA) Conf / Journal Date Capability
MedICaTImg, Text217k217karXiv; EMNLP 20202020-11-16classification, captioning
PMC-OAImg, Text1.6M1.6Mβ€”β€”classification, captioning
ChiMed-VL-AlignmentImg, Text580k580karXiv2023-11-01classification, captioning (Chinese)
MedTrinity-25MImg, ROI, Text25M25MarXiv; OpenReview2024-08-06classification, detection, segmentation, captioning, report
SLAKEImg, Mask, BBox, Text64214karXiv; ISBI 20212021-05-25segmentation, detection, VQA (English/Chinese)
PMC-VQAImg, Text149k227karXiv2024-09-08VQA
OmniMedVQAImg, Text118k127.9karXiv; CVPR 20242024-09-16VQA
PubMedVisionImg, Text914.9k1.3MarXiv2024-09-30VQA
MedMDImg, Text16Mβ€”arXiv; Nat. Commun.2025-08-23VQA, report generation

πŸ’Ό For-Profit Multimodal LLMs

🏒 Commercial Models
Model Parent License Input token limit Output token limit Release Date Knowledge Cutoff Modalities
Claude 3 HaikuAnthropicProprietary200K4096March 2024August 2023Text, Image
Claude 3 OpusAnthropicProprietary200K4096March 2024August 2023Text, Image
Claude 3 SonnetAnthropicProprietary200K4096March 2024August 2023Text, Image
Gemini 1.0 ProGoogleProprietary32.8K8192December 2023β€”Text, Image, Audio, Video
Gemini 1.0 UltraGoogleProprietary32.8K8192February 2024November 2023Text, Image, Audio, Video
Gemini 1.5 Flash (001)GoogleProprietary1M8192May 2024November 2023Text, Image, Audio, Video
Gemini 1.5 Pro (001)GoogleProprietary2M8192February 2024November 2023Text, Image, Audio, Video
Gemini 2.0 FlashGoogleProprietary1M8192December 2024August 2024Text, Image, Audio, Video
Gemini 2.0 ProGoogleProprietary2M8192December 2024August 2024Text, Image, Audio, Video
GPT-4OpenAIProprietary81928192June 2023September 2021Text, Image
GPT-4oOpenAIProprietary128K16.4KAugust 2024October 2023Text, Image
GPT-o1OpenAIProprietary200K100KDecember 2024October 2023Text, Image
Grok-2xAIProprietary128K8KAugust 2024June 2023Text, Image
Grok-3xAIProprietary128K8KFebruary 2025β€”Text, Image
Nova LiteAmazonProprietary300K5KDecember 2024β€”Text, Image, Video
Nova ProAmazonProprietary300K5KDecember 2024β€”Text, Image, Video
FlamingoDeepMindOpen2048β€”April 2022β€”Text, Image
PaLM-EGoogleOpen81962014March 2023Mid 2021Text, Image
PaLM 2GoogleOpen81961024May 2023Mid 2021Text, Image
InternVLShanghai AI LaboratoryOpenβ€”β€”June 2024β€”Text, Image
InternVL2.5Shanghai AI LaboratoryOpenβ€”β€”December 2024β€”Text, Image
InternVL3Shanghai AI LaboratoryOpenβ€”β€”April 2025β€”Text, Image
InternVL3.5Shanghai AI LaboratoryOpenβ€”β€”August 2025β€”Text, Image
LLaVAUniversity of Wisconsin-MadisonOpenβ€”β€”September 2023β€”Text, Image
LLaMA 3.2 11B VisionMetaOpen128K128KSeptember 2024December 2023Text, Image
LLaMA 3.2 90B VisionMetaOpen128K128KSeptember 2024December 2023Text, Image
Phi-3.5-vision-instructMicrosoftOpen128K128KAugust 2024October 2023Text, Image
Pixtral LargeMistralOpen128K128KNovember 2024β€”Text, Image
Pixtral-12BMistralOpen128K8KSeptember 2024β€”Text, Image
QvQ-72B-PreviewQwenOpen32.8K32.8KDecember 2024β€”Text, Image
Qwen2-VL-2B-InstructQwenOpen32.8K32.8KAugust 2024June 2023Text, Image
Qwen2-VL-7B-InstructQwenOpen32.8K32.8KAugust 2024June 2023Text, Image
Qwen2-VL-72B-InstructQwenOpen32.8K32.8KSeptember 2024June 2023Text, Image
Qwen2.5-VL-3B-InstructQwenOpen32.8K32.8KFebruary 2025June 2023Text, Image
Qwen2.5-VL-7B-InstructQwenOpen32.8K32.8KFebruary 2025June 2023Text, Image
Qwen2.5-VL-72B-InstructQwenOpen32.8K32.8KFebruary 2025June 2023Text, Image
Qwen2.5-VL-32B-InstructQwenOpen32.8K32.8KMarch 2025β€”Text, Image
Qwen2.5-Omni-3BQwenOpen32.8Kβ€”March 2025β€”Text, Image, Audio, Video
Qwen2.5-Omni-7BQwenOpen32.8Kβ€”March 2025β€”Text, Image, Audio, Video

🧾 Citation & Acknowledgement

If you use this repository, please cite:

@article{gu2025biomedmllm,
  title={Multimodal Large Language Models in Biomedicine and Healthcare},
  author={Ran Gu, Benjamin Hou, Yin Fang, Lauren He, Qingqing Zhu, Zhiyong Lu},
  journal={},
  year={2025}
}

Maintained by BioNLP Group, Division of Intramural Research, National Library of Medicine, National Institutes of Health.


About

Multimodal Large Language Models in Biomedicine and Healthcare. Summaries and Guidelines of multimodal LLMs deployed in biomedicine.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •