This is an actively updated list of practical guide resources for Biomedical Multimodal Large Language Models (BioMed-MLLMs). It's based on our survey paper:
Multimodal Large Language Models in Biomedicine
[2025-11-11] We updated the 1st version of MLLM4BioMed.
[2025-09-18] We have released the repository of this survey, aiming to collect and organize these updates of MLLMs in biomedicine.
- βοΈ Methods of Biomedical MLLMs
- π§ͺ Biomedical Datasets for MLLMs
- πΌ For-Profit Multimodal LLMs
- π§Ύ Citation & Acknowledgement
π· Image-LLMs
π©» Radiology LLMs
| Model | Paper name | Conf / Journal | Date |
|---|---|---|---|
| GLoRIA | GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-efficient Medical Image Recognition | ICCV 2021 | 2021-10-10 |
| MedViLL | Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training | Arxiv; JBHI | 2022-09-19 |
| RepsNet | RepsNet: Combining Vision with Language for Automated Medical Reports | Arxiv; MICCAI 2022 | 2022-10-01 |
| ConVIRT | Contrastive Learning of Medical Visual Representations from Paired Images and Text | Arxiv; MLCH 2022 | 2022-10-02 |
| MedCLIP | MedCLIP: Contrastive Learning from Unpaired Medical Images and Text | Arxiv; EMNLP 2022 | 2022-12-07 |
| ViewXGen | Vision-Language Generative Model for View-Specific Chest X-ray Generation | Arxiv; CHIL 2024 | 2023-02-23 |
| RAMM | Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training | Arxiv; MM 2023 | 2023-03-01 |
| X-REM | Multimodal Image-Text Matching Improves Retrieval-based Chest X-Ray Report Generation | Arxiv; MIDL 2023 | 2023-03-29 |
| PubMedCLIP | How Much Does CLIP Benefit Visual Question Answering in the Medical Domain? | Arxiv; EACL 2023 | 2023-05-01 |
| MUMC | Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering | Arxiv; MICCAI 2023 | 2023-10-08 |
| MAIRA-1 | A specialised large multimodal model for radiology report generation | Arxiv; BioNLP-ACL 2024 | 2023-11-01 |
| Med-MLLM | A medical multimodal large language model for future pandemics | NPJ Digital Medicine | 2023-12-02 |
| CheXagent | A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation | Arxiv | 2024-01-22 |
| XrayGPT | Chest Radiographs Summarization using Medical Vision-Language Models | Arxiv; BioNLP-ACL 2024 | 2024-08-16 |
| OaD | An Organ-aware Diagnosis Framework for Radiology Report Generation | TMI | 2024-07-01 |
| MCPL | Multi-Modal Collaborative Prompt Learning for Medical Vision-Language Model | TMI | 2024-06-24 |
| ClinicalBLIP | Vision-Language Model for Generating Textual Descriptions From Clinical Images | JMIR Form Res | 2024-08-02 |
| CT2Rep | Automated Radiology Report Generation for 3D Medical Imaging | Arxiv; MICCAI 2024 | 2024-10-01 |
| CTChat | Developing Generalist Foundation Models from a Multimodal Dataset for 3D CT | Arxiv | 2024-10-16 |
| Merlin | A Vision Language Foundation Model for 3D Computed Tomography | Arxiv; Res Sq | 2025-06-28 |
| LLMSeg | LLM-driven multimodal target volume contouring in radiation oncology | Arxiv; Nat. Commun. | 2024-10-24 |
| Flamingo-CXR | Collaboration between clinicians and visionβlanguage models in radiology report generation | Arxiv; Nat. Med. | 2024-11-07 |
| MAIRA-2 | Grounded Radiology Report Generation | Arxiv | 2024-06-06 |
| MAIRA-Seg | Segmentation-Aware Multimodal Large Language Models | Arxiv; MLHS 2024 | 2024-12-15 |
| CXR-LLaVA | A multimodal large language model for interpreting chest X-rays | Arxiv; Eur. Radiol. | 2025-01-15 |
| RoentGen | A visionβlanguage foundation model for the generation of realistic chest X-ray images | Nat. Biomed Eng. | 2025-04-09 |
| RaDialog | A Large Vision-Language Model for Radiology Report Generation and Conversational Assistance | Arxiv; MIDL 2025 | 2025-07-09 |
| RadFM | Towards Generalist Foundation Model for Radiology by Leveraging Web-scale 2D&3D Data | Arxiv; Nat. Commun. | 2025-08-23 |
π§« Pathology LLMs
| Model | Paper name | Conf / Journal | Date |
|---|---|---|---|
| TraP-VQA | Vision-language transformer for interpretable pathology visual question answering | JBHI 2022 | 2022-03-31 |
| K-PathVQA | K-PathVQA: Knowledge-Aware Multimodal Representation for Pathology Visual Question Answering | JBHI 2023 | 2023-07-11 |
| PLIP | A visual-language foundation model for pathology image analysis using medical Twitter | Nat. Med. | 2023-09-29 |
| PathAsst | PathAsst: A Generative Foundation AI Assistant Towards Artificial General Intelligence of Pathology | Arxiv; AAAI 2024 | 2024-02-24 |
| CONCH | A visual-language foundation model for computational pathology | Nat. Med. | 2024-03-19 |
| Prov-GigaPath | A whole-slide foundation model for digital pathology from real-world data | Nat. | 2024-05-22 |
| PathChat | A multimodal generative AI copilot for human pathology | Nat. | 2024-06-12 |
| Quilt-LLaVA | Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos | Arxiv; CVPR 2024 | 2024-06-19 |
| ViLa-MIL | Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification | Arxiv; CVPR 2024 | 2024-06-19 |
| WsiCaption | WsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole-Slide Images | Arxiv; MICCAI 2024 | 2024-10-16 |
| WSI-VQA | WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering | Arxiv; ECCV 2024 | 2024-10-25 |
| CHIEF | A pathology foundation model for cancer diagnosis and prognosis prediction | Nat. | 2024-09-04 |
| TITAN | Multimodal Whole Slide Foundation Model for Pathology | Arxiv | 2024-11-29 |
| MUSK | A visionβlanguage foundation model for precision oncology | Nat. | 2025-01-08 |
| PathologyVLM | PathologyVLM: a large vision-language model for pathology image understanding | Artif. Intell. Rev. | 2025-03-28 |
| CPath-Omni | CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology | CVPR 2025 | 2025-06-11 |
| SlideChat | SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding | Arxiv; CVPR 2025 | 2025-06-11 |
| PRISM | PRISM: A Multi-Modal Generative Foundation Model for Slide-Level Histopathology | Arxiv | 2025-05-16 |
| PRISM2 | PRISM2: Unlocking Multi-Modal General Pathology AI with Clinical Dialogue | Arxiv | 2025-06-16 |
| WSI-LLaVA | WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image | Arxiv; ICCV 2025 | 2025-10-19 |
ποΈ Ophthalmology LLMs
| Model | Paper name | Conf / Journal | Date |
|---|---|---|---|
| EyeDoctor | A Role-specific Guided Large Language Model for Ophthalmic Consultation Based on Stylistic Differentiation | Arxiv | 2024-06-24 |
| OphGLM | OphGLM: Training an Ophthalmology Large Language-and-Vision Assistant based on Instructions and Dialogue | Arxiv; AI in Medicine 2024 | 2024-11-01 |
| VisionFM | Development and Validation of a Multimodal Multitask Vision Foundation Model for Generalist Ophthalmic Artificial Intelligence | Arxiv; NEJM AI 2024 | 2024-11-27 |
| IOMIDS | Multimodal machine learning enables AI chatbot to diagnose ophthalmic diseases and provide high-quality medical responses | npj Digit. Med. | 2025-01-27 |
| FLAIR | A Foundation Language-Image Model of the Retina (FLAIR): encoding expert knowledge in text supervision | Med. Img. Anal. | 2025-01-01 |
| LMOD | A Large Multimodal Ophthalmology Dataset and Benchmark for Vision-Language Models | Arxiv; NAACL 2025 | 2025-04-29 |
| EyeCLIP | A Multimodal Generalist Foundation Model for Ophthalmic Imaging | Arxiv; npj Digit. Med. | 2025-06-21 |
| RetiZero | Enhancing diagnostic accuracy in rare and common fundus diseases with a knowledge-rich vision-language model | Nat. Commun. | 2025-07-01 |
| VisionUnite | VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge | Arxiv; TPAMI 2025 | 2025-08-13 |
| EyeFM | An eyecare foundation model for clinical assistance: a randomized controlled trial | Nat. Med. | 2025-08-28 |
π©Ί Endoscopy & Surgical LLMs
| Model | Paper name | Conf / Journal | Date |
|---|---|---|---|
| Surgical-VQA | Surgical-VQA: visual question answering in surgical scenes using transformer | MICCAI 2022 | 2022-09-17 |
| MIU-VL | Medical Image Understanding with Pretrained Vision Language Models: A Comprehensive Study | Arxiv; ICLR 2023 | 2023-02-01 |
| CAT-ViL DeiT | CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery | Arxiv; MICCAI 2023 | 2023-10-01 |
| SurgicalGPT | SurgicalGPT: End-to-end language-vision GPT for visual question answering in surgery | Arxiv; MICCAI 2023 | 2023-10-01 |
| LLaVA-Surg | LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning | Arxiv | 2024-08-15 |
| SurgRAW | SurgRAW: Multi-Agent Workflow with Chain-of-Thought Reasoning for Surgical Intelligence | Arxiv | 2025-03-13 |
| SurgVidLM | SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model | Arxiv | 2025-06-22 |
π§΄ Dermatology LLMs
| Model | Paper name | Conf / Journal | Date |
|---|---|---|---|
| MONET | Transparent medical image AI via an imageβtext foundation model grounded in medical literature | Nat. Med. | 2024-04-16 |
| SkinGPT-4 | Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4 | Nat. Commun. | 2024-06-05 |
| PanDerm | A multimodal vision foundation model for clinical dermatology | Nat. Med. | 2025-06-06 |
π§ Multidomain LLMs
| Model | Paper name | Conf / Journal | Date |
|---|---|---|---|
| TV-SAM | TV-SAM: Increasing Zero-Shot Segmentation Performance on Multimodal Medical Images Using GPT-4 Generated Descriptive Prompts Without Human Annotation | Arxiv; BDMA 2024 | 2024-12-04 |
| MRI-PTPCa | An MRIβpathology foundation model for noninvasive diagnosis and grading of prostate cancer | Nat. Can. | 2025-09-02 |
𧬠Omics-LLMs
| Model | Paper name | Conf / Journal | Date |
|---|---|---|---|
| Precious3GPT | Multimodal Multi-Species Multi-Omics Multi-Tissue Transformer for Aging Research and Drug Discovery | bioRxiv | 2024-07-25 |
| GenePT | Simple and effective embedding model for single-cell biology built from ChatGPT | Nat. Biomed. Eng. | 2024-12-06 |
| scELMo | Embeddings from Language Models are Good Learners for Single-cell Data Analysis | bioRxiv | 2023-12-08 |
| LangCell | Language-Cell Pre-training for Cell Identity Understanding | arXiv | 2024-06-11 |
| CellWhisperer | Multimodal learning of transcriptomes and text enables interactive single-cell RNA-seq data exploration with natural-language chats | bioRxiv | 2024-10-18 |
| scMulan | A Multitask Generative Pre-trained Language Model for Single-Cell Analysis | bioRxiv; RECOMB 2024 | 2024-05-17 |
| scInterpreter | Training Large Language Models to Interpret scRNA-seq Data for Cell Type Annotation | arXiv | 2024-02-18 |
| Cell2Sentence | Teaching Large Language Models the Language of Biology | bioRxiv; ICML 2024 | 2024-05-01 |
| GPT-4 (for scRNA-seq) | Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis | bioRxiv; Nat. Methods Brief Comm. | 2024-03-25 |
| CELLama | Foundation Model for Single Cell and Spatial Transcriptomics by Cell Embedding Leveraging Language Model Abilities | bioRxiv | 2024-05-10 |
π Generalist Models
| Model | Paper name | Conf / Journal | Date |
|---|---|---|---|
| LLaVA-Med | Training a Large Language-and-Vision Assistant for Biomedicine in One Day | arXiv; NeurIPS 2023 | 2023-09-25 |
| GPT-4v-med | GPT-4v | OpenAI Report | 2023-09-25 |
| Med-Flamingo | A Multimodal Medical Few-shot Learner | arXiv; ML4H 2023 | 2023-12-10 |
| Med-PaLM M | Towards Generalist Biomedical AI | arXiv; NEJM AI | 2024-02-22 |
| ChatCAD+ | Toward a Universal and Reliable Interactive CAD Using LLMs | arXiv; IEEE 2024 | 2024-05-08 |
| InternVL | Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks | arXiv; CVPR 2024 | 2024-06-17 |
| MedVersa | A Generalist Foundation Model for Medical Image Interpretation | arXiv | 2024-05-13 |
| BiomedGPT | A Generalist VisionβLanguage Foundation Model for Diverse Biomedical Tasks | arXiv; Nat. Med. | 2024-08-07 |
| Dragonfly-Med | Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models | arXiv; OpenReview 2024 | 2024-10-14 |
| BiomedCLIP | A Multimodal Biomedical Foundation Model Pretrained from Fifteen Million Scientific Image-Text Pairs | arXiv; NEJM AI | 2024-12-20 |
| Vision-BioLLM | Large Vision Language Model for Visual Dialogue in Biomedical Imagery | BSPC 2025 | 2025-01-03 |
| MedGemini | Capabilities of Gemini Models in Medicine | arXiv | 2024-04-29 |
| Lingshu | A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning | arXiv | 2025-06-08 |
π©» Radiology Datasets
| Dataset | Modality | Scale (imgs) | Text (QA) | Conf / Journal | Date | Capability |
|---|---|---|---|---|---|---|
| CT-RATE | Img, Text | 25.6k | 25.6k | arXiv | 2024-10-16 | classification, report |
| RadGPT | Img, Text | 2.7M | 1.8M | arXiv; ICCV 2025 | 2025-10-19 | segmentation, report |
| PadChest | Img, Text | 160k | 160k | arXiv; Med. Image Anal. | 2020-08-20 | classification, captioning |
| MIMIC-CXR | EHR, Img, Text | 377k | 227k | arXiv; Nat. Sci. Data | 2019-12-12 | classification, captioning |
| CANDID-PTX | Img, Mask, Text | 19k | 19k | Radiology AI | 2021-10-13 | classification, segmentation, report |
| CheXpert Plus | Img, Text | 223k | 223k | arXiv | 2024-05-29 | classification, captioning, report |
| PadChest-GR | Img, Bbox, Text | 4.5k | 10.4k | arXiv | 2024-11-07 | classification, report |
| ROCO | Img, Text | 87k | 87k | LABELS 2018 | 2018-10-17 | classification, captioning |
| ROCOv2 | Img, Text | 79k | 79k | arXiv; Nat. Sci. Data | 2024-06-26 | classification, captioning |
| ImageCLEF-Med | Img, Text | 2.8k | 6.4k | CLEF 2024 | 2024-09-19 | classification, VQA |
| GEMeX | Img, Bbox, Text | 151k | 1.6M | arXiv | 2024-11-25 | VQA |
| RadMD | Img, Text | 5M | β | arXiv; Nat. Commun. | 2025-08-23 | VQA, report generation |
π§« Histopathology Datasets
| Dataset | Modality | Scale (imgs) | Text (QA) | Conf / Journal | Date | Capability |
|---|---|---|---|---|---|---|
| TCGA | Img, Gene, Text | 44k* | β | β | β | classification, survival analysis |
| OpenPath | Img, Text | 208k | 208k | Nat. Med. | 2023-08-17 | classification, report |
| Quilt-1M | Img, Audio, Text | 802k | 802k | arXiv; NeurIPS 2023 | 2023-09-25 | classification, report |
| PathVQA | Img, Text | 4.9k | 32.7k | arXiv | 2020-05-07 | VQA |
| WSI-VQA | Img, Text | 977 | 8.6k | arXiv; ECCV 2024 | 2024-10-25 | VQA |
| QUILT-Instruct | Img, Text | 107k | 107k | CVPR 2024 | 2024-06-09 | VQA |
| WSI-Bench | Img, Text | 180k | 180k | ICCV 2025 | 2025-11-19 | VQA |
ποΈ Ophthalmology Datasets
| Dataset | Modality | Scale (imgs) | Text (QA) | Conf / Journal | Date | Capability |
|---|---|---|---|---|---|---|
| FFA-IR | Img, Text | 1M | 10k | NeurIPS 2021 | 2021-10-11 | report |
| FairVLMed | Img, Text | 10k | 10k | arXiv; CVPR 2024 | 2024-09-16 | report |
| LMOD | Img, Text | 21.9k | 21.9k | arXiv; NAACL 2025 | TBD | classification, detection, segmentation |
| MM-Retinal | Img, Text | 4.3k | 4.3k | arXiv; MICCAI 2024 | 2024-05-20 | report |
| OphthalVQA | Img, Text | 60 | 600 | medRxiv; BJO | 2024-09-20 | VQA |
π¬ Endoscopy Datasets
| Dataset | Modality | Scale (imgs) | Text (QA) | Conf / Journal | Date | Capability |
|---|---|---|---|---|---|---|
| EndoVis-18-VQLA | Video, Text | 2k | 12k | arXiv; ICRA 2023 | 2023-06-20 | VQA |
𧬠Omics Datasets
| Dataset | Modality | Scale (imgs) | Text (QA) | Conf / Journal | Date | Capability |
|---|---|---|---|---|---|---|
| Immune Tissue Dataset | scRNA-seq | 40k | β | Science | 2022-05-13 | Raw data for Cell2Sentence |
| CellxGene | scRNA-seq | 107.5M | β | bioRxiv | 2021-04-06 | Used by LangCell & CellWhisperer |
| HubMAP (Azimuth) | scRNA-seq | β | β | Nature | 2019-10-09 | Raw data for GPTcelltype |
| GTEx gene matrix | scRNA-seq | 209k | β | Science | 2022-05-13 | Used by GPTcelltype |
| Human Cell Landscape | scRNA-seq | 700k | β | Nature | 2020-03-25 | Used by GPTcelltype |
| Mouse Cell Atlas | scRNA-seq | 400k | β | Cell | 2018-02-22 | Used by GPTcelltype |
| B-cell Lymphoma | scRNA-seq | β | β | Cell Discovery | 2023-06-12 | Used by GPTcelltype |
| Colon Cancer | scRNA-seq | 63k | β | Nat. Genet. | 2020-05-25 | Used by GPTcelltype |
| Lung Cancer | scRNA-seq | 208k | β | Nat. Commun. | 2020-05-08 | Used by GPTcelltype |
| Tabula Sapiens | scRNA-seq | 500k | β | Science | 2022-05-13 | Used by GPTcelltype |
| NCBI Summary of Genes | scRNA-seq | 93k | β | Nat. Biomed. Eng. | 2024-12-06 | Used by GenePT |
| GEO Repository | scRNA-seq | β | β | NAR | 2012-11-26 | Used by CellWhisperer |
π§ Multimodal Datasets
| Dataset | Modality | Scale (imgs) | Text (QA) | Conf / Journal | Date | Capability |
|---|---|---|---|---|---|---|
| MedICaT | Img, Text | 217k | 217k | arXiv; EMNLP 2020 | 2020-11-16 | classification, captioning |
| PMC-OA | Img, Text | 1.6M | 1.6M | β | β | classification, captioning |
| ChiMed-VL-Alignment | Img, Text | 580k | 580k | arXiv | 2023-11-01 | classification, captioning (Chinese) |
| MedTrinity-25M | Img, ROI, Text | 25M | 25M | arXiv; OpenReview | 2024-08-06 | classification, detection, segmentation, captioning, report |
| SLAKE | Img, Mask, BBox, Text | 642 | 14k | arXiv; ISBI 2021 | 2021-05-25 | segmentation, detection, VQA (English/Chinese) |
| PMC-VQA | Img, Text | 149k | 227k | arXiv | 2024-09-08 | VQA |
| OmniMedVQA | Img, Text | 118k | 127.9k | arXiv; CVPR 2024 | 2024-09-16 | VQA |
| PubMedVision | Img, Text | 914.9k | 1.3M | arXiv | 2024-09-30 | VQA |
| MedMD | Img, Text | 16M | β | arXiv; Nat. Commun. | 2025-08-23 | VQA, report generation |
π’ Commercial Models
| Model | Parent | License | Input token limit | Output token limit | Release Date | Knowledge Cutoff | Modalities |
|---|---|---|---|---|---|---|---|
| Claude 3 Haiku | Anthropic | Proprietary | 200K | 4096 | March 2024 | August 2023 | Text, Image |
| Claude 3 Opus | Anthropic | Proprietary | 200K | 4096 | March 2024 | August 2023 | Text, Image |
| Claude 3 Sonnet | Anthropic | Proprietary | 200K | 4096 | March 2024 | August 2023 | Text, Image |
| Gemini 1.0 Pro | Proprietary | 32.8K | 8192 | December 2023 | β | Text, Image, Audio, Video | |
| Gemini 1.0 Ultra | Proprietary | 32.8K | 8192 | February 2024 | November 2023 | Text, Image, Audio, Video | |
| Gemini 1.5 Flash (001) | Proprietary | 1M | 8192 | May 2024 | November 2023 | Text, Image, Audio, Video | |
| Gemini 1.5 Pro (001) | Proprietary | 2M | 8192 | February 2024 | November 2023 | Text, Image, Audio, Video | |
| Gemini 2.0 Flash | Proprietary | 1M | 8192 | December 2024 | August 2024 | Text, Image, Audio, Video | |
| Gemini 2.0 Pro | Proprietary | 2M | 8192 | December 2024 | August 2024 | Text, Image, Audio, Video | |
| GPT-4 | OpenAI | Proprietary | 8192 | 8192 | June 2023 | September 2021 | Text, Image |
| GPT-4o | OpenAI | Proprietary | 128K | 16.4K | August 2024 | October 2023 | Text, Image |
| GPT-o1 | OpenAI | Proprietary | 200K | 100K | December 2024 | October 2023 | Text, Image |
| Grok-2 | xAI | Proprietary | 128K | 8K | August 2024 | June 2023 | Text, Image |
| Grok-3 | xAI | Proprietary | 128K | 8K | February 2025 | β | Text, Image |
| Nova Lite | Amazon | Proprietary | 300K | 5K | December 2024 | β | Text, Image, Video |
| Nova Pro | Amazon | Proprietary | 300K | 5K | December 2024 | β | Text, Image, Video |
| Flamingo | DeepMind | Open | 2048 | β | April 2022 | β | Text, Image |
| PaLM-E | Open | 8196 | 2014 | March 2023 | Mid 2021 | Text, Image | |
| PaLM 2 | Open | 8196 | 1024 | May 2023 | Mid 2021 | Text, Image | |
| InternVL | Shanghai AI Laboratory | Open | β | β | June 2024 | β | Text, Image |
| InternVL2.5 | Shanghai AI Laboratory | Open | β | β | December 2024 | β | Text, Image |
| InternVL3 | Shanghai AI Laboratory | Open | β | β | April 2025 | β | Text, Image |
| InternVL3.5 | Shanghai AI Laboratory | Open | β | β | August 2025 | β | Text, Image |
| LLaVA | University of Wisconsin-Madison | Open | β | β | September 2023 | β | Text, Image |
| LLaMA 3.2 11B Vision | Meta | Open | 128K | 128K | September 2024 | December 2023 | Text, Image |
| LLaMA 3.2 90B Vision | Meta | Open | 128K | 128K | September 2024 | December 2023 | Text, Image |
| Phi-3.5-vision-instruct | Microsoft | Open | 128K | 128K | August 2024 | October 2023 | Text, Image |
| Pixtral Large | Mistral | Open | 128K | 128K | November 2024 | β | Text, Image |
| Pixtral-12B | Mistral | Open | 128K | 8K | September 2024 | β | Text, Image |
| QvQ-72B-Preview | Qwen | Open | 32.8K | 32.8K | December 2024 | β | Text, Image |
| Qwen2-VL-2B-Instruct | Qwen | Open | 32.8K | 32.8K | August 2024 | June 2023 | Text, Image |
| Qwen2-VL-7B-Instruct | Qwen | Open | 32.8K | 32.8K | August 2024 | June 2023 | Text, Image |
| Qwen2-VL-72B-Instruct | Qwen | Open | 32.8K | 32.8K | September 2024 | June 2023 | Text, Image |
| Qwen2.5-VL-3B-Instruct | Qwen | Open | 32.8K | 32.8K | February 2025 | June 2023 | Text, Image |
| Qwen2.5-VL-7B-Instruct | Qwen | Open | 32.8K | 32.8K | February 2025 | June 2023 | Text, Image |
| Qwen2.5-VL-72B-Instruct | Qwen | Open | 32.8K | 32.8K | February 2025 | June 2023 | Text, Image |
| Qwen2.5-VL-32B-Instruct | Qwen | Open | 32.8K | 32.8K | March 2025 | β | Text, Image |
| Qwen2.5-Omni-3B | Qwen | Open | 32.8K | β | March 2025 | β | Text, Image, Audio, Video |
| Qwen2.5-Omni-7B | Qwen | Open | 32.8K | β | March 2025 | β | Text, Image, Audio, Video |
If you use this repository, please cite:
@article{gu2025biomedmllm,
title={Multimodal Large Language Models in Biomedicine and Healthcare},
author={Ran Gu, Benjamin Hou, Yin Fang, Lauren He, Qingqing Zhu, Zhiyong Lu},
journal={},
year={2025}
}Maintained by BioNLP Group, Division of Intramural Research, National Library of Medicine, National Institutes of Health.

