OpenGVLab

All

75 repositories

Mono-InternVL
Public
[CVPR 2025] Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Python
•
MIT License
•0•9•0•0•Updated Mar 12, 2025Mar 12, 2025
EgoVideo
Public
[CVPR 2024 Champions][ICLR 2025] Solutions for EgoVis Chanllenges in CVPR 2024
Jupyter Notebook
•3•124•8•0•Updated Mar 12, 2025Mar 12, 2025
STM-Evaluation
Public
Python
•
MIT License
•6•70•1•0•Updated Mar 10, 2025Mar 10, 2025
VideoChat-Flash
Public
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Python
•
MIT License
•8•350•7•0•Updated Mar 9, 2025Mar 9, 2025
InternImage
Public
[CVPR 2023 Highlight] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
backbone semantic-segmentation deformable-convolution foundation-model object-detection
Python
•
MIT License
•244•2.6k•180•1•Updated Mar 4, 2025Mar 4, 2025
InternVideo
Public
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
benchmark action-recognition video-understanding video-data self-supervised multimodal video-dataset open-set-recognition video-retrieval video-question-answering
Python
•
Apache License 2.0
•103•1.7k•115•3•Updated Feb 27, 2025Feb 27, 2025
VisionLLM
Public
VisionLLM Series
object-detection large-language-models generalist-model
Python
•
Apache License 2.0
•41•1k•15•0•Updated Feb 27, 2025Feb 27, 2025
PVC
Public
[CVPR 2025] PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Python
•
MIT License
•0•34•2•0•Updated Feb 27, 2025Feb 27, 2025
InternVL
Public
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
image-classification gpt multi-modal semantic-segmentation video-classification image-text-retrieval llm vision-language-model gpt-4v vit-6b
Python
•
MIT License
•559•7.2k•167•3•Updated Feb 26, 2025Feb 26, 2025
Vision-RWKV
Public
[ICLR 2025 Spotlight] Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
Python
•
Apache License 2.0
•17•423•24•0•Updated Feb 18, 2025Feb 18, 2025
TimeSuite
Public
[ICLR 2025] TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
temporal-grounding long-video-understanding
Python
•
MIT License
•1•21•2•0•Updated Feb 12, 2025Feb 12, 2025
LCL
Public
[NeurIPS 2024] Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Python
•
MIT License
•4•68•4•0•Updated Feb 11, 2025Feb 11, 2025
Ask-Anything
Public
[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.
chat video gradio big-model video-understanding captioning-videos video-question-answering foundation-models large-model large-language-models
Python
•
MIT License
•260•3.2k•69•5•Updated Jan 18, 2025Jan 18, 2025
PIIP
Public
[NeurIPS 2024 Spotlight ⭐️] Parameter-Inverted Image Pyramid Networks (PIIP)
computer-vision image-classification object-detection semantic-segmentation instance-segmentation vision-transformer multimodal-large-language-models vision-language-models
Python
•
MIT License
•2•85•2•0•Updated Jan 15, 2025Jan 15, 2025
vinci
Public
Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model
Python
•2•48•2•1•Updated Jan 13, 2025Jan 13, 2025
TPO
Public
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
Python
•2•42•1•0•Updated Jan 2, 2025Jan 2, 2025
V2PE
Public
[ArXiv] V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Python
•
MIT License
•1•30•0•0•Updated Dec 13, 2024Dec 13, 2024
VLMEvalKit_InternVL2_5
Public
Open-source evaluation toolkit of large vision-language models (LVLMs), support 160+ VLMs, 50+ benchmarks
Python
•
Apache License 2.0
•294•0•0•0•Updated Dec 9, 2024Dec 9, 2024
Hulk
Public
An official implementation of "Hulk: A Universal Knowledge Translator for Human-Centric Tasks"
Python
•
MIT License
•5•121•14•0•Updated Dec 4, 2024Dec 4, 2024
MM-NIAH
Public
[NeurIPS 2024] Needle In A Multimodal Haystack (MM-NIAH): A comprehensive benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.
benchmark long-context vision-language-model multimodal-large-language-models
Python
•6•114•0•0•Updated Nov 25, 2024Nov 25, 2024
OmniCorpus
Public
[ICLR 2025 Spotlight] OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Python
•7•319•0•0•Updated Nov 17, 2024Nov 17, 2024
GUI-Odyssey
Public
GUI Odyssey is a comprehensive dataset for training and evaluating cross-app navigation agents. GUI Odyssey consists of 7,735 episodes from 6 mobile devices, spanning 6 types of cross-app tasks, 201 apps, and 1.4K app combos.
Python
•4•93•3•0•Updated Nov 12, 2024Nov 12, 2024
.github
Public
1•0•0•0•Updated Oct 30, 2024Oct 30, 2024
OV-OAD
Public
This repo takes the initial step towards leveraging text learning for online action detection without explicit human supervision.
1•1•0•0•Updated Oct 28, 2024Oct 28, 2024
InternVL-MMDetSeg
Public
Train InternViT-6B in MMSegmentation and MMDetection with DeepSpeed
object-detection semantic-segmentation vision-foundation
Jupyter Notebook
•6•80•1•0•Updated Oct 25, 2024Oct 25, 2024
PhyGenBench
Public
The code and data of Paper: Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
Python
•1•90•3•0•Updated Oct 25, 2024Oct 25, 2024
VideoMAEv2
Public
[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
video-understanding action-detection self-supervised-learning temporal-action-detection foundation-model cvpr2023 action-recognition
Python
•
MIT License
•68•593•17•0•Updated Oct 8, 2024Oct 8, 2024
EfficientQAT
Public
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
Python
•19•251•6•0•Updated Oct 8, 2024Oct 8, 2024
OmniQuant
Public
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
quantization large-language-models llm
Python
•
MIT License
•59•781•25•1•Updated Oct 8, 2024Oct 8, 2024
MMIU
Public
[ICLR2025] MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Python
•2•64•3•0•Updated Sep 14, 2024Sep 14, 2024