OpenGVLab

All

74 repositories

InternVideo
Public
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
benchmark action-recognition video-understanding video-data self-supervised multimodal video-dataset open-set-recognition video-retrieval video-question-answering
Python
•
Apache License 2.0
•101•1.7k•113•4•Updated Feb 27, 2025Feb 27, 2025
VideoChat-Flash
Public
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Python
•
MIT License
•6•322•4•0•Updated Feb 27, 2025Feb 27, 2025
VisionLLM
Public
VisionLLM Series
object-detection large-language-models generalist-model
Python
•
Apache License 2.0
•40•1k•15•0•Updated Feb 27, 2025Feb 27, 2025
PVC
Public
[CVPR 2025] PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Python
•
MIT License
•0•30•2•0•Updated Feb 27, 2025Feb 27, 2025
InternVL
Public
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
image-classification gpt multi-modal semantic-segmentation video-classification image-text-retrieval llm vision-language-model gpt-4v vit-6b
Python
•
MIT License
•544•7.1k•166•3•Updated Feb 26, 2025Feb 26, 2025
InternImage
Public
[CVPR 2023 Highlight] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
backbone semantic-segmentation deformable-convolution foundation-model object-detection
Python
•
MIT License
•243•2.6k•180•5•Updated Feb 25, 2025Feb 25, 2025
Vision-RWKV
Public
[ICLR 2025 Spotlight] Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
Python
•
Apache License 2.0
•17•409•24•0•Updated Feb 18, 2025Feb 18, 2025
TimeSuite
Public
[ICLR 2025] TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
temporal-grounding long-video-understanding
Python
•
MIT License
•1•21•1•0•Updated Feb 12, 2025Feb 12, 2025
LCL
Public
[NeurIPS 2024] Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Python
•
MIT License
•4•68•4•0•Updated Feb 11, 2025Feb 11, 2025
Ask-Anything
Public
[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.
chat video gradio big-model video-understanding captioning-videos video-question-answering foundation-models large-model large-language-models
Python
•
MIT License
•259•3.2k•69•5•Updated Jan 18, 2025Jan 18, 2025
STM-Evaluation
Public
Python
•
MIT License
•6•70•1•0•Updated Jan 18, 2025Jan 18, 2025
PIIP
Public
[NeurIPS 2024 Spotlight ⭐️] Parameter-Inverted Image Pyramid Networks (PIIP)
computer-vision image-classification object-detection semantic-segmentation instance-segmentation vision-transformer multimodal-large-language-models vision-language-models
Python
•
MIT License
•2•85•0•0•Updated Jan 15, 2025Jan 15, 2025
vinci
Public
Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model
Python
•2•43•2•1•Updated Jan 13, 2025Jan 13, 2025
TPO
Public
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
Python
•2•41•1•0•Updated Jan 2, 2025Jan 2, 2025
V2PE
Public
[ArXiv] V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Python
•
MIT License
•1•30•0•0•Updated Dec 13, 2024Dec 13, 2024
VLMEvalKit_InternVL2_5
Public
Open-source evaluation toolkit of large vision-language models (LVLMs), support 160+ VLMs, 50+ benchmarks
Python
•
Apache License 2.0
•280•0•0•0•Updated Dec 9, 2024Dec 9, 2024
Hulk
Public
An official implementation of "Hulk: A Universal Knowledge Translator for Human-Centric Tasks"
Python
•
MIT License
•4•119•13•0•Updated Dec 4, 2024Dec 4, 2024
MM-NIAH
Public
[NeurIPS 2024] Needle In A Multimodal Haystack (MM-NIAH): A comprehensive benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.
benchmark long-context vision-language-model multimodal-large-language-models
Python
•6•112•1•0•Updated Nov 25, 2024Nov 25, 2024
OmniCorpus
Public
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Python
•7•311•0•0•Updated Nov 17, 2024Nov 17, 2024
GUI-Odyssey
Public
GUI Odyssey is a comprehensive dataset for training and evaluating cross-app navigation agents. GUI Odyssey consists of 7,735 episodes from 6 mobile devices, spanning 6 types of cross-app tasks, 201 apps, and 1.4K app combos.
Python
•4•89•2•0•Updated Nov 12, 2024Nov 12, 2024
.github
Public
1•0•0•0•Updated Oct 30, 2024Oct 30, 2024
OV-OAD
Public
This repo takes the initial step towards leveraging text learning for online action detection without explicit human supervision.
1•1•0•0•Updated Oct 28, 2024Oct 28, 2024
InternVL-MMDetSeg
Public
Train InternViT-6B in MMSegmentation and MMDetection with DeepSpeed
object-detection semantic-segmentation vision-foundation
Jupyter Notebook
•6•79•1•0•Updated Oct 25, 2024Oct 25, 2024
PhyGenBench
Public
The code and data of Paper: Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
Python
•1•86•3•0•Updated Oct 25, 2024Oct 25, 2024
VideoMAEv2
Public
[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
video-understanding action-detection self-supervised-learning temporal-action-detection foundation-model cvpr2023 action-recognition
Python
•
MIT License
•68•590•17•0•Updated Oct 8, 2024Oct 8, 2024
EfficientQAT
Public
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
Python
•19•250•6•0•Updated Oct 8, 2024Oct 8, 2024
OmniQuant
Public
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
quantization large-language-models llm
Python
•
MIT License
•60•773•25•1•Updated Oct 8, 2024Oct 8, 2024
MMIU
Public
[ICLR2025] MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Python
•2•62•3•0•Updated Sep 14, 2024Sep 14, 2024
ChartAst
Public
[ACL 2024] ChartAssistant is a chart-based vision-language model for universal chart comprehension and reasoning.
Python
•
Other
•9•112•7•0•Updated Sep 7, 2024Sep 7, 2024
EgoExoLearn
Public
[CVPR 2024] Data and benchmark code for the EgoExoLearn dataset
Python
•
MIT License
•0•54•2•0•Updated Sep 3, 2024Sep 3, 2024