[Model] VLM2Vec, the first multimodal embedding model in vLLM (vllm-p…

…roject#9303)
MengqingCao · Oct 16, 2024 · 7abba39 · 7abba39
1 parent 7e7eae3
commit 7abba39
Show file tree

Hide file tree

Showing 16 changed files with 465 additions and 261 deletions.
diff --git a/docs/source/models/supported_models.rst b/docs/source/models/supported_models.rst
@@ -3,7 +3,7 @@
 Supported Models
 ================
 
-vLLM supports a variety of generative Transformer models in `HuggingFace Transformers <https://huggingface.co/models>`_.
+vLLM supports a variety of generative Transformer models in `HuggingFace (HF) Transformers <https://huggingface.co/models>`_.
 The following is the list of model architectures that are currently supported by vLLM.
 Alongside each architecture, we include some popular models that use it.
 
@@ -19,7 +19,7 @@ Text Generation
 
   * - Architecture
     - Models
-    - Example HuggingFace Models
+    - Example HF Models
     - :ref:`LoRA <lora>`
     - :ref:`PP <distributed_serving>`
   * - :code:`AquilaForCausalLM`
@@ -280,7 +280,7 @@ Text Embedding
 
   * - Architecture
     - Models
-    - Example HuggingFace Models
+    - Example HF Models
     - :ref:`LoRA <lora>`
     - :ref:`PP <distributed_serving>`
   * - :code:`Gemma2Model`
@@ -303,7 +303,7 @@ Reward Modeling
 
   * - Architecture
     - Models
-    - Example HuggingFace Models
+    - Example HF Models
     - :ref:`LoRA <lora>`
     - :ref:`PP <distributed_serving>`
   * - :code:`Qwen2ForRewardModel`
@@ -316,86 +316,93 @@ Reward Modeling
     As an interim measure, these models are supported via Embeddings API. See `this RFC <https://github.com/vllm-project/vllm/issues/8967>`_ for upcoming changes.
 
 Multimodal Language Models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The following modalities are supported depending on the model:
+
+- **T**\ ext
+- **I**\ mage
+- **V**\ ideo
+- **A**\ udio
 
 .. _supported_vlms:
 
 Text Generation
 ---------------
 
 .. list-table::
-  :widths: 25 25 25 25 5 5
+  :widths: 25 25 15 25 5 5
   :header-rows: 1
 
   * - Architecture
     - Models
-    - Modalities
-    - Example HuggingFace Models
+    - Inputs
+    - Example HF Models
     - :ref:`LoRA <lora>`
     - :ref:`PP <distributed_serving>`
   * - :code:`Blip2ForConditionalGeneration`
     - BLIP-2
-    - Image\ :sup:`E`
+    - T + I\ :sup:`E`
     - :code:`Salesforce/blip2-opt-2.7b`, :code:`Salesforce/blip2-opt-6.7b`, etc.
     -
     - ✅︎
   * - :code:`ChameleonForConditionalGeneration`
     - Chameleon
-    - Image
+    - T + I
     - :code:`facebook/chameleon-7b` etc.
     - 
     - ✅︎
   * - :code:`FuyuForCausalLM`
     - Fuyu
-    - Image
+    - T + I
     - :code:`adept/fuyu-8b` etc.
     - 
     - ✅︎
   * - :code:`ChatGLMModel`
     - GLM-4V
-    - Image
+    - T + I
     - :code:`THUDM/glm-4v-9b` etc.
     - 
     - ✅︎
   * - :code:`InternVLChatModel`
     - InternVL2
-    - Image\ :sup:`E+`
+    - T + I\ :sup:`E+`
     - :code:`OpenGVLab/InternVL2-4B`, :code:`OpenGVLab/InternVL2-8B`, etc.
     - 
     - ✅︎
   * - :code:`LlavaForConditionalGeneration`
     - LLaVA-1.5
-    - Image\ :sup:`E+`
+    - T + I\ :sup:`E+`
     - :code:`llava-hf/llava-1.5-7b-hf`, :code:`llava-hf/llava-1.5-13b-hf`, etc.
     -
     - ✅︎
   * - :code:`LlavaNextForConditionalGeneration`
     - LLaVA-NeXT
-    - Image\ :sup:`E+`
+    - T + I\ :sup:`E+`
     - :code:`llava-hf/llava-v1.6-mistral-7b-hf`, :code:`llava-hf/llava-v1.6-vicuna-7b-hf`, etc.
     -
     - ✅︎
   * - :code:`LlavaNextVideoForConditionalGeneration`
     - LLaVA-NeXT-Video
-    - Video
+    - T + V
     - :code:`llava-hf/LLaVA-NeXT-Video-7B-hf`, etc.
     -
     - ✅︎
   * - :code:`LlavaOnevisionForConditionalGeneration`
     - LLaVA-Onevision
-    - Image\ :sup:`+` / Video
+    - T + I\ :sup:`+` + V
     - :code:`llava-hf/llava-onevision-qwen2-7b-ov-hf`, :code:`llava-hf/llava-onevision-qwen2-0.5b-ov-hf`, etc.
     -
     - ✅︎
   * - :code:`MiniCPMV`
     - MiniCPM-V
-    - Image\ :sup:`E+`
+    - T + I\ :sup:`E+`
     - :code:`openbmb/MiniCPM-V-2` (see note), :code:`openbmb/MiniCPM-Llama3-V-2_5`, :code:`openbmb/MiniCPM-V-2_6`, etc.
     - ✅︎
     - ✅︎
   * - :code:`MllamaForConditionalGeneration`
     - Llama 3.2
-    - Image
+    - T + I
     - :code:`meta-llama/Llama-3.2-90B-Vision-Instruct`, :code:`meta-llama/Llama-3.2-11B-Vision`, etc.
     -
     -
@@ -407,43 +414,43 @@ Text Generation
     - ✅︎
   * - :code:`NVLM_D_Model`
     - NVLM-D 1.0
-    - Image\ :sup:`E+`
+    - T + I\ :sup:`E+`
     - :code:`nvidia/NVLM-D-72B`, etc.
     - 
     - ✅︎
   * - :code:`PaliGemmaForConditionalGeneration`
     - PaliGemma
-    - Image\ :sup:`E`
+    - T + I\ :sup:`E`
     - :code:`google/paligemma-3b-pt-224`, :code:`google/paligemma-3b-mix-224`, etc.
     - 
     - ✅︎
   * - :code:`Phi3VForCausalLM`
     - Phi-3-Vision, Phi-3.5-Vision
-    - Image\ :sup:`E+`
+    - T + I\ :sup:`E+`
     - :code:`microsoft/Phi-3-vision-128k-instruct`, :code:`microsoft/Phi-3.5-vision-instruct` etc.
     -
     - ✅︎
   * - :code:`PixtralForConditionalGeneration`
     - Pixtral
-    - Image\ :sup:`+`
+    - T + I\ :sup:`+`
     - :code:`mistralai/Pixtral-12B-2409`
     -
     - ✅︎
   * - :code:`QWenLMHeadModel`
     - Qwen-VL
-    - Image\ :sup:`E+`
+    - T + I\ :sup:`E+`
     - :code:`Qwen/Qwen-VL`, :code:`Qwen/Qwen-VL-Chat`, etc.
     -
     - ✅︎
   * - :code:`Qwen2VLForConditionalGeneration`
     - Qwen2-VL
-    - Image\ :sup:`E+` / Video\ :sup:`+`
+    - T + I\ :sup:`E+` + V\ :sup:`+`
     - :code:`Qwen/Qwen2-VL-2B-Instruct`, :code:`Qwen/Qwen2-VL-7B-Instruct`, :code:`Qwen/Qwen2-VL-72B-Instruct`, etc.
     -
     - ✅︎
   * - :code:`UltravoxModel`
     - Ultravox
-    - Audio\ :sup:`E+`
+    - T + A\ :sup:`E+`
     - :code:`fixie-ai/ultravox-v0_3`
     -
     - ✅︎
@@ -455,6 +462,26 @@ Text Generation
   For :code:`openbmb/MiniCPM-V-2`, the official repo doesn't work yet, so we need to use a fork (:code:`HwwwH/MiniCPM-V-2`) for now.
   For more details, please see: https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630
 
+Multimodal Embedding
+--------------------
+
+.. list-table::
+  :widths: 25 25 15 25 5 5
+  :header-rows: 1
+
+  * - Architecture
+    - Models
+    - Inputs
+    - Example HF Models
+    - :ref:`LoRA <lora>`
+    - :ref:`PP <distributed_serving>`
+  * - :code:`Phi3VForCausalLM`
+    - Phi-3-Vision-based
+    - T + I
+    - :code:`TIGER-Lab/VLM2Vec-Full`
+    - 🚧
+    - ✅︎
+
 ----
 
 If your model uses one of the above model architectures, you can seamlessly run your model with vLLM.

diff --git a/examples/offline_inference_vision_language_embedding.py b/examples/offline_inference_vision_language_embedding.py
@@ -0,0 +1,21 @@
+from vllm import LLM
+from vllm.assets.image import ImageAsset
+
+image = ImageAsset("cherry_blossom").pil_image.convert("RGB")
+prompt = "<|image_1|> Represent the given image with the following question: What is in the image"  # noqa: E501
+
+# Create an LLM.
+llm = LLM(
+    model="TIGER-Lab/VLM2Vec-Full",
+    trust_remote_code=True,
+    max_model_len=4096,
+    max_num_seqs=2,
+    mm_processor_kwargs={"num_crops": 16},
+)
+
+# Generate embedding. The output is a list of EmbeddingRequestOutputs.
+outputs = llm.encode({"prompt": prompt, "multi_modal_data": {"image": image}})
+
+# Print the outputs.
+for output in outputs:
+    print(output.outputs.embedding)  # list of 3072 floats