Add support for DeepseekAI's DeepseekVL #36248

geetu040 · 2025-02-18T07:41:43Z

What does this PR do?

This PR adds DeepseekAI's DeepseekVL model to Hugging Face Transformers.

DeepseekVL is an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. DeepSeek-VL possesses general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios.

Relevant Links

Research Paper: DeepSeek-VL: Towards Real-World Vision-Language Understanding
Authors: Haoyu Lu, Wen Liu, Bo Zhang, et al.
Implementation: github.com/deepseek-ai/DeepSeek-VL
Models Weights: huggingface.co/collections/deepseek-ai/deepseek-vl

CC: @Benjamin-eecs, @RERV (github contributors of deepseek-ai/DeepSeek-VL)

Before submitting

Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker, @Rocketknight1, @Cyrilvallez, @zucchini-nlp

TODOs

geetu040 · 2025-02-24T04:51:00Z

@zucchini-nlp , @Rocketknight1, @Cyrilvallez

The Deepseek-VL uses Sam as backbone for encoding high-resolution images.
And to be more specific, the backbone is SamVisionEncoder instead of SamModel, which is not available as a public class. By which I mean that, you can do following with SamModel but not with SamVisionEncoder

from transformers import SamConfig, SamModel
config = SamConfig()
model = SamModel(config)

I think that we should rename SamVisionEncoder -> SamVisionModel, inherit it from SamPreTrainedModel and make it accessible to the user. I don't think it breaks backward compatibility in any way.

Otherwise, we would have to copy all the classes that build SamVisionEncoder for deepseek. There is nothing wrong with this either but having a SamVisionModel along with a SamModel makes sense, since it might benefit someone else as well.

If you think having a SamVisionModel makes sense, should that be done in a separate PR?

Btw, final results would look like this

from transformers import SamVisionConfig, SamVisionModel
config = SamVisionConfig()
model = SamVisionModel(config)

and SamVisionConfig is already available publically.

zucchini-nlp · 2025-02-24T08:26:22Z

@geetu040 we had similar situation with ideficsVision afair. Yes, in that case, we can just make it public and add in the docs. Renaming though would be breaking, imo we can leave name as is

geetu040 · 2025-02-25T05:25:04Z

@zucchini-nlp is it okay to do it in the same PR? or should I create a new one

zucchini-nlp · 2025-02-25T08:31:51Z

@geetu040 imo a new PR will make it easier for us to iterate and review

geetu040 · 2025-02-26T09:41:51Z

Hi @zucchini-nlp, I am working on the SamVisionEncoder (going to create the PR soon) and I have a quick question.
I realized that SamVisionAttention and SamVisionSdpaAttention produce attn_weights of different shapes when output_attentions=True.

Can you please answer these 2 questions:

Is this allowed in transformers for the 2 attentions to produce outputs of different shapes?
And lets suppose we do something that changes the shape of output_attentions, does that break backward compatibility?

zucchini-nlp · 2025-02-26T10:07:36Z

@geetu040 no, that is not expected to have different shapes. Usually using sdpa attention means that no attn_weights are returned, so it should be available only through 'eager' attention modules

I see that the weights are calculated on top of SDPA by manual matmul of key and query, which imo defeats the purpose of using SDPA in the first place. Can you remove the returned attention and raise warning similar to what is done in ViT?

geetu040 · 2025-02-26T10:08:44Z

@zucchini-nlp sure I'll do that.

geetu040 and others added 7 commits February 18, 2025 12:27

upload initial code

f3d1896

update deepseek-vl adaptor

b904f22

update hierarchy of vision model classes

7d44bee

udpate aligner model

a3734d6

Added Image Processor

65886ec

Added Image Processor

19a7666

Added Image Processor

9c3c544

This was referenced Mar 2, 2025

Make SamVisionEncoder public for better accessibility #36493

Draft

Fix sdpa in sam and refactor relative position embeddings #36422

Open

Add the __init__ file

c72cc51

Shakib-IO force-pushed the deepseek-vl branch from 8d32560 to c72cc51 Compare March 2, 2025 17:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for DeepseekAI's DeepseekVL #36248

Add support for DeepseekAI's DeepseekVL #36248

geetu040 commented Feb 18, 2025 •

edited

Loading

geetu040 commented Feb 24, 2025

zucchini-nlp commented Feb 24, 2025

geetu040 commented Feb 25, 2025

zucchini-nlp commented Feb 25, 2025

geetu040 commented Feb 26, 2025

zucchini-nlp commented Feb 26, 2025

geetu040 commented Feb 26, 2025

Add support for DeepseekAI's DeepseekVL #36248

Are you sure you want to change the base?

Add support for DeepseekAI's DeepseekVL #36248

Conversation

geetu040 commented Feb 18, 2025 • edited Loading

What does this PR do?

Before submitting

Who can review?

TODOs

geetu040 commented Feb 24, 2025

zucchini-nlp commented Feb 24, 2025

geetu040 commented Feb 25, 2025

zucchini-nlp commented Feb 25, 2025

geetu040 commented Feb 26, 2025

zucchini-nlp commented Feb 26, 2025

geetu040 commented Feb 26, 2025

geetu040 commented Feb 18, 2025 •

edited

Loading