Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for DeepseekAI's DeepseekVL #36248

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

geetu040
Copy link
Contributor

@geetu040 geetu040 commented Feb 18, 2025

What does this PR do?

Fixes #36110

This PR adds DeepseekAI's DeepseekVL model to Hugging Face Transformers.

DeepseekVL is an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. DeepSeek-VL possesses general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios.

Relevant Links

CC: @Benjamin-eecs, @RERV (github contributors of deepseek-ai/DeepSeek-VL)

Before submitting

Who can review?

@ArthurZucker, @Rocketknight1, @Cyrilvallez, @zucchini-nlp

TODOs

  • Documentation
  • Tests
  • Import Statements and Auto Modeling
  • Modeling
    • VisionModel
    • TextModel
    • AlignerModel
  • Weight Conversion Script
  • Tokenizer/Processor
  • Fix CI/CD tests

@geetu040
Copy link
Contributor Author

@zucchini-nlp , @Rocketknight1, @Cyrilvallez

The Deepseek-VL uses Sam as backbone for encoding high-resolution images.
And to be more specific, the backbone is SamVisionEncoder instead of SamModel, which is not available as a public class. By which I mean that, you can do following with SamModel but not with SamVisionEncoder

from transformers import SamConfig, SamModel
config = SamConfig()
model = SamModel(config)

I think that we should rename SamVisionEncoder -> SamVisionModel, inherit it from SamPreTrainedModel and make it accessible to the user. I don't think it breaks backward compatibility in any way.

Otherwise, we would have to copy all the classes that build SamVisionEncoder for deepseek. There is nothing wrong with this either but having a SamVisionModel along with a SamModel makes sense, since it might benefit someone else as well.

If you think having a SamVisionModel makes sense, should that be done in a separate PR?

Btw, final results would look like this

from transformers import SamVisionConfig, SamVisionModel
config = SamVisionConfig()
model = SamVisionModel(config)

and SamVisionConfig is already available publically.

@zucchini-nlp
Copy link
Member

@geetu040 we had similar situation with ideficsVision afair. Yes, in that case, we can just make it public and add in the docs. Renaming though would be breaking, imo we can leave name as is

@geetu040
Copy link
Contributor Author

@zucchini-nlp is it okay to do it in the same PR? or should I create a new one

@zucchini-nlp
Copy link
Member

@geetu040 imo a new PR will make it easier for us to iterate and review

@geetu040
Copy link
Contributor Author

Hi @zucchini-nlp, I am working on the SamVisionEncoder (going to create the PR soon) and I have a quick question.
I realized that SamVisionAttention and SamVisionSdpaAttention produce attn_weights of different shapes when output_attentions=True.

Can you please answer these 2 questions:

  1. Is this allowed in transformers for the 2 attentions to produce outputs of different shapes?
  2. And lets suppose we do something that changes the shape of output_attentions, does that break backward compatibility?

@zucchini-nlp
Copy link
Member

@geetu040 no, that is not expected to have different shapes. Usually using sdpa attention means that no attn_weights are returned, so it should be available only through 'eager' attention modules

I see that the weights are calculated on top of SDPA by manual matmul of key and query, which imo defeats the purpose of using SDPA in the first place. Can you remove the returned attention and raise warning similar to what is done in ViT?

@geetu040
Copy link
Contributor Author

@zucchini-nlp sure I'll do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Deepseek-VL
3 participants