-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Fix Hunyuan I2V for transformers>4.47.1
#11293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if the outputs match before/after these changes! Prompt template and token indices are very unlikely to change by end users, but even if they do, they need to provide the correct indices to work, so I would assume it should be okay on our end
@@ -100,6 +100,50 @@ | |||
} | |||
|
|||
|
|||
def _expand_input_ids_with_image_tokens( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe could point to a specific commit by adding a comment about where this was adapted from (assuming if it was)
Hmm so prompt embeds do change, because the way the input embeds are created in LLava With the current changes, I think the padding token embedding is used to pad to the end of the sequence. Since the attention mask is passed in both cases, all the differences in the embeddings appear to be where the pad tokens are supposed to be, visually the pipeline outputs look pretty much the same, but numerically the prompt embeds are different. Old: Current Fix: |
@vladmandic Do you have some time to check if this fixes the issue you were facing? |
quick test
lgtm! |
What does this PR do?
We need to manually expand the
text_input_ids
to include the token ids for the image. The function responsible for it in transformers has been deprecated, and the logic in the currentLLaVAProcessor
isn't compatible with the waymax_sequence_length
should be applied in this model.This PR
_expand_input_ids_with_image_tokens
function to replicate some of the behaviour in the LLava model in transformers<4.47.1https://github.com/huggingface/transformers/blob/241c04d36867259cdf11dbb4e9d9a60f9cb65ebc/src/transformers/models/llava/modeling_llava.py#L302
This should keep backwards compatibility and work with the newer transformers versions, provided the prompt template and token indices for the image token do not change (seems unlikely?)
Fixes # (issue)
#11118
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.