-
-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model][VLM] Add LLaVA-Onevision model support #8486
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Co-authored-by: Cyrus Leung <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@litianjian Thank you for making this contribution! I did a first pass so please take a look!
# TODO: support multiple videos | ||
num_videos = mm_counts["video"] | ||
if num_videos > _MAX_NUM_VIDEOS: | ||
raise NotImplementedError( | ||
f"Only {_MAX_NUM_VIDEOS} videos are supported") | ||
|
||
# TODO: support configuring the number of frames | ||
frames_per_video = _MAX_FRAMES_PER_VIDEO | ||
video_feature_size = get_llava_onevision_video_tokens( | ||
ctx, frames_per_video) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a question for later if needed: Now I think about it, do you see any issue with setting this check based on total number of frames?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most VLM, the maximum of frames depends on the tokens per frame and context length in LLM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, I was just thinking if it makes sense to do it by total_num_frames
instead of num_videos
and frames_per_video
. This will indeed make the validation more complicated, and we can definitely discuss more later!
My understanding is that, in a multi-video setting, one should be able to send two videos of 5 and 8 frames or 4 and 9 frames since to the sequence they're both equivalent to inference with 13 images. Is that correct?
…into llava-one-vision
@litianjian FYI I'm going to spend some time testing this PR this week. Overall it's in a good shape so we can probably get it merged by the end of this week/early next week! |
Hey @litianjian! I have finished some testing on this PR. A few questions:
|
Thank you for your patience. I will solve the problem 1 and 2, and update APSP. The model doesn't support the image-video mixed now, I can update in another PR if it adds the support in at some point in the future. |
1、For |
@litianjian Thanks! I can confirm image example can run correctly too. I just pushed a change 92c827f to clean up some code, as well as adding an image test in Given this is not an issue on vLLM, and this PR came directly from model vendor, I'm okay with merge this PR as long as you think the model quality doesn't need to be checked for now. |
Thank you for your patience. It's ok for me. |
This PR adding support for LLaVA-OneVision model.
FIX #7420
Requrements
This PR requires
transformers
with this PR merged(You can install it viapip install git+https://github.com/huggingface/transformers
)Example Usage
Roadmap
LlavaOnevisionForConditionalGeneration
.Notes
spatial_pool_mode, spatial_pool_stride, mm_newline_position
. We will follow up with updates from Huggingface.post_layer_norm
issues in LLaVA-OneVision may be updated in other PRs.