-
Notifications
You must be signed in to change notification settings - Fork 22
Description
Hello, thank you for releasing this great work! I am trying to reproduce your training results, and while reviewing the pretraining setup, I found a few points that are unclear and would appreciate some clarification:
-
The paper states that LLaVA Video-7B was used, but the pretrain.sh script and HuggingFace release mention llava-1.5-7b. Could you clarify which model was actually used?
-
In pretrain.sh, the dataset file llava-3d-instruct-1m.json is referenced, but I couldn’t find it. On Hugging Face, there is a release of LLaVA-3D-Instruct-860K.json. Are these the same dataset, or are they different versions?
-
The paper reports training with 32 frames, 3096 tokens, and learning rate 1e-5, while the pretrain.sh script sets 20 frames, 1152 tokens, and learning rate 1e-3. Could you clarify which configuration was actually used during training?
-
Overall, it seems that the training script and the paper description don’t fully align. Would it be possible to update or clarify the training script so that others can reproduce the reported results more faithfully?
Thank you very much for your help!