Clarification on Pretraining Script vs. Paper Implementation Details

Hello, thank you for releasing this great work! I am trying to reproduce your training results, and while reviewing the pretraining setup, I found a few points that are unclear and would appreciate some clarification:

1. The paper states that LLaVA Video-7B was used, but the pretrain.sh script and HuggingFace release mention llava-1.5-7b. Could you clarify which model was actually used?

2. In pretrain.sh, the dataset file llava-3d-instruct-1m.json is referenced, but I couldn’t find it. On Hugging Face, there is a release of LLaVA-3D-Instruct-860K.json. Are these the same dataset, or are they different versions?

3. The paper reports training with 32 frames, 3096 tokens, and learning rate 1e-5, while the pretrain.sh script sets 20 frames, 1152 tokens, and learning rate 1e-3. Could you clarify which configuration was actually used during training?

4. Overall, it seems that the training script and the paper description don’t fully align. Would it be possible to update or clarify the training script so that others can reproduce the reported results more faithfully?

Thank you very much for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on Pretraining Script vs. Paper Implementation Details #37

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Clarification on Pretraining Script vs. Paper Implementation Details #37

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions