About only given 2D video as input. #58

Kevin56-dot · 2025-02-27T09:14:51Z

Thanks for your great work.
I have the following questions:

As you said Chat-Scene enables the processing of 2D video using a tracking-based detector when 3D input is unavailable, how does this implement in data preprocessing and model training/evaluating when only given the 2D video for multiple tasks, such as ScanRefer, ScanQA, Scan2Cap, etc? Could you please provide an example pipeline about how to run the code only given the 2D video as input? I haven't found any instruction about only video input in the current repository (apologies if I miss it).
It also seems that how the evaluation of model is implemented after using only video input is not yet provided (like Video Grounding metric based on the ST-IoU threshold mentioned in your paper).

This work is well done and I think it will be better if you could provide more detailed instructions on how to use your project when only using 2D video as input, including the dataset preprocessing, model training, and inference, since using only 2D video is very normal in practical application.

I will appreciate it for your kind response and help. Thanks again for your great work.

The text was updated successfully, but these errors were encountered:

ZzZZCHS · 2025-02-27T18:26:55Z

Hi,

Thank you for your interest in our work!

Apologies for the delay in releasing instructions for processing video input only. I've been quite busy lately, so I’m unable to provide a specific timeline for when the detailed instructions will be available. Additionally, the related code still needs some organization. However, I can give you a brief overview here:

We follow DEVA to retrieve object proposals, where each object is represented by masks from several video frames.
Using the masks, we extract 2D features with DINOv2 (similar to the process for 3D objects, except we don’t need to project point cloud masks onto images).
For ST-IoU, we pre-compute the IoU between each object proposal’s masks and the GT masks on the video frames.
The training and evaluation process is similar to the 3D+2D version, but without the need for 3D features. When calculating [email protected] / [email protected] for video grounding, we use the pre-computed ST-IoU directly.

Hope this helps! Let me know if you have any further questions.

Kevin56-dot · 2025-02-28T15:08:35Z

Thanks for your kind reply. I will try it as you said.
It would be very helpful if you could provide some detailed instructions for processing video input only when you have time.
Good luck and looking forward to your newest release about this respository~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About only given 2D video as input. #58

About only given 2D video as input. #58

Kevin56-dot commented Feb 27, 2025

ZzZZCHS commented Feb 27, 2025

Kevin56-dot commented Feb 28, 2025

About only given 2D video as input. #58

About only given 2D video as input. #58

Comments

Kevin56-dot commented Feb 27, 2025

ZzZZCHS commented Feb 27, 2025

Kevin56-dot commented Feb 28, 2025