Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About only given 2D video as input. #58

Open
Kevin56-dot opened this issue Feb 27, 2025 · 2 comments
Open

About only given 2D video as input. #58

Kevin56-dot opened this issue Feb 27, 2025 · 2 comments

Comments

@Kevin56-dot
Copy link

Thanks for your great work.
I have the following questions:

  1. As you said Chat-Scene enables the processing of 2D video using a tracking-based detector when 3D input is unavailable, how does this implement in data preprocessing and model training/evaluating when only given the 2D video for multiple tasks, such as ScanRefer, ScanQA, Scan2Cap, etc? Could you please provide an example pipeline about how to run the code only given the 2D video as input? I haven't found any instruction about only video input in the current repository (apologies if I miss it).
  2. It also seems that how the evaluation of model is implemented after using only video input is not yet provided (like Video Grounding metric based on the ST-IoU threshold mentioned in your paper).

This work is well done and I think it will be better if you could provide more detailed instructions on how to use your project when only using 2D video as input, including the dataset preprocessing, model training, and inference, since using only 2D video is very normal in practical application.

I will appreciate it for your kind response and help. Thanks again for your great work.

@ZzZZCHS
Copy link
Owner

ZzZZCHS commented Feb 27, 2025

Hi,

Thank you for your interest in our work!

Apologies for the delay in releasing instructions for processing video input only. I've been quite busy lately, so I’m unable to provide a specific timeline for when the detailed instructions will be available. Additionally, the related code still needs some organization. However, I can give you a brief overview here:

  1. We follow DEVA to retrieve object proposals, where each object is represented by masks from several video frames.
  2. Using the masks, we extract 2D features with DINOv2 (similar to the process for 3D objects, except we don’t need to project point cloud masks onto images).
  3. For ST-IoU, we pre-compute the IoU between each object proposal’s masks and the GT masks on the video frames.
  4. The training and evaluation process is similar to the 3D+2D version, but without the need for 3D features. When calculating [email protected] / [email protected] for video grounding, we use the pre-computed ST-IoU directly.

Hope this helps! Let me know if you have any further questions.

@Kevin56-dot
Copy link
Author

Thanks for your kind reply. I will try it as you said.
It would be very helpful if you could provide some detailed instructions for processing video input only when you have time.
Good luck and looking forward to your newest release about this respository~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants