You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your great work.
I have the following questions:
As you said Chat-Scene enables the processing of 2D video using a tracking-based detector when 3D input is unavailable, how does this implement in data preprocessing and model training/evaluating when only given the 2D video for multiple tasks, such as ScanRefer, ScanQA, Scan2Cap, etc? Could you please provide an example pipeline about how to run the code only given the 2D video as input? I haven't found any instruction about only video input in the current repository (apologies if I miss it).
It also seems that how the evaluation of model is implemented after using only video input is not yet provided (like Video Grounding metric based on the ST-IoU threshold mentioned in your paper).
This work is well done and I think it will be better if you could provide more detailed instructions on how to use your project when only using 2D video as input, including the dataset preprocessing, model training, and inference, since using only 2D video is very normal in practical application.
I will appreciate it for your kind response and help. Thanks again for your great work.
The text was updated successfully, but these errors were encountered:
Apologies for the delay in releasing instructions for processing video input only. I've been quite busy lately, so I’m unable to provide a specific timeline for when the detailed instructions will be available. Additionally, the related code still needs some organization. However, I can give you a brief overview here:
We follow DEVA to retrieve object proposals, where each object is represented by masks from several video frames.
Using the masks, we extract 2D features with DINOv2 (similar to the process for 3D objects, except we don’t need to project point cloud masks onto images).
For ST-IoU, we pre-compute the IoU between each object proposal’s masks and the GT masks on the video frames.
The training and evaluation process is similar to the 3D+2D version, but without the need for 3D features. When calculating [email protected] / [email protected] for video grounding, we use the pre-computed ST-IoU directly.
Hope this helps! Let me know if you have any further questions.
Thanks for your kind reply. I will try it as you said.
It would be very helpful if you could provide some detailed instructions for processing video input only when you have time.
Good luck and looking forward to your newest release about this respository~
Thanks for your great work.
I have the following questions:
an example pipeline
about how to run the code only given the 2D video as input? I haven't found any instruction about only video input in the current repository (apologies if I miss it).This work is well done and I think it will be better if you could provide more detailed instructions on how to use your project when only using 2D video as input, including the dataset preprocessing, model training, and inference, since using only 2D video is very normal in practical application.
I will appreciate it for your kind response and help. Thanks again for your great work.
The text was updated successfully, but these errors were encountered: