Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guidance needed: Processing GRIT-20M dataset in .parquet format for Alpha-CLIP #60

Open
qingpowuwu opened this issue Aug 17, 2024 · 1 comment

Comments

@qingpowuwu
Copy link

qingpowuwu commented Aug 17, 2024

Hello,

I'm working with the GRIT-20M dataset for the Alpha-CLIP project as described in the training README . However, I've encountered some discrepancies between the instructions and the dataset format I've obtained.

  1. Dataset Format:
    • The data preparation script (sam_grit.py) is configured to use .tar files, as evidenced by the line:
      parser.add_argument('--tar-pth', type=str, default="GRIT-1m/00001.tar")
    • However, the dataset I've downloaded is in .parquet format (e.g., coyo_0_snappy.parquet, coyo_10_snappy.parquet, etc.).
    • Could you confirm if this .parquet format is correct for the latest version of the dataset?

Thank you for your time and assistance.

@SunzeY
Copy link
Owner

SunzeY commented Aug 18, 2024

you can follow the download script in KOSMOS-2 to download .tar file. If you download from hugging face, you need to adjust the script. (by the way, this script only use SAM to change box into masks, its easy to reimplement it in .parquet format)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants