-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
write datasets in a JLD2 or Arrow format for faster read #125
Comments
Have done this for large vision datasets like COCO that have annotations in JSON which can be slow to parse. One thing to keep in mind is the size of the JLD2 files, though of course it shouldn't be a problem for MNIST. Arrow.jl can also be a good format with built-in compression when the data has samples made up of primitive types and arrays. |
What's to be expected from the JLD2 sizes? hopefully not larger than the size of the original data, right? |
Depends. If you have a large dataset of .jpg images and store them as arrays (hence losslessly), size can be multiples. |
I agree too Arrow.jl is a good format:
|
HuggingFace's datasets library also uses Arrow: https://huggingface.co/docs/datasets/about_arrow |
some code showing how to read/write color arrays from/to arrow tables |
We could have a "processed" folder in each dataset folder where we write the dataset object the first time we create it. In the following creations, e.g.
d = MNIST()
we just load the JLD2 file.Example:
The text was updated successfully, but these errors were encountered: