-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Medical Decathlon Datasets #47
Comments
For reference, project Monai (https://monai.io) has this functionality and they have already prepared a public dropbox location (https://github.com/Project-MONAI/MONAI/blob/master/monai/apps/datasets.py) |
A glance at this dataset it seems like >50GB size in google drive, I'm afraid this is not suitable for this repo.
|
It seems odd to me that monai created a dropbox for this when the downloads can be accessed via a link to a public google drive account. I mentioned Artifacts on slack. Using an Artifacts.toml in a package basically allows you to have a version controlled script for downloading stuff. |
It's still not a good practice/experience to host an artifact of over 1Gb. Besides, currently the whole Julia ecosystem only produces <558Gb data for all the artifacts, adding this dataset as artifacts would dramatically increase the disk pressure by 1/10 and I don't think we should advertise this "solution". Ref: check the Julia item for the storage in https://mirrors.bfsu.edu.cn/status/#server-status |
I didn't know we hosted the download associated with each artifact. Why do we download the url associated with an artifact? According to the Pkg documentation the user still receives the download from the url in the artifact. |
I think I had that wrong. Monai hosts the dataset on AWS. My best guess as to why they chose to do this is because google drive has a daily download limit (which I often ran into when previously downloading it from the original google drive) |
I'm not sure if I understand your words correctly. Generally we are downloading artifacts from pkg servers, which are backed by storage servers hosted by julialang.org. So no, we don't download the dataset from the original url unless we failed to download that from the pkg server. Currently, every artifacts with url provided are kept a copy in all those storage servers. There isn't a hard size limit on it. This says if we coded the dataset in Artifacts.toml with url provided, we are doing pressure test on the storage server.... cc: @staticfloat |
FYI, I think dvc is a better tool for managing large datasets and experiments. It suits for all languages. |
It's a little complex; the
If users want to create 50GB artifacts, they are more than welcome to, but we will probably prevent them from being cached in the Pkg servers. :) That would then cause the downloads to fall back to the original location. So that's totally fine. That being said, I also suggest |
I am new to Julia and I am working on a medical imaging research project. I would like to add the medical decathlon datasets (http://medicaldecathlon.com) (https://arxiv.org/pdf/1902.09063.pdf) to this repo as I think it would be a great way for me to learn what's going on and it would likely benefit the entire Julia community. I will definitely need help in this endeavor though so please let me know if that is something of interest to the contributors of this project.
The text was updated successfully, but these errors were encountered: