Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Medical Decathlon Datasets #47

Open
Dale-Black opened this issue Oct 22, 2020 · 9 comments
Open

Add Medical Decathlon Datasets #47

Dale-Black opened this issue Oct 22, 2020 · 9 comments

Comments

@Dale-Black
Copy link

I am new to Julia and I am working on a medical imaging research project. I would like to add the medical decathlon datasets (http://medicaldecathlon.com) (https://arxiv.org/pdf/1902.09063.pdf) to this repo as I think it would be a great way for me to learn what's going on and it would likely benefit the entire Julia community. I will definitely need help in this endeavor though so please let me know if that is something of interest to the contributors of this project.

@Dale-Black
Copy link
Author

For reference, project Monai (https://monai.io) has this functionality and they have already prepared a public dropbox location (https://github.com/Project-MONAI/MONAI/blob/master/monai/apps/datasets.py)

@johnnychen94
Copy link
Member

johnnychen94 commented Oct 22, 2020

A glance at this dataset it seems like >50GB size in google drive, I'm afraid this is not suitable for this repo.

  • Download and storing this large dataset can be problematic for normal usage. This repo only serves small datasets for test usages.
  • the dataset is stored as multiple files in the google drive, it needs extra codes to communicate with the google drive API, which I don't know if there is such a tool in Julia.

@Tokazama
Copy link

It seems odd to me that monai created a dropbox for this when the downloads can be accessed via a link to a public google drive account. I mentioned Artifacts on slack. Using an Artifacts.toml in a package basically allows you to have a version controlled script for downloading stuff.

@johnnychen94
Copy link
Member

johnnychen94 commented Oct 22, 2020

It's still not a good practice/experience to host an artifact of over 1Gb. Besides, currently the whole Julia ecosystem only produces <558Gb data for all the artifacts, adding this dataset as artifacts would dramatically increase the disk pressure by 1/10 and I don't think we should advertise this "solution".

Ref: check the Julia item for the storage in https://mirrors.bfsu.edu.cn/status/#server-status

@Tokazama
Copy link

I didn't know we hosted the download associated with each artifact. Why do we download the url associated with an artifact? According to the Pkg documentation the user still receives the download from the url in the artifact.

@Dale-Black
Copy link
Author

I think I had that wrong. Monai hosts the dataset on AWS. My best guess as to why they chose to do this is because google drive has a daily download limit (which I often ran into when previously downloading it from the original google drive)

@johnnychen94
Copy link
Member

johnnychen94 commented Oct 22, 2020

I didn't know we hosted the download associated with each artifact. Why do we download the url associated with an artifact? According to the Pkg documentation the user still receives the download from the url in the artifact.

I'm not sure if I understand your words correctly. Generally we are downloading artifacts from pkg servers, which are backed by storage servers hosted by julialang.org. So no, we don't download the dataset from the original url unless we failed to download that from the pkg server.

Currently, every artifacts with url provided are kept a copy in all those storage servers. There isn't a hard size limit on it. This says if we coded the dataset in Artifacts.toml with url provided, we are doing pressure test on the storage server....

cc: @staticfloat

@johnnychen94
Copy link
Member

johnnychen94 commented Oct 22, 2020

FYI, I think dvc is a better tool for managing large datasets and experiments. It suits for all languages.

@staticfloat
Copy link

According to the Pkg documentation the user still receives the download from the url in the artifact.

It's a little complex; the Artifact.toml contains URLs which are the source URLs that your client can download from, but it will first attempt to download from a Pkg server, because those are generally closer and higher-performance.

Currently, every artifacts with url provided are kept a copy in all those storage servers. There isn't a hard size limit on it.

If users want to create 50GB artifacts, they are more than welcome to, but we will probably prevent them from being cached in the Pkg servers. :) That would then cause the downloads to fall back to the original location. So that's totally fine.

That being said, I also suggest DataDeps.jl as a natural solution for these very large datasets. That should make it easier for your package to download the dataset directly from the origin server, no matter what format it's in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants