Add Medical Decathlon Datasets #47

Dale-Black · 2020-10-22T01:23:48Z

I am new to Julia and I am working on a medical imaging research project. I would like to add the medical decathlon datasets (http://medicaldecathlon.com) (https://arxiv.org/pdf/1902.09063.pdf) to this repo as I think it would be a great way for me to learn what's going on and it would likely benefit the entire Julia community. I will definitely need help in this endeavor though so please let me know if that is something of interest to the contributors of this project.

Dale-Black · 2020-10-22T01:28:23Z

For reference, project Monai (https://monai.io) has this functionality and they have already prepared a public dropbox location (https://github.com/Project-MONAI/MONAI/blob/master/monai/apps/datasets.py)

johnnychen94 · 2020-10-22T07:07:10Z

A glance at this dataset it seems like >50GB size in google drive, I'm afraid this is not suitable for this repo.

Download and storing this large dataset can be problematic for normal usage. This repo only serves small datasets for test usages.
the dataset is stored as multiple files in the google drive, it needs extra codes to communicate with the google drive API, which I don't know if there is such a tool in Julia.

Tokazama · 2020-10-22T14:53:53Z

It seems odd to me that monai created a dropbox for this when the downloads can be accessed via a link to a public google drive account. I mentioned Artifacts on slack. Using an Artifacts.toml in a package basically allows you to have a version controlled script for downloading stuff.

johnnychen94 · 2020-10-22T17:39:58Z

It's still not a good practice/experience to host an artifact of over 1Gb. Besides, currently the whole Julia ecosystem only produces <558Gb data for all the artifacts, adding this dataset as artifacts would dramatically increase the disk pressure by 1/10 and I don't think we should advertise this "solution".

Ref: check the Julia item for the storage in https://mirrors.bfsu.edu.cn/status/#server-status

Tokazama · 2020-10-22T17:52:20Z

I didn't know we hosted the download associated with each artifact. Why do we download the url associated with an artifact? According to the Pkg documentation the user still receives the download from the url in the artifact.

Dale-Black · 2020-10-22T17:58:52Z

I think I had that wrong. Monai hosts the dataset on AWS. My best guess as to why they chose to do this is because google drive has a daily download limit (which I often ran into when previously downloading it from the original google drive)

johnnychen94 · 2020-10-22T19:21:00Z

I didn't know we hosted the download associated with each artifact. Why do we download the url associated with an artifact? According to the Pkg documentation the user still receives the download from the url in the artifact.

I'm not sure if I understand your words correctly. Generally we are downloading artifacts from pkg servers, which are backed by storage servers hosted by julialang.org. So no, we don't download the dataset from the original url unless we failed to download that from the pkg server.

Currently, every artifacts with url provided are kept a copy in all those storage servers. There isn't a hard size limit on it. This says if we coded the dataset in Artifacts.toml with url provided, we are doing pressure test on the storage server....

cc: @staticfloat

johnnychen94 · 2020-10-22T19:23:57Z

FYI, I think dvc is a better tool for managing large datasets and experiments. It suits for all languages.

staticfloat · 2020-10-22T19:50:35Z

According to the Pkg documentation the user still receives the download from the url in the artifact.

It's a little complex; the Artifact.toml contains URLs which are the source URLs that your client can download from, but it will first attempt to download from a Pkg server, because those are generally closer and higher-performance.

Currently, every artifacts with url provided are kept a copy in all those storage servers. There isn't a hard size limit on it.

If users want to create 50GB artifacts, they are more than welcome to, but we will probably prevent them from being cached in the Pkg servers. :) That would then cause the downloads to fall back to the original location. So that's totally fine.

That being said, I also suggest DataDeps.jl as a natural solution for these very large datasets. That should make it easier for your package to download the dataset directly from the origin server, no matter what format it's in.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Medical Decathlon Datasets #47

Add Medical Decathlon Datasets #47

Dale-Black commented Oct 22, 2020

Dale-Black commented Oct 22, 2020

johnnychen94 commented Oct 22, 2020 •

edited

Loading

Tokazama commented Oct 22, 2020

johnnychen94 commented Oct 22, 2020 •

edited

Loading

Tokazama commented Oct 22, 2020

Dale-Black commented Oct 22, 2020

johnnychen94 commented Oct 22, 2020 •

edited

Loading

johnnychen94 commented Oct 22, 2020 •

edited

Loading

staticfloat commented Oct 22, 2020

Add Medical Decathlon Datasets #47

Add Medical Decathlon Datasets #47

Comments

Dale-Black commented Oct 22, 2020

Dale-Black commented Oct 22, 2020

johnnychen94 commented Oct 22, 2020 • edited Loading

Tokazama commented Oct 22, 2020

johnnychen94 commented Oct 22, 2020 • edited Loading

Tokazama commented Oct 22, 2020

Dale-Black commented Oct 22, 2020

johnnychen94 commented Oct 22, 2020 • edited Loading

johnnychen94 commented Oct 22, 2020 • edited Loading

staticfloat commented Oct 22, 2020

johnnychen94 commented Oct 22, 2020 •

edited

Loading

johnnychen94 commented Oct 22, 2020 •

edited

Loading

johnnychen94 commented Oct 22, 2020 •

edited

Loading

johnnychen94 commented Oct 22, 2020 •

edited

Loading