TCGA-Tools is a Python package that provides a clean, modular interface for downloading and organizing datasets from the NCI Genomic Data Commons (GDC), The Cancer Imaging Archive (TCIA), and Patho-Bench metadata hosted on Hugging Face. It allows you to fetch raw data together with directly usable metadata tables and, on the GDC path, optional annotations such as clinical, molecular, diagnostic reports, and diagnosis.
- Simple one-liner to fetch project files (e.g., whole-slide images /
.svs). - Write analysis-ready CSVs with file metadata and patient grouping.
- Emit optional annotation CSVs: clinical (survival/outcomes/treatments), molecular (DNA/RNA/CNV/methylation), free-text reports, and diagnosis/subtype.
- Be resilient to missing or sparse fields across projects.
- Source-aware interface for
gdc,tcia, andpatho-bench. - Clean, modular architecture with explicit ports/adapters for GDC and TCIA.
- Easy interface to the GDC API and TCIA/NBIA public API.
- Multi-dataset support (download one or multiple TCGA projects at once).
- TCIA imaging collection support with collection listing, modality listing, and series downloads.
- Patho-Bench benchmark metadata support with direct Hugging Face split download plus raw-data access manifests.
- Annotation options:
clinical: survival, treatment outcomes, patient metadatamolecular: genomic, transcriptomic, and methylation datareport: free-text pathology or clinical reportsdiagnosis: tumor subtype and diagnostic informationall: fetch everything available
- Progress bars for downloads.
- Logging of all transformations for reproducibility.
raw=Trueoption for βdry runsβ (inspect available data without downloading).- Optional statistics and visualizations: class distributions, survival curves, annotation summaries.
pip install tcga-toolsuv pip install tcga-toolspip install gdc-api-wrappergit clone https://github.com/LUMCPathAI/TCGA-Tools.git
cd TCGA-Tools
pip install -e .List the available GDC / TCGA projects first:
import tcga_tools as tt
datasets = tt.list_datasets(source="gdc")
print(datasets[["project_id", "name", "summary.case_count"]].head())Preview a single dataset without downloading files:
import tcga_tools as tt
artifacts = tt.download(
dataset_name="TCGA-LUSC",
source="gdc",
datatype=["WSI"],
annotations=["clinical", "molecular", "report"],
output_dir="./TCGA-LUSC",
raw=True,
)
print(artifacts["files_csv"])
# Download multiple datasets
tt.download(
dataset_name=["TCGA-LUSC", "TCGA-LUAD", "TCGA-BRCA"], # list of datasets
source="gdc",
filetypes=[".svs", ".maf"], # multiple file types
annotations="all", # fetch everything
output_dir="./TCGA",
)Download actual files instead of only metadata by removing raw=True.
Preview a TCIA imaging collection without downloading DICOM ZIP files:
tcia_artifacts = tt.download(
dataset_name="TCGA-BRCA",
source="tcia",
datatype=["MR"],
annotations=["clinical"],
output_dir="./TCIA-TCGA-BRCA",
raw=True,
)
print(tcia_artifacts["files_csv"])Download Patho-Bench metadata for one benchmark dataset and task:
patho_bench_artifacts = tt.download(
dataset_name="cptac_coad",
source="patho-bench",
task_name="KRAS_mutation",
output_dir="./PathoBench-COAD",
)
print(patho_bench_artifacts["raw_data_manifest_json"])Use the high-level portal to query pathology metadata and download slides from TCGA (GDC) and TCIA using clean, modular services.
from tcga_tools.pathology import PathologyDataPortal
from tcga_tools.services.tcia_pathology import TciaSeriesQuery
portal = PathologyDataPortal()
# --- TCIA: SOP Instance lookup and downloads ---
query = TciaSeriesQuery(series_instance_uid="uid.series.instance", format_="JSON")
sop_result = portal.list_tcia_sop_instance_uids(query)
portal.download_tcia_series(series_instance_uid="uid.series.instance", output_dir="./TCIA")
# --- TCGA: download pathology files via GDC wrapper ---
tcga_files = portal.download_tcga_project(
project_id="TCGA-LUSC",
filetypes=[".svs"],
output_dir="./TCGA-LUSC",
)-
Summary log of transformations and queries
-
Distributions of diagnosis categories
-
Survival curves based on clinical annotations
-
Counts per file type and annotation
data/(downloads)files_metadata.csv(flattened file + case/sample fields)groups.csv(per-case: paired / tumor_only / normal_only)clinical.csv,molecular_index.csv,reports_index.csv,diagnosis.csv(if requested)gdc_manifest.tsv(for the GDC Transfer Tool)patient_studies.csv(TCIA patient/study metadata)raw_data_manifest.json(Patho-Bench raw image access instructions)
If you need controlled-access files, set an environment variable with your token:
export GDC_TOKEN="<your token>"Use list_datasets() to enumerate the identifiers that can be passed to dataset_name.
For GDC / TCGA downloads, these are project IDs such as TCGA-LUSC, TCGA-BRCA, and TCGA-LUAD.
import tcga_tools as tt
df = tt.list_datasets(source="gdc")
print(df[["project_id", "name", "primary_site"]].head(10))Return plain Python records instead of a pandas DataFrame:
projects = tt.list_datasets(source="gdc", as_dataframe=False)
print(projects[0]["project_id"])For TCIA downloads, these are collection names such as TCGA-BRCA, LIDC-IDRI, or CPTAC-CCRCC.
tcia_collections = tt.list_datasets(source="tcia")
print(tcia_collections.head())List TCIA data types (modalities) globally or for one collection:
tt.list_data_types(source="tcia").head()
tt.list_data_types(source="tcia", dataset_name="TCGA-BRCA")From the CLI:
tcga-tools list-datasets --source gdc --program TCGA
tcga-tools list-datasets --source tcia
tcga-tools list-datasets --source patho-bench --include-data-types
tcga-tools list-data-types --source tcia --dataset TCGA-BRCA
tcga-tools list-data-types --source patho-bench --dataset cptac_coadUse the dataset_name argument with one or more IDs returned by list_datasets().
GDC metadata-only dry run:
tcga-tools download --source gdc --dataset TCGA-LUSC --datatype WSI --raw --out ./TCGA-LUSCActual GDC WSI download:
tcga-tools download --source gdc --dataset TCGA-LUSC --datatype WSI \
--download-workers 4 --out ./TCGA-LUSCActual GDC RNA-seq metadata and files:
tcga-tools download --source gdc --dataset TCGA-LUSC --datatype rna-seq --out ./TCGA-LUSC_RNAMultiple GDC projects in one run:
tcga-tools download --source gdc --dataset TCGA-LUSC TCGA-LUAD TCGA-BRCA \
--datatype WSI --download-workers 4 --raw --out ./TCGA-WSITCIA metadata-only preview for one imaging collection and modality:
tcga-tools download --source tcia --dataset TCGA-BRCA --datatype MR --raw --out ./TCIA-BRCATCIA metadata-only preview for one patient:
tcga-tools download --source tcia --dataset TCGA-BRCA --datatype MR \
--patient-id TCGA-AR-A1AQ --raw --out ./TCIA-BRCA-patientActual TCIA DICOM ZIP download for one series:
tcga-tools download --source tcia --dataset TCGA-BRCA \
--series-instance-uid 1.3.6.1.4.1.14519.5.2.1.3344.4002.142000486987125226950494153345 \
--out ./TCIA-seriesFor TCIA, datatype corresponds to imaging modalities such as MR, CT, PT, MG, or DX.
Patho-Bench metadata download from Hugging Face:
tcga-tools download --source patho-bench --dataset cptac_coad \
--task KRAS_mutation --out ./PathoBench-COADPatho-Bench metadata for a full benchmark dataset:
tcga-tools download --source patho-bench --dataset bracs --out ./PathoBench-BRACSPatho-Bench metadata plus delegated raw-image download when the dataset is TCIA-backed:
tcga-tools download --source patho-bench --dataset cptac_coad \
--download-raw-data --out ./PathoBench-COAD-fullPatho-Bench contains only metadata splits and labels on Hugging Face. Raw image downloads still come from the original dataset portals. For CPTAC-style entries already hosted on TCIA, --download-raw-data delegates to the TCIA backend. For the remaining Patho-Bench datasets, the downloader writes a raw_data_manifest.json with the original access URL.
The downloader is now designed for bounded parallel transfers within each dataset or imaging collection.
- Use
dataset_nameas a list to process multiple datasets in one run. - Use
--download-workersordownload_workers=to control parallel file or series downloads. - Keep worker counts modest on login nodes; values like
2to4are usually appropriate there. - On dedicated transfer or batch nodes, increase the worker count if the filesystem and network can sustain it.
- For very large GDC runs,
--tarcan still be preferable when one archive per dataset is easier to manage than many parallel file transfers.
Pass any subset of:
"clinical"β survival/clinical outcome/treatment effect (diagnoses, treatments, follow-ups, exposures)"molecular"β DNA/RNA/CNV/Methylation file index"report"β free-text/clinical/pathology reports (XML/PDF)"diagnosis"β diagnostic subtype, morphology, stage/grade"all"β everything above
On the GDC path, these map to GDC case/file annotations.
On the TCIA path, only clinical is supported, and it is limited to patient/study metadata returned by the public imaging API. TCIA does not expose a standardized public cross-collection API for GDC-style molecular, report, or diagnosis tables.
On the Patho-Bench path, benchmark tasks are selected with task_name / --task instead of the annotations argument. The Hugging Face repo stores split TSVs and YAML metadata only.
GDC projects vary in completeness. TCGA-Tools is defensive:
- Broad field requests; if the API rejects fields (HTTP 400), it retries without fields to maximize returned content.
- JSON is flattened into wide CSVs; absent fields simply do not appear, or appear with empty values.
- Grouping logic remains robust even if sample types are missing.
tcga-tools download --source gdc --dataset TCGA-LUSC --datatype WSI \
--annotations clinical molecular report diagnosis --raw --out ./TCGA-LUSCShow all valid TCGA project IDs:
tcga-tools list-datasets --source gdc --program TCGAShow all valid TCIA collection IDs and modalities:
tcga-tools list-datasets --source tcia
tcga-tools list-data-types --source tcia --dataset TCGA-BRCAShow all valid Patho-Bench benchmark datasets and tasks:
tcga-tools list-datasets --source patho-bench --include-data-types
tcga-tools list-data-types --source patho-bench --dataset cptac_coadThe legacy form is still supported:
python -m tcga_tools --dataset TCGA-LUSC --datatype WSI --raw --out ./TCGA-LUSCValidate an existing output directory offline:
tcga-tools validate-output ./TCGA-LUSC --dataset TCGA-LUSC- Python β₯ 3.9
- Tested on Linux, macOS, Windows
- Dependencies are listed in
pyproject.tomland installed automatically.
All downloads and transformations are logged to download.log in your output directory for reproducibility.
Preview available data without downloading:
tt.download(dataset_name="TCGA-LUSC", raw=True)Run unit tests:
pytest tests/- For very large downloads, prefer the emitted
gdc_manifest.tsvwith the GDC Data Transfer Tool. - Extend
config.pyto add/modify field lists or filetype preferences as needed.
Apache 2.0 β free for research and commercial use.
Contributions are welcome! Please open an issue or PR on GitHub.
