Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-cancer colorectal polyps: HTAN Vanderbilt dataset. Chen et al. Cell 2021 (Discovery & Validation Sets) #2019

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

rmadupuri
Copy link
Collaborator

@rmadupuri rmadupuri commented Apr 12, 2024

Fixes #1915
Testing Instance:
Triage Portal: https://triage.cbioportal.mskcc.org/study/summary?id=crc_hta11_htan_2021
Private Portal: https://private.cbioportal.mskcc.org/study/summary?id=crc_hta11_htan_2021

Curation and transformation of Pre-cancer HTAN CRC Vanderbilt Dataset:

Data collection:

  • Data Source: Data on HTAN Portal
  • Reference publication: PubMed Reference
  • Reference genome used: GRCh37 for WXS and GRCh38 for scRNA-seq
  • Datatypes available: Clinical, WES MAF, scRNA (transformed to absolute and relative cell counts, as well as pseudo-bulk RNA counts), MxIF, and HE images.

Sample size selection

  • Discovery & Validation set samples were used to generate the study (61 polyps). 30 samples were whole exome sequenced. 55 samples went though scRNA-seq exp analysis.
  • Two patients (HTA11_3918 and HTA11_4781) were not listed in the supplementary clinical tables but were included in the Oncoplot analysis (Figure 1). They have been added to the cohort to match the oncoplot.

Clinical data

  • Patient-Level Data: Table S1, Participants tab in HTAN Portal
  • Sample-Level Data:Table S1, Biospecimens tab in HTAN Portal

Mutation data

  • The Level 3 Bulk DNA filtered data files were obtained from the Vanderbilt team.
  • Variants were annotated using Genome Nexus.

scRNA-seq data

  • The Discovery, Validation set h5ad files for both epithelial and non-epithelial cells were used from cellxgene
    • Files used: Discovery (DIS) set of human colorectal tumor: Epithelial && Validation (Val) set of human colorectal tumor: Epithelial && VAL and DIS datasets: Non-Epithelial. They contain the normalized counts.
  • The absolute and relative cell frequencies in Generic Assay format were calculated
  • The pseudo bulk RNA expression counts per sample was calculated from scRNA-seq by averaging the values across the cells.
  • The script to generate the absolute, relative cell freq and pseudo bulk RNA-seq data from h5ad files : https://gist.github.com/rmadupuri/447b5689b256ccc4880aa26fb91b72ff
  • The pseudo bulk RNA expression data was log transformed and zscores were calculated.

Imaging data

  • H&E data was available for 26 samples: H&E tab
  • MxIF images were available for 25 samples: MxIF tab. Multiple images were available per sample and as we do not support this, the images were split to multiple tabs for now as MxIF Image 1, 2, 3..

@rmadupuri rmadupuri changed the title Pre-cancer colorectal polyps: HTAN Vanderbilt dataset. Chen et al. Cell 2021 Pre-cancer colorectal polyps: HTAN Vanderbilt dataset. Chen et al. Cell 2021 (Discovery & Validation Sets) Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant