Skip to content

Exploration of how different datatypes and subnetworks of deep drug response prediction (DRP) models effect performance. And implementation of BinaryET for DRP

Notifications You must be signed in to change notification settings

Nik-BB/Understanding_DRP_models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Understanding the Sources of Performance in Deep Learning Drug Response Prediction Models

Exploration of how different datatypes and subnetworks of state-of-the-art drug response prediction (DRP) models effect performance.

Quick start

To run the mean model baseline/benchmark with the example data (subset of full data) example_data = True simply run:

python source_code/mean_model_benchmark.py 

Using the repository

It is recommended to use the example data provided to create the models on a subset of the full dataset. This demonstrates how the code functions. To create the full models, used in the paper, downloading the datasets from external websites is required (see Datasetses section for full instructions) and remember to then set example_data = False.

Running the benchmarks

Running the benchmarks that do not use omics or chemical structure data is done in marker_benchmark.py and mean_model_benchmark.py. Note to run on the example data (subset of full dataset) set example_data = True in the .py file.

marker_benchmark.py runs the marker benchmark while mean_model_benchmark.py runs the drug average and cell line average benchmarks. Note to run on the example data (subset of full dataset) set example_data = True in the .py file.

The code used to create these models can be found in source_code/models

If you don't want to use the example data and the datasets (instructions on downloading given below) are not in the same paths specified in the code or if they have different names, then these paths (omic_dir_path, gdsc2_target_path and pubchem_ids_path, genomics_path) need to be set in the code to where you have the datasets and to the matching file name.

Running the literature models

main_run_model.py file can be run for each model and testing type by specific both as arguments. The arguments are model_type, split_type, and epochs. E.g.

python source_code/main_run_model.py tcnn c_blind 10 

runs tCNNs with cancer blind splitting for 10 epochs.

Setting example_data = True allows the models to run using example data provided (subset of full dataset). To run the models with all data use instructions from the dataset section and note that the paths to the datasets may need to be set as described above.

Datasetses

The datasets needed to re-train the models are publicly available.

  • Transcriptomics and genomics data can be downloaded from the Genomics of drug sensitivity in cancer database https://www.cancerrxgene.org/ -> https://www.cancerrxgene.org/gdsc1000/GDSC1000_WebResources/Home.html Here the expression profiles, genomic profiles and meta data mapping the cell names (Annotated list of cell-lines) can be found. The expression profiles data file needs to be convert to csv and renamed to gdsc_expresstion_dat.csv to run out of the box. The cell names also need to be converted to csv and renamed to gdsc_cell_names.csv to run out of the box. (Or the file names can be changed in the read_rna_gdsc function in data_loading.py). The genomics profiles come from the MULT omics Cancer functional events (CFEs) BEMs for cell-lines zip download and the PANCAN_simple_MOBEM.rdata.tsv is the file required.

  • Drug response data in the form of IC50 values can be downloaded form values from GDSC https://www.cancerrxgene.org/ -> https://www.cancerrxgene.org/downloads/bulk_download GDSC2 IC50 values are used for this study.

PubChem ID's and smiles strings for the drugs in GDSC can be found from https://pubchem.ncbi.nlm.nih.gov/

Once downloaded the path to these datasets needs to be set (see 'using the repository' above).

Processing data

Train test splits are available in the data folder and new train test splits can be generated by using the train_test_split notebook

A dict mapping drug names to smiles strings can be found and saved using the create_drug_to_smiles_mapping_gdsc2 method from data_loading.py. This dict can then be read in, in later use.

Running BinaryET

This can be done inside the BinaryET folder see the readme inside the folder for further running instructions. The datasets required are the same as the ones outlined here.

Environment

The yml file gives the environment used with package versions, the key packages are pytorch torch_geometric transformers numpy pandas sklearn

Problem Formulation

The goal of DRP is to predict how effective different drugs are for different cancer types. Here we predict the I50 values, the concentration of a drug needed to inhibit the activity of a cell lie by 50%, as a measure of efficacy. This is typically done using omics profiles of cell lines and chemical profiles of drugs.

Consider the traning set $T = \lbrace \boldsymbol{x_{c,i}}, \boldsymbol{x_{d,i}}, y_i \rbrace$ where $\boldsymbol{x_{c,i}}$, $\boldsymbol{x_{d,i}}$ are representation of the $i^{th}$ cell line and drug respectively and $y_i$ is the IC50 value associated with the $i^{th}$ cell line drug pair.

Thus, we want to find a model, $M$, that takes $\boldsymbol{x_{c,i}}$ and $\boldsymbol{x_{d,i}}$ as inputs and predicts for the corresponding IC50 value $\hat{y_i}$ such that $M(\boldsymbol{x_{c,i}}, \boldsymbol{x_{d,i}})=\hat{y_i}$.

About

Exploration of how different datatypes and subnetworks of deep drug response prediction (DRP) models effect performance. And implementation of BinaryET for DRP

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors