Understanding the Sources of Performance in Deep Learning Drug Response Prediction Models

Exploration of how different datatypes and subnetworks of state-of-the-art drug response prediction (DRP) models effect performance.

Quick start

To run the mean model baseline/benchmark with the example data (subset of full data) example_data = True simply run:

python source_code/mean_model_benchmark.py

Using the repository

It is recommended to use the example data provided to create the models on a subset of the full dataset. This demonstrates how the code functions. To create the full models, used in the paper, downloading the datasets from external websites is required (see Datasetses section for full instructions) and remember to then set example_data = False.

Running the benchmarks

Running the benchmarks that do not use omics or chemical structure data is done in marker_benchmark.py and mean_model_benchmark.py. Note to run on the example data (subset of full dataset) set example_data = True in the .py file.

marker_benchmark.py runs the marker benchmark while mean_model_benchmark.py runs the drug average and cell line average benchmarks. Note to run on the example data (subset of full dataset) set example_data = True in the .py file.

The code used to create these models can be found in source_code/models

If you don't want to use the example data and the datasets (instructions on downloading given below) are not in the same paths specified in the code or if they have different names, then these paths (omic_dir_path, gdsc2_target_path and pubchem_ids_path, genomics_path) need to be set in the code to where you have the datasets and to the matching file name.

Running the literature models

main_run_model.py file can be run for each model and testing type by specific both as arguments. The arguments are model_type, split_type, and epochs. E.g.

python source_code/main_run_model.py tcnn c_blind 10

runs tCNNs with cancer blind splitting for 10 epochs.

Setting example_data = True allows the models to run using example data provided (subset of full dataset). To run the models with all data use instructions from the dataset section and note that the paths to the datasets may need to be set as described above.

Datasetses

The datasets needed to re-train the models are publicly available.

Transcriptomics and genomics data can be downloaded from the Genomics of drug sensitivity in cancer database https://www.cancerrxgene.org/ -> https://www.cancerrxgene.org/gdsc1000/GDSC1000_WebResources/Home.html Here the expression profiles, genomic profiles and meta data mapping the cell names (Annotated list of cell-lines) can be found. The expression profiles data file needs to be convert to csv and renamed to gdsc_expresstion_dat.csv to run out of the box. The cell names also need to be converted to csv and renamed to gdsc_cell_names.csv to run out of the box. (Or the file names can be changed in the read_rna_gdsc function in data_loading.py). The genomics profiles come from the MULT omics Cancer functional events (CFEs) BEMs for cell-lines zip download and the PANCAN_simple_MOBEM.rdata.tsv is the file required.
Drug response data in the form of IC50 values can be downloaded form values from GDSC https://www.cancerrxgene.org/ -> https://www.cancerrxgene.org/downloads/bulk_download GDSC2 IC50 values are used for this study.

PubChem ID's and smiles strings for the drugs in GDSC can be found from https://pubchem.ncbi.nlm.nih.gov/

Once downloaded the path to these datasets needs to be set (see 'using the repository' above).

Processing data

Train test splits are available in the data folder and new train test splits can be generated by using the train_test_split notebook

A dict mapping drug names to smiles strings can be found and saved using the create_drug_to_smiles_mapping_gdsc2 method from data_loading.py. This dict can then be read in, in later use.

Running BinaryET

This can be done inside the BinaryET folder see the readme inside the folder for further running instructions. The datasets required are the same as the ones outlined here.

Environment

The yml file gives the environment used with package versions, the key packages are pytorch torch_geometric transformers numpy pandas sklearn

Problem Formulation

The goal of DRP is to predict how effective different drugs are for different cancer types. Here we predict the I50 values, the concentration of a drug needed to inhibit the activity of a cell lie by 50%, as a measure of efficacy. This is typically done using omics profiles of cell lines and chemical profiles of drugs.

Consider the traning set $T = \lbrace \boldsymbol{x_{c,i}}, \boldsymbol{x_{d,i}}, y_i \rbrace$ where $\boldsymbol{x_{c,i}}$, $\boldsymbol{x_{d,i}}$ are representation of the $i^{th}$ cell line and drug respectively and $y_i$ is the IC50 value associated with the $i^{th}$ cell line drug pair.

Thus, we want to find a model, $M$, that takes $\boldsymbol{x_{c,i}}$ and $\boldsymbol{x_{d,i}}$ as inputs and predicts for the corresponding IC50 value $\hat{y_i}$ such that $M(\boldsymbol{x_{c,i}}, \boldsymbol{x_{d,i}})=\hat{y_i}$.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
BinaryET_and_BinaryCB		BinaryET_and_BinaryCB
data		data
notebooks		notebooks
source_code		source_code
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Understanding the Sources of Performance in Deep Learning Drug Response Prediction Models

Quick start

Using the repository

Running the benchmarks

Running the literature models

Datasetses

Processing data

Running BinaryET

Environment

Problem Formulation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Nik-BB/Understanding_DRP_models

Folders and files

Latest commit

History

Repository files navigation

Understanding the Sources of Performance in Deep Learning Drug Response Prediction Models

Quick start

Using the repository

Running the benchmarks

Running the literature models

Datasetses

Processing data

Running BinaryET

Environment

Problem Formulation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages