Exploration of how different datatypes and subnetworks of state-of-the-art drug response prediction (DRP) models effect performance.
To run the mean model baseline/benchmark with the example data (subset of full data) example_data = True simply run:
python source_code/mean_model_benchmark.py
It is recommended to use the example data provided to create the models on a subset of the full dataset. This demonstrates how the code functions. To create the full models, used in the paper, downloading the datasets from external websites is required (see Datasetses section for full instructions) and remember to then set example_data = False.
Running the benchmarks that do not use omics or chemical structure data is done in marker_benchmark.py and mean_model_benchmark.py. Note to run on the example data (subset of full dataset) set example_data = True in the .py file.
marker_benchmark.py runs the marker benchmark while mean_model_benchmark.py runs the drug average and cell line average benchmarks. Note to run on the example data (subset of full dataset) set example_data = True in the .py file.
The code used to create these models can be found in source_code/models
If you don't want to use the example data and the datasets (instructions on downloading given below) are not in the same paths specified in the code or if they have different names, then these paths (omic_dir_path, gdsc2_target_path and pubchem_ids_path, genomics_path) need to be set in the code to where you have the datasets and to the matching file name.
main_run_model.py file can be run for each model and testing type by specific both as arguments. The arguments are model_type, split_type, and epochs. E.g.
python source_code/main_run_model.py tcnn c_blind 10
runs tCNNs with cancer blind splitting for 10 epochs.
Setting example_data = True allows the models to run using example data provided (subset of full dataset). To run the models with all data use instructions from the dataset section and note that the paths to the datasets may need to be set as described above.
The datasets needed to re-train the models are publicly available.
-
Transcriptomics and genomics data can be downloaded from the Genomics of drug sensitivity in cancer database https://www.cancerrxgene.org/ -> https://www.cancerrxgene.org/gdsc1000/GDSC1000_WebResources/Home.html Here the expression profiles, genomic profiles and meta data mapping the cell names (Annotated list of cell-lines) can be found. The expression profiles data file needs to be convert to csv and renamed to gdsc_expresstion_dat.csv to run out of the box. The cell names also need to be converted to csv and renamed to gdsc_cell_names.csv to run out of the box. (Or the file names can be changed in the read_rna_gdsc function in data_loading.py). The genomics profiles come from the MULT omics Cancer functional events (CFEs) BEMs for cell-lines zip download and the PANCAN_simple_MOBEM.rdata.tsv is the file required.
-
Drug response data in the form of IC50 values can be downloaded form values from GDSC https://www.cancerrxgene.org/ -> https://www.cancerrxgene.org/downloads/bulk_download GDSC2 IC50 values are used for this study.
PubChem ID's and smiles strings for the drugs in GDSC can be found from https://pubchem.ncbi.nlm.nih.gov/
Once downloaded the path to these datasets needs to be set (see 'using the repository' above).
Train test splits are available in the data folder and new train test splits can be generated by using the train_test_split notebook
A dict mapping drug names to smiles strings can be found and saved using the create_drug_to_smiles_mapping_gdsc2 method from data_loading.py. This dict can then be read in, in later use.
This can be done inside the BinaryET folder see the readme inside the folder for further running instructions. The datasets required are the same as the ones outlined here.
The yml file gives the environment used with package versions, the key packages are pytorch torch_geometric transformers numpy pandas sklearn
The goal of DRP is to predict how effective different drugs are for different cancer types. Here we predict the I50 values, the concentration of a drug needed to inhibit the activity of a cell lie by 50%, as a measure of efficacy. This is typically done using omics profiles of cell lines and chemical profiles of drugs.
Consider the traning set
Thus, we want to find a model,