Skip to content

2019-ncovgroup/HTDockingDataInstructions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

HT Docking COVID Data Prep Instructions

This outlines the vocab we are using and the particular system to keep track of changing proteins, novel states, and pocket locations.

Vocab:

  • Proteins : particular protein (ex PLPro, NSP-15) from the COVID virus
  • Version -- particular pdb structure either generated by light source or deep drive MD. These include a source tag.
  • Pocket_id -- these are listed in the pocket_id file which contain coordinates for a particular protein and version of that protein.
  • Receptor/Target -- this is a protein + a version + a pocket_id. This will result in an .oeb file that we can dock against. We can only dock against this.

Thus the formula for a target is:

{protein name}_{orignial source of structure}_{workflow_came_from}{version tracking}_pocket{id}.oeb

TODO: this needs to be done for the target names on the box drug screening : raw_data/docking_data_march_17 We are working to add original source of structure to the pocket id csv to tag the names.

Preparing pocket receptors

Should be formatted where the folder has fp pocket output like I was given originally

python scripts/prepare_receptors.py /Users/austin/Downloads/Pocket-analysis-ANL-structures_with_receptors/

Aggregating data from the workflow teams

We get data from two groups currently. I will show how to injest both, prepare, and combine.

Run the two scripts

python scripts/balsam_data_gather.py /Users/austin/results-2020-03-17.pkl
python scripts/radical_data_gather.py {path_to_folder_with .out files}
python scripts/combine_source.py 

which will result in a out.{.pkl, .csv, .csv.gz} all with same results gathered and merged. The script will also pick which targets are ok to start ML on and output dock_out_ml_v1.{.pkl, .csv, .csv.gz}.

I do not know which smiles are cannonical or not, this needs to be handled still.

TODO: add merge index (is it cannon smiles?) for descriptors/downstream tasks. Smiles is good, rec ID scheme.

Details for ML

Use the dock_out_ml_v1.{.pkl, .csv, .csv.gz} files. They will all be version from now on and in the same format. The rows represent a single ligand, use the 'smiles' column to get that. All the other columns are values from a receptor/target with the _dock implying they are docking score. Most of them will be docking scores, though some will end in _minimize or _mmgbsa and those will be very sparse.

In this file, you should probably only do machine learning for a single column (multitask doesn't make too much sense, but I could be wrong). All failures and docking scores above zero were converted to zero, yes some have a lot of failures, but those are informative. Don't remove the zeros. Now for the NaN's in the _minimize or _mmgbsa columns, these you should remove and NaN indicates a task was not computed.

Columns are titled 'smiles' or 'protein_pocketid_{dock, mmgbsa, minimize}'. They correspond to different values, dock of course referring to docking score.

Joining on smiles

There is some confusion here, things are cannonical up to the method you used for it. I rec. just doing it before the join like

from rdkit import Chem
import pandas as pd
def cannon_smile(smi):
    try:
        cannon = Chem.MolFromSmiles(smi)
        if cannon is not None:
            return Chem.MolToSmiles(cannon, canonical=True)
        else:
            print(smi)
    except:
        print('error', smi)
    return smi

df1.smiles = df1.smiles.apply(cannon_smile)
df2.smiles = df2.smiles.apply(cannon_smile)
pd.merge(df1, df2, on='smiles')

About

Internal instructions for managing data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages