HT Docking COVID Data Prep Instructions

This outlines the vocab we are using and the particular system to keep track of changing proteins, novel states, and pocket locations.

Vocab:

Proteins : particular protein (ex PLPro, NSP-15) from the COVID virus
Version -- particular pdb structure either generated by light source or deep drive MD. These include a source tag.
Pocket_id -- these are listed in the pocket_id file which contain coordinates for a particular protein and version of that protein.
Receptor/Target -- this is a protein + a version + a pocket_id. This will result in an .oeb file that we can dock against. We can only dock against this.

Thus the formula for a target is:

{protein name}_{orignial source of structure}_{workflow_came_from}{version tracking}_pocket{id}.oeb

TODO: this needs to be done for the target names on the box drug screening : raw_data/docking_data_march_17 We are working to add original source of structure to the pocket id csv to tag the names.

Preparing pocket receptors

Should be formatted where the folder has fp pocket output like I was given originally

python scripts/prepare_receptors.py /Users/austin/Downloads/Pocket-analysis-ANL-structures_with_receptors/

Aggregating data from the workflow teams

We get data from two groups currently. I will show how to injest both, prepare, and combine.

Run the two scripts

python scripts/balsam_data_gather.py /Users/austin/results-2020-03-17.pkl
python scripts/radical_data_gather.py {path_to_folder_with .out files}
python scripts/combine_source.py

which will result in a out.{.pkl, .csv, .csv.gz} all with same results gathered and merged. The script will also pick which targets are ok to start ML on and output dock_out_ml_v1.{.pkl, .csv, .csv.gz}.

I do not know which smiles are cannonical or not, this needs to be handled still.

TODO: add merge index (is it cannon smiles?) for descriptors/downstream tasks. Smiles is good, rec ID scheme.

Details for ML

Use the dock_out_ml_v1.{.pkl, .csv, .csv.gz} files. They will all be version from now on and in the same format. The rows represent a single ligand, use the 'smiles' column to get that. All the other columns are values from a receptor/target with the _dock implying they are docking score. Most of them will be docking scores, though some will end in _minimize or _mmgbsa and those will be very sparse.

In this file, you should probably only do machine learning for a single column (multitask doesn't make too much sense, but I could be wrong). All failures and docking scores above zero were converted to zero, yes some have a lot of failures, but those are informative. Don't remove the zeros. Now for the NaN's in the _minimize or _mmgbsa columns, these you should remove and NaN indicates a task was not computed.

Columns are titled 'smiles' or 'protein_pocketid_{dock, mmgbsa, minimize}'. They correspond to different values, dock of course referring to docking score.

Joining on smiles

There is some confusion here, things are cannonical up to the method you used for it. I rec. just doing it before the join like

from rdkit import Chem
import pandas as pd
def cannon_smile(smi):
    try:
        cannon = Chem.MolFromSmiles(smi)
        if cannon is not None:
            return Chem.MolToSmiles(cannon, canonical=True)
        else:
            print(smi)
    except:
        print('error', smi)
    return smi

df1.smiles = df1.smiles.apply(cannon_smile)
df2.smiles = df2.smiles.apply(cannon_smile)
pd.merge(df1, df2, on='smiles')

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HT Docking COVID Data Prep Instructions

Vocab:

Preparing pocket receptors

Aggregating data from the workflow teams

Details for ML

Joining on smiles

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HT Docking COVID Data Prep Instructions

Vocab:

Preparing pocket receptors

Aggregating data from the workflow teams

Details for ML

Joining on smiles

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages