This outlines the vocab we are using and the particular system to keep track of changing proteins, novel states, and pocket locations.
- Proteins : particular protein (ex PLPro, NSP-15) from the COVID virus
- Version -- particular pdb structure either generated by light source or deep drive MD. These include a source tag.
- Pocket_id -- these are listed in the pocket_id file which contain coordinates for a particular protein and version of that protein.
- Receptor/Target -- this is a protein + a version + a pocket_id. This will result in an .oeb file that we can dock against. We can only dock against this.
Thus the formula for a target is:
{protein name}_{orignial source of structure}_{workflow_came_from}{version tracking}_pocket{id}.oeb
TODO: this needs to be done for the target names on the box drug screening : raw_data/docking_data_march_17 We are working to add original source of structure to the pocket id csv to tag the names.
Should be formatted where the folder has fp pocket output like I was given originally
python scripts/prepare_receptors.py /Users/austin/Downloads/Pocket-analysis-ANL-structures_with_receptors/We get data from two groups currently. I will show how to injest both, prepare, and combine.
Run the two scripts
python scripts/balsam_data_gather.py /Users/austin/results-2020-03-17.pkl
python scripts/radical_data_gather.py {path_to_folder_with .out files}
python scripts/combine_source.py which will result in a out.{.pkl, .csv, .csv.gz} all with same results gathered and merged. The script will also pick which targets are ok to start ML on and output dock_out_ml_v1.{.pkl, .csv, .csv.gz}.
I do not know which smiles are cannonical or not, this needs to be handled still.
TODO: add merge index (is it cannon smiles?) for descriptors/downstream tasks. Smiles is good, rec ID scheme.
Use the dock_out_ml_v1.{.pkl, .csv, .csv.gz} files. They will all be version from now on and in the same format. The rows represent a single ligand, use the 'smiles' column to get that. All the other columns are values from a receptor/target with the _dock implying they are docking score. Most of them will be docking scores, though some will end in _minimize or _mmgbsa and those will be very sparse.
In this file, you should probably only do machine learning for a single column (multitask doesn't make too much sense, but I could be wrong). All failures and docking scores above zero were converted to zero, yes some have a lot of failures, but those are informative. Don't remove the zeros. Now for the NaN's in the _minimize or _mmgbsa columns, these you should remove and NaN indicates a task was not computed.
Columns are titled 'smiles' or 'protein_pocketid_{dock, mmgbsa, minimize}'. They correspond to different values, dock of course referring to docking score.
There is some confusion here, things are cannonical up to the method you used for it. I rec. just doing it before the join like
from rdkit import Chem
import pandas as pd
def cannon_smile(smi):
try:
cannon = Chem.MolFromSmiles(smi)
if cannon is not None:
return Chem.MolToSmiles(cannon, canonical=True)
else:
print(smi)
except:
print('error', smi)
return smi
df1.smiles = df1.smiles.apply(cannon_smile)
df2.smiles = df2.smiles.apply(cannon_smile)
pd.merge(df1, df2, on='smiles')