|
1 | 1 | # DIPS-Plus
|
2 |
| -The Enhanced Database of Interacting Protein Structures |
| 2 | + |
| 3 | +The enhanced Database of Interacting Protein Structures (DIPS) |
| 4 | + |
| 5 | +## How to run creation tools |
| 6 | + |
| 7 | +First, install and configure Conda environment: |
| 8 | + |
| 9 | +```bash |
| 10 | +# Clone project: |
| 11 | +git clone https://github.com/amorehead/DIPS-Plus |
| 12 | + |
| 13 | +# Change to project directory: |
| 14 | +cd DIPS-Plus |
| 15 | + |
| 16 | +# (If on HPC cluster) Load 'open-ce' module |
| 17 | +module load open-ce-1.1.3-py38-0 |
| 18 | + |
| 19 | +# (If on HPC cluster) Clone Conda environment into this directory using provided 'open-ce' environment: |
| 20 | +conda create --name DIPS-Plus --clone open-ce-1.1.3-py38-0 |
| 21 | + |
| 22 | +# (If on HPC cluster - Optional) Create Conda environment in a particular directory using provided 'open-ce' environment: |
| 23 | +conda create --prefix MY_VENV_DIR --clone open-ce-1.1.3-py38-0 |
| 24 | + |
| 25 | +# (Else, if on local machine) Set up Conda environment locally |
| 26 | +conda env create --name DIPS-Plus -f environment.yml |
| 27 | + |
| 28 | +# (Else, if on local machine - Optional) Create Conda environment in a particular directory using local 'environment.yml' file: |
| 29 | +conda env create --prefix MY-VENV-DIR -f environment.yml |
| 30 | + |
| 31 | +# Activate Conda environment located in the current directory: |
| 32 | +conda activate DIPS-Plus |
| 33 | + |
| 34 | +# (Optional) Activate Conda environment located in another directory: |
| 35 | +conda activate MY-VENV-DIR |
| 36 | + |
| 37 | +# (Optional) Deactivate the currently-activated Conda environment: |
| 38 | +conda deactivate |
| 39 | + |
| 40 | +# (If on local machine - Optional) Perform a full update on the Conda environment described in 'environment.yml': |
| 41 | +conda env update -f environment.yml --prune |
| 42 | + |
| 43 | +# (Optional) To remove this long prefix in your shell prompt, modify the env_prompt setting in your .condarc file with: |
| 44 | +conda config --set env_prompt '({name})' |
| 45 | + ``` |
| 46 | + |
| 47 | +(If on HPC cluster) Install all project dependencies: |
| 48 | + |
| 49 | +```bash |
| 50 | +# Install project as a pip dependency in the Conda environment currently activated: |
| 51 | +pip3 install -e . |
| 52 | + |
| 53 | +# Install external pip dependencies in the Conda environment currently activated: |
| 54 | +pip3 install -r requirements.txt |
| 55 | + |
| 56 | +# Install pip dependencies used for unit testing in the Conda environment currently activated: |
| 57 | +pip3 install -r tests/requirements.txt |
| 58 | + ``` |
| 59 | + |
| 60 | +## How to compile DIPS-Plus from scratch |
| 61 | + |
| 62 | +Retrieve protein complexes from the RCSB PDB: |
| 63 | + |
| 64 | +```bash |
| 65 | +# Remove all existing training/testing sample lists |
| 66 | +rm project/datasets/DIPS/final/raw/pairs-postprocessed.txt project/datasets/DIPS/final/raw/pairs-postprocessed-train.txt project/datasets/DIPS/final/raw/pairs-postprocessed-val.txt project/datasets/DIPS/final/raw/pairs-postprocessed-test.txt |
| 67 | + |
| 68 | +# Create data directories (if not already created): |
| 69 | +mkdir project/datasets/DIPS/raw project/datasets/DIPS/raw/pdb project/datasets/DIPS/interim project/datasets/DIPS/interim/external_feats project/datasets/DIPS/final project/datasets/DIPS/final/raw project/datasets/DIPS/final/processed |
| 70 | + |
| 71 | +# Download the raw PDB files: |
| 72 | +rsync -rlpt -v -z --delete --port=33444 --include='*.gz' --include='*.xz' --include='*/' --exclude '*' \ |
| 73 | +rsync.rcsb.org::ftp_data/biounit/coordinates/divided/ project/datasets/DIPS/raw/pdb |
| 74 | + |
| 75 | +# Extract the raw PDB files: |
| 76 | +python3 project/datasets/builder/extract_raw_pdb_gz_archives.py project/datasets/DIPS/raw/pdb |
| 77 | + |
| 78 | +# Process the raw PDB data into associated pair files: |
| 79 | +python3 project/datasets/builder/make_dataset.py project/datasets/DIPS/raw/pdb project/datasets/DIPS/interim --num_cpus 28 --source_type rcsb --bound |
| 80 | + |
| 81 | +# Apply additional filtering criteria: |
| 82 | +python3 project/datasets/builder/prune_pairs.py project/datasets/DIPS/interim/pairs project/datasets/DIPS/filters project/datasets/DIPS/interim/pairs-pruned --num_cpus 28 |
| 83 | + |
| 84 | +# Generate externally-sourced features: |
| 85 | +python3 project/datasets/builder/generate_psaia_features.py "$PSAIADIR" "$PROJDIR"/project/datasets/builder/psaia_config_file_dips.txt "$PROJDIR"/project/datasets/DIPS/raw/pdb "$PROJDIR"/project/datasets/DIPS/interim/parsed "$PROJDIR"/project/datasets/DIPS/interim/pairs-pruned "$PROJDIR"/project/datasets/DIPS/interim/external_feats --source_type rcsb |
| 86 | +python3 project/datasets/builder/generate_hhsuite_features.py "$PROJDIR"/project/datasets/DIPS/interim/parsed "$PROJDIR"/project/datasets/DIPS/interim/pairs-pruned "$HHSUITE_DB" "$PROJDIR"/project/datasets/DIPS/interim/external_feats --num_cpu_jobs 4 --num_cpus_per_job 8 --num_iter 2 --source_type rcsb --write_file |
| 87 | + |
| 88 | +# Add new features to the filtered pairs, ensuring that the pruned pairs' original PDB files are stored locally for DSSP: |
| 89 | +python3 project/datasets/builder/download_missing_pruned_pair_pdbs.py "$PROJDIR"/project/datasets/DIPS/raw/pdb "$PROJDIR"/project/datasets/DIPS/interim/pairs-pruned --num_cpus 32 --rank "$1" --size "$2" |
| 90 | +python3 project/datasets/builder/postprocess_pruned_pairs.py "$PROJDIR"/project/datasets/DIPS/raw/pdb "$PROJDIR"/project/datasets/DIPS/interim/pairs-pruned "$PROJDIR"/project/datasets/DIPS/interim/external_feats "$PROJDIR"/project/datasets/DIPS/final/raw --num_cpus 32 --full-run |
| 91 | + |
| 92 | +# Downsample negative class, partition dataset filenames, aggregate statistics, and impute missing features |
| 93 | +python3 project/datasets/builder/downsample_negative_class.py "$PROJDIR"/project/datasets/DIPS/final/raw --source_type rcsb --num_cpus 32 --rank "$1" --size "$2" |
| 94 | +python3 project/datasets/builder/partition_dataset_filenames.py "$PROJDIR"/project/datasets/DIPS/final/raw --source_type rcsb --filter_by_seq_length True --max_seq_length 1000 --rank "$1" --size "$2" |
| 95 | +python3 project/datasets/builder/collect_dataset_statistics.py "$PROJDIR"/project/datasets/DIPS/final/raw --rank "$1" --size "$2" |
| 96 | +python3 project/datasets/builder/impute_missing_feature_values.py "$PROJDIR"/project/datasets/DIPS/final/raw --num_cpus 32 --rank "$1" --size "$2" |
| 97 | +``` |
| 98 | + |
| 99 | +## How to assemble DB5-Plus |
| 100 | + |
| 101 | +Fetch prepared protein complexes from Dataverse: |
| 102 | + |
| 103 | +```bash |
| 104 | +# Download the prepared DB5 files: |
| 105 | +wget -O project/datasets/DB5.tar.gz https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/H93ZKK/BXXQCG |
| 106 | + |
| 107 | +# Extract downloaded DB5 archive: |
| 108 | +tar -xzf project/datasets/DB5.tar.gz --directory project/datasets/ |
| 109 | + |
| 110 | +# Remove (now) redundant DB5 archive and other miscellaneous files: |
| 111 | +rm project/datasets/DB5.tar.gz project/datasets/DB5/.README.swp |
| 112 | +rm project/datasets/DB5.tar.gz "$MYLOCAL"/datasets/DB5/.README.swp |
| 113 | +rm -rf project/datasets/DB5/interim "$MYLOCAL"/datasets/DB5/processed |
| 114 | + |
| 115 | +# Create relevant interim and final data directories: |
| 116 | +mkdir project/datasets/DB5/interim project/datasets/DB5/interim/external_feats |
| 117 | +mkdir project/datasets/DB5/final project/datasets/DB5/final/raw project/datasets/DB5/final/processed |
| 118 | + |
| 119 | +# Construct DB5 dataset pairs: |
| 120 | +python3 project/datasets/builder/make_dataset.py "$PROJDIR"/project/datasets/DB5/raw "$PROJDIR"/project/datasets/DB5/interim --num_cpus 32 --source_type db5 --unbound |
| 121 | + |
| 122 | +# Generate externally-sourced features: |
| 123 | +python3 project/datasets/builder/generate_psaia_features.py "$PSAIADIR" "$PROJDIR"/project/datasets/builder/psaia_config_file_db5.txt "$PROJDIR"/project/datasets/DB5/raw "$PROJDIR"/project/datasets/DB5/interim/parsed "$PROJDIR"/project/datasets/DB5/interim/parsed "$PROJDIR"/project/datasets/DB5/interim/external_feats --source_type db5 |
| 124 | +python3 project/datasets/builder/generate_hhsuite_features.py "$PROJDIR"/project/datasets/DB5/interim/parsed "$PROJDIR"/project/datasets/DB5/interim/parsed "$HHSUITE_DB" "$PROJDIR"/project/datasets/DB5/interim/external_feats --num_cpu_jobs 4 --num_cpus_per_job 8 --num_iter 2 --source_type db5 --write_file |
| 125 | + |
| 126 | +# Add new features to the filtered pairs: |
| 127 | +python3 project/datasets/builder/postprocess_pruned_pairs.py "$PROJDIR"/project/datasets/DB5/raw "$PROJDIR"/project/datasets/DB5/interim/pairs "$PROJDIR"/project/datasets/DB5/interim/external_feats "$PROJDIR"/project/datasets/DB5/final/raw --num_cpus 32 --source_type db5 --full-run |
| 128 | + |
| 129 | +# Prepackage labels, partition dataset filenames, aggregate statistics, and impute missing features |
| 130 | +python3 project/datasets/builder/downsample_negative_class.py "$PROJDIR"/project/datasets/DB5/final/raw --source_type rcsb --num_cpus 32 --rank "$1" --size "$2" |
| 131 | +python3 project/datasets/builder/partition_dataset_filenames.py "$PROJDIR"/project/datasets/DB5/final/raw --source_type db5 --rank "$1" --size "$2" |
| 132 | +python3 project/datasets/builder/collect_dataset_statistics.py "$PROJDIR"/project/datasets/DB5/final/raw --rank "$1" --size "$2" |
| 133 | +python3 project/datasets/builder/impute_missing_feature_values.py "$PROJDIR"/project/datasets/DB5/final/raw --num_cpus 32 --rank "$1" --size "$2" |
| 134 | +``` |
| 135 | + |
| 136 | +## Python 2 to 3 pickle file solution |
| 137 | + |
| 138 | +While using Python 3 in this project, you may encounter the following error if you try to postprocess '.dill' pruned |
| 139 | +pairs that were created using Python 2. |
| 140 | + |
| 141 | +ModuleNotFoundError: No module named 'dill.dill' |
| 142 | + |
| 143 | +1. To resolve it, ensure that the 'dill' package's version is greater than 0.3.2. |
| 144 | +2. If the problem persists, edit the pickle.py file corresponding to your Conda environment's Python 3 installation ( |
| 145 | + e.g. ~/DIPS-Plus/venv/lib/python3.8/pickle.py) and add the statement |
| 146 | + |
| 147 | +```python |
| 148 | +if module == 'dill.dill': module = 'dill._dill' |
| 149 | +``` |
| 150 | + |
| 151 | +to the end of the |
| 152 | + |
| 153 | +```python |
| 154 | +if self.proto < 3 and self.fix_imports: |
| 155 | +``` |
| 156 | + |
| 157 | +block in the Unpickler class' find_class() function |
| 158 | +(e.g. line 1577 of Python 3.8.5's pickle.py). |
| 159 | + |
| 160 | +### Citation |
| 161 | + |
| 162 | +``` |
| 163 | +@article{DIPS-Plus, |
| 164 | + title={DIPS-Plus: The Enhanced Database of Interacting Protein Structures}, |
| 165 | + author={Morehead, Alex, Chen, Chen, and Cheng, Jianlin}, |
| 166 | + journal={NeurIPS}, |
| 167 | + year={2021} |
| 168 | +} |
| 169 | +``` |
0 commit comments