Skip to content

Commit f9206a5

Browse files
committed
Add first commit in batch
1 parent 369bf3c commit f9206a5

33 files changed

+530323
-26
lines changed

.gitignore

+37-25
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
__pycache__/
33
*.py[cod]
44
*$py.class
5+
.github
56

67
# C extensions
78
*.so
@@ -20,13 +21,21 @@ parts/
2021
sdist/
2122
var/
2223
wheels/
23-
pip-wheel-metadata/
24-
share/python-wheels/
2524
*.egg-info/
2625
.installed.cfg
2726
*.egg
2827
MANIFEST
2928

29+
# Lightning /research
30+
test_tube_exp/
31+
tests/tests_tt_dir/
32+
tests/save_dir
33+
default/
34+
test_tube_logs/
35+
test_tube_data/
36+
model_weights/
37+
processed/
38+
3039
# PyInstaller
3140
# Usually these files are written by a python script from a template
3241
# before PyInstaller builds the exe, so as to inject date/other infos into it.
@@ -40,14 +49,12 @@ pip-delete-this-directory.txt
4049
# Unit test / coverage reports
4150
htmlcov/
4251
.tox/
43-
.nox/
4452
.coverage
4553
.coverage.*
4654
.cache
4755
nosetests.xml
4856
coverage.xml
4957
*.cover
50-
*.py,cover
5158
.hypothesis/
5259
.pytest_cache/
5360

@@ -59,7 +66,6 @@ coverage.xml
5966
*.log
6067
local_settings.py
6168
db.sqlite3
62-
db.sqlite3-journal
6369

6470
# Flask stuff:
6571
instance/
@@ -77,26 +83,11 @@ target/
7783
# Jupyter Notebook
7884
.ipynb_checkpoints
7985

80-
# IPython
81-
profile_default/
82-
ipython_config.py
83-
8486
# pyenv
8587
.python-version
8688

87-
# pipenv
88-
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
89-
# However, in case of collaboration, if having platform-specific dependencies or dependencies
90-
# having no cross-platform support, pipenv may install dependencies that don't work, or not
91-
# install all needed dependencies.
92-
#Pipfile.lock
93-
94-
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
95-
__pypackages__/
96-
97-
# Celery stuff
89+
# celery beat schedule file
9890
celerybeat-schedule
99-
celerybeat.pid
10091

10192
# SageMath parsed files
10293
*.sage.py
@@ -109,6 +100,9 @@ venv/
109100
ENV/
110101
env.bak/
111102
venv.bak/
103+
.conda/
104+
miniconda3
105+
venv.tar.gz
112106

113107
# Spyder project settings
114108
.spyderproject
@@ -122,8 +116,26 @@ venv.bak/
122116

123117
# mypy
124118
.mypy_cache/
125-
.dmypy.json
126-
dmypy.json
127119

128-
# Pyre type checker
129-
.pyre/
120+
# IDEs
121+
.idea
122+
.vscode
123+
124+
# DIPS
125+
project/datasets/DIPS/complexes/
126+
project/datasets/DIPS/interim/**
127+
project/datasets/DIPS/pairs/
128+
project/datasets/DIPS/parsed/
129+
project/datasets/DIPS/ptt/
130+
project/datasets/DIPS/raw/**
131+
project/datasets/DIPS/final/raw/**
132+
project/datasets/DIPS/final/raw.tar.gz
133+
project/datasets/DIPS/final/raw (copy)
134+
135+
# DB5
136+
project/datasets/DB5/processed
137+
project/datasets/DB5/README
138+
project/datasets/DB5/raw/**
139+
project/datasets/DB5/interim/**
140+
project/datasets/DB5/final/raw
141+
project/datasets/DB5/final/raw (copy)/**

LICENSE

+13
Original file line numberDiff line numberDiff line change
@@ -672,3 +672,16 @@ may consider it more useful to permit linking proprietary applications with
672672
the library. If this is what you want to do, use the GNU Lesser General
673673
Public License instead of this License. But first, please read
674674
<https://www.gnu.org/licenses/why-not-lgpl.html>.
675+
676+
-----------------------------------------------------------------------------------------------------------------------
677+
DIPS-Plus makes use of source code derived from DIPS (https://github.com/drorlab/DIPS)
678+
for compiling our new version of the DIPS dataset. The MIT licensing for DIPS is as follows:
679+
680+
The MIT License (MIT)
681+
Copyright (c) 2019, Patricia Suriana
682+
683+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
684+
685+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
686+
687+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

README.md

+168-1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,169 @@
11
# DIPS-Plus
2-
The Enhanced Database of Interacting Protein Structures
2+
3+
The enhanced Database of Interacting Protein Structures (DIPS)
4+
5+
## How to run creation tools
6+
7+
First, install and configure Conda environment:
8+
9+
```bash
10+
# Clone project:
11+
git clone https://github.com/amorehead/DIPS-Plus
12+
13+
# Change to project directory:
14+
cd DIPS-Plus
15+
16+
# (If on HPC cluster) Load 'open-ce' module
17+
module load open-ce-1.1.3-py38-0
18+
19+
# (If on HPC cluster) Clone Conda environment into this directory using provided 'open-ce' environment:
20+
conda create --name DIPS-Plus --clone open-ce-1.1.3-py38-0
21+
22+
# (If on HPC cluster - Optional) Create Conda environment in a particular directory using provided 'open-ce' environment:
23+
conda create --prefix MY_VENV_DIR --clone open-ce-1.1.3-py38-0
24+
25+
# (Else, if on local machine) Set up Conda environment locally
26+
conda env create --name DIPS-Plus -f environment.yml
27+
28+
# (Else, if on local machine - Optional) Create Conda environment in a particular directory using local 'environment.yml' file:
29+
conda env create --prefix MY-VENV-DIR -f environment.yml
30+
31+
# Activate Conda environment located in the current directory:
32+
conda activate DIPS-Plus
33+
34+
# (Optional) Activate Conda environment located in another directory:
35+
conda activate MY-VENV-DIR
36+
37+
# (Optional) Deactivate the currently-activated Conda environment:
38+
conda deactivate
39+
40+
# (If on local machine - Optional) Perform a full update on the Conda environment described in 'environment.yml':
41+
conda env update -f environment.yml --prune
42+
43+
# (Optional) To remove this long prefix in your shell prompt, modify the env_prompt setting in your .condarc file with:
44+
conda config --set env_prompt '({name})'
45+
```
46+
47+
(If on HPC cluster) Install all project dependencies:
48+
49+
```bash
50+
# Install project as a pip dependency in the Conda environment currently activated:
51+
pip3 install -e .
52+
53+
# Install external pip dependencies in the Conda environment currently activated:
54+
pip3 install -r requirements.txt
55+
56+
# Install pip dependencies used for unit testing in the Conda environment currently activated:
57+
pip3 install -r tests/requirements.txt
58+
```
59+
60+
## How to compile DIPS-Plus from scratch
61+
62+
Retrieve protein complexes from the RCSB PDB:
63+
64+
```bash
65+
# Remove all existing training/testing sample lists
66+
rm project/datasets/DIPS/final/raw/pairs-postprocessed.txt project/datasets/DIPS/final/raw/pairs-postprocessed-train.txt project/datasets/DIPS/final/raw/pairs-postprocessed-val.txt project/datasets/DIPS/final/raw/pairs-postprocessed-test.txt
67+
68+
# Create data directories (if not already created):
69+
mkdir project/datasets/DIPS/raw project/datasets/DIPS/raw/pdb project/datasets/DIPS/interim project/datasets/DIPS/interim/external_feats project/datasets/DIPS/final project/datasets/DIPS/final/raw project/datasets/DIPS/final/processed
70+
71+
# Download the raw PDB files:
72+
rsync -rlpt -v -z --delete --port=33444 --include='*.gz' --include='*.xz' --include='*/' --exclude '*' \
73+
rsync.rcsb.org::ftp_data/biounit/coordinates/divided/ project/datasets/DIPS/raw/pdb
74+
75+
# Extract the raw PDB files:
76+
python3 project/datasets/builder/extract_raw_pdb_gz_archives.py project/datasets/DIPS/raw/pdb
77+
78+
# Process the raw PDB data into associated pair files:
79+
python3 project/datasets/builder/make_dataset.py project/datasets/DIPS/raw/pdb project/datasets/DIPS/interim --num_cpus 28 --source_type rcsb --bound
80+
81+
# Apply additional filtering criteria:
82+
python3 project/datasets/builder/prune_pairs.py project/datasets/DIPS/interim/pairs project/datasets/DIPS/filters project/datasets/DIPS/interim/pairs-pruned --num_cpus 28
83+
84+
# Generate externally-sourced features:
85+
python3 project/datasets/builder/generate_psaia_features.py "$PSAIADIR" "$PROJDIR"/project/datasets/builder/psaia_config_file_dips.txt "$PROJDIR"/project/datasets/DIPS/raw/pdb "$PROJDIR"/project/datasets/DIPS/interim/parsed "$PROJDIR"/project/datasets/DIPS/interim/pairs-pruned "$PROJDIR"/project/datasets/DIPS/interim/external_feats --source_type rcsb
86+
python3 project/datasets/builder/generate_hhsuite_features.py "$PROJDIR"/project/datasets/DIPS/interim/parsed "$PROJDIR"/project/datasets/DIPS/interim/pairs-pruned "$HHSUITE_DB" "$PROJDIR"/project/datasets/DIPS/interim/external_feats --num_cpu_jobs 4 --num_cpus_per_job 8 --num_iter 2 --source_type rcsb --write_file
87+
88+
# Add new features to the filtered pairs, ensuring that the pruned pairs' original PDB files are stored locally for DSSP:
89+
python3 project/datasets/builder/download_missing_pruned_pair_pdbs.py "$PROJDIR"/project/datasets/DIPS/raw/pdb "$PROJDIR"/project/datasets/DIPS/interim/pairs-pruned --num_cpus 32 --rank "$1" --size "$2"
90+
python3 project/datasets/builder/postprocess_pruned_pairs.py "$PROJDIR"/project/datasets/DIPS/raw/pdb "$PROJDIR"/project/datasets/DIPS/interim/pairs-pruned "$PROJDIR"/project/datasets/DIPS/interim/external_feats "$PROJDIR"/project/datasets/DIPS/final/raw --num_cpus 32 --full-run
91+
92+
# Downsample negative class, partition dataset filenames, aggregate statistics, and impute missing features
93+
python3 project/datasets/builder/downsample_negative_class.py "$PROJDIR"/project/datasets/DIPS/final/raw --source_type rcsb --num_cpus 32 --rank "$1" --size "$2"
94+
python3 project/datasets/builder/partition_dataset_filenames.py "$PROJDIR"/project/datasets/DIPS/final/raw --source_type rcsb --filter_by_seq_length True --max_seq_length 1000 --rank "$1" --size "$2"
95+
python3 project/datasets/builder/collect_dataset_statistics.py "$PROJDIR"/project/datasets/DIPS/final/raw --rank "$1" --size "$2"
96+
python3 project/datasets/builder/impute_missing_feature_values.py "$PROJDIR"/project/datasets/DIPS/final/raw --num_cpus 32 --rank "$1" --size "$2"
97+
```
98+
99+
## How to assemble DB5-Plus
100+
101+
Fetch prepared protein complexes from Dataverse:
102+
103+
```bash
104+
# Download the prepared DB5 files:
105+
wget -O project/datasets/DB5.tar.gz https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/H93ZKK/BXXQCG
106+
107+
# Extract downloaded DB5 archive:
108+
tar -xzf project/datasets/DB5.tar.gz --directory project/datasets/
109+
110+
# Remove (now) redundant DB5 archive and other miscellaneous files:
111+
rm project/datasets/DB5.tar.gz project/datasets/DB5/.README.swp
112+
rm project/datasets/DB5.tar.gz "$MYLOCAL"/datasets/DB5/.README.swp
113+
rm -rf project/datasets/DB5/interim "$MYLOCAL"/datasets/DB5/processed
114+
115+
# Create relevant interim and final data directories:
116+
mkdir project/datasets/DB5/interim project/datasets/DB5/interim/external_feats
117+
mkdir project/datasets/DB5/final project/datasets/DB5/final/raw project/datasets/DB5/final/processed
118+
119+
# Construct DB5 dataset pairs:
120+
python3 project/datasets/builder/make_dataset.py "$PROJDIR"/project/datasets/DB5/raw "$PROJDIR"/project/datasets/DB5/interim --num_cpus 32 --source_type db5 --unbound
121+
122+
# Generate externally-sourced features:
123+
python3 project/datasets/builder/generate_psaia_features.py "$PSAIADIR" "$PROJDIR"/project/datasets/builder/psaia_config_file_db5.txt "$PROJDIR"/project/datasets/DB5/raw "$PROJDIR"/project/datasets/DB5/interim/parsed "$PROJDIR"/project/datasets/DB5/interim/parsed "$PROJDIR"/project/datasets/DB5/interim/external_feats --source_type db5
124+
python3 project/datasets/builder/generate_hhsuite_features.py "$PROJDIR"/project/datasets/DB5/interim/parsed "$PROJDIR"/project/datasets/DB5/interim/parsed "$HHSUITE_DB" "$PROJDIR"/project/datasets/DB5/interim/external_feats --num_cpu_jobs 4 --num_cpus_per_job 8 --num_iter 2 --source_type db5 --write_file
125+
126+
# Add new features to the filtered pairs:
127+
python3 project/datasets/builder/postprocess_pruned_pairs.py "$PROJDIR"/project/datasets/DB5/raw "$PROJDIR"/project/datasets/DB5/interim/pairs "$PROJDIR"/project/datasets/DB5/interim/external_feats "$PROJDIR"/project/datasets/DB5/final/raw --num_cpus 32 --source_type db5 --full-run
128+
129+
# Prepackage labels, partition dataset filenames, aggregate statistics, and impute missing features
130+
python3 project/datasets/builder/downsample_negative_class.py "$PROJDIR"/project/datasets/DB5/final/raw --source_type rcsb --num_cpus 32 --rank "$1" --size "$2"
131+
python3 project/datasets/builder/partition_dataset_filenames.py "$PROJDIR"/project/datasets/DB5/final/raw --source_type db5 --rank "$1" --size "$2"
132+
python3 project/datasets/builder/collect_dataset_statistics.py "$PROJDIR"/project/datasets/DB5/final/raw --rank "$1" --size "$2"
133+
python3 project/datasets/builder/impute_missing_feature_values.py "$PROJDIR"/project/datasets/DB5/final/raw --num_cpus 32 --rank "$1" --size "$2"
134+
```
135+
136+
## Python 2 to 3 pickle file solution
137+
138+
While using Python 3 in this project, you may encounter the following error if you try to postprocess '.dill' pruned
139+
pairs that were created using Python 2.
140+
141+
ModuleNotFoundError: No module named 'dill.dill'
142+
143+
1. To resolve it, ensure that the 'dill' package's version is greater than 0.3.2.
144+
2. If the problem persists, edit the pickle.py file corresponding to your Conda environment's Python 3 installation (
145+
e.g. ~/DIPS-Plus/venv/lib/python3.8/pickle.py) and add the statement
146+
147+
```python
148+
if module == 'dill.dill': module = 'dill._dill'
149+
```
150+
151+
to the end of the
152+
153+
```python
154+
if self.proto < 3 and self.fix_imports:
155+
```
156+
157+
block in the Unpickler class' find_class() function
158+
(e.g. line 1577 of Python 3.8.5's pickle.py).
159+
160+
### Citation
161+
162+
```
163+
@article{DIPS-Plus,
164+
title={DIPS-Plus: The Enhanced Database of Interacting Protein Structures},
165+
author={Morehead, Alex, Chen, Chen, and Cheng, Jianlin},
166+
journal={NeurIPS},
167+
year={2021}
168+
}
169+
```

environment.yml

+25
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
name: DIPS-Plus
2+
channels:
3+
- defaults
4+
- anaconda
5+
- conda-forge
6+
- bioconda
7+
- pytorch
8+
- salilab
9+
- dglteam
10+
dependencies:
11+
- python=3.8
12+
- pip
13+
- scipy
14+
- pandas
15+
- scikit-learn
16+
# - After creating initial Conda environment, uncomment and run the following if not already installed on your machine
17+
# - biopython=1.78 # For PDB parsing
18+
# - hhsuite=3.3.0 # For generating sequence profile HMMs
19+
# - msms=2.6.1 # For computing residue depths
20+
# - dssp=3.0.0 # For computing secondary structures - must be compiled from source for PowerPC architectures (e.g. Summit)
21+
# - pytorch # Install with 'conda install pytorch -c pytorch -c conda-forge' - already installed on Summit as 1.7.1
22+
# - dgl-cudaXX.X # Replace XX.X with the version of cudatoolkit installed by 'conda install pytorch -c pytorch -c conda-forge' directly above - must be manually compiled and installed on Summit
23+
- pip:
24+
- -e .
25+
- -r file:requirements.txt

0 commit comments

Comments
 (0)