Skip to content

RitAreaSciencePark/dpc_fam_and_struct_webapp

Repository files navigation

Web Application for DPCfam and DPCstruct Data Exploration

Status License GitHub

Hi! Thank you for visiting our repository. This is a Django project designed to facilitate the exploration of proteins at the domain level. We work with the DPCFam and DPCStruct datasets, which provide clusterings of protein sequences and protein structures, respectively, with the purpose of classifying protein domains at large scale.

Dataset Description Zenodo
DPCFam Sequence-based domain clusters DOI
DPCStruct Structure-based domain clusters DOI

The project currently consists of two applications dpcfam and dpcstruct corresponding to the two datasets presented above. To reproduce the current state of this project (which is under development), please follow the steps below.


Table of Contents


1. Prerequisites

Ubuntu Python Django PostgreSQL Git VS Code

Our development environment uses:

  • Ubuntu 24.04.3 LTS
  • Python 3.12.3
  • Visual Studio Code 1.109.3
  • Git 2.43.0
  • PostgreSQL 16.11
  • CSV Data Files: Available upon request; place them in static/dataframes/.
  • Static files: To enable the "Downloads" feature and serve domains data, organize the static/ directory as follows:
static/
├── downloads/
│   ├── dpcfam/
│   │   ├── alphafolddb_reps.zip
│   │   ├── dpcfamb_dataset.zip
│   │   ├── dpcfam_full_seeds.zip
│   │   ├── dpcfam_hmm_profiles.zip
│   │   └── dpcfam_msa_profiles.zip
│   └── dpcstruct/
│       ├── dpcstruct_reps_pdbs.tar.gz
│       └── dpcstruct_reps_seqs.tar.gz
└── production_files/
    ├── dpcfam/
    │   ├── metaclusters_fasta/       # MCID.fasta files
    │   ├── metaclusters_hmms/        # MCID.hmm files
    │   └── metaclusters_msas_cdhit/  # MCID.msa files
    └── dpcstruct/
        ├── dpcstruct_reps_seqs/           # MCID.fasta files (representatives only)
        ├── dpcstruct_reps_pdbs_zipped/    # MCID_pdb.zip files (representatives only)
        └── dpcstruct_reps_pdbs/           # MCID.pdb files (representatives only)

2. Clone the Repository

GitHub

If this is your first time, clone the project:

git clone https://github.com/emmanuelnyandukagarabi/dpc_fam_and_struct_webapp
cd dpc_fam_and_struct_webapp

Otherwise, pull the latest changes:

cd dpc_fam_and_struct_webapp
git pull

3. Installation

pip venv

  1. Create (for first-time users) and activate a virtual environment:

    python3 -m venv .venv
    source .venv/bin/activate
  2. Install dependencies:

    pip install -r requirements.txt

4. Database Initialization

PostgreSQL SQL

Start the PostgreSQL service:

sudo service postgresql start

4.1 Create User and Database (only once)

Use the provided script to set up the PostgreSQL user and database:

sudo -u postgres psql -f static/scripts/create_a_user_and_a_database.sql

4.2 Create Tables and Indexes, then Populate Tables from CSV Files

I. Application I : dpcfam (almost done)
  1. Run the following script to create dpcfam tables and indexes :

    PGPASSWORD="EmmaPSQL2026" psql -U enyanduk -h localhost -d dpcfam_mcs_db -f static/scripts/dpcfam/create_dpcfam_tables.sql
  2. Run the following script to populate dpcfam tables by loading data from CSV files (It will take a while; please wait until the process is completed!):

    PGPASSWORD="EmmaPSQL2026" psql -U enyanduk -h localhost -d dpcfam_mcs_db -f static/scripts/dpcfam/populate_dpcfam_tables.sql
II. Application II : dpcstruct (under development)
  1. Run the following script to create dpcstruct tables and indexes:

    PGPASSWORD="EmmaPSQL2026" psql -U enyanduk -h localhost -d dpcfam_mcs_db -f static/scripts/dpcstruct/create_dpcstruct_tables.sql
  2. Run the following script to populate dpcstruct tables by loading data from CSV files:

    PGPASSWORD="EmmaPSQL2026" psql -U enyanduk -h localhost -d dpcfam_mcs_db -f static/scripts/dpcstruct/populate_dpcstruct_tables.sql

5. Migrations

Django

We have already created and pushed all migrations in this project. Optionally, you may run:

python3 manage.py makemigrations
python3 manage.py migrate

6. Run the Server

Django localhost

python3 manage.py runserver

7. Usage

Chrome

Visit the following URL in your web browser (Chrome is my friend!):

http://127.0.0.1:8000/

Note: Congratulations, you made it! To stop the server, use Ctrl+C. To stop PostgreSQL, run sudo service postgresql stop. Once the database is successfully populated, you may delete the CSV files (static/dataframes/) to save space. For any feedback, reach out to us via any address on our profile. More features are coming soon!


References

If you use this project or the associated datasets, please cite:

  1. Barone, F., Laio, A., Punta, M., Cozzini, S., Ansuini, A., & Cazzaniga, A. (2025). Unsupervised domain classification of AlphaFold2-predicted protein structures. PRX Life, 3(2), 023009. https://doi.org/10.1103/PRXLife.3.023009

  2. Russo, E. T., Barone, F., Bateman, A., Cozzini, S., Punta, M., & Laio, A. (2022). DPCfam: Unsupervised protein family classification by density peak clustering of large sequence datasets. PLOS Computational Biology, 18(10), e1010610. https://doi.org/10.1371/journal.pcbi.1010610

BibTeX
@article{barone2025unsupervised,
  title={Unsupervised domain classification of AlphaFold2-predicted protein structures},
  author={Barone, Federico and Laio, Alessandro and Punta, Marco and Cozzini, Stefano and Ansuini, Alessio and Cazzaniga, Alberto},
  journal={PRX Life},
  volume={3},
  number={2},
  pages={023009},
  year={2025},
  publisher={APS}
}

@article{russo2022dpcfam,
  title={Dpcfam: unsupervised protein family classification by density peak clustering of large sequence datasets},
  author={Russo, Elena Tea and Barone, Federico and Bateman, Alex and Cozzini, Stefano and Punta, Marco and Laio, Alessandro},
  journal={PLOS Computational Biology},
  volume={18},
  number={10},
  pages={e1010610},
  year={2022},
  publisher={Public Library of Science San Francisco, CA USA}
}

About

In this project, We are designing and developing a browser-based web application from scratch using Django, enabling users to navigate the DPCfam and DPCstruct datasets related to protein clusterisation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages