- Introduction
- Project Structure
- Detailed Component Description
- Installation
- Usage
- Data Requirements
- Output Details
- Customization and Extension
- Troubleshooting
- License
- Contact
This project implements Deep Embedded Clustering (DEC) approach for analyzing genomic data. It combines the power of autoencoders for dimensionality reduction and information extraction with density based clustering techniques to uncover patterns in high-dimensional genomic datasets. The project is particularly designed to work with one-hot encoded genomic data, making it suitable for a wide range of genomic analysis tasks.
The main goals of this project are:
- To reduce the dimensionality of complex genomic data while keeping relevant information and "diluting" bias using deep autoencoders
- To perform clustering on the reduced-dimension data to identify meaningful groups or patterns
- To visualize the clustered data using t-SNE for intuitive interpretation
- To select the optimal hyperparameter for the Density-based clustering approach used
This project is for researchers and data scientists working in bioinformatics, genomics, and related fields who need to analyze large-scale genomic datasets efficiently.
The project is organized into several Python scripts, each handling a specific aspect of the analysis pipeline:
.
├── main_v2_individual.py
├── DenseDEC.py
├── DEC.py
├── datasets.py
└── data/
├── prokka_onehot_nay60k_1000.parquet
├── prokka_onehot_allGUT_combined.parquet
└── humags_prokka_onehotencoded_dataset.parquet
This is the main entry point of the project. It orchestrates the entire analysis pipeline and provides a command-line interface for running the autoencoder training and clustering processes. Key features include:
- Flexible configuration through command-line arguments
- Support for multiple trials and cross-validation folds (Seeds are set for reproducibility and, therefore, folds can be done in parallel.)
- Ability to run specific combinations of trials, folds, and encoding dimensions
- Comprehensive logging and result saving
This script implements the Dense Deep Embedded Clustering model, which is a variation of the DEC algorithm. It includes:
- A custom dense autoencoder architecture
- Methods for initializing and pretraining the autoencoder
- Integration with the base DEC class for clustering
This script contains the base implementation of the Deep Embedded Clustering algorithm. It provides:
- Methods for model initialization and pretraining
- Implementation of the DenMune clustering algorithm
- Functions for evaluating clustering quality using various metrics (DBCV, CDbw) #### TO BE FIXED
- Utilities for data sampling and visualization
This utility script handles data loading and preprocessing. It includes:
- Functions for loading one-hot encoded genomic data from Parquet files
- Data validation and error checking to ensure input data quality
- Conversion of data to formats suitable for TensorFlow processing
-
Clone the repository:
git clone https://github.com/JotaKas/GenoClust-DEC.git cd genomic-data-analysis -
Create a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate` -
Install the required dependencies
Note: Depending on your system and CUDA compatibility, you might need to install a specific version of TensorFlow. Refer to the TensorFlow installation guide for more details.
The project supports two main modes of operation: Autoencoder Mode and Clustering Mode.
This mode is used for training the autoencoder to reduce the dimensionality of the input genomic data.
python main_v2_individual.py autoencoder [options]Options:
--dbs: List of databases to process. Default:['prokka_onehot_nay60k_1000', 'prokka_onehot_allGUT_combined', 'humags_prokka_onehotencoded_dataset']--expdir: Directory to save results. Default:'results/genomic_exp'--trials: Number of trials to run. Default:2--verbose: Verbosity level (0, 1, or 2). Default:1--encoding_dims: List of encoding dimensions to try. Default:[2, 64, 128, 256, 1024]--specific_trial: Run a specific trial (optional)--specific_fold: Run a specific fold (optional)--specific_dim: Run a specific encoding dimension (optional)
Example:
python main_v2_individual.py autoencoder --dbs prokka_onehot_nay60k_1000 --trials 3 --encoding_dims 64 128 --verbose 2This command will run the autoencoder training on the 'prokka_onehot_nay60k_1000' database for 3 trials, using encoding dimensions of 64 and 128, with maximum verbosity.
This mode performs clustering on the reduced-dimension data obtained from the autoencoder.
python main_v2_individual.py clustering [options]Options:
--dbs: List of databases to process--expdir: Directory to save results. Default:'results/genomic_exp'--knn_values: List of KNN values for clustering. Default:[10, 20, 50, 100, 200]--seeds: List of random seeds for reproducibility. Default:[42, 123, 456, 789, 101]--encoding_dims: List of encoding dimensions to use. Default:[2, 64, 128, 256, 1024]--specific_trial: Run a specific trial (optional)--specific_fold: Run a specific fold (optional)--specific_dim: Run a specific encoding dimension (optional)
Example:
python main_v2_individual.py clustering --dbs prokka_onehot_allGUT_combined --knn_values 15 30 --seeds 42 789 --encoding_dims 128This command will perform clustering on the 'prokka_onehot_allGUT_combined' database using KNN values of 15 and 30, seeds 42 and 789, and an encoding dimension of 128.
The project expects genomic data in Parquet format. The data should be:
- One-hot encoded
- Stored in the
data/directory relative to the project root - Named according to the database names used in the command-line arguments
Each Parquet file should contain a table where:
- Rows represent individual genomes or samples
- Columns represent genes or features
- The first column is typically an identifier (which is ignored during processing)
- All other columns contain binary (0 or 1) values representing the one-hot encoding
Example of expected data structure:
| ID | Gene1 | Gene2 | Gene3 | ... | GeneN |
|---|---|---|---|---|---|
| 1 | 0 | 1 | 0 | ... | 1 |
| 2 | 1 | 0 | 1 | ... | 0 |
| ... | ... | ... | ... | ... | ... |
The project generates various outputs in the specified experiment directory:
-
Trained Models:
autoencoder_model.h5: The full autoencoder modelencoder_model.h5: The encoder part of the autoencoderae_weights.h5: Weights of the trained autoencoder
-
Training Logs:
pretrain_log.csv: Log of the autoencoder pretraining process
-
Encoded Features:
full_features.txt: Encoded features for the entire dataset
-
t-SNE Visualizations:
tsne_features_seed_{seed}.txt: t-SNE transformed features for each seed
-
Clustering Results:
predicted_clusters_knn_{knn}_seed_{seed}.txt: Predicted cluster labelsclustering_results_seed_{seed}_knn_{knn}.csv: Detailed clustering metrics
-
Plots:
2D_plot_knn_{knn}_seed_{seed}.png: 2D scatter plot of clustered data
These outputs allow for comprehensive analysis and interpretation of the results, including model performance, clustering quality, and visual representation of the data structure.
The project is designed to be modular and extensible. Here are some ways you can customize or extend the functionality:
-
Adding New Databases: Simply add new Parquet files to the
data/directory and include their names in the--dbsargument. -
Modifying the Autoencoder Architecture: Edit the
DenseAutoencoderfunction inDenseDEC.pyto change the network architecture. -
Implementing New Clustering Algorithms: Extend the
DECclass inDEC.pyto include additional clustering methods. -
Adding New Evaluation Metrics: Incorporate additional clustering quality metrics in the
evaluate_clusteringmethod ofDEC.py. -
Customizing Visualizations: Modify the plotting code in
DEC.pyto create different types of visualizations.
Common issues and their solutions:
-
Out of Memory Errors: If you encounter memory issues, try reducing the batch size or using a smaller encoding dimension.
-
CUDA Errors: Ensure that your TensorFlow installation is compatible with your CUDA version. You may need to install a specific TensorFlow version.
-
Data Loading Errors: Verify that your Parquet files are in the correct format and location. Use the
--verboseflag to get more detailed error messages. -
Poor Clustering Results: Experiment with different KNN values and encoding dimensions. Consider preprocessing your data further if results are consistently poor.
GPL-3.0 license
Jonas C Kasmanas Email: [email protected]
For bug reports and feature requests, please open an issue on the GitHub repository.
Abbas, M., El-Zoghabi, A., & Shoukry, A. (2021). DenMune: Density peak based clustering using mutual nearest neighbors. Pattern Recognition, 109(107589), 107589. https://doi.org/10.1016/j.patcog.2020.107589
Ren, Y., Wang, N., Li, M., & Xu, Z. (2020). Deep density-based image clustering. Knowledge-Based Systems, 197(105841), 105841. https://doi.org/10.1016/j.knosys.2020.105841