Genomic Data Analysis Project

Introduction

This project implements Deep Embedded Clustering (DEC) approach for analyzing genomic data. It combines the power of autoencoders for dimensionality reduction and information extraction with density based clustering techniques to uncover patterns in high-dimensional genomic datasets. The project is particularly designed to work with one-hot encoded genomic data, making it suitable for a wide range of genomic analysis tasks.

The main goals of this project are:

To reduce the dimensionality of complex genomic data while keeping relevant information and "diluting" bias using deep autoencoders
To perform clustering on the reduced-dimension data to identify meaningful groups or patterns
To visualize the clustered data using t-SNE for intuitive interpretation
To select the optimal hyperparameter for the Density-based clustering approach used

This project is for researchers and data scientists working in bioinformatics, genomics, and related fields who need to analyze large-scale genomic datasets efficiently.

Project Structure

The project is organized into several Python scripts, each handling a specific aspect of the analysis pipeline:

.
├── main_v2_individual.py
├── DenseDEC.py
├── DEC.py
├── datasets.py
└── data/
    ├── prokka_onehot_nay60k_1000.parquet
    ├── prokka_onehot_allGUT_combined.parquet
    └── humags_prokka_onehotencoded_dataset.parquet

Detailed Component Description

main_v2_individual.py

This is the main entry point of the project. It orchestrates the entire analysis pipeline and provides a command-line interface for running the autoencoder training and clustering processes. Key features include:

Flexible configuration through command-line arguments
Support for multiple trials and cross-validation folds (Seeds are set for reproducibility and, therefore, folds can be done in parallel.)
Ability to run specific combinations of trials, folds, and encoding dimensions
Comprehensive logging and result saving

DenseDEC.py

This script implements the Dense Deep Embedded Clustering model, which is a variation of the DEC algorithm. It includes:

A custom dense autoencoder architecture
Methods for initializing and pretraining the autoencoder
Integration with the base DEC class for clustering

DEC.py

This script contains the base implementation of the Deep Embedded Clustering algorithm. It provides:

Methods for model initialization and pretraining
Implementation of the DenMune clustering algorithm
Functions for evaluating clustering quality using various metrics (DBCV, CDbw) #### TO BE FIXED
Utilities for data sampling and visualization

datasets.py

This utility script handles data loading and preprocessing. It includes:

Functions for loading one-hot encoded genomic data from Parquet files
Data validation and error checking to ensure input data quality
Conversion of data to formats suitable for TensorFlow processing

Installation

Clone the repository:

git clone https://github.com/JotaKas/GenoClust-DEC.git
cd genomic-data-analysis

Create a virtual environment (optional but recommended):

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install the required dependencies

Note: Depending on your system and CUDA compatibility, you might need to install a specific version of TensorFlow. Refer to the TensorFlow installation guide for more details.

Usage

The project supports two main modes of operation: Autoencoder Mode and Clustering Mode.

Autoencoder Mode

This mode is used for training the autoencoder to reduce the dimensionality of the input genomic data.

python main_v2_individual.py autoencoder [options]

Options:

--dbs: List of databases to process. Default: ['prokka_onehot_nay60k_1000', 'prokka_onehot_allGUT_combined', 'humags_prokka_onehotencoded_dataset']
--expdir: Directory to save results. Default: 'results/genomic_exp'
--trials: Number of trials to run. Default: 2
--verbose: Verbosity level (0, 1, or 2). Default: 1
--encoding_dims: List of encoding dimensions to try. Default: [2, 64, 128, 256, 1024]
--specific_trial: Run a specific trial (optional)
--specific_fold: Run a specific fold (optional)
--specific_dim: Run a specific encoding dimension (optional)

Example:

python main_v2_individual.py autoencoder --dbs prokka_onehot_nay60k_1000 --trials 3 --encoding_dims 64 128 --verbose 2

This command will run the autoencoder training on the 'prokka_onehot_nay60k_1000' database for 3 trials, using encoding dimensions of 64 and 128, with maximum verbosity.

Clustering Mode

This mode performs clustering on the reduced-dimension data obtained from the autoencoder.

python main_v2_individual.py clustering [options]

Options:

--dbs: List of databases to process
--expdir: Directory to save results. Default: 'results/genomic_exp'
--knn_values: List of KNN values for clustering. Default: [10, 20, 50, 100, 200]
--seeds: List of random seeds for reproducibility. Default: [42, 123, 456, 789, 101]
--encoding_dims: List of encoding dimensions to use. Default: [2, 64, 128, 256, 1024]
--specific_trial: Run a specific trial (optional)
--specific_fold: Run a specific fold (optional)
--specific_dim: Run a specific encoding dimension (optional)

Example:

python main_v2_individual.py clustering --dbs prokka_onehot_allGUT_combined --knn_values 15 30 --seeds 42 789 --encoding_dims 128

This command will perform clustering on the 'prokka_onehot_allGUT_combined' database using KNN values of 15 and 30, seeds 42 and 789, and an encoding dimension of 128.

Data Requirements

The project expects genomic data in Parquet format. The data should be:

One-hot encoded
Stored in the data/ directory relative to the project root
Named according to the database names used in the command-line arguments

Each Parquet file should contain a table where:

Rows represent individual genomes or samples
Columns represent genes or features
The first column is typically an identifier (which is ignored during processing)
All other columns contain binary (0 or 1) values representing the one-hot encoding

Example of expected data structure:

ID	Gene1	Gene2	Gene3	...	GeneN
1	0	1	0	...	1
2	1	0	1	...	0
...	...	...	...	...	...

Output Details

The project generates various outputs in the specified experiment directory:

Trained Models:
- autoencoder_model.h5: The full autoencoder model
- encoder_model.h5: The encoder part of the autoencoder
- ae_weights.h5: Weights of the trained autoencoder
Training Logs:
- pretrain_log.csv: Log of the autoencoder pretraining process
Encoded Features:
- full_features.txt: Encoded features for the entire dataset
t-SNE Visualizations:
- tsne_features_seed_{seed}.txt: t-SNE transformed features for each seed
Clustering Results:
- predicted_clusters_knn_{knn}_seed_{seed}.txt: Predicted cluster labels
- clustering_results_seed_{seed}_knn_{knn}.csv: Detailed clustering metrics
Plots:
- 2D_plot_knn_{knn}_seed_{seed}.png: 2D scatter plot of clustered data

These outputs allow for comprehensive analysis and interpretation of the results, including model performance, clustering quality, and visual representation of the data structure.

Customization and Extension

The project is designed to be modular and extensible. Here are some ways you can customize or extend the functionality:

Adding New Databases: Simply add new Parquet files to the data/ directory and include their names in the --dbs argument.
Modifying the Autoencoder Architecture: Edit the DenseAutoencoder function in DenseDEC.py to change the network architecture.
Implementing New Clustering Algorithms: Extend the DEC class in DEC.py to include additional clustering methods.
Adding New Evaluation Metrics: Incorporate additional clustering quality metrics in the evaluate_clustering method of DEC.py.
Customizing Visualizations: Modify the plotting code in DEC.py to create different types of visualizations.

Troubleshooting

Common issues and their solutions:

Out of Memory Errors: If you encounter memory issues, try reducing the batch size or using a smaller encoding dimension.
CUDA Errors: Ensure that your TensorFlow installation is compatible with your CUDA version. You may need to install a specific TensorFlow version.
Data Loading Errors: Verify that your Parquet files are in the correct format and location. Use the --verbose flag to get more detailed error messages.
Poor Clustering Results: Experiment with different KNN values and encoding dimensions. Consider preprocessing your data further if results are consistently poor.

License

GPL-3.0 license

Contact

Jonas C Kasmanas Email: [email protected]

For bug reports and feature requests, please open an issue on the GitHub repository.

References

Abbas, M., El-Zoghabi, A., & Shoukry, A. (2021). DenMune: Density peak based clustering using mutual nearest neighbors. Pattern Recognition, 109(107589), 107589. https://doi.org/10.1016/j.patcog.2020.107589

Ren, Y., Wang, N., Li, M., & Xu, Z. (2020). Deep density-based image clustering. Knowledge-Based Systems, 197(105841), 105841. https://doi.org/10.1016/j.knosys.2020.105841

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
.github/workflows		.github/workflows
data		data
traffic		traffic
DEC.py		DEC.py
DenseDEC.py		DenseDEC.py
LICENSE		LICENSE
README.md		README.md
cluster_evaluation.py		cluster_evaluation.py
datasets.py		datasets.py
main_v1.py		main_v1.py
main_v2_individual.py		main_v2_individual.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Genomic Data Analysis Project

Table of Contents

Introduction

Project Structure

Detailed Component Description

main_v2_individual.py

DenseDEC.py

DEC.py

datasets.py

Installation

Usage

Autoencoder Mode

Clustering Mode

Data Requirements

Output Details

Customization and Extension

Troubleshooting

License

Contact

References

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

JotaKas/GenoClust-DEC

Folders and files

Latest commit

History

Repository files navigation

Genomic Data Analysis Project

Table of Contents

Introduction

Project Structure

Detailed Component Description

main_v2_individual.py

DenseDEC.py

DEC.py

datasets.py

Installation

Usage

Autoencoder Mode

Clustering Mode

Data Requirements

Output Details

Customization and Extension

Troubleshooting

License

Contact

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages