Skip to content

lucaskearns/sc_transcriptomic_cell_type_classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Single Cell Transcriptomic Cell Type Classifier

The goal of this project was to build a containerized cell type classifier that could predict broad cell type categories from single-cell transcriptomic data. This was accomplished by training a Logistic Regression classifier (SDGClassifier with log_loss) on the tabula sapiens dataset.

Key features

** This classifier was trained in a resource limited environment- consequently most of the features revolve around streamlining memory usage (Docker Container limited to 16 GB of ram running on an Apple M4 Pro processor) **

  • Feature selection by calculating top 2000 HVG in a memory-efficient manner across all *.h5ad files
  • Incremental training with 'partial_fit' enabling training on entire dataset even with limited memory
  • Global train / test splitting enabling robust evaluation

Performance

When tested on Tabula Sapiens dataset, the classifier attained a global accuracy of $\approx$ 72%. This was approximately 4x the accuracy of the naive Majority class baseline ( $\approx$ 18%).

However, classification accuracy varied greatly across both broad cell class and tissue type. A more detailed overview / analysis of performance can be found here.

Classifier usage / Dependencies

The classifier training/evaluation/usage architecture has been dockerized for easy set up. Consequently, a valid Docker installation is required for ML architecture usage.

Prior to classifier training, evaluation, or implementation the repository must be cloned and Docker image initialized.

Clone Repository

git clone https://github.com/lucaskearns/sc_transcriptomic_cell_type_classifier.git

Initialize Docker Environment

cd /path/to/sc_transcriptomic_cell_type_classifier/docker_files
docker build -t ml_image .

** Note: This command will build a docker image named ml_image, but the image can be named as desired. However, if the name is changed, subsequent commands will need to be adjusted accordingly. **

Classifier Training

To produce a Logistic Regression classifier using my framework,the tabula sapiens dataset is required. This code could be refactored to accomodate a different dataset fairly easily, but it would require modification to be compatible with the layout of the new data.

Make directory to house classifier and other output files

mkdir out_dir

Calculate Gene Variance

docker run \
--rm -it -v \
/path/to/sc_transcriptomic_cell_type_classifier/scripts/:/mounted_scripts \
-v \
/path/to/tabula_sapiens_data:/tabula_sapiens_data \
-v \
/path/to/out_dir:/out_dir \
ml_image \
python /mounted_scripts/find_hvg.py \
--h5ad_dir /tabula_sapiens_data \
--output_prefix /out_dir/gene_variances

Train Classifier

docker run \
--rm -it -v \
/path/to/sc_transcriptomic_cell_type_classifier/scripts/:/mounted_scripts \
-v \
/path/to/tabula_sapiens_data:/tabula_sapiens_data \
-v \
/path/to/out_dir:/out_dir \
ml_image \
python /mounted_scripts/make_classifier.py \
--h5ad_dir /tabula_sapiens_data \
--variance_csv /out_dir/gene_variances.csv \
--clf_out_prefix /out_dir/user_generated_clf \
--out_split_csv_prefix /out_dir/train_test_split

out_dir will contain a sorted list of gene variances (gene_variances.csv), the train / test split (train_test_split.csv), and the classifier (user_generated_clf.joblib).

Classifier Evaluation

To evaluate the trained classifier simply run

docker run \
--rm -it -v \
/path/to/sc_transcriptomic_cell_type_classifier/scripts/:/mounted_scripts \
-v \
/path/to/tabula_sapiens_data:/tabula_sapiens_data \
-v \
/path/to/out_dir:/out_dir \
ml_image \
--split_csv /out_dir/train_test_split.csv \
--trained_clf /out_dir/user_generated_clf.joblib \
--h5ad_dir /tabula_sapiens_data \
--variance_csv /out_dir/gene_variances.csv \
--cm_out_prefix /out_dir/cm_out \
--cell_classification_out_prefix /out_dir/cell_classifications

This will write tissue-wise, global, and class-wise accuracy (in accordance with test / train split) to stdout and corresponding histograms (/out/*_accuracy.pdf) - as well as produce a confusion matrix plot (/out_dir/cm_out.pdf). Additionally, a list of predicted and true assignments for each cell will be produced (/out_dir/cell_classifications.csv).

Classifier Usage

Trained classifiers can be easily loaded and used with joblib. An example python script would be:

import scanpy as sc
import joblib
import pandas as pd
import numpy as np

# Load classifier
clf = joblib.load("/path/to/classifier.joblib")

# Load cell data you want to classify
cell_data = sc.read_h5ad("/path/to/log_normalized_scanpy_file.h5ad", "r")

# Load in HVG genes from a sorted variance csv
hvg_count = 2000
variance_df = pd.read_csv(variance_csv)
ensembl_ids = variance_df["ensembl_id"]
HVG = np.array(ensembl_ids[:hvg_count])

# Subset features
features = cell_data.layers["log_normalized"][:, HVG_idxs]

# Predict
predicted_classes = clf.predict(features)

The relevant folders could then be mounted and the python script run via the provided docker image.

Misc

**The gene variance, train test split, and classifier produced when I ran the classifier are located in the gene_variance, train_test_split, and classifier folder within this repository. This is useful in instances where you'd like to exactly replicate my training conditions to see if your model improves accuracy while minimizing random variance between different splits etc.**

References

Tabula Sapiens reveals transcription factor expression, senescence effects, and sex-specific features in cell types from 28 human organs and tissues Stephen R Quake, The Tabula Sapiens Consortium bioRxiv 2024.12.03.626516; doi: https://doi.org/10.1101/2024.12.03.626516

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors