The goal of this project was to build a containerized cell type classifier that could predict broad cell type categories from single-cell transcriptomic data. This was accomplished by training a Logistic Regression classifier (SDGClassifier with log_loss) on the tabula sapiens dataset.
** This classifier was trained in a resource limited environment- consequently most of the features revolve around streamlining memory usage (Docker Container limited to 16 GB of ram running on an Apple M4 Pro processor) **
- Feature selection by calculating top 2000 HVG in a memory-efficient manner across all *.h5ad files
- Incremental training with 'partial_fit' enabling training on entire dataset even with limited memory
- Global train / test splitting enabling robust evaluation
When tested on Tabula Sapiens dataset, the classifier attained a global accuracy of
However, classification accuracy varied greatly across both broad cell class and tissue type. A more detailed overview / analysis of performance can be found here.
The classifier training/evaluation/usage architecture has been dockerized for easy set up. Consequently, a valid Docker installation is required for ML architecture usage.
Prior to classifier training, evaluation, or implementation the repository must be cloned and Docker image initialized.
Clone Repository
git clone https://github.com/lucaskearns/sc_transcriptomic_cell_type_classifier.git
Initialize Docker Environment
cd /path/to/sc_transcriptomic_cell_type_classifier/docker_files
docker build -t ml_image .
** Note: This command will build a docker image named ml_image, but the image can be named as desired. However, if the name is changed, subsequent commands will need to be adjusted accordingly. **
To produce a Logistic Regression classifier using my framework,the tabula sapiens dataset is required. This code could be refactored to accomodate a different dataset fairly easily, but it would require modification to be compatible with the layout of the new data.
Make directory to house classifier and other output files
mkdir out_dir
Calculate Gene Variance
docker run \
--rm -it -v \
/path/to/sc_transcriptomic_cell_type_classifier/scripts/:/mounted_scripts \
-v \
/path/to/tabula_sapiens_data:/tabula_sapiens_data \
-v \
/path/to/out_dir:/out_dir \
ml_image \
python /mounted_scripts/find_hvg.py \
--h5ad_dir /tabula_sapiens_data \
--output_prefix /out_dir/gene_variances
Train Classifier
docker run \
--rm -it -v \
/path/to/sc_transcriptomic_cell_type_classifier/scripts/:/mounted_scripts \
-v \
/path/to/tabula_sapiens_data:/tabula_sapiens_data \
-v \
/path/to/out_dir:/out_dir \
ml_image \
python /mounted_scripts/make_classifier.py \
--h5ad_dir /tabula_sapiens_data \
--variance_csv /out_dir/gene_variances.csv \
--clf_out_prefix /out_dir/user_generated_clf \
--out_split_csv_prefix /out_dir/train_test_split
out_dir will contain a sorted list of gene variances (gene_variances.csv), the train / test split (train_test_split.csv), and the classifier (user_generated_clf.joblib).
To evaluate the trained classifier simply run
docker run \
--rm -it -v \
/path/to/sc_transcriptomic_cell_type_classifier/scripts/:/mounted_scripts \
-v \
/path/to/tabula_sapiens_data:/tabula_sapiens_data \
-v \
/path/to/out_dir:/out_dir \
ml_image \
--split_csv /out_dir/train_test_split.csv \
--trained_clf /out_dir/user_generated_clf.joblib \
--h5ad_dir /tabula_sapiens_data \
--variance_csv /out_dir/gene_variances.csv \
--cm_out_prefix /out_dir/cm_out \
--cell_classification_out_prefix /out_dir/cell_classifications
This will write tissue-wise, global, and class-wise accuracy (in accordance with test / train split) to stdout and corresponding histograms (/out/*_accuracy.pdf) - as well as produce a confusion matrix plot (/out_dir/cm_out.pdf). Additionally, a list of predicted and true assignments for each cell will be produced (/out_dir/cell_classifications.csv).
Trained classifiers can be easily loaded and used with joblib. An example python script would be:
import scanpy as sc
import joblib
import pandas as pd
import numpy as np
# Load classifier
clf = joblib.load("/path/to/classifier.joblib")
# Load cell data you want to classify
cell_data = sc.read_h5ad("/path/to/log_normalized_scanpy_file.h5ad", "r")
# Load in HVG genes from a sorted variance csv
hvg_count = 2000
variance_df = pd.read_csv(variance_csv)
ensembl_ids = variance_df["ensembl_id"]
HVG = np.array(ensembl_ids[:hvg_count])
# Subset features
features = cell_data.layers["log_normalized"][:, HVG_idxs]
# Predict
predicted_classes = clf.predict(features)
The relevant folders could then be mounted and the python script run via the provided docker image.
**The gene variance, train test split, and classifier produced when I ran the classifier are located in the gene_variance, train_test_split, and classifier folder within this repository. This is useful in instances where you'd like to exactly replicate my training conditions to see if your model improves accuracy while minimizing random variance between different splits etc.**
Tabula Sapiens reveals transcription factor expression, senescence effects, and sex-specific features in cell types from 28 human organs and tissues Stephen R Quake, The Tabula Sapiens Consortium bioRxiv 2024.12.03.626516; doi: https://doi.org/10.1101/2024.12.03.626516