This repository contains the code for the Identification of Disease-Relevant Cell-Types in Rheumatoid Arthritis Using an Integrated scRNA-seq/GWAS Approach capstone project by Chris Teng, Rucha Deo, and Rachel Zeng for the Summer 2023 cohort of the University of Chicago's Masters of Biomedical Informatics program.
The Python modules of this repository create a series of wrappers around MAGMA and PLINK to handle munging and processing of input data to be used by scDRS, a tool which associates individual cells within scRNA-seq data with disease GWAS data. The modules are then used within the associated Jupyter notebooks.
The R module of this repository and associated Quarto documents use Seurat to perform standard pre-processing on scRNA-seq data then map disease scores and p-values (generated by scDRS) onto individual cells for visualization purposes.
The expected flow of notebooks is:
make_reference.ipynb
-> gwas_processing.ipynb
-> celseq_processing.qmd
-> score_cells.ipynb
-> cell_score_viz.qmd
Python dependencies for this repository are handled with poetry. Alternatively, a
requirements.txt
file has been provided.
R dependencies for this repository are handled with renv.
In addition to Python and R dependencies, a local installation of both MAGMA and PLINK is required. To ensure that the notebooks run smoothly, make sure that both the MAGMA and PLINK installations are accessible through $PATH.
In each notebook, parameters that are used to run the notebooks are placed outside the function calls. This is to provide an easier way of modifying the run parameters to suit the local environment. File paths should be provided relative to the location of the notebook.
Each notebook should create/use both a tmp
directory and an output
directory. The output directory holds the
primary/final outputs of the notebook, whereas the tmp
directory contains work directories holding intermediate
files that are generated as part of the workflow. Only the final outputs are listed below.
- a merged PLINK bed file
- a merged PLINK bim file
- a merged PLINK fam file
- a
.genes.annot
file mapping GWAS SNPs to their associated genes - a
.genes.out
file containing the gene analysis results (i.e. which genes are significantly associated with the trait of interest)
cell_type_distribution.png
- a histogram of the cell type distributionpca_10dims.png
- a PCA of the first 10 dimensionsumap_cell_clusters_v1.png
andumap_cell_clusters_v2.png
- UMAPs of the single-cell data to try and recapitulate the cell clusters of the sourced article.canonical_markers.png
- a UMAP of the single-cell data annotated with canonical cell markers in Fibroblasts, Monocytes, B-cells, and T-cells
- CSVs of individual cells in the scRNA-seq data scored on their disease relevance
- plots of the disease relevance scores (as calculated by scDRS) mapped onto individual cells from scRNA-seq
- plots of the disease relevance p-values (as calculated by scDRS) mapped onto individual cells from scRNA-seq
- plot of markers that differentiate rheumatoid arthritis (RA)-relevant vs RA-irrelevant cell subpopulations within an analyzed cell type
- table of differentially expressed markers between RA-relevant and RA-irrelevant cell subpopulations within a defined cell type
- plot of distribution of RA associated cells (case) vs osteoarthritis (OA) associated cells (control) on the previously generated UMAP clusters
- a CSV of RA markers
- a CSV of markers differentiating RA and OA cells within a defined cell type
- a CSV of markers differentiating RA and OA cells within the disease relevant cells of a defined cell type
- a CSV of markers differentiating RA and OA cells within the disease irrelevant cells of a defined cell type