This repository contains a pipeline to reprocess the Human Cell Landscape data for cleavage site identification as reported in:
Fansler, M.M., Mitschka, S. & Mayr, C. Quantifying 3′UTR length from scRNA-seq data reveals changes independent of gene expression. Nat Commun 15, 4050 (2024). https://doi.org/10.1038/s41467-024-48254-9
🤔 Important Note: This pipeline is provided for the scientific record, not necessarily with reuse in mind. However, we made some engineering improvements when writing the analogous pipeline for mouse data (https://github.com/Mayrlab/mca-utrome) making it more geared toward reuse. In particular, we moved what are really pipeline parameters out of the
Snakefileand into theconfig.yamlwhere they really belong. If considering rerunning this pipeline or applying it to other Microwell-seq data, you may want to start from that version instead, or at least incorporate those pipeline improvements here. Also, be mindful that both of these are resource heavy pipelines - we may be able to provide useful intermediate files to expediate generating output variants that do not require rerunning alignments (open an Issue).
The folders in the repository have the following purposes:
data- (created at runtime) output data filesenvs- Conda environment YAML files for recreating the execution environmentmetadata- metadata files that annotate input data filesscripts- scripts used by the Snakefileqc- (created at runtime) output quality checks
All code is expected to be executed with this repository as the present working directory.
The primary source code is found in the Snakefile and the scripts folder.
Files in the metadata folder describe most of the information necessary to download
the raw input sequencing files, as well as annotate the cells.
This pipeline also requires a HISAT2 index, which is not automatically retrieved. The location
of this should be specified with the hisatIndex key in the config.yaml.
This repository can be cloned with:
git clone https://github.com/Mayrlab/hcl-utrome.gitThis requires Conda/Mamba and Snakemake. If you do not already have a Conda installation, we strongly recommend Miniforge.
Two configuration options in config.yaml should be adjusted by the user prior to running:
tmpdir: temporary directory for scratchhisatIndex: human HISAT2 index
Optional parameters in the config.yaml that could be adjusted are:
minReadLength: the minumum read length required to include the resulting merged readradiusGENCODE: radius for merging GENCODE transcriptsradiusPAS: radius for merging PolyASite entriesextUTR3: downstream distance from annotated gene locus to classify as "extended 3'UTR"extUTR5: upwnstream distance from annotated gene locus to classify as "extended 5'UTR"
Additional parameters of interest in the Snakefile are:
epsilon: the initial radius within which read ends are merged to the modethreshold: minimum TPM per cell type cutoff for filtering low-frequency cleavage sitesversion: the human GENCODE version to be built upontpm: the minimum TPM threshold for PolyASite entries to be used as "supporting" evidencelikelihood: the minimum CleanUpdTSeq score that a cleavage site is not from internal priming to be considered a "likely" cleavage sitewidth: the width for truncating the UTRomemerge: the distance within which to merge 3'ends during scUTRquant quantification
The full pipeline can be executed with simply
snakemake --use-condaWe encourage HPC users to configure a Snakemake profile and use this via a --profile argument.