Hey everybody!
In this section you find explanation about the script that runs CellRanger pipelines such as: converting BCL to fastqs files, alignment with\without multiple samples, various options inside the pipeline like: include introns or not in the alignment, read 3’ or 5’ and so on.
Location: Currently the CellRangerIDE is located in the next GitHub URL: https://github.com/yoavnahum21/CellRangerIDE.git.
you can find the Implementation code in the CellRangerIDE\Backhand\Scripts
Instructions:
-
Fill out the general config file.
-
The general_config.yml file is located inside the const_files folder.
-
In the pipeline choose whatever you desire, your options are:
a) mkfastq - creating fastqs out of bcl
b) count - creating count matrixes and h5 files for single sample
c) multi - creating count matrixes and h5 files for multiple samples using hashtag oligos.
d) vdj - creating count matrixes for vdj
e) mkref - creating custom/filtered transcriptome reference - TBD
f) qc score - returns a score of the specific process
g) cellbender - remove ambient rna noises
-
according to what you have chosen in the pipeline you should fill out the right second config file named {pipeline)_config.yml
-
The config.yml file is divided into several types:
a) Optional: means this field is not necessary to the pipeline’s inputs.
b) Required: means that if you submit something not valid, the pipeline won’t work for you.
c) Default: means that if you don’t submit something there is a default input that the pipeline will use. (the default is also written in the comment above).
- For running the cellranger software, please use:
-
Option 1: if you are running on wexac just run before:
ml CellRanger/9.0.0 ml CellRanger-ATAC/2.0.0
- Option 2: just download cellranger/cellranger-atac and add it to your path: For cellranger:
#!/bin/bash cd /opt wget -O cellranger-9.0.1.tar.gz "https://cf.10xgenomics.com/releases/cell-exp/cellranger-9.0.1.tar.gz?Expires=1739846992&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA&Signature=lWj4nhoNebGYpV7bZzkYzxxmnhdlFD0zuLKtbLQsqPevb5VRscVDDsEdlnXdjnavCB6vyQMJ8DzgH9CHZqFIWwIfJz8jL4iPDGXXIZq4zqW1LG46hR18xergOgDrLaysRxzUZiFI2BOimjDtARViyyxZRSeVEsN3oILMLpWRukOPRt3czKfSbffpRq4Nw-QSlsTQQovruwQ5x27AgZ7ENApYSOgGKF5GF~hbOJYVbTchDUHNvyHChwmLPgENTefM3ZeGS1-Vs0X2XUL~pbIDeVdVUvLrwj~McjPzZuvXq-XB26qbD3jWNQrhmEG31OUVUfGvsbC2xdkur9EJwTCssA__" tar -xzvf cellranger-9.0.1.tar.gz export PATH=/opt/cellranger-9.0.1:$PATH
For cellranger-atac:
#!/bin/bash
cd /opt
wget -O cellranger-atac-2.1.0.tar.gz "https://cf.10xgenomics.com/releases/cell-atac/cellranger-atac-2.1.0.tar.gz?Expires=1739846838&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA&Signature=e91w5zq4U~sU4G3ibOj1fvO5HW19FrwFMs8WpsresMLUy~IoBbI2FfZbB3QsC1UvrXZjqZ2f4WEkLz36Ww7nfdI37-AkOnpaVZVt3gjwjnoUPfAbLdM3p1S37AgEtGJ00TOS4xzP3l1rxfV-9aGnIlGCVGojtQfT20L3j0mydvUVPmhvs2HXqzdbtgDcUeFU-d8YBt7GvcFrSaM6d4veWXgMKeX1K8fn7s9AlsvBfKeRAKTZu6UPK8w4DTbCpB9--nmTDNyKJjRH9I6AhIXp0NmDDJ81wuhVDQ5f6x0o1q0yX1UWnv8oWbvF5bAtUgLqF68E9v8L-2cANe1pgpjKCA__"
tar -xzvf cellranger-atac-2.1.0.tar.gz
export PATH=/opt/cellranger-atac-2.1.0:$PATH
- For running cellbender first use the next environment:
cbenv.yaml
For running any different pipeline please use the next environment:
crenv.yaml
# In case you are using cellbender
conda env create -f cbenv.yml
conda activate cellbender
# in case you are using something else
conda env create -f crenv.yml
conda activate cellranger
NOTE:
- In the config files there are 3 types of files: string, list, int. Don't change it!
- If you try to run mkfastq pipeline use before:
module load bcl2fastq2 in the terminal, if you rather have another binary file of bcl2fastq2 instead, just make sure you add the parent folder to your path using:
export $PATH:={your_bcl2fastq2_parent_folder_path}
in the terminal. - In the config files, make sure that several lists need to have the same number of elements. (should be clarified in the comments above each attribute)
- If you decide using an aligner, we encourage you to use the multi pipeline. This aligner generalize all kind of feature types including: Gene expression, Antibody Capture, VDJ etc.
- Please be in touch with Yoav for any further questions :)
-
config_general.yml:
project_name: give your project a name
id: this is your output from cellranger in case you are using one of its pipelines
pipeline: choose your pipeline out of the list: cellbender, demulti, flex, mkfastq, mkref, multi, qc
aligner_software_path: give the path of aligner (in case you are using it), if you have exported your aligner to $PATH environment variable just use the aligner name instead. The aligner options:
- cellranger
- cellranger-atac
running_machine: are you using your Wexac or your PC? for Wexac just write Wexac or use Default, for you PC write PC or leave empty
aws_ec2: TBD
Runtime parameters: Nothing to elaborate
-
config_count_pipeline.yml:
id: A unique run ID string (e.g., sample345). The name is arbitrary and will be used to name the directory containing all pipeline-generated files and outputs.
alignment_ref_genome_file: Path to folder containing a Cell Ranger GEX/ATAC reference.
fastq_path: Path of the fastq path folder
-
config_cellbender_pipeline.yml:
data_path: add the path to your cellranger output directory
chosen_pipeline: specify what kind of pipeline you used to align your samples
-
config_demulti_pipeline.yml:
adata_path: add you h5 you wish to use
sample_names: give an arbitrary name according to the adata_path you are using
hto_list_per_sample: give the hashes names in each sample
demultiplex_method: there are 2 demultiplex algorithm we are using, hashsolo and demultiplex2, check out in both documentation what are the difference between them. hashsolo: https://www.sciencedirect.com/science/article/pii/S2405471220301952?via%3Dihub
demultiplex2: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03177-y
priors: only relevant when using hashsolo, those three elements describes the a-priors probabilities of each hypothesis: singlet, doublet, empty droplet
-
config_flex_pipeline.yml:
alignment_ref_genome_file: add reference genome path for gene expressions
when using flex you have 2 options: option 1: using INCPM link
INCPM_link: add the link you get from INCPM of the fastqs
INCPM_directory: where all the data will be downloaded
Note: if you are using option 1, pay attention you will be still asked for more inputs while running the main.py
option2: manual option
fastq_path: add the paths for fastqs parent folder
probe_set_path:
create_bam: True/False, do you wish to include bam in your output?
include_intros: True/False, do you wish to include introns while aligning?
expected_cell: estimate the number of expected cells in your experiment, you can leave it blanc
sample_name: arbitrary name for your sample, usually named after the prefix in the fastq
fastq_folders_name: necessarily has to be the prefix of the fastqs E.G: the fastq file is: PAC-i03-HmYC003-UNT-BMd000-xIPx-1-GEX-C_S2_L001_I1_001.fastq.gz The prefix is: PAC-i03-HmYC003-UNT-BMd000-xIPx-1-GEX-C
lanes_used: just leave as Default
multiplexing method: in case you are using hashes, choose your method: “feature_barcode” (hashtag oligos) or “cmo_barcode”
probe_barcode_csv: add a csv file that elaborates your multiplexing experiment according to cellranger csv format.
sample_id_probe: arbitrary name for your sample
probe_barcode_ids: arbitrary number for your barcode
probe_description: copied from sample_id_probe
TBD
alignment_ref_genome_file: add reference genome path for gene expressions
alignment_ref_vdj_file: add reference genome path for vdj
when using multi you have 2 options:
option 1: using INCPM link
INCPM_link: add the link you get from INCPM of the fastqs
INCPM_directory: where all the data will be downloaded
Note: if you are using option 1, pay attention you will be still asked for more inputs while running the main.py
option2: manual option
fastq_path: add the paths for fastqs parent folder
create_bam: True/False, do you wish to include bam in your output?
include_intros: True/False, do you wish to include introns while aligning?
expected_cell: estimate the number of expected cells in your experiment, you can leave it blanc
R1_length: what is the number of counted nucleotides from R1
R2_length: what is the number of counted nucleotides from R2
sample_name: arbitrary name for your sample, usually named after the prefix in the fastq
fastq_folders_name: necessarily has to be the prefix of the fastqs E.G: the fastq file is: PAC-i03-HmYC003-UNT-BMd000-xIPx-1-GEX-C_S2_L001_I1_001.fastq.gz The prefix is: PAC-i03-HmYC003-UNT-BMd000-xIPx-1-GEX-C
lanes_used: just leave as Default
multiplexing method: in case you are using hashes, choose your method: “feature_barcode” (hashtag oligos) or “cmo_barcode”
feature_types: respectively your submit in the sample name/fastq_folders_name please add their types: Gene Expression, Antibody Capture, CRISPR Guide Capture, Multiplexing Capture, VDJ-B, VDJ-T, VDJ-T-GD, Antigen Capture #5', Antigen Capture only
In case you are using feature_barcode:
feature_reference_csv: In case you have a prepared csv please add the path
otherwise do it manually:
hto_id: give the hashes a name
hto_names: usually are the same as hto_id
hto_read: declare which reading you are concatenate your hashes “R1” or “R2”
hto_pattern: explain where exactly the barcode are
hto_sequence: what is the sequence of each hash
HTO_feature_type: most of the time is anybody
In case you are using cmo_barcode:
cmo_barcode_csv: In case you have a prepared csv please add the path
sample_id_cmo:
cmo_id:
-
config_qc.yml:
Nothing needs to be added