This repository contains scripts that may be helpful when working with the Decombinator software.
This script should be run using R.
The folder also contains examples of the plots produced by the script.
This script calculates the TCR overlap (using the 5-part Decombinator id, DCR) between any two samples produced using Collapsinator from Decombinator V4.
-
The overlap is calculated as follows: Let A be the overlap matrix. Then for any two samples i,j,
A{i,j}, the overlap with respect to row i, is
(Number of distinct DCRs found in both i and j) / (number of unique DCRs in i).
-
Note this is in general asymmetric (A{i,j} is not equal to A{j,i}).
-
The overlap matrix is then plotted as a heatmap, with red squares marking an overlap greater than (mean + 3 standard deviations). This threshold can be adjusted in the script.
-
If comparing all samples in a particular sequencing run, note that as some runs contain several samples from the same individual/s, the absolute expected 'background' overlap will differ between runs.
-
The script calculates and plots the overlap of the alpha files first followed by the beta.
This script will generate a summary .csv file of all the data stored in the Logs folder within the Decombinator repository. This script is should be run in Python 3.
How to run:
python LogSummary.py -l /path/to/LogsFolder/ -o /path/to/outfile.csv
The output .csv file contains the following fields:
sample NumberReadsInput NumberReadsDecombined PercentReadsDecombined UniqueDCRsPassingFilters TotalDCRsPassingFilters PercentDCRPassingFilters(withbarcode) UniqueDCRsPostCollapsing TotalDCRsPostCollapsing PercentUniqueDCRsKept PercentTotalDCRsKept AverageInputTCRAbundance AverageOutputTCRAbundance AverageRNAduplication
The order of the output summary will contain samples containing 'alpha' in their name, followed by those containing 'beta', followed by any additional samples that contain neither, all sorted alphabetically.
Alternatively, an additional argument can be used to specify the order of samples in the output summary file:
python LogSummary.py -l /path/to/LogsFolder/ -o /path/to/outfile.csv -s /path/to/order/of/samples.csv
The first column of this file should contain the sample names in the desired order, and the file should be comma-separated if it contains more than one column. This is designed to match the formatting of the index sample sheet used with Demultiplexor (which can simply be re-used here for the summary ordering).
This script should be run using Python 3.
This script can be used to randomly sub-sample a file produced from the Decombinator pipeline (.fq, .n12, .freq, .cdr3, .np, .dcrcdr3) down to a specified sample size.
How to run:
python RandomlySample.py -in /path/to/infile.n12 -n 50
where the -in argument should be supplied with the file to be subsampled, and the -n argument should be supplied with the desired final sample size. The output file filename of the input file is prefixed by subsampled_. Note that the path will not be retained, and the output file will be written to the current working directory. Similarly, the Logs file will be stored in the current working directory.
To generate a subsampled file and Logs file in the Decombinator directory, run from within Decombinator:
python /path/to/Decombinator-Tools/RandomlySample.py -in infile.n12 -n 50
A full list of arguments can be viewed by running:
python RandomlySample.py -h
This script should be run using Python 3.
This script can be used to generate a histogram plot displaying the frequency of average UMI cluster sizes present in collapsed TCR repertoire data. It should be run as an optional analysis tool in the Collapsinator stage of the Decombinator pipeline.
If Collapsinator has been run using the -uh argument, a .csv file will be saved to the logs folder with the suffix _UMIhistogram.csv. Supplying this file to the UMIHistogram script will generate the histogram.
How to run:
python UMIHistogram.py -in /path/to/filename_UMIhistogram.csv
The histogram plot will be saved to the same path as the input file. To change the output file, supply the -o argument:
python UMIHistogram.py -in /path/to/filename_UMIhistogram.csv -o /path/to/outfile.png
Additional arguments include -b to adjust the histogram bin size, -c to change the colour of the histogram (supply as text, e.g. pink, or as hexcode within quotes, e.g. '#34623F'), and -d to change the DPI of the output plot:
python UMIHistogram.py -in /path/to/filename_UMIhistogram.csv -b 80 -c '#34623F' -d 300
This script should be run using Python 3.
The first five comma-delimited fields in a Decombinator or Collapsinator output file are referred to as the DCR identifier. The first two of these five fields are integer indices that correspond to V and J genes.
The DCRtoGeneName script reads in a file containing a DCR identifier and converts these first two V and J indices to the real gene name. The output (path and) file name is the same as the input name, but has prefixes the file extension with tr_ e.g. example.n12.gz will convert to example.tr_n12.gz.
How to run:
python DCRtoGeneName.py -in path/to/input_alpha_file.n12.gz
If the chain name is not present in the input file name, it should be supplied by the user, e.g. :
python DCRtoGeneName.py -in path/to/input_file.freq -c beta
The default species for this script is assumed to be human, but can be changed with the -sp argument, e.g. -sp mouse. A full list of additional arguments can be viewed by running:
python DCRtoGeneName.py -h
The automatic test data generator can be used to produce customised FASTQ files to use as test input data for use with the Decombinator Script. This script should be run using Python 3.
Generic TCR alpha and beta chain FASTQ files for Decombinator can be generated by simply running:
python TestDataGenerator.py
Separate FASTQ files containing alpha and beta TCR sequences will be saved to a newly created TestData directory. Each file will contain 100 FASTQ reads each containing a V and J tag. The direction of these reads by default is in the reverse direction (which is also the default for Decombinator). By default, these sequences will include dummy barcodes (with a default barcode length of 42).
The id of each read in the FASTQ file contains information regarding which chain is present in the read, the direction of the read, whether the read contains one (SingleTag) or two (DualTag) tags, and a read-id.
The test data generator can be provided a number of input arguments to customise the output test data. Information on these arguments can be found by running:
python TestDataGenerator.py -h
| Argument | Description |
|---|---|
-d |
change the name of the output directory for the test data |
-rl |
change the read length of the output sequences (before barcoding) |
-n |
change the total number of output reads per chain |
-v |
choose whether to produce dual tag or single tag output data |
-mc |
produces a single output file of mixed alpha and beta chain TCRs |
-nbc |
choose to produce non-barcoded data |
-bl |
change the length of the attached barcode |
-it |
provide a percentage value for how many of the reads contain TCR sequences |
-ie |
provide a percentage value for how many tags contain one "sequencing" error |
-mt |
provide a percentage value for how many alpha and beta tags will have an overlapping match in single tag mode |
-tf |
direct the script to the local directory where tag files are stored. Useful for offline work. By default the script will first attempt to download the tags from the Decombinator-Tags-FASTAs directory. |
-or |
change the orientation of the output reads, 'reverse', 'forward', or 'both' (50% forward and 50% reverse) |
-sp |
change the species ('human' or 'mouse') |
-ol |
change the oligo used ('i8' or 'm13') |
- To create a file of 250 non-barcoded, mixed alpha and beta chain reads of 160 bp in length, stored in a directory called
MyDatarun:
python TestDataGenerator.py -nbc -mc -n 250 -rl 160 -d MyData
- To create alpha and beta files for Single Tag Decombinator, in the forward direction, with 40% of sequences containing a sequencing error (1 bp substitution), run:
python TestDataGenerator.py -v s -or forward -ie 40
The run_scripts folder contains a number of bash scripts that can be used or modified as a shorthand to run over multiple input files in a directory.
This guide is only relevant for members of the Innate2Adaptive group at UCL. At present it is designed to work with batches that have samples generated by a single protocol and species (e.g. RACE on Humans). If you are from another group utilising the COLCC resource, feel free to adapt the scripts found here. If you would like help with this, raise an issue in this repository and we will try to assist.
- You will need access to the UCL CS HPC cluster which can be requested here.
- You will need to be added to the
inn2adapuser group to get access to our project storage. Ask for another member of the group to give you the directory path and request access for you when you obtain your login details.- In order to
cdto the project storage you will need to directlycdto the project, due to dynamic mounting, the directory will not appear inlsor a directory GUI. - For convenience, create a bash script in your home directory with a
cdcommand to the project directory.
- In order to
- Create a new batch directory in the project directory and place your
.tarfile there, make sure the.tarfile is also backed up to the RDS as the project directory is not itself backed up.- See the UCL CS HPC website for information on how to set up port-forwarding if you need to
scpdata to the cluster.
- See the UCL CS HPC website for information on how to set up port-forwarding if you need to
- Copy the contents of
Decombinator-Tools/jobs/cshpc/to your batch directory.- There are 4 scripts:
submit.shwhich you will call to unpack your.tarfile and submit all of your samples as jobs,dcr_job.qsub.shis the job script which instructs the cluster how to treat your task,resubmit.shis a handy script that you can use to resubmit any jobs which fail due to various reasons (see the troubleshooting section below), andrepack.shwhich will format your processed data into the Chain lab standard storage format.
- There are 4 scripts:
- Check
dcr_job.qsub.shto make sure it has the correct pipeline settings for your data.- If you have samples processed by different protocols in your batch, you will have to split them manually and edit
submit.shto skip the already completed steps.
- If you have samples processed by different protocols in your batch, you will have to split them manually and edit
- Run
sh submit.shto submit all samples in the.tarfile to the job scheduler. - Check if your jobs begin running successfully with
qstat. - TROUBLESHOOTING: If a subset of jobs fails to complete, it may be due to memory or runtime limits. Investigate why using the
.ofiles in each job subdirectory (feel free to ask for help with this), edit the copy ofdcr_job.qsub.shlocated in the top level of your batch directory, and then runsh resubmit.sh. This will resubmit all jobs that failed to complete, but using your new memory or runtime limits.- If your
.ofile ends withMemoryError, increase the memory requested indcr_job.qsub.sh. - If your
.oends without any error message, the job likely hit the runtime limit. Increase the runtime requested indcr_job.qsub.sh. - For other errors, inspect the error stack message carefully, as often it will be related to naming convention issues. Feel free to open an issue on this repository for help.
- If your
- Create a
.csvwith your batch name and the order in which you would like your samples presented in the summary sheet. - Once all jobs are complete, check you have all the results you expected with
find temp/ -type f -name "*tsv*" | wc -l(remove| wc -lto see filenames rather than counts), and then runsh summarise.shto package the results for transfer to the RDS. - When you have verified that the data is safely saved onto the RDS, delete the batch directory containing the analysis on the CS HPC cluster. Only you have the permissions to delete this folder and we have a 500GB limit in the project directory.
This script will run Decombinator for all .fq and .fq.gz files in the Decombinator directory. Note that this script assumes the Decombinator and Decombinator-Tools repositories are located in the same parent directory. Otherwise, a path to the Decombinator repository should be supplied. Additionally, the Logs folder for these files will be located in the working directory from which the script is run - to keep log files in the Decombinator repository, follow step 4 below.
How to run:
- from run_scripts folder:
./Decombinator.sh
or
source Decombinator.sh
- from Decombinator-Tools folder:
./run_scripts/Decombinator.sh
or
source run_scripts/Collaspinsator.sh
- from Decombinator folder:
source /path/to/Decombinator-Tools/run_scripts/Decombinator.sh
- Supplying path to Decombinator repository
source Decombinator.sh /path/to/Decombinator
This script will run Collapsinator for all .n12.gz files in the Decombinator directory in parallel. Note that this script assumes the Decombinator and Decombinator-Tools repositories are located in the same parent directory. Otherwise, a path to the Decombinator repository should be supplied. Additionally, the Logs folder for these files will be located in the working directory from which the script is run - to keep log files in the Decombinator repository, follow step 4 below.
How to run:
- from run_scripts folder:
./Collapsinator.sh
or
source Collapsinator.sh
- from Decombinator-Tools folder:
./run_scripts/Collapsinator.sh
or
source run_scripts/Collaspinsator.sh
- from Decombinator folder:
source /path/to/Decombinator-Tools/run_scripts/Collapsinator.sh
- Supplying path to Decombinator repository
source Collapsinator.sh /path/to/Decombinator
The Job Scripts directory contains an example job script for running the Decombinator Test Data compatible with the University College London cluster setup. Some modifications may be required for other setups.
-
You will need to install Decombinator-Tools in your local space on the cluster. After logging in, run the command:
git clone https://github.com/innate2adaptive/Decombinator-Tools.git -
To use the job script, you must first have Decombinator and its required packages installed within a conda virtual environment in your local space on the cluster. Instructions are provided on the main Decombinator README.
-
After Decombinator has been installed in your local space on the cluster, create an output directory for your data in your local Scratch space:
mkdir Scratch/DCRExample -
Next, you will need to modify step 6 labelled in the job script
testdatajob.sh, replacing<user-id>with your own user id. If you have named your output directory differently toDCRExample, this should also be changed in step 6. If you have named your conda environment differently from the example environment, this shop be changed in step 9 in the script. -
After making these changes, submit your job:
qsub run_scripts/jobscripts/testdatajob.sh -
You can monitor your job using the
qstatcommand. The initial state should showqwwhen waiting in the queue, andrwhen the job is running. -
The output of running the test data will be saved to
Scratch/DCRExample/files_from_job_<job_number>.tar.gz. Extract your data, replacing<job_number>with your cluster job number, using:tar -xvzf Scratch/DCRExample/files_from_job_<job_number>.tar.gz
This template script can be modified and extended to run the Decombinator pipeline on any appropriate data.
This script should be run using Python 3.
This script can be used to automatically run the test data stored in the Decombinator-Test-Data repository through the Decombinator pipeline. The script assumes the directory structure shown below, where Decombinator, Decombinator-Tools, and Decombinator-Test-Data are stored in the same parent directory:
+-- parent-directory
| +-- Decombinator
| +-- Decombinator-Test-Data
| +-- Decombinator-Tools
| | +-- RunTestData.py
How to run:
python RunTestData.py
Note that the output files and Log files will be stored in the current working directory (Decombinator-Tools). To generate output and Log files in the Decombinator directory, run from within Decombinator directory:
python /path/to/Decombinator-Tools/RunTestData.py
The RunTestData script can be supplied with two checkpoint arguments, -c1 and -c2 that tell the script at which part in the pipeline to start from and end at. For example, test data through only Demultiplexor and Decombinator, run:
python /path/to/Decombinator-Tools/RunTestData.py -c1 demultiplexor -c2 decombinator
or to run the test data through only Collapsinator, run:
python /path/to/Decombinator-Tools/RunTestData.py -c1 collapsinator -c2 collapsinator