SDrecall is a specialized variant caller designed to improve variant detection in segmental duplication (SD) regions where conventional callers often struggle due to mapping ambiguity.
SDrecall works by:
- Identifying SDs for realignment and recall (SDs overlapping with user-defined target regions, which by default is protein-coding regions, and regions covered by multi-aligned reads)
- Identifying the homologous counterparts of the targeting SDs, as well as creating masked reference genomes for these regions
- Recruiting reads from the counterparts and perform realignment
- Phasing of the realigned reads and assembled into micro-haplotypes, then eliminate less optimal haplotypes from the realignments with Linear Integer Constraint Programming. Variant calling with BCFtools based on the filtered realignments.
- Merging the result variants with conventional caller output (Suggested follow up)
- Annotating common variants using a cohort VCF (optional)
SDrecall significantly improves small variant (SNVs and small indels) detection in SDs where conventional callers typically miss variants or produce false negatives.
The result callset is not with high precision rate like the callsets generated by GATK/DeepVariant. SDrecall is primarily designed for molecular diagnosis of Mendelian diseases patients. Despite the limited precision rate, the false positive control measures in SDrecall still managed to control the amount of FP noises survived to be causal variant candidates. Upon systematic evaluation, when targeting the entire exome, SDrecall only left 1-3 rare and deleterious FPs to cloud the final selection of the causal variants among candidates while compensated the detection sensitvity to approximately 95%.
For molecular diagnosis of Mendelian disease patients, SDrecall provides comprehensive inspection of SD regions that would otherwise be missed by traditional NGS analysis pipelines, while introducing marginal noise that could interfere with causal variant identification.
Users should first clone this repository to a local directory.
For mamba/conda users, create an environment from YAML:
mamba env create -f ./env/SDrecall.yml
mamba activate SDrecall
Given the long list of dependencies of SDrecall, we are still working on a docker file / singularity recipe. Any contributions are most welcome.
SDrecall provides three main execution modes:
# Run with conventional caller integration and cohort annotation
SDrecall run \
-i input.bam \
-r /path/to/reference.fa \
-m /path/to/sd_map.bed \
-b /path/to/target.bed \
-o /path/to/output_dir \
-t 16 \
-s <sample_id> \
--target_tag <label_of_target_region> \
--conventional_vcf /path/to/deep_variant.vcf \
--caller_name DeepVariant \
--cohort_vcf /path/to/control_cohort.vcf \
--inhouse_common_cutoff 0.01 \
--cohort_conf_level 0.999
# Run the complete SDrecall pipeline
SDrecall run \
-i input.bam \
-r /path/to/reference.fa \
-m /path/to/sd_map.bed \
-o /path/to/output_dir \
-b /path/to/target.bed \
-t 16 \
-s <sample_id> \
--target_tag <label_of_target_region> \
# Run only the preparation phase (identifies SD regions, creates masked references)
SDrecall prepare \
-i input.bam \
-r /path/to/reference.fa \
-m /path/to/sd_map.bed \
-o /path/to/output_dir \
-b /path/to/target.bed \
-t 16 \
-s <sample_id> \
--target_tag <label_of_target_region> \
--high_quality_depth 10 \
--minimum_depth 3
# Run only realignment and recall (requires preparation output)
SDrecall realign \
-i input.bam \
-r /path/to/reference.fa \
-m /path/to/sd_map.bed \
-b /path/to/target.bed \
-o /path/to/output_dir \
-s <sample_id> \
-t 16 \
--target_tag <label_of_target_region> \
--numba_threads 4
This stage identifies SD regions with mapping issues:
- Extracts multi-aligned regions based on mapping quality (
pick_multialigned_regions()
) - Compares to a reference SD map
- Creates a multiplex graph with SD pairs (
build_SD_graph()
) - Builds masked reference genomes for each SD group (
build_beds_and_masked_genomes()
)
This stage performs targeted variant calling:
- Extracts reads from identified SD regions (
imap_prepare_masked_align_region_per_RG()
) - Realigns to masked references (
imap_process_masked_bam()
) - Eliminates misalignments (
eliminate_misalignments()
) - Performs variant calling on filtered alignments
- Tags variants for provenance
Final steps may include:
- Annotating variants with cohort data (
identify_common_vars.py
) - Merging with conventional caller output (
merge_with_priority()
insrc/merge_variants_with_priority.py
) - Prioritizing variants based on quality metrics
- BAM file: Aligned sequencing reads (must be sorted by coordinates and indexed)
- Reference genome: FASTA format (hg19 or hg38 supported)
- Reference SD map: BED file with segmental duplication coordinates (Two gzipped bed files are offered in data/hg19(hg38)/ref_SD)
- Target BED : Specific regions to analyze (the targeting regions you want to ensure detection sensitivity. For molecular diagnosis of Mendelian diseases, this can be the whole exome, or the coding regions of functionally relevant genes.)
- Supplementary VCF (optional): Conventional caller results to merge with (The VCF file of the same sample, called by other conventional callers like GATK and DeepVariant. If provided, SDrecall will try to merge its own output with this VCF file to offer a final output VCF for downstream analysis)
- Cohort VCF (optional): Population data for identifying common variants ( It is recommended to perform SDrecall on dozens of control samples with the similar coverage profile. Then merge them with bcftools and have AC and AN INFO tags calculated in the final merged VCF. This way, the AN, AC info for each variant called by SDrecall within your inhouse control cohort can be exploited to estimate whether it is truly a common variant in the general population. This is important because traditional population databases like gnomAD and 1000g is based on NGS data, therefore having gaps on the regions like segmental duplications due to the mapping ambiguity)
The main outputs include:
- Filtered BAM files: Realigned reads in SD regions (in
<output_dir>/<sample_id>_<assembly>_<target_tag>_SDrecall/recall_results
) - Variant calls: VCF files with variants in SD regions (in
<output_dir>/<sample_id>_<assembly>_<target_tag>_SDrecall/recall_results/<sample_id>.sdrecall.vcf.gz
) - Final merged VCF: Combined results with appropriate filters/tags (in
<output_dir>/final_vcf/
) - Log files: Detailed processing information
- Mapping quality filtering: Adjust thresholds with
--mq_cutoff
(default: 41) - Depth filtering: Control with
--high_quality_depth
(default: 10, used for pickup multialigned regions, specifies maximal depth of high MAPQ reads to be considered as insufficient coverage for downstream variant calling) and--minimum_depth
(default: 3, used for pick up the region suffering multialignments, this specifies the minimal required depth regardless of MAPQs) - Confidence levels: Set statistical confidence with
--conf_level
(default: 0.999, used for common variant estimation) - Variant filtering: Filter with
--inhouse_common_cutoff
(default: 0.01) when using cohort data - Performance tuning: Adjust
--threads
for overall parallelism and--numba_threads
for computational acceleration
Common Arguments Use SDrecall --help and SDrecall run/prepare/realign --help to see the full argument list
-i, --input_bam Input BAM file path (must be indexed)
-r, --ref_genome Reference genome path
-m, --reference_sd_map Reference segmental duplication map
-o, --outdir Output directory
-b, --target_bed Required target regions for analysis (default: whole exome, offered in data/hg19(38)/default_target)
-s, --sample_id Sample ID (default: extracted from BAM filename)
-t, --threads Number of threads to use (default: 10)
-v, --verbose Verbosity level (INFO, DEBUG, etc.)
--target_tag Label of the target region (recommended to specify)
SDrecall is under active development. We welcome all kinds of suggestions and collaborations.
Xingtian Yang ([email protected]), Louis She ([email protected])