Skip to content

SDrecall is designed for sensitive variant detection in segmental duplications

License

Notifications You must be signed in to change notification settings

snakesch/SDrecall

Repository files navigation

SDrecall

SDrecall is a specialized variant caller designed to improve variant detection in segmental duplication (SD) regions where conventional callers often struggle due to mapping ambiguity.

Overview

SDrecall works by:

  1. Identifying SDs for realignment and recall (SDs overlapping with user-defined target regions, which by default is protein-coding regions, and regions covered by multi-aligned reads)
  2. Identifying the homologous counterparts of the targeting SDs, as well as creating masked reference genomes for these regions
  3. Recruiting reads from the counterparts and perform realignment
  4. Phasing of the realigned reads and assembled into micro-haplotypes, then eliminate less optimal haplotypes from the realignments with Linear Integer Constraint Programming. Variant calling with BCFtools based on the filtered realignments.
  5. Merging the result variants with conventional caller output (Suggested follow up)
  6. Annotating common variants using a cohort VCF (optional)

SDrecall significantly improves small variant (SNVs and small indels) detection in SDs where conventional callers typically miss variants or produce false negatives.

The result callset is not with high precision rate like the callsets generated by GATK/DeepVariant. SDrecall is primarily designed for molecular diagnosis of Mendelian diseases patients. Despite the limited precision rate, the false positive control measures in SDrecall still managed to control the amount of FP noises survived to be causal variant candidates. Upon systematic evaluation, when targeting the entire exome, SDrecall only left 1-3 rare and deleterious FPs to cloud the final selection of the causal variants among candidates while compensated the detection sensitvity to approximately 95%.

For molecular diagnosis of Mendelian disease patients, SDrecall provides comprehensive inspection of SD regions that would otherwise be missed by traditional NGS analysis pipelines, while introducing marginal noise that could interfere with causal variant identification.

Installation

Using conda/mamba

Users should first clone this repository to a local directory.

For mamba/conda users, create an environment from YAML:

mamba env create -f ./env/SDrecall.yml
mamba activate SDrecall

Using docker/singularity

Given the long list of dependencies of SDrecall, we are still working on a docker file / singularity recipe. Any contributions are most welcome.

Usage

SDrecall provides three main execution modes:

Complete Pipeline

With Supplementary VCF and Cohort Annotation (Recommended way to run SDrecall)

# Run with conventional caller integration and cohort annotation
SDrecall run \
  -i input.bam \
  -r /path/to/reference.fa \
  -m /path/to/sd_map.bed \
  -b /path/to/target.bed \
  -o /path/to/output_dir \
  -t 16 \
  -s <sample_id> \
  --target_tag <label_of_target_region> \
  --conventional_vcf /path/to/deep_variant.vcf \
  --caller_name DeepVariant \
  --cohort_vcf /path/to/control_cohort.vcf \
  --inhouse_common_cutoff 0.01 \
  --cohort_conf_level 0.999

Without Supplementary VCF and Cohort Annotation

# Run the complete SDrecall pipeline
SDrecall run \
  -i input.bam \
  -r /path/to/reference.fa \
  -m /path/to/sd_map.bed \
  -o /path/to/output_dir \
  -b /path/to/target.bed \
  -t 16 \
  -s <sample_id> \
  --target_tag <label_of_target_region> \

Preparation Only

# Run only the preparation phase (identifies SD regions, creates masked references)
SDrecall prepare \
  -i input.bam \
  -r /path/to/reference.fa \
  -m /path/to/sd_map.bed \
  -o /path/to/output_dir \
  -b /path/to/target.bed \
  -t 16 \
  -s <sample_id> \
  --target_tag <label_of_target_region> \
  --high_quality_depth 10 \
  --minimum_depth 3

Realignment and Recall Only

# Run only realignment and recall (requires preparation output)
SDrecall realign \
  -i input.bam \
  -r /path/to/reference.fa \
  -m /path/to/sd_map.bed \
  -b /path/to/target.bed \
  -o /path/to/output_dir \
  -s <sample_id> \
  -t 16 \
  --target_tag <label_of_target_region> \
  --numba_threads 4

Workflow Stages

1. Preparation (prepare_recall_regions.py)

This stage identifies SD regions with mapping issues:

  • Extracts multi-aligned regions based on mapping quality (pick_multialigned_regions())
  • Compares to a reference SD map
  • Creates a multiplex graph with SD pairs (build_SD_graph())
  • Builds masked reference genomes for each SD group (build_beds_and_masked_genomes())

2. Realignment and Recall (realign_and_recall.py)

This stage performs targeted variant calling:

  • Extracts reads from identified SD regions (imap_prepare_masked_align_region_per_RG())
  • Realigns to masked references (imap_process_masked_bam())
  • Eliminates misalignments (eliminate_misalignments())
  • Performs variant calling on filtered alignments
  • Tags variants for provenance

3. Post-processing (post_process_vcf() in SDrecall.py)

Final steps may include:

  • Annotating variants with cohort data (identify_common_vars.py)
  • Merging with conventional caller output (merge_with_priority() in src/merge_variants_with_priority.py)
  • Prioritizing variants based on quality metrics

Inputs

  • BAM file: Aligned sequencing reads (must be sorted by coordinates and indexed)
  • Reference genome: FASTA format (hg19 or hg38 supported)
  • Reference SD map: BED file with segmental duplication coordinates (Two gzipped bed files are offered in data/hg19(hg38)/ref_SD)
  • Target BED : Specific regions to analyze (the targeting regions you want to ensure detection sensitivity. For molecular diagnosis of Mendelian diseases, this can be the whole exome, or the coding regions of functionally relevant genes.)
  • Supplementary VCF (optional): Conventional caller results to merge with (The VCF file of the same sample, called by other conventional callers like GATK and DeepVariant. If provided, SDrecall will try to merge its own output with this VCF file to offer a final output VCF for downstream analysis)
  • Cohort VCF (optional): Population data for identifying common variants ( It is recommended to perform SDrecall on dozens of control samples with the similar coverage profile. Then merge them with bcftools and have AC and AN INFO tags calculated in the final merged VCF. This way, the AN, AC info for each variant called by SDrecall within your inhouse control cohort can be exploited to estimate whether it is truly a common variant in the general population. This is important because traditional population databases like gnomAD and 1000g is based on NGS data, therefore having gaps on the regions like segmental duplications due to the mapping ambiguity)

Outputs

The main outputs include:

  • Filtered BAM files: Realigned reads in SD regions (in <output_dir>/<sample_id>_<assembly>_<target_tag>_SDrecall/recall_results)
  • Variant calls: VCF files with variants in SD regions (in <output_dir>/<sample_id>_<assembly>_<target_tag>_SDrecall/recall_results/<sample_id>.sdrecall.vcf.gz)
  • Final merged VCF: Combined results with appropriate filters/tags (in <output_dir>/final_vcf/)
  • Log files: Detailed processing information

Advanced Features

  • Mapping quality filtering: Adjust thresholds with --mq_cutoff (default: 41)
  • Depth filtering: Control with --high_quality_depth (default: 10, used for pickup multialigned regions, specifies maximal depth of high MAPQ reads to be considered as insufficient coverage for downstream variant calling) and --minimum_depth (default: 3, used for pick up the region suffering multialignments, this specifies the minimal required depth regardless of MAPQs)
  • Confidence levels: Set statistical confidence with --conf_level (default: 0.999, used for common variant estimation)
  • Variant filtering: Filter with --inhouse_common_cutoff (default: 0.01) when using cohort data
  • Performance tuning: Adjust --threads for overall parallelism and --numba_threads for computational acceleration

Common Arguments Use SDrecall --help and SDrecall run/prepare/realign --help to see the full argument list

-i, --input_bam        Input BAM file path (must be indexed)
-r, --ref_genome       Reference genome path
-m, --reference_sd_map Reference segmental duplication map
-o, --outdir           Output directory
-b, --target_bed       Required target regions for analysis (default: whole exome, offered in data/hg19(38)/default_target)
-s, --sample_id        Sample ID (default: extracted from BAM filename)
-t, --threads          Number of threads to use (default: 10)
-v, --verbose          Verbosity level (INFO, DEBUG, etc.)
--target_tag           Label of the target region (recommended to specify)

Code development and feature requests

SDrecall is under active development. We welcome all kinds of suggestions and collaborations.

Contact and correspondence

Xingtian Yang ([email protected]), Louis She ([email protected])

About

SDrecall is designed for sensitive variant detection in segmental duplications

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published