SummarizeVCF.py is a command line tool for producing per-sample summary statistics for variants in a VCF file. Conceptually, SummarizeVCF.py is intended to be for VCF files what FastQC is for FASTQ files.
cd Pipeline-Tools
python ./SummarizeVCF.py <summary_type> --vcf <vcf_file> [options]
summary types:
Mutect, Multisample
Detailed description of additional options available in help menu.
python ./SummarizeVCF.py --help
usage: SummarizeVCF.py <summary_type> [options]
positional arguments:
{Mutect,Multisample} VCF Summary type.
optional arguments:
-h, --help show this help message and exit
--vcf VCF_FILE Path to vcf file to summarize.
--max-records MAX_RECORDS
Maximum number of records to process. Default: ALL.
--max-indel-len MAX_INDEL_LEN
Upper bound of indel length summary.
--max-depth MAX_DEPTH
Upper bound of variant depth summary.
--max-qual MAX_QUAL Upper bound of variant quality summary.
--afs-bins NUM_AFS_BINS
Number of bins to use for Allele Frequency Spectrum.
-v Increase verbosity of the program.Multiple -v's
increase the verbosity level: 0 = Errors 1 = Errors +
Warnings 2 = Errors + Warnings + Info 3 = Errors +
Warnings + Info + Debug
Although VCF is supposed to be a standardized format, variant calling programs differ in how they utilize specific fields. Currently, VCFSummary.py is able to handle both standard multisample VCF files (produced by GATK GenotypeGVCFs or Samtools mpileup):
python ./SummarizeVCF.py Multisample --vcf <vcf_file> [options]
And VCF files produced by somatic callers like Mutect2.
python ./SummarizeVCF.py Mutect --vcf <vcf_file> [options]
Disclaimer: Choosing the correct summary type is important as a mismatch will cause the summary program to crash. Make sure you know what kind of VCF you're working with.
For each sample in VCF, SummarizeVCF.py reports the number of:
- Missing GT
- Called GT
- Variant GT
- Heterozygous loci
- Homozygous-Alt
- Deletions
- Insertions
- Monomorphs
- SNPs
- Ts
- Tv
- 12 SNP transition types (e.g. C -> A, G -> T)
- dbSNP variants
- Structural variants
If annotation information is provided, the following can also reported if present:
- Number of variants by impact (HIGH, MODERATE, LOW, MODIFIER)
- Number of variants by type
- intergenic
- intronic
- synonymous_SNV
- nonsynonymous_SNV
- stoploss,stopgain,
- onframeshift_deletion
- nonframeshift_insertion
- frameshift_deletion
- frameshift_insertion
SummarizeVCF.py also produces per-sample distribution summaries of the following:
- Genotype Quality
- Read depth
- Allele frequency (AF=percentage of samples with that variant)
- Insertion length
- Deletion length
If run in Mutect mode, the following are also computed for each sample:
- Number variants passed Mutect filter
- VCF format v4.0+
- Annotated with either Annovar, SnpEff, both, or no annotation.
- No multi-allelic variants
- Obviously real loci are multi-alleleic but VCF files should be normalized using BCFtools Norm to break them into separate lines in VCF
- Allows each allele to have separate annotations
- If in Mutect mode, VCF file must be Mutect output or merged VCF from multiple Mutect runs
- If in Multisample mode, VCF must not be Mutect output
--max-records option can be used to subsample the number of VCF records processed for faster runtimes. Default is to process all records.
--max-indel-len sets the upper bound for summarizing the indel size distribution
--max-depth sets the upper bound for summarizing the variant read depth distribution
--afs-bins specifies the number of bins for summarizing the allele frequency spectrum of alternate alleles
SummarizeVCF.py produces a tab-delimited output file with 6 sections:
- COUNTS: counts described above for each sample.
- DEPTH: variant depth distribution for each sample
- QUAL: variant quality score distribution for each sample
- INSERT_LEN: insertion length distribution for each sample
- DELETE_LEN: deletion length distribution for each sample
- AAFS: Allele frequency spectrum for each sample
All sections have a first header row followed by subsequent rows for each sample.
Check out the Example VCFSummary for a better idea.
The helper program CatVCFSummary.py is designed to merge VCFSummaries to facilitate parallelized processing.
cd Pipeline-Tools
python CatVCFSummary.py --help
usage: CatVCFSummary [-h] -i INPUT_FILES [INPUT_FILES ...] [-v]
optional arguments:
-h, --help show this help message and exit
-i INPUT_FILES [INPUT_FILES ...]
Space-delimited list of VCFSummary files to combine
-v Increase verbosity of the program.Multiple -v's
increase the verbosity level: 0 = Errors 1 = Errors +
Warnings 2 = Errors + Warnings + Info 3 = Errors +
Warnings + Info + Debug
To drastically decrease processing time, VCF files can be split by chromosome using SnpEff and summarized in parallel. CatVCFSummary.py is designed to merge these splits back into a single VCFSummary.
Example:
Summarize variants in VCF splits using SummarizeVCF.py
cd ./Pipeline-Tools
python ./SummarizeVCF.py Multisample -vcf chr1.vcf > chr1.sum.txt
python ./SummarizeVCF.py Multisample -vcf chr2.vcf > chr2.sum.txt
python ./SummarizeVCF.py Multisample -vcf chr3.vcf > chr3.sum.txt
Merge using CatVCFSummary.py
cd ./Pipeline-Tools
python ./CatVCFSummary.py -i chr1.sum.txt chr2.sum.txt chr3.sum.txt \
-o full.sum.txt