A use friendly tool for single sample variants quality control with vcf
format. Note: This tool is only for VCF file
generate by GATK 4.x.
- please add ANNOVAR script to
bin
directory include:annotate_variation.pl
,coding_change.pl
,convert2annovar.pl
,table_annovar.pl
- python3.x
- python package: pandas,feather-format, numpy ,tenserflow 2.x, tqdm
- samtools (set
samtools
as executable command)
- Separating multi-allelic variants
bcftools norm -m -any -f Homo_sapiens_assembly38.fasta raw.vcf.gz -Oz --threads 6 > raw.norm.vcf.gz
- There are several ways to do this. We use Bcftools to accomplish these step.
- Please note that if you do not properly separate out multi-allelic variants,
VariantsQC
will automatically remove that variant in later steps.
- get annotated matrix
python 1_vcf2matrix.py -i raw.norm.vcf.gz -o annoWithSeq.matrix.tsv -r Homo_sapiens_assembly38.fasta -d /Path_of_humandb/ --thread 6 -j 32 --reserved
- There are several parameters you can find by
python 1_vcf2matrix.py -h
- -i, --vcf: you must have a variants file generate by GATK with vcf format.
- -o, ---annofile: you can set result name or generated named
result.matrix
. - -r, --reference: the chromosome in the reference must start with
chr
or you change the scriptgetRef.sh
inbin
: removechr
in line 8. - -d, --humandb: database of annovar, and have: hgxx_refGene, hgxx_rmsk, hgxx_cpgIslandExt, hgxx_genomicSuperDups
- -t, --tmp: temp directory, [./tmp/]
- --dp: depth for filter variants. [0]
- --refTag: reference of variants hg19/hg38. [hg38]
- --thread: thread of annovar annotation. [1]
- -j, --job: multiple running to get reference, if set
-j
> 1, please install parallel. [1] - -n, --paranum: number of lines that get reference one job, please set
-j
> 1. Recommend50k-1m
. [100000] - --reserved: Whether to reserved temp documents.
- filter with multi dnn