-
See some 2022 SV validations here: https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac039/6517943?login=false
-
Benchmark (DEL calls) Trost2018
I used an updated benchmark (personal communication from Brett Trost), which includes calls from PacBio: 7,952 deletions after bedtools merge, sizes: [50 – 171,021,482].
For comparison, GIAB’s NA12878 truth_DEL.merged.bed has 2,668 deletions of [50 - 139,620].
module load sratoolkit
prefetch -c SRR8595488 --max-size 60GB
qsub ~/bioscripts/fastq.sra2fq.sh -v srr=SRR8595488,sample=huref_blood1
Result
38G huref_blood1_1.fq.gz
42G huref_blood1_2.fq.gz
details:
- algorithm:
aligner: bwa
effects: false
mark_duplicates: true
realign: false
recalibrate: false
save_diskspace: true
tools_on:
- svplots
- qualimap
variantcaller: false
analysis: variant2
description: huref_blood1
files:
- /path/huref/input/huref_blood1_1.fq.gz
- /path/huref/input/huref_blood1_2.fq.gz
genome_build: GRCh37d5
metadata:
batch: huref
fc_name: huref
resources:
default:
cores: 7
jvm_opts:
- -Xms750m
- -Xmx7000m
memory: 7G
upload:
dir: ../final
details:
- algorithm:
aligner: false
effects: false
mark_duplicates: true
realign: false
recalibrate: false
remove_lcr: false
save_diskspace: true
svcaller:
- lumpy
- manta
- cnvkit
- metasv
- wham
tools_off:
- vqsr
tools_on:
- svplots
- qualimap
- noalt_calling
variantcaller: false
analysis: variant2
description: huref_blood1
files:
- /path/huref/input/huref_blood1.bam
genome_build: GRCh37
metadata:
batch: huref
fc_name: huref
resources:
default:
cores: 7
jvm_opts:
- -Xms750m
- -Xmx7000m
memory: 7G
upload:
dir: ../final
_evaluate_one
function in bcbio/structural/validate.py uses pybedtool’s intersect to overlap intervals.
By default the overlap reported if there is min 1bp overlap, the equivalent of
bedtools intersect -u -a test.bed -b truth.bed
For deletions using 50% overlap threshold would be more accurate, because events length >> 1bp, i.e.
bedtools intersect -u -r 0.5 -R 0.5 -a test.bed -b truth.bed
Surprisingly, in my test it does not make a big difference, just 15 deletions out of 2668 in GIAB, and it seems that pybedtools does not support parameters -f and -F, so I continued with the default intersect method.
I modified size bins in bcbio/structural/validate.py to be compatible with Trost2018 article:
EVENT_SIZES = [(50, 100), (100, 300), (300, 500), (500, 1000), (1000, 10000),
(10000, 100000), (100000, int(1e6))]
For each tool (except of cnvkit which does not set PASS filter):
export tool=metasv
bcftools query -i 'FILTER="PASS" && SVTYPE="DEL"' -f '%CHROM\t%POS\t%END\n' $tool.vcf.gz | grep -v GL > $tool.PASS.DEL.bed
bedtools merge -i $tool.PASS.DEL.bed | awk -v tl=$tool '{print $0"\t""DEL_"tl}' > $tool.PASS.DEL.merged.bed
crg.sv.validate_bed.py $tool.PASS.DEL.merged.bed HuRef.SV.DEL.merged.bed $tool
Merge all dataframes from all tools and plot the picture:
head -n1 cnvkit.df.csv > del.csv
cat *.df.csv | grep -v svtype >> del.csv
python /path/bcbio/anaconda/lib/python2.7/site-packages/bcbio/structural/validate.py del.csv
versions:
- bcbio-nextgen,1.0.9a0-5e3e041
- manta,1.3.0
- lumpy-sv,0.2.13
- cnvkit,0.9.2a0
- wham,1.7.0.311
- metasv,0.4.0
- the alignment was done vs GRCh37 without decoy
versions:
- bcbio-nextgen,1.1.2
- manta,1.5.0
- lumpy-sv,0.2.14a
- cnvkit,0.9.6a0
- wham,1.8.0.1.2017.05.03
- metasv,0.4.0 (=consesnsus of manta, lumpy, wham, and cnvkit)
- the alignment was done vs GRCh37 with decoy and decoy reads were filtered out
Tool | [50,100] | [100,300] | [300,500] | [500,1000] | [1k,10k] | [10k,100k] | [100k,1mln] | total |
---|---|---|---|---|---|---|---|---|
cnvkit | 0 | 0 | 0 | +1 | +2 | -9 | -2 | -8 |
wham | +3 | +13 | +35 | +4 | +14 | +4 | 0 | +73 |
manta | +22 | +13 | +24 | +4 | +11 | +1 | +1 | +76 |
lumpy | -41 | -18 | -91 | +101 | +327 | +28 | +1 | +307 |
metasv | -15 | -22 | -5 | -8 | -24 | -1 | 0 | -75 |