Skip to content

vcf2fasta

deena-b edited this page Jul 1, 2019 · 4 revisions

Compress a variant call file into a fasta file using gatk FastaAlternateReferenceMaker

Abbreviations

FARM = FastaAlternateReferenceMaker

Read the docs and help page

Note that the docs are for gatk3, but FARM has been piped to gatk4

FARM docs

./gatk FastaAlternateReferenceMaker --help

Activate environment

Need to build the environment? See install_gatk.md

Navigate to your cloned gatk repo

cd Documents/work/lawson_lab/gatk/

Activate your environment (which has R in it, but nothing else)

  • I don't know if this is necessary

conda activate gatk

Unzip files

E.g.

gunzip -k human_g1k_v37.fasta.gz

Flags

-k keeps the zipped file

Run FARM 'out of the box' with minimum inputs

Inputs

  • Reference fasta
    • ucsc.hg19.fasta
  • Index & dictionary files must be in the same folder as reference fasta
    • ucsc.hg19.fasta.fai
    • ucsc.hg19.dict
  • VCF file
    • recalibrated_duplicates_marked_reordered_sorted_filtered_realigned_Basal-1-2016-A10_CGAGGCTG-GCGTAAGA_L008_R1_001.fastq.gz.raw_variants.vcf

Description of code

./gatk FastaAlternateReferenceMaker \
    -O <path/to/output-file.fa> \
    -R <path/to/ref-file.fasta> \
    -V <path/to/file.vcf>

My actual code

./gatk FastaAlternateReferenceMaker \
        -O /Users/drb/Documents/work/lawson_lab/deepcelllineage/mitolin/data/generated/20190613-fastas/1457-1sttry.fa \
        -R /Users/drb/Documents/work/lawson_lab/deepcelllineage/mitolin/downloads/ref/ucsc.hg19.fasta \
        -V /Users/drb/Documents/work/lawson_lab/data/nguyen_nc_2018/20190502_output/Basal-1-2016-A10_CGAGGCTG-GCGTAAGA_L008_R1_001.gz/recalibrated_duplicates_marked_reordered_sorted_filtered_realigned_Basal-1-2016-A10_CGAGGCTG-GCGTAAGA_L008_R1_001.fastq.gz.raw_variants.vcf 

Flags

-O = output
-R = reference
-V = vcf file

Note

  • .gz in the above file names will be cleaned up in the next round of file naming
  • The reference fasta needs to be the same one that was used to create the vcf index and dictionary files
  • It took about 14 minutes for one vcf file to be turned into a fasta

Look at the output

Move into the output directory

List files with human readable sizes

ls -lh

-rw-r--r--  1 drb  staff   5.1K Jun 13 17:03 1457-1sttry.dict
-rw-r--r--  1 drb  staff   3.0G Jun 13 17:03 1457-1sttry.fa
-rw-r--r--  1 drb  staff   2.5K Jun 13 17:03 1457-1sttry.fa.fai

Note that the .fa (aka the fasta file) is 3 GBs. My VScode didn't want to open it without me changing the settings!

How many lines does the .fa file have?

wc -l 1457-1sttry.fa

52286210 

What do the first 4 lines look like?

head -4 1457-1sttry.fa

>1 chrM:1-16571
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT
CGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC
GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT

Cut off the end of the file, after the 16571st nucleotide of chromosome M

I plan to make an issue to ask someone to write a script for this