-
Notifications
You must be signed in to change notification settings - Fork 3
vcf2fasta
FARM = FastaAlternateReferenceMaker
Note that the docs are for gatk3, but FARM has been piped to gatk4
./gatk FastaAlternateReferenceMaker --help
Need to build the environment? See install_gatk.md
Navigate to your cloned gatk repo
cd Documents/work/lawson_lab/gatk/
Activate your environment (which has R in it, but nothing else)
- I don't know if this is necessary
conda activate gatk
E.g.
gunzip -k human_g1k_v37.fasta.gz
Flags
-k keeps the zipped file
Inputs
- Reference fasta
- ucsc.hg19.fasta
- Index & dictionary files must be in the same folder as reference fasta
- ucsc.hg19.fasta.fai
- ucsc.hg19.dict
- VCF file
- recalibrated_duplicates_marked_reordered_sorted_filtered_realigned_Basal-1-2016-A10_CGAGGCTG-GCGTAAGA_L008_R1_001.fastq.gz.raw_variants.vcf
Description of code
./gatk FastaAlternateReferenceMaker \
-O <path/to/output-file.fa> \
-R <path/to/ref-file.fasta> \
-V <path/to/file.vcf>
My actual code
./gatk FastaAlternateReferenceMaker \
-O /Users/drb/Documents/work/lawson_lab/deepcelllineage/mitolin/data/generated/20190613-fastas/1457-1sttry.fa \
-R /Users/drb/Documents/work/lawson_lab/deepcelllineage/mitolin/downloads/ref/ucsc.hg19.fasta \
-V /Users/drb/Documents/work/lawson_lab/data/nguyen_nc_2018/20190502_output/Basal-1-2016-A10_CGAGGCTG-GCGTAAGA_L008_R1_001.gz/recalibrated_duplicates_marked_reordered_sorted_filtered_realigned_Basal-1-2016-A10_CGAGGCTG-GCGTAAGA_L008_R1_001.fastq.gz.raw_variants.vcf
Flags
-O = output
-R = reference
-V = vcf file
Note
-
.gz
in the above file names will be cleaned up in the next round of file naming - The reference fasta needs to be the same one that was used to create the vcf index and dictionary files
- It took about 14 minutes for one vcf file to be turned into a fasta
Move into the output directory
List files with human readable sizes
ls -lh
-rw-r--r-- 1 drb staff 5.1K Jun 13 17:03 1457-1sttry.dict
-rw-r--r-- 1 drb staff 3.0G Jun 13 17:03 1457-1sttry.fa
-rw-r--r-- 1 drb staff 2.5K Jun 13 17:03 1457-1sttry.fa.fai
Note that the .fa (aka the fasta file) is 3 GBs. My VScode didn't want to open it without me changing the settings!
How many lines does the .fa file have?
wc -l 1457-1sttry.fa
52286210
What do the first 4 lines look like?
head -4 1457-1sttry.fa
>1 chrM:1-16571
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT
CGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC
GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT
I plan to make an issue to ask someone to write a script for this