-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathSNPeff_Galaxy_tutorial_V3.Rmd
More file actions
222 lines (136 loc) · 11.4 KB
/
SNPeff_Galaxy_tutorial_V3.Rmd
File metadata and controls
222 lines (136 loc) · 11.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
---
title: "Using SNPeff on Galaxy"
author: "Kayla Barton, Lindsey Fenderson, and Ben King"
date: "`r Sys.Date()`"
output:
html_document:
toc: yes
df_print: paged
---
## 1. Why we need good tools to visualize annotation
Using UCSC we can visualize the locations of our prospective sites, SnpEff vcf outputs, and bed files.


SNPeff allows us to see how our variants change a gene and where. Here is what a raw SnpEff file looks like. I've highlighted important parts in purple that I'll put into a spreadsheet.




Note that just like in the picture of DAB2 on UCSC we found the same amino acid changes!

### a. Opening UCSC Genome Browser
You can access the White-Throated sparrow genome on the UCSC Genome Browser by googling it or using this link:
https://genome.ucsc.edu/cgi-bin/hgTracks?db=hub_2172235_GCF_000385455.1&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=NW_005081537v1%3A11248104%2D11248175&hgsid=1271870139_caIUP1mE0oiZeK7TPV2S5EB8EPno

### b. Loading Custom Tracks on UCSC
Click the custom custom track button to add our files.

First lets try and load a bam file from http://katahdin.acg.maine.edu/~benking/GECO/
On this page we have large bam files for each species. Bam files are tab-delimited text files that contain sequence alignment data. Because they are so large it's not easy to simply click download them and upload them to UCSC, instead you can right click the link and copy it into custom track data for faster upload. For this example we'll upload all of the nelson bam files. (You can either copy each link, or copy and paste this below)
```{bash eval=FALSE}
http://katahdin.acg.maine.edu/~benking/GECO/2391-71617_Anelsoni_BassHarborME_20100715_trimmed-bwamem-Zalbicollis-1.0.1.sorted_UCSC.bam
http://katahdin.acg.maine.edu/~benking/GECO/2631-21304_Anelsoni_MaquoitBayME_20150812_trimmed-bwamem-Zalbicollis-1.0.1.sorted_UCSC.bam
http://katahdin.acg.maine.edu/~benking/GECO/2781-84960_Anelsoni_PleasantRiverAddisonME_20190624_trimmed-bwamem-Zalbicollis-1.0.1.sorted_UCSC.bam
```

If the file was uploaded successfully, you should then see the manage custom tracks page, where you can continue to add files or delete unwanted tracks.

Next we'll add a bed file with gene regions. From http://katahdin.acg.maine.edu/~benking/GECO/
copy the link for UCSC_WTSFoundGenesFull.bed and upload it to custom tracks.
```{bash eval=FALSE}
http://katahdin.acg.maine.edu/~benking/GECO/UCSC_WTSFoundGenesFull.bed
```

Sometimes bed files on UCSC can be finicky. Make sure your bed file follows this same format! Here are UCSC's basic guidelines for Bed files: https://genome.ucsc.edu/FAQ/FAQformat.html#format1
Finally press go to see your custom tracks alongside the genome! (use the toolbar to toggle view.)

Now you can observe different regions of the genome, with our bed track we can see that this particular section is apart of GREM1.

Note if your screen doesn't match this change location by copying and pasting the text below:
```{bash eval=FALSE}
NW_005081537v1:11,248,139-11,248,210
```
By clicking the bars on the new bam tracks you can expand them to look at the reads and variants.

Notice anything interesting about these 3 different individuals?
Like with DAB2 using a SNPeff output we can find variants that have a different allele frequency

Now that we understand why we need these visualization tools lets make our SnpEff file!
## 2. Why use SNPeff and Galaxy?
Galaxy creates a user friendly interface for running jobs without the use of your command line. SNPeff is a tool used for predicting the effects of genetic variants on genes and proteins. In this tutorial we will be going through how to use SNPeff to help narrow down the list of genes that will be most impactful for the GTseq assay.
Note: all files used in this tutorial and more are available at http://katahdin.acg.maine.edu/~benking/GECO/
### Workflow

## 3. Importing Histories in Galaxy
### Logging in and importing shared histories
First log in or create a Galaxy account. https://usegalaxy.org/login
The basic galaxy layout consists of three major sections history, menu, and tools

Use the following link to obtain the premade history
Galaxy Link: url: https://usegalaxy.org/u/suika_64/h/snpeffanalysisgtseq2
After clicking the link you'll be redirected to the history. In the upper right hand corner click the plus button to add the history to your galaxy account

If done successfully, you should now have your galaxy homepage look like this:
(If you have another history go to view all histories and switch to the imported one)

## 4. Finding tools in Galaxy
### Tool search and selection
First search and select the tool you want to use, let's start off by searching "isec" and selecting "bcftools isec"

### Options and parameters
When you select a tool, the middle menu section will change to the options of that tool, here you can select files to be used and parameters.

After selecting options, just click execute! (you can also have Galaxy email you notifications for when the job is done)
## 5.Running tools in Galaxy
### a. bcftools isec
Bcftools isec allows you to create intersections, unions and complements of VCF files. Specifically we'll be using this for finding variants that are unique to a certain species when compared to another.


Recommended options: Slide the "Complement" option to yes. This will output positions that are unique to the first file when compared to others.
Select the two species you'd like to compare. (Helpful tip: isec will always makes your earliest file your first file, the way around this is to make a new history and copy over the species that you're interested in. For example NESP-BCF.bcf.gz is the 1st dataset in my history so no matter what combo I choose it will always output the unique Nelson variants when compared to different samples.)
Example usage: Let's say I want to find all the variants that are unique to the Saltmarsh sparrow when compared to the Swamp sparrow. First I would create a new history and copy over the SALS-BCF.bcf.gz (drag from previous directory to current) then copy over SWSP-BCF.bcf.gz making them 1. and 2. respectively. Search for the isec tool, select it, and slide the Complement option to yes, select SALS-BCF.bcf.gz and SWSP-BCF.bcf.gz, and hit execute.
Note: You can also restrict this output by a certain region, just click "Restrict to" scroll down to "Regions" and select "Specify one or more Region(s) directly" in the drop down menu
```{bash eval=FALSE}
NW_005081537.1
11247728
11248940
```

The isec output should look like this!

Another way to visualize bcftools isec is to see it almost a venn diagram


Note: that this list matches our isec results!
### b. bcftools stats
This tool is pretty straightforward just select the file(s) you'd like to compare/get stats on and hit execute. (we can keep all the defaults for this tool)
Note: we can also run this on our finished SNPeff file down the line
### c. bcftools view
Bcftools view is a tool that allows us to filter by region and allele frequency

Recommended options: Set to desired region. Click Filter options, scroll down to Min Af and Max AF to set allelic frequency filters. As in bcftools isec we can also restrict by region if we're only interested in certain genes
Example usage: 10-90 AF only want variants in GREM1 region

#### Bcftools view -Restrict by region

#### Bcftools view -Restrict by allele frequency

### d. SNPeff eff
SnpEff will allow us to identify the specific changes that are caused due to variants.

Select your filtered file, say yes to creating csv report, change genome source to "Custom snpEff database in your history" which should automatically select the "SnpEff4.3 database for Zonotrichia_albicollis"

After the parameters are set hit the blue "Execute" button to run the job!
The job may take a while so feel free to click away or do other work while the job is running. (it'll turn grey when submitted, yellow when running, and green once completed) You can even check the status on your phone or have galaxy email you once it's completed.
Once the job is finished you should now have a vcf/bed file!
### e. Download files from Galaxy
Once your new vcf or bed file is made you can download it by clicking the floppy disk.

## Appendix
SNPeff intro: http://pcingola.github.io/SnpEff/se_introduction/
Galaxy tutorial: https://training.galaxyproject.org/training-material/topics/introduction/tutorials/galaxy-intro-101/tutorial.html?utm_source=redirect&utm_medium=learn&utm_campaign=galaxyhub
Custom SnpEff database generation workflow:
#### Simple

#### Detailed
