-
Notifications
You must be signed in to change notification settings - Fork 8
Expand file tree
/
Copy pathsequenceAnalysis.R
More file actions
119 lines (99 loc) · 4.74 KB
/
sequenceAnalysis.R
File metadata and controls
119 lines (99 loc) · 4.74 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
# sequenceAnalysis.R
#
# +-----------------------------------------------------------------+
# | |
# | Do not edit this file! Edit "mySequenceAnalysis.R" instead. |
# | |
# +-----------------------------------------------------------------+
#
# Purpose:
#
# Version: 1.1
#
# Date: 2019 05 12
# Author: Boris Steipe (boris.steipe@utoronto.ca)
#
# V 1.1 2019 updates
# V 1.0 First code 2018
#
# TODO:
#
#
# == HOW TO WORK WITH THIS FILE ================================================
#
# This file contains scenarios and tasks, we will discuss them in detail in
# class. Edit profusely, write code, experiment with options, or just play.
# Especially play.
#
# If there is anything you don't understand, use R's help system,
# Google for an answer, or ask. Especially ask. Don't continue if you don't
# understand what's going on. That's not how it works ...
#
# ==============================================================================
#TOC> ==========================================================================
#TOC>
#TOC> Section Title Line
#TOC> -----------------------------------------
#TOC> 1 SCENARIO 47
#TOC> 2 READ DATA 56
#TOC> 3 ANALYZE THE DATA 95
#TOC> 4 PLOT DATA 111
#TOC>
#TOC> ==========================================================================
# = 1 SCENARIO ============================================================
# We will consider 100,000 nucleotides from human chromosome 20. What are the
# dinucleotide frequencies? Are they random? We wish to create a plot that
# looks like this:
source("./sampleSolutions/sequenceAnalysisSampleSolutions-ShowPlot.R")
# = 2 READ DATA ===========================================================
# Task 2.1: Find a source for nucleotides 58,815,001 to 58,915,000 on the
# hg38 assembly of chromosome 20. Download the data to your project
# directory. Name the file "chr20-100kbp.fasta". What format is this
# in? How is this format specified?
#
# Task 2.2: Prepare to read the data into R.
# - How do we read this in?
# - How do we keep the data in memory?
# Write a function that takes a filename as an argument and
# returns the sequence as a vector with one character per element.
#
# Hint: the functions readLines(), grep(), paste(), and strsplit()
# will be useful. But don't run off to look up the help files.
# Your first task is to define - step by step - what your
# function needs to do. Once you have defined this, THEN you
# start thinking about how to implement it. Make a copy of
# functionTemplate.R, call it readFASTA.R, and put it into
# the "./R" folder. Open it for editing. Then put your
# "pseudocode" into the function body as comments. Once this
# is done, all you need to do is to translate pseudocode into
# R syntax. Also: it may be smart to develop code with a
# mini-dataset - not your genome-scale production data.
# Note: Focus on development workflow. How do you define a task,
# break it down into steps, figure out how to approach each step
# and finally implement each step in code.
# Note: Once your function is defined, and saved in the "./R" folder,
# it will be automatically loaded in your session when you type
# init() - that's the way the init() function is defined.
#
# Task 2.3: Test and verify that your function is correct.
# - What should you test for?
# - How should you test?
#
# Task 2.4: Actually read "chr20-100kbp.fasta" and assign it to "mySeq".
# - confirm that this has worked. How?
# = 3 ANALYZE THE DATA ====================================================
#
# Task 3.1: Calculate the GC contents
# - How?
# Task 3.2: Is this value expected? What is the human average?
# - Find the value. Is it constant throughout the genome?
# Task 3.3: What are the dinucleotide frequencies?
# - Make a vector of dinucleotides (hint: use paste() )
# - Get the counts (hint: use table() )
# Task 3.4: What are the expected dinucleotide frequencies?
# - Can we get them from first principles?
# - Can we simulate them (hint: use sample())
# = 4 PLOT DATA ===========================================================
#
# Task 4.1: Plot observed and expected frequencies as a barplot().
# [END]