R-Intro/sequenceAnalysis.R at master · hyginn/R-Intro · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
# sequenceAnalysis.R
#
#   +-----------------------------------------------------------------+
#   |                                                                 |
#   |  Do not edit this file! Edit "mySequenceAnalysis.R" instead.    |
#   |                                                                 |
#   +-----------------------------------------------------------------+
#
# Purpose:
#
# Version: 1.1
#
# Date:    2019  05  12
# Author:  Boris Steipe (boris.steipe@utoronto.ca)
#
# V 1.1    2019 updates
# V 1.0    First code 2018
#
# TODO:
#
#
# == HOW TO WORK WITH THIS FILE ================================================
#
#  This file contains scenarios and tasks, we will discuss them in detail in
#  class. Edit profusely, write code, experiment with options, or just play.
#  Especially play.
#
#  If there is anything you don't understand, use R's help system,
#  Google for an answer, or ask. Especially ask. Don't continue if you don't
#  understand what's going on. That's not how it works ...
#
# ==============================================================================


#TOC> ==========================================================================
#TOC>
#TOC>   Section  Title                     Line
#TOC> -----------------------------------------
#TOC>   1        SCENARIO                    47
#TOC>   2        READ DATA                   56
#TOC>   3        ANALYZE THE DATA            95
#TOC>   4        PLOT DATA                  111
#TOC>
#TOC> ==========================================================================


# =    1  SCENARIO  ============================================================

# We will consider 100,000 nucleotides from human chromosome 20. What are the
# dinucleotide frequencies? Are they random? We wish to create a plot that
# looks like this:

source("./sampleSolutions/sequenceAnalysisSampleSolutions-ShowPlot.R")


# =    2  READ DATA  ===========================================================

# Task 2.1: Find a source for nucleotides 58,815,001 to 58,915,000 on the
#           hg38 assembly of chromosome 20. Download the data to your project
#           directory. Name the file "chr20-100kbp.fasta". What format is this
#           in? How is this format specified?
#
# Task 2.2: Prepare to read the data into R.
#           - How do we read this in?
#           - How do we keep the data in memory?
#           Write a function that takes a filename as an argument and
#           returns the sequence as a vector with one character per element.
#
#           Hint: the functions readLines(), grep(), paste(), and strsplit()
#                 will be useful. But don't run off to look up the help files.
#                 Your first task is to define - step by step - what your
#                 function needs to do. Once you have defined this, THEN you
#                 start thinking about how to implement it. Make a copy of
#                 functionTemplate.R, call it readFASTA.R, and put it into
#                 the "./R" folder. Open it for editing. Then put your
#                 "pseudocode" into the function body as comments. Once this
#                 is done, all you need to do is to translate pseudocode into
#                 R syntax. Also: it may be smart to develop code with a
#                 mini-dataset - not your genome-scale production data.
#           Note: Focus on development workflow. How do you define a task,
#                 break it down into steps, figure out how to approach each step
#                 and finally implement each step in code.
#           Note: Once your function is defined, and saved in the "./R" folder,
#                 it will be automatically loaded in your session when you type
#                 init() - that's the way the init() function is defined.
#
# Task 2.3: Test and verify that your function is correct.
#           - What should you test for?
#           - How should you test?
#
# Task 2.4: Actually read "chr20-100kbp.fasta" and assign it to "mySeq".
#           - confirm that this has worked. How?


# =    3  ANALYZE THE DATA  ====================================================
#
# Task 3.1: Calculate the GC contents
#           - How?

# Task 3.2: Is this value expected? What is the human average?
#           - Find the value. Is it constant throughout the genome?

# Task 3.3: What are the dinucleotide frequencies?
#           - Make a vector of dinucleotides (hint: use paste() )
#           - Get the counts (hint: use table() )

# Task 3.4: What are the expected dinucleotide frequencies?
#           - Can we get them from first principles?
#           - Can we simulate them (hint: use sample())

# =    4  PLOT DATA  ===========================================================
#

# Task 4.1: Plot observed and expected frequencies as a barplot().


# [END]