|
| 1 | +```@setup traitevol_fixednet |
| 2 | +using PhyloNetworks |
| 3 | +using DataFrames |
| 4 | +mkpath("../assets/figures") |
| 5 | +``` |
| 6 | + |
1 | 7 | # Discrete Trait Evolution
|
2 | 8 |
|
3 | 9 | With a phylogenetic network structure inferred, we can now estimate how quickly traits
|
4 |
| -have evolved over time using a likelihood model. These traits should be discrete |
5 |
| -characteristics of a species such as feather color, diet type, or dna in aligned |
6 |
| -genetic sequences. |
| 10 | +have evolved over time using a likelihood model. These traits should be discrete |
| 11 | +characteristics of a species such as feather color, diet type, |
| 12 | +or DNA in aligned genetic sequences. |
7 | 13 |
|
8 | 14 | ## Discrete Trait Data
|
9 | 15 |
|
10 |
| -As with continuous trait evolution, we assume a fixed network, correctly rooted, |
| 16 | +As with continuous trait evolution, we assume a fixed network, correctly rooted, |
11 | 17 | with branch lengths proportional to calendar time. We start with a network, then
|
12 | 18 | add data about the tips of this network. We allow data of two types.
|
13 |
| -1. A vector of species names with a data frame of traits: |
14 | 19 |
|
15 |
| -```@setup fitDiscrete |
16 |
| -using PhyloNetworks, DataFrames |
17 |
| -mkpath("../assets/figures") |
18 |
| -``` |
19 |
| - |
20 |
| -```@example fitDiscrete |
21 |
| -#read in network |
22 |
| -simple_net = readTopology("(A:3.0,(B:2.0,(C:1.0,D:1.0):1.0):1.0);"); |
23 |
| -
|
24 |
| -#read in trait data |
25 |
| -simple_species = ["C","A","B","D"] |
26 |
| -simple_dat = DataFrame(trait=["hi","lo","lo","hi"]) |
27 |
| -``` |
28 |
| - |
29 |
| -If your species names and trait data are in the same data frame, read in your data |
30 |
| -frame then subset the data like this: |
31 |
| -```@example fitDiscrete |
32 |
| -dat = DataFrame(species=["C","A","B","D"], trait=["hi","lo","lo","hi"]) |
33 |
| -simple_species = dat[:species] |
34 |
| -simple_dat = DataFrame(trait = dat[:trait]) |
35 |
| -``` |
36 |
| - |
37 |
| -2. To use dna data, read in the network structure then start with a fasta |
38 |
| -file. Reading the data from this file using the `readfastatodna` function. This |
39 |
| -creates a data frame of dna data and a vector of dna pattern weights. |
40 |
| - |
41 |
| -```@example fitDiscrete |
42 |
| -#read in network |
43 |
| -dna_net = readTopology("((((((((((((((Ae_caudata_Tr275:1.0,Ae_caudata_Tr276:1.0):1.0,Ae_caudata_Tr139:1.0):1.0)#H1:1.0::0.6,((((((Ae_longissima_Tr241:1.0,Ae_longissima_Tr242:1.0):1.0,Ae_longissima_Tr355:1.0):1.0,(Ae_sharonensis_Tr265:1.0,Ae_sharonensis_Tr264:1.0):1.0):1.0,((Ae_bicornis_Tr408:1.0,Ae_bicornis_Tr407:1.0):1.0,Ae_bicornis_Tr406:1.0):1.0):1.0,((Ae_searsii_Tr164:1.0,Ae_searsii_Tr165:1.0):1.0,Ae_searsii_Tr161:1.0):1.0):1.0)#H2:1.0::0.6):1.0,(((Ae_umbellulata_Tr266:1.0,Ae_umbellulata_Tr257:1.0):1.0,Ae_umbellulata_Tr268:1.0):1.0,#H1:1.0::0.4):1.0):1.0,((Ae_comosa_Tr271:1.0,Ae_comosa_Tr272:1.0):1.0,(((Ae_uniaristata_Tr403:1.0,Ae_uniaristata_Tr357:1.0):1.0,Ae_uniaristata_Tr402:1.0):1.0,Ae_uniaristata_Tr404:1.0):1.0):1.0):1.0,(((Ae_tauschii_Tr352:1.0,Ae_tauschii_Tr351:1.0):1.0,(Ae_tauschii_Tr180:1.0,Ae_tauschii_Tr125:1.0):1.0):1.0,(#H2:1.0::0.4,((((Ae_mutica_Tr237:1.0,Ae_mutica_Tr329:1.0):1.0,Ae_mutica_Tr244:1.0):1.0,Ae_mutica_Tr332:1.0):1.0)#H4:1.0::0.6):1.0):1.0):1.0,(((T_boeoticum_TS8:1.0,(T_boeoticum_TS10:1.0,T_boeoticum_TS3:1.0):1.0):1.0,T_boeoticum_TS4:1.0):1.0,((T_urartu_Tr315:1.0,T_urartu_Tr232:1.0):1.0,(T_urartu_Tr317:1.0,T_urartu_Tr309:1.0):1.0):1.0):1.0):1.0,(((((Ae_speltoides_Tr320:1.0,Ae_speltoides_Tr323:1.0):1.0,Ae_speltoides_Tr223:1.0):1.0,Ae_speltoides_Tr251:1.0):1.0):1.0,#H4:1.0::0.4):1.0):1.0):1.0,Ta_caputMedusae_TB2:1.0):1.0,S_vavilovii_Tr279:1.0):1.0,Er_bonaepartis_TB1:1.0):1.0,H_vulgare_HVens23:1.0);"); |
| 20 | +1. A vector of species names with a data frame of traits: |
44 | 21 |
|
45 |
| -#read in dna data |
46 |
| -fastafile = joinpath(dirname(pathof(PhyloNetworks)), "..","examples", |
47 |
| -"Ae_bicornis_Tr406_Contig10132.aln") |
48 |
| -dna_dat, dna_weights = readfastatodna(fastafile, true); |
49 |
| -``` |
| 22 | + ```@example traitevol_fixednet |
| 23 | + # read in network |
| 24 | + net = readTopology("(A:3,((B:0.4)#H1:1.6::0.92,((C:0.4,#H1:0::0.08):0.6,D:1):1):1);"); |
| 25 | + # read in trait data |
| 26 | + species = ["C","A","B","D"] |
| 27 | + dat = DataFrame(trait=["hi","lo","lo","hi"]) |
| 28 | + ``` |
| 29 | + |
| 30 | + If your species names and trait data are in the same data frame, |
| 31 | + read in your data frame then subset the data like this: |
| 32 | + ```@example traitevol_fixednet |
| 33 | + dat = DataFrame(species=["C","A","B","D"], trait=["hi","lo","lo","hi"]) |
| 34 | + species = dat[:species] |
| 35 | + dat = DataFrame(trait = dat[:trait]) |
| 36 | + ``` |
| 37 | + |
| 38 | +2. To use dna data, read in the network structure then start with a fasta |
| 39 | + file. Reading the data from this file using the `readfastatodna` function. |
| 40 | + This creates a data frame of dna data and a vector of dna pattern weights. |
| 41 | + |
| 42 | + ```@example traitevol_fixednet |
| 43 | + # read in network |
| 44 | + dna_net = readTopology("((((((((((((((Ae_caudata_Tr275:1.0,Ae_caudata_Tr276:1.0):1.0,Ae_caudata_Tr139:1.0):1.0)#H1:1.0::0.6,((((((Ae_longissima_Tr241:1.0,Ae_longissima_Tr242:1.0):1.0,Ae_longissima_Tr355:1.0):1.0,(Ae_sharonensis_Tr265:1.0,Ae_sharonensis_Tr264:1.0):1.0):1.0,((Ae_bicornis_Tr408:1.0,Ae_bicornis_Tr407:1.0):1.0,Ae_bicornis_Tr406:1.0):1.0):1.0,((Ae_searsii_Tr164:1.0,Ae_searsii_Tr165:1.0):1.0,Ae_searsii_Tr161:1.0):1.0):1.0)#H2:1.0::0.6):1.0,(((Ae_umbellulata_Tr266:1.0,Ae_umbellulata_Tr257:1.0):1.0,Ae_umbellulata_Tr268:1.0):1.0,#H1:1.0::0.4):1.0):1.0,((Ae_comosa_Tr271:1.0,Ae_comosa_Tr272:1.0):1.0,(((Ae_uniaristata_Tr403:1.0,Ae_uniaristata_Tr357:1.0):1.0,Ae_uniaristata_Tr402:1.0):1.0,Ae_uniaristata_Tr404:1.0):1.0):1.0):1.0,(((Ae_tauschii_Tr352:1.0,Ae_tauschii_Tr351:1.0):1.0,(Ae_tauschii_Tr180:1.0,Ae_tauschii_Tr125:1.0):1.0):1.0,(#H2:1.0::0.4,((((Ae_mutica_Tr237:1.0,Ae_mutica_Tr329:1.0):1.0,Ae_mutica_Tr244:1.0):1.0,Ae_mutica_Tr332:1.0):1.0)#H4:1.0::0.6):1.0):1.0):1.0,(((T_boeoticum_TS8:1.0,(T_boeoticum_TS10:1.0,T_boeoticum_TS3:1.0):1.0):1.0,T_boeoticum_TS4:1.0):1.0,((T_urartu_Tr315:1.0,T_urartu_Tr232:1.0):1.0,(T_urartu_Tr317:1.0,T_urartu_Tr309:1.0):1.0):1.0):1.0):1.0,(((((Ae_speltoides_Tr320:1.0,Ae_speltoides_Tr323:1.0):1.0,Ae_speltoides_Tr223:1.0):1.0,Ae_speltoides_Tr251:1.0):1.0):1.0,#H4:1.0::0.4):1.0):1.0):1.0,Ta_caputMedusae_TB2:1.0):1.0,S_vavilovii_Tr279:1.0):1.0,Er_bonaepartis_TB1:1.0):1.0,H_vulgare_HVens23:1.0);"); |
| 45 | + # read in dna data |
| 46 | + fastafile = joinpath(dirname(pathof(PhyloNetworks)), "..","examples","Ae_bicornis_Tr406_Contig10132.aln") |
| 47 | + dna_dat, dna_weights = readfastatodna(fastafile, true); |
| 48 | + dna_dat |
| 49 | + dna_weights |
| 50 | + ``` |
50 | 51 |
|
51 | 52 | ## Choosing a Substitution Model
|
52 | 53 |
|
53 |
| -After reading in your data, choose a model to describe how evolutionary changes |
54 |
| -(or substitutions, in the case of DNA) happened over time. We offer a selection |
55 |
| -of Markov substitution models to describe the evolutionary process. |
| 54 | +After reading in your data, choose a model to describe how evolutionary changes |
| 55 | +(or substitutions, in the case of DNA) happened over time. |
| 56 | +Available Markov substitution models are described below. |
56 | 57 |
|
57 |
| -### Generic Trait Models |
| 58 | +### Generic Trait Models |
58 | 59 |
|
59 | 60 | These models works well for any type of trait we may want to model. For general
|
60 |
| -trait types, use one of these three models: |
61 |
| -`:ERSM` Equal Rates Substitution Model |
62 |
| -`:BTSM` Binary Trait Substitution Model |
63 |
| -`:TBTSM` Two Binary Trait Substituion Model |
64 |
| - |
65 |
| -### DNA-Specific Models |
66 |
| - |
67 |
| -The DNA-specific models are optimized for aligned sequence data. We offer JC69 |
68 |
| -and HKY85 models in both relative and absolute versions. The JC69 model was |
69 |
| -developed by Jukes and Cantor in 1969 and uses one rate for all type of substitutions. |
70 |
| -The HKY85 model was developed in 1985 by Hasegawa, Kishino, & Yano. It treats |
71 |
| -transitions differently from transversions. |
72 |
| -`:JC69` Jukes Cantor 69 Model |
73 |
| -`:HKY85` Hasegawa, Kishino and Yano 1985 |
74 |
| - |
75 |
| -## Running FitDiscrete |
76 |
| - |
77 |
| -To infer evolutionary rates, run the `fitDiscrete` function on the network and data. |
78 |
| -It will calculate the maximum likelihood score of a network given one or more |
79 |
| -discrete trait characters at the tips. Along each edge, evolutionary changes |
80 |
| -are modeled with a continous time Markov model, with parameters estimated by |
81 |
| -maximizing the likelihood. At each hybrid node, the trait is assumed to be |
82 |
| -inherited from the immediate parent (or parents, in the case of a hybrid edge). |
83 |
| -If there is a hybrid edge, the trait is modeled according to the parents' weighted |
84 |
| -average genetic contributions, as measured by inheritance gamma γ. The model |
85 |
| -currently ignores incomplete lineage sorting. |
| 61 | +trait types, use one of these three models: |
| 62 | +- `:BTSM` Binary Trait Substitution Model (2 states, rates unconstrained) |
| 63 | +- `:ERSM` Equal Rates Substitution Model |
| 64 | + (`k` states, all transitions possible with equal rates) |
| 65 | +- `:TBTSM` Two Binary Trait Substituion Model (though not fully implemented yet) |
| 66 | + |
| 67 | +### DNA-Specific Models |
| 68 | + |
| 69 | +The DNA-specific models are optimized for aligned sequence data. |
| 70 | +The 4 nucleotide states are from |
| 71 | +[BioSymbols](https://github.com/BioJulia/BioSymbols.jl) |
| 72 | +(listed [here](http://biojulia.net/BioSymbols.jl/stable/nucleicacids/)). |
| 73 | +Each model has a relative and an absolute version. |
| 74 | +- `:JC69` Jukes & Cantor 1969 model: one single rate for all transitions. |
| 75 | + The relative version has values -1 along the diagonal of the rate matrix |
| 76 | + (1 expected transition / unit of time). The absolute version has an extra |
| 77 | + parameter to scale the rate matrix. |
| 78 | +- `:HKY85` Hasegawa, Kishino & Yano 1985: treats transitions differently |
| 79 | + from transversions. |
| 80 | + |
| 81 | +## Fitting the model |
| 82 | + |
| 83 | +To infer evolutionary rates, run the `fitdiscrete` function on the network and data. |
| 84 | +It will calculate the maximum likelihood score of a fixed network |
| 85 | +given one or more discrete trait characters at the tips. |
| 86 | +Along each edge, evolutionary changes |
| 87 | +are modeled with a continous time Markov model, with parameters estimated by |
| 88 | +maximizing the likelihood. At each hybrid node, the trait is assumed to be |
| 89 | +inherited from the immediate parent (or parents, in the case of a hybrid node). |
| 90 | +At a hybrid node, the trait is assumed to be inherited from one or the other |
| 91 | +parent, with probabilities equal to the inheritance γ of each parent edge |
| 92 | +(which is given by the network). |
| 93 | +The model ignores incomplete lineage sorting (e.g. hemiplasy). |
86 | 94 |
|
87 | 95 | ### General Trait Data
|
88 | 96 |
|
89 |
| -```@example fitDiscrete |
90 |
| -s1 = fitDiscrete(simple_net, :ERSM, simple_species, simple_dat; optimizeQ=false, |
91 |
| -optimizeRVAS=false) |
92 |
| -s2 = fitDiscrete(simple_net, :BTSM, simple_species, simple_dat; optimizeQ=false, |
93 |
| -optimizeRVAS=false) |
| 97 | +```@repl traitevol_fixednet |
| 98 | +s1 = fitdiscrete(net, :ERSM, species, dat; optimizeQ=false) |
| 99 | +s2 = fitdiscrete(net, :BTSM, species, dat; optimizeQ=false) |
94 | 100 | ```
|
95 |
| -In this `fitDiscrete` call, we do not optimize rates or allow for rate variation |
96 |
| -across sites. |
97 |
| - |
98 |
| -If `optimizeQ = true`, the `fitDiscrete` function estimates the evolutionary rate |
99 |
| -or rates. Because we didn't allow for rate variation across sites in these models, |
100 |
| -we do not optimize the way rates may vary across trait types. |
101 |
| - |
102 |
| -```@example fitDiscrete |
103 |
| -s3 = fitDiscrete(simple_net, :ERSM, simple_species, simple_dat; optimizeQ=true, |
104 |
| -optimizeRVAS=true) |
105 |
| -s4 = fitDiscrete(simple_net, :BTSM, simple_species, simple_dat; optimizeQ=true, |
106 |
| -optimizeRVAS=true) |
| 101 | +In this `fitdiscrete` call, we do not optimize rates or allow for rate variation |
| 102 | +across sites. The default rates (which act as starting value if rates |
| 103 | +were to be optimized) are chosen equal to the inverse of the total edge lengths |
| 104 | +in the network (or 1/ntax if all branch lengths are missing). |
| 105 | + |
| 106 | +If `optimizeQ = true` (which is the default), the `fitdiscrete` |
| 107 | +function estimates the parameters of the rate matrix. |
| 108 | +Because we didn't allow for rate variation across sites in these models, |
| 109 | +there is nothing to optimize in the way rates may vary across traits (sites). |
| 110 | + |
| 111 | +```@repl traitevol_fixednet |
| 112 | +s3 = fitdiscrete(net, :ERSM, species, dat) |
| 113 | +s4 = fitdiscrete(net, :BTSM, species, dat) |
107 | 114 | ```
|
108 | 115 |
|
109 | 116 | ### DNA Data
|
110 | 117 |
|
111 |
| -For DNA data, use `:JC69` or `:HKY85` models. |
112 |
| -```@example fitDiscrete |
113 |
| -d1 = fitDiscrete(dna_net, :JC69, dna_dat, dna_weights, :RV; optimizeQ=false, |
114 |
| -optimizeRVAS=false) |
115 |
| -d2 = fitDiscrete(dna_net, :HKY85, dna_dat, dna_weights, :RV; optimizeQ=false, |
116 |
| -optimizeRVAS=false) |
| 118 | +For DNA data, use one of `:JC69` or `:HKY85`. |
| 119 | +To allow for rate variation across sites, use the `:RV` option. |
| 120 | + |
| 121 | +```@example traitevol_fixednet |
| 122 | +d1 = fitdiscrete(dna_net, :JC69, dna_dat, dna_weights, :RV; optimizeQ=false, optimizeRVAS=false) |
| 123 | +d2 = fitdiscrete(dna_net, :HKY85, dna_dat, dna_weights, :RV; optimizeQ=false, optimizeRVAS=false) |
117 | 124 | ```
|
118 |
| -In these `fitDiscrete` models, we do not optimize rates (`optimizeQ=false`), but |
119 |
| -we do allow for rate variation across sites. |
| 125 | +In these `fitdiscrete` models, we do not optimize rates (`optimizeQ=false`), but |
| 126 | +we do allow for rate variation across sites, with a default α of 1. |
120 | 127 |
|
121 | 128 | ### Rate Variation Across Sites
|
122 | 129 |
|
123 |
| -In its default version, `fitDiscrete` does not allow for rate variation across sites. |
124 |
| -To allow for rate variation across sites in your estimate of evolutionary rates |
125 |
| -(or rate variation across trait types, in the case of general trait types), |
126 |
| -include `:RV`. If you include `:RV` and `optimizeRVAS = true`, the model will |
127 |
| -not only allow for rate variation, but it will also optimize how rates vary across |
128 |
| -sites. |
| 130 | +In its default version, `fitdiscrete` does not allow for rate variation across sites. |
| 131 | +To allow for rate variation across sites in your estimate of evolutionary rates |
| 132 | +(or rate variation across traits in the case of general traits), |
| 133 | +include `:RV`. If you include `:RV` and `optimizeRVAS = true`, |
| 134 | +the model will allow for rate variation and |
| 135 | +also optimize the parameter α of the distribution of rates across sites. |
129 | 136 |
|
130 | 137 | We optimize the evolutionary rates and the way rates vary across sites for the
|
131 |
| -DNA data here: |
132 |
| -```@example fitDiscrete |
133 |
| -d3 = fitDiscrete(dna_net, :JC69, dna_dat, dna_weights, :RV; optimizeQ=true, |
134 |
| -optimizeRVAS=false) |
135 |
| -d4 = fitDiscrete(dna_net, :HKY85, dna_dat, dna_weights, :RV; optimizeQ=true, |
136 |
| -optimizeRVAS=false) |
137 |
| -``` |
| 138 | +DNA data here: |
| 139 | +```@repl traitevol_fixednet |
| 140 | +d3 = fitdiscrete(dna_net, :JC69, dna_dat, dna_weights, :RV; optimizeRVAS=false) |
| 141 | +d4 = fitdiscrete(dna_net, :HKY85, dna_dat, dna_weights, :RV) |
| 142 | +``` |
0 commit comments