Skip to content

Commit 8ec36e5

Browse files
coraallensaviettacecileane
authored andcommitted
JC69, HKY85 and rate variation across sites (JuliaPhylo#94)
1 parent 736e2c8 commit 8ec36e5

25 files changed

+1834
-373
lines changed

.gitignore

+5-1
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,10 @@ docs/build/
55
docs/site/
66
docs/*.toml
77

8+
# benchmark tuning parameters #
9+
benchmark/params.json
10+
benchmark/tune.json
11+
812
# Compiled source #
913
###################
1014
*.com
@@ -95,4 +99,4 @@ Backup[ ]of[ ]*.numbers/
9599
*.jpg
96100
*.jpeg
97101
*.png
98-
*.dot
102+
*.dot

REQUIRE

+1
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ Combinatorics 0.7
55
CSV 0.4
66
DataFrames 0.13
77
DataStructures 0.9
8+
Distributions 0.15.0
89
GLM 1.0
910
NLopt 0.5.1
1011
SpecialFunctions 0.7

benchmark/README.md

+32
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Using PkgBenchmark to Compare the Efficiency of Two Different Commits using Benchmark
2+
3+
PkgBenchmarks allows us to compare the performance of a package at different branches, commits, or tags. For full documentation, see the PkgBenchmark [documentation here] (https://juliaci.github.io/PkgBenchmark.jl/stable/)
4+
5+
# Comparing Two Commits on Speed using Benchmarks
6+
This benchmark compares the speed of your current version of PhyloNetworks to the
7+
version in a previous commit.
8+
9+
To use, enter PhyloNetworks' benchmark directory and run:
10+
```bash
11+
bash compareCommits.sh oldCommitNumber
12+
```
13+
For example, in .julia/dev/PhyloNetworks/benchmark, run:
14+
```bash
15+
bash compareCommits.sh oldCommitNumber
16+
```
17+
variables:
18+
oldCommitNumber: a GitHub commit number
19+
20+
# Adding New Benchmarks
21+
22+
To add new benchmarks, use the dictionary interface introduced by Benchmarks.jl. [docs here](https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/doc/manual.md#defining-benchmark-suites)
23+
24+
First, open <PKGROOT>/benchmark/benchmarks.jl. In this file, create a new suite. I've created a suite to test nucleic acid substitution models. It has two subparts, jc69 and hky85.
25+
```julia
26+
SUITE["nasm"] = BenchmarkGroup(["jc69", "hky85"])
27+
```
28+
We can then add to this suite:
29+
```julia
30+
SUITE["nasm"]["jc69"] = @benchmarkable JC69([0.5])
31+
```
32+

benchmark/REQUIRE

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
BenchmarkTools 0.2
2+
PkgBenchmark 0.2
3+
Logging

benchmark/benchmarks.jl

+56
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
using BenchmarkTools, PhyloNetworks, DataFrames, Logging
2+
3+
#suppresses @warn and @info for benchmarks
4+
logger = SimpleLogger(stderr, Logging.Error);
5+
old_logger = global_logger(logger);
6+
7+
# Define a parent BenchmarkGroup to contain our SUITE
8+
const SUITE = BenchmarkGroup()
9+
10+
SUITE["nasm"] = BenchmarkGroup(["JC69", "HKY85"])
11+
SUITE["fitDiscreteFixed"] = BenchmarkGroup(["ERSM", "BTSM", "JC69", "HKY85"])
12+
SUITE["fitDiscrete"] = BenchmarkGroup(["ERSM", "BTSM", "JC69", "HKY85"])
13+
14+
# Add benchmarks to nasm group
15+
SUITE["nasm"]["JC69"] = @benchmarkable JC69([0.5])
16+
m1 = HKY85([.5], [0.25, 0.25, 0.25, 0.25]);
17+
SUITE["nasm"]["HKY85"] = @benchmarkable P!(P(m1, 1.0), m1, 3.0)
18+
19+
# fitDiscreteFixed benchmarks
20+
net_dat = readTopology("(((A:2.0,(B:1.0)#H1:0.1::0.9):1.5,(C:0.6,#H1:1.0::0.1):1.0):0.5,D:2.0);")
21+
species_alone = ["C","A","B","D"]
22+
dat_alone = DataFrame(trait=["hi","lo","lo","hi"])
23+
SUITE["fitDiscreteFixed"]["ERSM"] = @benchmarkable fitDiscrete(net_dat, :ERSM, species_alone, dat_alone; optimizeQ=false, optimizeRVAS=false)
24+
SUITE["fitDiscreteFixed"]["BTSM"] = @benchmarkable fitDiscrete(net_dat, :BTSM, species_alone, dat_alone; optimizeQ=false, optimizeRVAS=false)
25+
26+
fastafile = joinpath(@__DIR__, "..", "examples", "Ae_bicornis_Tr406_Contig10132.aln")
27+
dna_dat, dna_weights = readfastatodna(fastafile, true);
28+
net_dna = readTopology("((((((((((((((Ae_caudata_Tr275,Ae_caudata_Tr276),Ae_caudata_Tr139))#H1,#H2),(((Ae_umbellulata_Tr266,Ae_umbellulata_Tr257),Ae_umbellulata_Tr268),#H1)),((Ae_comosa_Tr271,Ae_comosa_Tr272),(((Ae_uniaristata_Tr403,Ae_uniaristata_Tr357),Ae_uniaristata_Tr402),Ae_uniaristata_Tr404))),(((Ae_tauschii_Tr352,Ae_tauschii_Tr351),(Ae_tauschii_Tr180,Ae_tauschii_Tr125)),(((((((Ae_longissima_Tr241,Ae_longissima_Tr242),Ae_longissima_Tr355),(Ae_sharonensis_Tr265,Ae_sharonensis_Tr264)),((Ae_bicornis_Tr408,Ae_bicornis_Tr407),Ae_bicornis_Tr406)),((Ae_searsii_Tr164,Ae_searsii_Tr165),Ae_searsii_Tr161)))#H2,#H4))),(((T_boeoticum_TS8,(T_boeoticum_TS10,T_boeoticum_TS3)),T_boeoticum_TS4),((T_urartu_Tr315,T_urartu_Tr232),(T_urartu_Tr317,T_urartu_Tr309)))),(((((Ae_speltoides_Tr320,Ae_speltoides_Tr323),Ae_speltoides_Tr223),Ae_speltoides_Tr251))H3,((((Ae_mutica_Tr237,Ae_mutica_Tr329),Ae_mutica_Tr244),Ae_mutica_Tr332))#H4))),Ta_caputMedusae_TB2),S_vavilovii_Tr279),Er_bonaepartis_TB1),H_vulgare_HVens23);");
29+
for edge in net_dna.edge #adds branch lengths
30+
setLength!(edge,1.0)
31+
if edge.gamma < 0
32+
setGamma!(edge, 0.5)
33+
end
34+
end
35+
SUITE["fitDiscreteFixed"]["JC69"] = @benchmarkable fitDiscrete(net_dna, :JC69, dna_dat, dna_weights; optimizeQ=false, optimizeRVAS=false)
36+
SUITE["fitDiscreteFixed"]["HKY85"] = @benchmarkable fitDiscrete(net_dna, :HKY85, dna_dat, dna_weights; optimizeQ=false, optimizeRVAS=false)
37+
38+
## fitDiscrete benchmarks
39+
SUITE["fitDiscrete"]["ERSM"] = @benchmarkable fitDiscrete(net_dat, :ERSM, species_alone, dat_alone; optimizeQ=true, optimizeRVAS=true)
40+
SUITE["fitDiscrete"]["BTSM"] = @benchmarkable fitDiscrete(net_dat, :BTSM, species_alone, dat_alone; optimizeQ=true, optimizeRVAS=true)
41+
SUITE["fitDiscrete"]["JC69"] = @benchmarkable fitDiscrete(net_dna, :JC69, dna_dat, dna_weights; optimizeQ=true, optimizeRVAS=true)
42+
SUITE["fitDiscrete"]["HKY85"] = @benchmarkable fitDiscrete(net_dna, :HKY85, dna_dat, dna_weights; optimizeQ=true, optimizeRVAS=true)
43+
44+
# If a cache of tuned parameters already exists, use it, otherwise, tune and cache
45+
# the benchmark parameters. Reusing cached parameters is faster and more reliable
46+
# than re-tuning `SUITE` every time the file is included.
47+
paramspath = joinpath(dirname(@__FILE__), "params.json")
48+
49+
if isfile(paramspath)
50+
loadparams!(SUITE, BenchmarkTools.load(paramspath)[1], :evals);
51+
else
52+
tune!(SUITE)
53+
BenchmarkTools.save(paramspath, params(SUITE));
54+
end
55+
56+
global_logger(old_logger) #restores typical logging at end of benchmarks

benchmark/compareCommits.sh

+18
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
#!/bin/bash
2+
# This benchmark compares the speed of your current version of PhyloNetworks to a
3+
# version in a previous commit.
4+
5+
# To use, enter PhyloNetworks' benchmark directory and run:
6+
# bash compareCommits.sh oldCommitNumber
7+
8+
# For example, in .julia/dev/PhyloNetworks/benchmark, run:
9+
# bash compareCommits.sh oldCommitNumber
10+
11+
# variables:
12+
# oldCommitNumber: a GitHub commit number
13+
14+
currBranch=$(git branch | sed -e '/^[^*]/d' -e 's/* \(.*\)/\1/')
15+
git checkout $1
16+
julia runBenchmark.jl
17+
git checkout $currBranch
18+
julia runBenchmark.jl

benchmark/runBenchmark.jl

+6
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
using Pkg
2+
Pkg.activate("/Users/cora/.julia/environments/net") #sets up development environment
3+
using PkgBenchmark
4+
using BenchmarkTools
5+
using PhyloNetworks
6+
benchmarkpkg("PhyloNetworks")

docs/Project.toml

+4-4
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
11
[deps]
2+
BioSymbols = "3c28c6f8-a34d-59c4-9654-267d177fcfa9"
3+
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
4+
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
25
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
36
DocumenterMarkdown = "997ab1e6-3595-5248-9280-8efb232c3433"
47
# will be added by Travis as being developed:
58
# PhyloNetworks = "33ad39ac-ed31-50eb-9b15-43d0656eaa72"
6-
# packages used in the manual pages:
7-
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
8-
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
99
# will be added to track master version in make.jl:
1010
# PhyloPlots = "c0d5b6db-e3fc-52bc-a87d-1d050989ed3b"
1111
RCall = "6f49c342-dc21-5d91-9882-a32aef131414"
1212
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
1313
StatsModels = "3eaba693-59b7-5ba5-a881-562e759f1c8d"
1414

1515
[compat]
16-
Documenter = "~0.21"
16+
Documenter = "~0.22"

docs/make.jl

+1
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ makedocs(
2424
"Multiple Alleles" => "man/multiplealleles.md",
2525
"Continuous Trait Evolution" => "man/trait_tree.md",
2626
"Parsimony on networks" => "man/parsimony.md",
27+
"Discrete Trait Evolution" => "man/fitDiscrete.md",
2728
],
2829
"Library" => [
2930
"Public" => "lib/public.md",

docs/readme.md

+6
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,12 @@
2020
3. run `deploydocs(...)` also from Documenter:
2121
to push the files on github, gh-pages branch.
2222

23+
for now, docstrings are automatically used to build an entry for
24+
- each internal thing that has a docstring (e.g. not exported in `src/PhyloNetworks.jl`)
25+
- each public *type*
26+
Therefore: any public *function* needs to be manually listed in `docs/src/man/public.md`,
27+
in a section to get a nice organization of all these manual entries.
28+
2329
## The "Documenter md" format
2430

2531
### Note on the format

docs/src/index.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,8 @@ Pages = [
4747
"man/bootstrap.md",
4848
"man/multiplealleles.md",
4949
"man/trait_tree.md",
50-
"man/parsimony.md"
50+
"man/parsimony.md",
51+
"man/fitDiscrete.md"
5152
]
5253
Depth = 3
5354
```

docs/src/lib/public.md

+8-2
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ See [Internal Documentation](@ref) for documentation on internal functions.
88
DocTestSetup = quote
99
using PhyloNetworks
1010
end
11+
DocTestFilters = r" PhyloNetworks .*:\d+"
1112
```
1213

1314
```@contents
@@ -52,11 +53,14 @@ getNodeAges
5253
pairwiseTaxonDistanceMatrix
5354
biconnectedComponents
5455
blobDecomposition
56+
getlabels
57+
nparams
5558
```
5659

5760
## data and topology read/write
5861

5962
```@docs
63+
readfastatodna
6064
readTopology
6165
readTopologyLevel1
6266
readInputTrees
@@ -130,15 +134,17 @@ vcv
130134
```@docs
131135
parsimonySoftwired
132136
parsimonyGF
133-
nStates
134137
Q
135-
P
136138
randomTrait
137139
randomTrait!
138140
fitDiscrete
139141
maxParsimonyNet
142+
nstates
143+
setrates!
144+
setalpha!
140145
```
141146

142147
```@meta
143148
DocTestSetup = nothing
149+
DocTestFilters = nothing
144150
```

docs/src/man/fitDiscrete.md

+137
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# Discrete Trait Evolution
2+
3+
With a phylogenetic network structure inferred, we can now estimate how quickly traits
4+
have evolved over time using a likelihood model. These traits should be discrete
5+
characteristics of a species such as feather color, diet type, or dna in aligned
6+
genetic sequences.
7+
8+
## Discrete Trait Data
9+
10+
As with continuous trait evolution, we assume a fixed network, correctly rooted,
11+
with branch lengths proportional to calendar time. We start with a network, then
12+
add data about the tips of this network. We allow data of two types.
13+
1. A vector of species names with a data frame of traits:
14+
15+
```@setup fitDiscrete
16+
using PhyloNetworks, DataFrames
17+
mkpath("../assets/figures")
18+
```
19+
20+
```@example fitDiscrete
21+
#read in network
22+
simple_net = readTopology("(A:3.0,(B:2.0,(C:1.0,D:1.0):1.0):1.0);");
23+
24+
#read in trait data
25+
simple_species = ["C","A","B","D"]
26+
simple_dat = DataFrame(trait=["hi","lo","lo","hi"])
27+
```
28+
29+
If your species names and trait data are in the same data frame, read in your data
30+
frame then subset the data like this:
31+
```@example fitDiscrete
32+
dat = DataFrame(species=["C","A","B","D"], trait=["hi","lo","lo","hi"])
33+
simple_species = dat[:species]
34+
simple_dat = DataFrame(trait = dat[:trait])
35+
```
36+
37+
2. To use dna data, read in the network structure then start with a fasta
38+
file. Reading the data from this file using the `readfastatodna` function. This
39+
creates a data frame of dna data and a vector of dna pattern weights.
40+
41+
```@example fitDiscrete
42+
#read in network
43+
dna_net = readTopology("((((((((((((((Ae_caudata_Tr275:1.0,Ae_caudata_Tr276:1.0):1.0,Ae_caudata_Tr139:1.0):1.0)#H1:1.0::0.6,((((((Ae_longissima_Tr241:1.0,Ae_longissima_Tr242:1.0):1.0,Ae_longissima_Tr355:1.0):1.0,(Ae_sharonensis_Tr265:1.0,Ae_sharonensis_Tr264:1.0):1.0):1.0,((Ae_bicornis_Tr408:1.0,Ae_bicornis_Tr407:1.0):1.0,Ae_bicornis_Tr406:1.0):1.0):1.0,((Ae_searsii_Tr164:1.0,Ae_searsii_Tr165:1.0):1.0,Ae_searsii_Tr161:1.0):1.0):1.0)#H2:1.0::0.6):1.0,(((Ae_umbellulata_Tr266:1.0,Ae_umbellulata_Tr257:1.0):1.0,Ae_umbellulata_Tr268:1.0):1.0,#H1:1.0::0.4):1.0):1.0,((Ae_comosa_Tr271:1.0,Ae_comosa_Tr272:1.0):1.0,(((Ae_uniaristata_Tr403:1.0,Ae_uniaristata_Tr357:1.0):1.0,Ae_uniaristata_Tr402:1.0):1.0,Ae_uniaristata_Tr404:1.0):1.0):1.0):1.0,(((Ae_tauschii_Tr352:1.0,Ae_tauschii_Tr351:1.0):1.0,(Ae_tauschii_Tr180:1.0,Ae_tauschii_Tr125:1.0):1.0):1.0,(#H2:1.0::0.4,((((Ae_mutica_Tr237:1.0,Ae_mutica_Tr329:1.0):1.0,Ae_mutica_Tr244:1.0):1.0,Ae_mutica_Tr332:1.0):1.0)#H4:1.0::0.6):1.0):1.0):1.0,(((T_boeoticum_TS8:1.0,(T_boeoticum_TS10:1.0,T_boeoticum_TS3:1.0):1.0):1.0,T_boeoticum_TS4:1.0):1.0,((T_urartu_Tr315:1.0,T_urartu_Tr232:1.0):1.0,(T_urartu_Tr317:1.0,T_urartu_Tr309:1.0):1.0):1.0):1.0):1.0,(((((Ae_speltoides_Tr320:1.0,Ae_speltoides_Tr323:1.0):1.0,Ae_speltoides_Tr223:1.0):1.0,Ae_speltoides_Tr251:1.0):1.0):1.0,#H4:1.0::0.4):1.0):1.0):1.0,Ta_caputMedusae_TB2:1.0):1.0,S_vavilovii_Tr279:1.0):1.0,Er_bonaepartis_TB1:1.0):1.0,H_vulgare_HVens23:1.0);");
44+
45+
#read in dna data
46+
fastafile = joinpath(dirname(pathof(PhyloNetworks)), "..","examples",
47+
"Ae_bicornis_Tr406_Contig10132.aln")
48+
dna_dat, dna_weights = readfastatodna(fastafile, true);
49+
```
50+
51+
## Choosing a Substitution Model
52+
53+
After reading in your data, choose a model to describe how evolutionary changes
54+
(or substitutions, in the case of DNA) happened over time. We offer a selection
55+
of Markov substitution models to describe the evolutionary process.
56+
57+
### Generic Trait Models
58+
59+
These models works well for any type of trait we may want to model. For general
60+
trait types, use one of these three models:
61+
`:ERSM` Equal Rates Substitution Model
62+
`:BTSM` Binary Trait Substitution Model
63+
`:TBTSM` Two Binary Trait Substituion Model
64+
65+
### DNA-Specific Models
66+
67+
The DNA-specific models are optimized for aligned sequence data. We offer JC69
68+
and HKY85 models in both relative and absolute versions. The JC69 model was
69+
developed by Jukes and Cantor in 1969 and uses one rate for all type of substitutions.
70+
The HKY85 model was developed in 1985 by Hasegawa, Kishino, & Yano. It treats
71+
transitions differently from transversions.
72+
`:JC69` Jukes Cantor 69 Model
73+
`:HKY85` Hasegawa, Kishino and Yano 1985
74+
75+
## Running FitDiscrete
76+
77+
To infer evolutionary rates, run the `fitDiscrete` function on the network and data.
78+
It will calculate the maximum likelihood score of a network given one or more
79+
discrete trait characters at the tips. Along each edge, evolutionary changes
80+
are modeled with a continous time Markov model, with parameters estimated by
81+
maximizing the likelihood. At each hybrid node, the trait is assumed to be
82+
inherited from the immediate parent (or parents, in the case of a hybrid edge).
83+
If there is a hybrid edge, the trait is modeled according to the parents' weighted
84+
average genetic contributions, as measured by inheritance gamma γ. The model
85+
currently ignores incomplete lineage sorting.
86+
87+
### General Trait Data
88+
89+
```@example fitDiscrete
90+
s1 = fitDiscrete(simple_net, :ERSM, simple_species, simple_dat; optimizeQ=false,
91+
optimizeRVAS=false)
92+
s2 = fitDiscrete(simple_net, :BTSM, simple_species, simple_dat; optimizeQ=false,
93+
optimizeRVAS=false)
94+
```
95+
In this `fitDiscrete` call, we do not optimize rates or allow for rate variation
96+
across sites.
97+
98+
If `optimizeQ = true`, the `fitDiscrete` function estimates the evolutionary rate
99+
or rates. Because we didn't allow for rate variation across sites in these models,
100+
we do not optimize the way rates may vary across trait types.
101+
102+
```@example fitDiscrete
103+
s3 = fitDiscrete(simple_net, :ERSM, simple_species, simple_dat; optimizeQ=true,
104+
optimizeRVAS=true)
105+
s4 = fitDiscrete(simple_net, :BTSM, simple_species, simple_dat; optimizeQ=true,
106+
optimizeRVAS=true)
107+
```
108+
109+
### DNA Data
110+
111+
For DNA data, use `:JC69` or `:HKY85` models.
112+
```@example fitDiscrete
113+
d1 = fitDiscrete(dna_net, :JC69, dna_dat, dna_weights, :RV; optimizeQ=false,
114+
optimizeRVAS=false)
115+
d2 = fitDiscrete(dna_net, :HKY85, dna_dat, dna_weights, :RV; optimizeQ=false,
116+
optimizeRVAS=false)
117+
```
118+
In these `fitDiscrete` models, we do not optimize rates (`optimizeQ=false`), but
119+
we do allow for rate variation across sites.
120+
121+
### Rate Variation Across Sites
122+
123+
In its default version, `fitDiscrete` does not allow for rate variation across sites.
124+
To allow for rate variation across sites in your estimate of evolutionary rates
125+
(or rate variation across trait types, in the case of general trait types),
126+
include `:RV`. If you include `:RV` and `optimizeRVAS = true`, the model will
127+
not only allow for rate variation, but it will also optimize how rates vary across
128+
sites.
129+
130+
We optimize the evolutionary rates and the way rates vary across sites for the
131+
DNA data here:
132+
```@example fitDiscrete
133+
d3 = fitDiscrete(dna_net, :JC69, dna_dat, dna_weights, :RV; optimizeQ=true,
134+
optimizeRVAS=false)
135+
d4 = fitDiscrete(dna_net, :HKY85, dna_dat, dna_weights, :RV; optimizeQ=true,
136+
optimizeRVAS=false)
137+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
>test_1
2+
GCGCCGAG
3+
>test_2
4+
GCACCGAG
5+
>test_3
6+
GCACCGAG
7+
>test_4
8+
GTACCGAG
9+
>test_5
10+
GTACCGAG
11+
>test_6
12+
GTACCGGG

0 commit comments

Comments
 (0)