Skip to content

Commit b4fdce9

Browse files
authored
bug fix to read table of qCFs (#188)
1. fix to delete uninformative 4-taxon sets, see this google group issue: https://groups.google.com/g/phylonetworks-users/c/w71ImPXIk58/m/ywrHVcIdBwAJ 2. fix #143 by adding new option 'mergerows' for custom use-cases 3. added tests for multiple alleles, and tested examples in manual on multiple alleles
1 parent 5e2e477 commit b4fdce9

File tree

6 files changed

+140
-128
lines changed

6 files changed

+140
-128
lines changed

Project.toml

+1-1
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ BioSequences = "2.0, 3"
3131
BioSymbols = "4.0, 5"
3232
CSV = "0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.10"
3333
Combinatorics = "0.7, 1.0"
34-
DataFrames = "0.21, 0.22, 1.0"
34+
DataFrames = "1.3"
3535
DataStructures = "0.9, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18"
3636
Distributions = "0.15, 0.16, 0.17, 0.18, 0.19, 0.20, 0.21, 0.22, 0.23, 0.24, 0.25"
3737
FASTX = "1.1, 2"

docs/src/man/multiplealleles.md

+39-27
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
```@setup multialleles
2+
using PhyloNetworks
3+
```
4+
15
# Multiple alleles per species
26

37
## between-species 4-taxon sets
@@ -7,20 +11,27 @@ to a taxon (a tip) in the network. If instead each allele/individual can be mapp
711
to a species, and if only the species-level network needs to be estimated,
812
then the following functions can be used:
913

10-
```julia
11-
tm = DataFrame(CSV.File(mappingFile)) # taxon map as a data frame
12-
taxonmap = Dict(tm[i,:allele] => tm[i,:species] for i in 1:110) # taxon map as a dictionary
14+
```@repl multialleles
15+
using CSV, DataFrames
16+
mappingfile = joinpath(dirname(pathof(PhyloNetworks)), "..","examples","mappingIndividuals.csv");
17+
tm = CSV.read(mappingfile, DataFrame) # taxon map as a data frame
18+
taxonmap = Dict(row[:individual] => row[:species] for row in eachrow(tm)) # taxon map as a dictionary
1319
```
1420

15-
The mapping file can be a text (or `csv`) file with two columns (at least):
16-
one column named `allele` and one named `species`,
17-
mapping each allele name to a species name. Next, read in the gene trees
21+
The [mapping file](https://github.com/crsl4/PhyloNetworks/blob/master/examples/mappingIndividuals.csv)
22+
can be a text (or `csv`) file with two columns (at least):
23+
one for the individuals, named `allele` or `individual`,
24+
and one column containing the species names, named `species`.
25+
Each row should map an allele name to a species name.
26+
Next, read in the [gene trees](https://github.com/crsl4/PhyloNetworks/blob/master/examples/genetrees_alleletips.tre)
1827
and calculate the quartet CFs at the species level:
1928

2029

21-
```julia
22-
genetrees = readMultiTopology(genetreeFile)
23-
df_sp = writeTableCF(countquartetsintrees(genetrees, taxonmap)...)
30+
```@repl multialleles
31+
genetreefile = joinpath(dirname(pathof(PhyloNetworks)), "..","examples","genetrees_alleletips.tre");
32+
genetrees = readMultiTopology(genetreefile);
33+
sort(tipLabels(genetrees[1])) # multiple tips in species S1
34+
df_sp = writeTableCF(countquartetsintrees(genetrees, taxonmap, showprogressbar=false)...)
2435
```
2536

2637
Now `df_sp` is a data frame containing the quartet concordance factors
@@ -34,9 +45,10 @@ to calculated of the CF for `A,B,C,D` (such that the total weight from
3445
this particular gene trees is 1).
3546
It is safe to save this data frame, then use it for `snaq!` like this:
3647

37-
```julia
38-
CSV.write("tableCF_species.csv", df) # to save the data frame to a file
39-
d_sp = readTableCF("tableCF_species.csv") # to get a "DataCF" object for use in snaq!.
48+
```@repl multialleles
49+
CSV.write("tableCF_species.csv", df_sp); # to save the data frame to a file
50+
d_sp = readTableCF("tableCF_species.csv"); # to get a "DataCF" object for use in snaq!
51+
summarizeDataCF(d_sp)
4052
```
4153

4254
## within-species 4-taxon sets
@@ -48,16 +60,17 @@ using this extra information. To get quartet CFs from sets of 4 individuals
4860
in which 2 individuals are from the same species, the following functions
4961
should be used:
5062

51-
```julia
52-
df_ind = writeTableCF(countquartetsintrees(genetrees)...) # no mapping here: so quartet CFs across individuals
53-
CSV.write("tableCF_individuals.csv", df) # to save to a file
54-
df_sp = mapAllelesCFtable(mappingFile, "tableCF_individuals.csv");
55-
d_sp = readTableCF!(df_sp);
63+
```@repl multialleles
64+
df_ind = writeTableCF(countquartetsintrees(genetrees, showprogressbar=false)...); # no mapping: CFs across individuals
65+
first(df_ind, 4) # to see the first 4 rows
66+
CSV.write("tableCF_individuals.csv", df_ind); # to save to a file
67+
df_sp = mapAllelesCFtable(mappingfile, "tableCF_individuals.csv");
68+
d_sp = readTableCF!(df_sp, mergerows=true);
5669
```
5770
where the mapping file can be a text (or `csv`) file with two columns
58-
named `allele` and `species`, mapping each allele name to a species name.
71+
named `allele` (or `individual`) and `species`, mapping each allele name to a species name.
5972
The data in `df_ind` is the table of concordance factors at the level of individuals.
60-
In other words, it list CFs using one row for each set of 4 alleles/individuals.
73+
In other words, it lists CFs using one row for each set of 4 alleles/individuals.
6174

6275
`mapAllelesCFtable` creates a new data frame `df_sp` of quartet concordance factors at the
6376
species level: with the allele names replaced by the appropriate species names.
@@ -81,9 +94,9 @@ But before, it is safe to save the concordance factor of quartets of species,
8194
which can be calculated by averaging the CFs of quartets of individuals
8295
from the associated species:
8396

84-
```julia
97+
```@repl multialleles
8598
df_sp = writeTableCF(d_sp) # data frame, quartet CFs averaged across individuals of same species
86-
CSV.write("CFtable_species.csv", df_sp) # save to file
99+
CSV.write("CFtable_species.csv", df_sp); # save to file
87100
```
88101

89102
Some quartets have the same species repeated twice,
@@ -115,15 +128,14 @@ This can be done as in the first section ("between-species 4-taxon sets")
115128
to give equal weight to all genes,
116129
or as shown below to give more weight to genes that have more alleles:
117130

118-
```julia
119-
df_sp = writeTableCF(d_sp) # some quartets have the same species twice
131+
```@repl multialleles
132+
first(df_sp, 3) # some quartets have the same species twice
120133
function hasrep(row) # see if a row (4-taxon set) has a species name ending with "__2": repeated species
121-
occursin(r"__2$", row[:tx1]) || occursin(r"__2$", row[:tx2]) ||
122-
occursin(r"__2$", row[:tx3]) || occursin(r"__2$", row[:tx4])
134+
occursin(r"__2$", row[:t1]) || occursin(r"__2$", row[:t2]) || # replace :t1 :t2 etc. by appropriate column names in your data,
135+
occursin(r"__2$", row[:t3]) || occursin(r"__2$", row[:t4]) # e.g. by :taxon1 :taxon2 etc.
123136
end
124137
df_sp_reduced = filter(!hasrep, df_sp) # removes rows with repeated species
125-
df_sp_reduced # should have fewer rows than df_sp
126-
CSV.write("CFtable_species_norep.csv", df_sp_reduced) # to save to file
138+
CSV.write("CFtable_species_norep.csv", df_sp_reduced); # to save to file
127139
d_sp_reduced = readTableCF(df_sp_reduced) # DataCF object, for input to snaq!
128140
```
129141

examples/genetrees_alleletips.tre

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
(((S4:1.4835,S5:1.4835):0.7879,((S1C:0.014,S1B:0.014):1.251,(S3:0.7704,S1A:0.7704):0.4946):1.0064):0.1762,S2:2.4476);
2+
(((S3:1.2587,(S1C:0.6223,(S1A:0.1231,S1B:0.1231):0.4992):0.6364):1.1552,(S5:1.0332,S4:1.0332):1.3807):0.1454,S2:2.5592);
3+
((S3:2.6379,(S4:0.8263,S5:0.8263):1.8116):0.212,((S1A:0.8435,(S1B:0.0441,S1C:0.0441):0.7995):0.8692,S2:1.7127):1.1372);
4+
(S2:3.1288,(((S1A:0.2974,S1C:0.2974):0.4802,S3:0.7776):1.9636,(S1B:2.129,(S5:0.8288,S4:0.8288):1.3002):0.6122):0.3876);
5+
(((S5:2.098,S4:2.098):0.6214,((S3:0.7126,(S1B:0.5033,S1A:0.5033):0.2094):0.8673,S1C:1.58):1.1394):0.8885,S2:3.6079);
6+
((((S1B:1.2942,S1A:1.2942):0.8796,(S4:1.3542,S5:1.3542):0.8195):0.2455,(S3:0.9679,S1C:0.9679):1.4514):2.9787,S2:5.398);
7+
((((S1B:0.0372,S1A:0.0372):1.2205,S1C:1.2577):0.2329,S3:1.4906):0.9159,(S2:1.932,(S5:0.9554,S4:0.9554):0.9766):0.4745);
8+
(S3:4.6311,(((S2:1.4604,S5:1.4604):0.4353,S4:1.8957):0.2433,(S1C:0.1665,(S1A:0.1024,S1B:0.1024):0.0642):1.9725):2.4921);
9+
((S2:1.4673,(S5:0.7265,S4:0.7265):0.7409):1.0067,(S3:2.1067,((S1B:0.2711,S1C:0.2711):1.0749,S1A:1.3461):0.7606):0.3673);
10+
((((S1B:0.0573,S1C:0.0573):0.3884,S1A:0.4457):0.4876,S3:0.9333):1.9679,((S4:1.5713,S5:1.5713):0.9966,S2:2.568):0.3333);
11+
((S1A:3.036,((S5:0.7604,S4:0.7604):1.6442,(S3:1.1176,(S1C:0.0465,S1B:0.0465):1.0711):1.287):0.6314):1.7552,S2:4.7912);
12+
((S2:1.4523,((S1B:0.4758,(S1A:0.0138,S1C:0.0138):0.462):0.3946,S3:0.8704):0.5818):1.3858,(S5:1.6356,S4:1.6356):1.2024);
13+
((S2:2.3144,(S4:0.7552,S5:0.7552):1.5592):0.8309,(S3:1.1027,((S1A:0.1344,S1C:0.1344):0.5612,S1B:0.6956):0.4071):2.0425);
14+
(((S5:1.5464,S4:1.5464):1.9521,((S1B:0.6516,(S1A:0.0281,S1C:0.0281):0.6235):0.172,S3:0.8236):2.6749):1.8799,S2:5.3785);
15+
(((S4:1.5191,S2:1.5191):0.7628,((S1A:1.1262,(S1B:0.3126,S1C:0.3126):0.8135):0.0776,S3:1.2038):1.0782):2.4257,S5:4.7076);
16+
(((S3:1.8527,(S1C:1.2485,(S1B:0.4803,S1A:0.4803):0.7682):0.6042):0.4991,(S5:1.7785,S4:1.7785):0.5734):0.1705,S2:2.5224);

src/multipleAlleles.jl

+45-82
Original file line numberDiff line numberDiff line change
@@ -33,8 +33,8 @@ in the example below, this file is best read later with the option
3333
```julia
3434
mapAllelesCFtable("allele-species-map.csv", "allele-quartet-CF.csv";
3535
filename = "quartetCF_speciesNames.csv")
36-
df_sp = DataFrame(CSV.File("quartetCF_speciesNames.csv"); copycols=false); # DataFrame object
37-
dataCF_specieslevel = readTableCF!(df_sp); # DataCF object
36+
df_sp = CSV.read("quartetCF_speciesNames.csv", DataFrame); # DataFrame object
37+
dataCF_specieslevel = readTableCF!(df_sp, mergerows=true); # DataCF object
3838
```
3939
"""
4040
function mapAllelesCFtable(alleleDF::AbstractString, cfDF::AbstractString;
@@ -65,12 +65,12 @@ as its first argument.
6565
function mapAllelesCFtable!(cfDF::DataFrame, alleleDF::DataFrame, co::Vector{Int},write::Bool,filename::AbstractString)
6666
size(cfDF,2) >= 7 || error("CF DataFrame should have 7+ columns: 4taxa, 3CF, and possibly ngenes")
6767
if length(co)==0 co=[1,2,3,4]; end
68-
compareTaxaNames(alleleDF,cfDF,co)
68+
allelecol, speciescol = compareTaxaNames(alleleDF,cfDF,co)
6969
for j in 1:4
7070
for ia in 1:size(alleleDF,1) # for all alleles
7171
cfDF[!,co[j]] = map(x->replace(string(x),
72-
Regex("^$(string(alleleDF[ia,:allele]))\$") =>
73-
alleleDF[ia,:species]),
72+
Regex("^$(string(alleleDF[ia,allelecol]))\$") =>
73+
alleleDF[ia,speciescol]),
7474
cfDF[!,co[j]])
7575
end
7676
end
@@ -85,98 +85,50 @@ end
8585
# inside readTableCF!
8686
# by deleting rows that are not informative like sp1 sp1 sp1 sp2
8787
# keepOne=true: we only keep one allele per species
88-
function cleanAlleleDF!(newdf::DataFrame, cols::Vector{Int};keepOne=false::Bool)
89-
withngenes = (length(cols)==8)
88+
function cleanAlleleDF!(newdf::DataFrame, cols::Vector{<:Integer}; keepOne=false::Bool)
9089
delrows = Int[] # indices of rows to delete
91-
repSpecies = String[]
90+
repSpecies = Set{String}()
9291
if(isa(newdf[1,cols[1]],Integer)) #taxon names as integers: we need this to be able to add __2
93-
newdf[!,cols[1]] = map(string, newdf[!,cols[1]])
94-
newdf[!,cols[2]] = map(string, newdf[!,cols[2]])
95-
newdf[!,cols[3]] = map(string, newdf[!,cols[3]])
96-
newdf[!,cols[4]] = map(string, newdf[!,cols[4]])
92+
for j in 1:4
93+
newdf[!,cols[j]] .= map(string, newdf[!,cols[j]])
94+
end
9795
end
9896
row = Vector{String}(undef, 4)
99-
for i in 1:size(newdf,1) #check all rows
100-
@debug "row number: $i"
101-
# fixit: check for no missing value, or error below
97+
for i in 1:nrow(newdf)
10298
map!(j -> newdf[i,cols[j]], row, 1:4)
103-
@debug "row $(row)"
10499
uniq = unique(row)
105-
@debug "unique $(uniq)"
106100

107-
keep = false # default: used if 1 unique name, or 2 in some cases
108101
if(length(uniq) == 4)
109-
keep = true
110-
else
111-
if(!keepOne)
112-
if(length(uniq) == 3) #sp1 sp1 sp2 sp3
102+
continue
103+
end
104+
# by now, at least 1 species is repeated
105+
if !keepOne # then we may choose to keep this row
106+
# 3 options: sp1 sp1 sp2 sp3; or sp1 sp1 sp2 sp2 (keep)
107+
# or sp1 sp1 sp1 sp2; or sp1 sp1 sp1 sp1 (do not keep)
108+
keep = false
109+
for u in uniq
110+
ind = row .== u # indices of taxon names matching u
111+
if sum(ind) == 2
113112
keep = true
114-
for u in uniq
115-
@debug "u $(u), typeof $(typeof(u))"
116-
ind = row .== u #taxon names matching u
117-
@debug "taxon names matching u $(ind)"
118-
if(sum(ind) == 2)
119-
push!(repSpecies,string(u))
120-
found = false
121-
for k in 1:4
122-
if(ind[k])
123-
if(found)
124-
@debug "found the second one in k $(k), will change newdf[i,cols[k]] $(newdf[i,cols[k]]), typeof $(typeof(newdf[i,cols[k]]))"
125-
newdf[i,cols[k]] = string(u, repeatAlleleSuffix)
126-
break
127-
else
128-
found = true
129-
end
130-
end
131-
end
132-
break
133-
end
134-
end
135-
elseif(length(uniq) == 2)
136-
# keep was initialized to false
137-
for u in uniq
138-
@debug "length uniq is 2, u $(u)"
139-
ind = row .== u
140-
if(sum(ind) == 1 || sum(ind) == 3)
141-
@debug "ind $(ind) is 1 or 3, should not keep"
142-
break
143-
elseif(sum(ind) == 2)
144-
@debug "ind $(ind) is 2, should keep"
145-
keep = true
146-
found = false
147-
push!(repSpecies,string(u))
148-
for k in 1:4
149-
if(ind[k])
150-
if(found)
151-
newdf[i,cols[k]] = string(u, repeatAlleleSuffix)
152-
break
153-
else
154-
found = true
155-
end
156-
end
157-
end
158-
end
159-
end
113+
push!(repSpecies, string(u))
114+
# change the second instance of a repeated taxon name with suffix
115+
k = findlast(ind)
116+
newdf[i,cols[k]] = string(u, repeatAlleleSuffix)
160117
end
161-
@debug "after if, keep is $(keep)"
162118
end
163119
end
164120
keep || push!(delrows, i)
165-
@debug "" keep
166121
end
167-
@debug "" delrows
168-
@debug "" repSpecies
169122
nrows = size(newdf,1)
170123
nkeep = nrows - length(delrows)
171124
if nkeep < nrows
172125
print("""found $(length(delrows)) 4-taxon sets uninformative about between-species relationships, out of $(nrows).
173126
These 4-taxon sets will be deleted from the data frame. $nkeep informative 4-taxon sets will be used.
174127
""")
175128
nkeep > 0 || @warn "All 4-taxon subsets are uninformative, so the dataframe will be left empty"
176-
deleterows!(newdf, delrows)
129+
deleteat!(newdf, delrows) # deleteat! requires DataFrames 1.3
177130
end
178-
# @show size(newdf)
179-
return unique(repSpecies)
131+
return collect(repSpecies)
180132
end
181133

182134

@@ -240,10 +192,9 @@ end
240192
# function to compare the taxon names in the allele-species matching table
241193
# and the CF table
242194
function compareTaxaNames(alleleDF::DataFrame, cfDF::DataFrame, co::Vector{Int})
243-
checkMapDF(alleleDF)
244-
#println("found $(length(alleleDF[1])) allele-species matches")
195+
allelecol, speciescol = checkMapDF(alleleDF)
245196
CFtaxa = string.(mapreduce(x -> unique(skipmissing(x)), union, eachcol(cfDF[!,co[1:4]])))
246-
alleleTaxa = map(string, alleleDF[!,:allele]) # as string, too
197+
alleleTaxa = map(string, alleleDF[!,allelecol]) # as string, too
247198
sizeCF = length(CFtaxa)
248199
sizeAllele = length(alleleTaxa)
249200
if sizeAllele > sizeCF
@@ -260,14 +211,26 @@ function compareTaxaNames(alleleDF::DataFrame, cfDF::DataFrame, co::Vector{Int})
260211
for n in unchanged warnmsg *= " $n"; end
261212
@warn warnmsg
262213
end
263-
return nothing
214+
return allelecol, speciescol
264215
end
265216

266-
# function to check that the allele df has one column labelled alleles and one column labelled species
217+
"""
218+
checkMapDF(mapping_allele2species::DataFrame)
219+
220+
Check that the data frame has one column named "allele" or "individual",
221+
and one column named "species". Output: indices of these column.
222+
"""
267223
function checkMapDF(alleleDF::DataFrame)
268224
size(alleleDF,2) >= 2 || error("Allele-Species matching Dataframe should have at least 2 columns")
269-
:allele in DataFrames.propertynames(alleleDF) || error("In allele mapping file there is no column named allele")
270-
:species in DataFrames.propertynames(alleleDF) || error("In allele mapping file there is no column named species")
225+
colnames = DataFrames.propertynames(alleleDF)
226+
allelecol = findfirst(x -> x == :allele, colnames)
227+
if isnothing(allelecol)
228+
allelecol = findfirst(x -> x == :individual, colnames)
229+
end
230+
isnothing(allelecol) && error("In allele mapping file there is no column named 'allele' or 'individual'")
231+
speciescol = findfirst(x -> x == :species, colnames)
232+
isnothing(speciescol) && error("In allele mapping file there is no column named species")
233+
return allelecol, speciescol
271234
end
272235

273236

0 commit comments

Comments
 (0)