Skip to content

caseywdunn/sk25

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Response to Integrative phylogenomics positions sponges at the root of the animal tree

DOI

This repository contains eLetter text, figures, and Supplementary Materials for our response to Steenwyk and King 2025.

eLetter

Authors:

  • Casey W. Dunn, Yale University
  • Xiaofan Zhou, South China Agricultural University
  • Jingxuan Chen, Zhejiang University
  • Jacob M. Musser, Yale University
  • Yuanning Li, Shandong University
  • Steven H.D. Haddock, Monterey Bay Aquarium Research Institute
  • Daniel S. Rokhsar, University of California Berkeley
  • Xing-Xing Shen, Zhejiang University

Steenwyk and King (1) (SK) addressed a fundamental question in animal evolution: were ctenophores or sponges the first lineage to split from other animals at the root of the metazoan tree? Modifying the approach of (2), they presented metrics to quantify the degree to which individual genes sampled across diverse species support the ctenophore-sister or sponge-sister hypotheses using two different phylogenomic frameworks (“concatenation” and “coalescence”). Genes that support the same hypothesis in both frameworks were deemed “consistent”, while genes that support different hypotheses in the two frameworks were “inconsistent” and not considered further. The counts of consistent sponge-sister and ctenophore-sister genes were then evaluated with chi-squared tests. The authors applied this approach to a new dataset of 869 BUSCO (3) genes sampled across 100 animal and outgroup species, as well as derivatives of this new dataset and ten previously published datasets. Although they found that the vast majority of genes were inconsistent in all datasets, they observed a statistically significant excess of consistent genes supporting the sponge-sister hypothesis in most analyses and no significant difference in others; none of their analyses supported ctenophore-sister (their Fig. 4). SK concluded that their “integrative phylogenomics” approach provides compelling evidence for the sponge-sister hypothesis.

Here we note several problems that compromise their conclusions. Supporting material, including figures, for our new analyses can be found at https://github.com/caseywdunn/sk25 (archived at https://doi.org/10.5281/zenodo.18022339). We focus our analyses on their “92.5” matrix (the top left matrix in their Fig. 4A), as this is the most inclusive version of their new 869-gene dataset, from which all results in their Fig. 4A–D are derived. In their scoring, they found that this matrix had 82 genes consistent for sponge-sister but only 6 consistent for ctenophore-sister, and that this difference was significant according to the chi-squared test.

As a first step toward understanding their phylogenetic signal, we inferred concatenation-based phylogenies with the same tool (iqtree2) and model (LG+I+G4+C60) they used. We found that phylogenetic trees based on the 88 consistent genes identified by SK in the “92.5” matrix support the ctenophore-sister hypothesis (Fig. 1a), even though most of these genes were scored as consistent for sponge-sister. Strikingly, we found that the phylogenetic tree based solely on the 82 genes SK scored as consistent for sponge-sister also strongly supports the ctenophore-sister hypothesis (Fig. 1b).

To further explore this discrepancy, we examined the taxonomic coverage of genes. As a matter of principle, testing the relative support for ctenophore-sister vs. sponge-sister requires sampling at least one species from each of four groups: sponges, ctenophores, other animals (Placozoa + Cnidaria + Bilateria), and the outgroup (non-animals). Yet 56 of the 869 genes analysed by SK are missing ctenophores, outgroups, or both. Surprisingly, 45 of these genes that have no information about the animal root are scored by SK as consistent for sponge-sister. This indicates problems with their scoring procedure. We were able to trace the causes of these problems to issues in both their quartet-based (coalescent) and likelihood-based (concatenation) analyses.

In the coalescence framework, SK reported problematic quartet scores, where genes without ctenophore sequences nevertheless support sponge-sister. These genes should not support either hypothesis. We found that this error arises from the interaction of three methodological choices: structurally inappropriate reference trees for evaluating quartets, imbalanced taxon sampling, and the inclusion of all induced quartets. Together these choices create a systematic bias in favor of sponge-sister due to the substantially greater number of 29 sponges than 13 ctenophores in the dataset. The critical issue is the scoring of quartets based on their concordance with two incorrectly defined reference trees. For scoring quartets against the ctenophore-sister hypothesis, SK collapsed “sponges + other animals” into a single clade, and alternatively scored quartets for sponge-sister by combining “ctenophores + other animals”. With this scoring system, quartets that contain two sponges and two other animals are systematically incompatible with the collapsed ctenophore-sister tree but will often match the collapsed sponge-sister tree (despite being inherently uninformative about the root). While other uninformative quartets effectively cancel out, this specific subset generates a spurious directional bias in favor of sponge-sister. This error is eliminated when quartets are properly scored relative to reference trees that maintain ctenophores, sponges, and other animals as separate clades.

In the concatenation framework, SK reported log-likelihood differences $|\Delta \ln L|$ between the sponge- and ctenophore-sister hypotheses that are orders of magnitude larger than typically observed for single genes, reaching into the thousands (Fig. 1). This was due to a procedural error in the use of iqtree (4). As SK note in their supplementary methods, “phylogenetic trees [used to calculate site log-likelihoods] were specified using the -z argument.” The trees specified with -z should be fully resolved phylogenies; in this case these should be the two maximum likelihood phylogenies inferred under the ctenophore- and sponge-sister constraint trees. Inspection of SK’s iqtree log files, however, indicates that the tree file specified with -z was Ctenophore_and_Sponge_first_trees.tre, which contains the constraint trees themselves (available in the TRADITIONAL_TOPOLOGY_TESTS folder in their figshare (5)). A critical step was therefore skipped. Instead of using the constraint trees to build maximum-likelihood trees to calculate the site log likelihoods on, they calculated the site log likelihoods on the constraint trees themselves. The unresolved internal branches of the constraint trees do not yield interpretable site log likelihoods. When we reran the analyses by first inferring maximum-likelihood trees under the ctenophore-sister and sponge-first constraints and then calculating site log-likelihoods on those inferred trees, the resulting $|\Delta \ln L|$ values fell back into typical single-gene ranges, most with magnitude less than 10 (Fig. 2). These corrected results are consistent with the original presentation of these methods in (3).

After correcting both likelihood and quartet scoring, the sponge-sister signal reported by SK disappears and is replaced by strong support for ctenophore-sister. Using corrected scoring, 544 genes are classified as consistent (out of 813 genes in the “92.5” matrix that sample all four groups required to test the animal root), indicating substantially less conflict within the data than SK reported. Furthermore, in the reanalyzed “92.5” matrix, significantly more genes are consistent with the ctenophore-sister hypothesis (370 genes) than with the sponge-sister hypothesis (174 genes). As expected, and in contrast to the results obtained using SK’s reported sponge-sister genes (Fig. 1b), phylogenetic analyses of the consistent sponge-sister gene set recover sponge-sister (Fig. 2b), whereas analyses of the consistent ctenophore-sister gene set recover ctenophore-sister (Fig. 2c). A combined phylogenomic analysis of all 544 consistent genes strongly supports ctenophores as the sister group of all other animals (Fig. 2a). Although we focus here on their “92.5” matrix, these methodological issues apply to SK’s other analyses.

A more comprehensive analysis will be presented elsewhere. We are grateful to Steenwyck and King for discussions about their work, their quick responses to our questions, and for their feedback on this letter.

  1. J. L. Steenwyk, N. King, Integrative phylogenomics positions sponges at the root of the animal tree. Science 390, 751–756 (2025).

  2. X. X. Shen, J. L. Steenwyk, A. Rokas, Dissecting incongruence between concatenation- and quartet-based approaches in phylogenomic data. Systematic Biology 70, 997–1014 (2021).

  3. M. Manni, M. R. Berkeley, M. Seppey, F. A. Simão, E. M. Zdobnov, BUSCO update: Novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Molecular Biology and Evolution 38, 4647–4654 (2021).

  4. B. Q. Minh, H. A. Schmidt, O. Chernomor, D. Schrempf, M. D. Woodhams, A. von Haeseler, R. Lanfear, Corrigendum to: IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol. Biol. Evol. 37, 2461 (2020).

  5. J. Steenwyk, Additional information for "Integrative phylogenomics positions sponges at the root of the animal tree". doi:10.6084/m9.figshare.28229990.v4 (2025).

Figures

Figure 1

Fig. 1. Analyses of the published consistent genes scored by SK in their “92.5” matrix. The left column indicates the absolute value of the log-likelihood differences $|\Delta \ln L|$ between the sponge- and ctenophore-sister hypotheses above the axis, and absolute difference in quartet score $|\Delta QS|$ below the axis. Genes consistent with ctenophore-sister are orange, genes consistent with sponge-sister are blue. Note that $|\Delta \ln L|$ values are in the hundreds to thousands, orders of magnitude higher than expected for single-gene analyses. The right column has phylogenies inferred from the indicated set of genes. (a) All 88 genes scored as consistent by SK. (b) The 82 genes scored by SK are consistent for sponge-sister. (c) The 6 genes classified by SK as consistent for ctenophore-sister.

Figure 2

Fig. 2. Analyses of gene scores for the “92.5” matrix using corrected methods. Colors and layout are the same as in Fig. 1. Note that most $|\Delta \ln L|$ values are less than 10, as expected. (a) All consistent genes. (b) Genes consistent for sponge-sister. (c) Genes consistent for ctenophore-sister.

Supplementary Materials

Files for our analyses are in the analyses/ directory of this repository.

Original gene taxon sampling

Files for our analyses of the originally published gene taxon sampling are in the analyses/sampling directory. These analyses cover all the matrices analyzed by SK in the left column of their Fig. 4.

Table 1. For each matrix from SK Fig. 4, the total number of genes, number of genes consistent for sponge-sister or ctenophore-sister in their original scoring, and the same counts considering only genes with sufficient sampling (at least one sponge, one ctenophore, one other animal, and one outgroup). Note that the sufficient sampling columns reflect their original scoring, not our corrected scoring.

matrix genes_orig sponge_orig cten_orig genes_suff sponge_suff cten_suff
Borowiec2015 1080 0 0 601 0 0
Chang2015 152 0 0 152 0 0
ClipKIT850 869 81 6 813 37 6
ClipKIT875 869 80 5 813 36 5
ClipKIT900 869 80 5 813 36 5
ClipKIT925 869 82 6 813 37 6
ClipKITkpi 869 85 6 813 40 6
ClipKITkpic 869 84 5 813 40 5
Dunn2008 150 0 0 55 0 0
Nosenko2013 84 21 0 83 20 0
Philippe2009 128 1 0 128 1 0
Ryan2013C 406 8 0 351 7 0
Ryan2013H 406 13 4 357 11 2
Whelan2015A 251 34 1 243 26 1
Whelan2015S 210 22 2 203 15 2
Whelan2017 117 51 8 76 14 8

A few things of note:

  • The Borowiec2015, Chang2015, and Dunn2008 matrices have no consistent genes in the original published analyses
  • Except in the Ryan2013H matrix, the number of genes consistent for ctenophore-sister is unchanged when considering only genes with sufficient sampling
  • For many matrices, more than half the original genes consistent for sponge-sister are removed when considering only genes with sufficient sampling.

These patterns emerge because:

  • Sponges tend to be much better sampled in these matrices than ctenophores are
  • This makes it more probable that a given gene will have no ctenophore sequences than no sponge sequences
  • Most genes without ctenophore sequences are scored as consistent for sponge-sister in the original SK scoring. These genes have no information about the animal root, so they should not be scored for either hypothesis.

Consider the ClipKIT925 matrix (also referred to as the "92.5" matrix) in more detail:

  • There are 869 genes in total
  • 813 of these genes have sufficient sampling to test the animal root
  • This means that 56 genes are missing either ctenophores, outgroups, or both (all have sponges and other animals)
  • There are 6 genes consistent for ctenophore-sister in the original SK scoring; all 6 have sufficient sampling
  • There are 82 genes consistent for sponge-sister in the original SK scoring; only 37 of these have sufficient sampling
  • Therefore, 45 genes are scored as consistent for sponge-sister but have no information about the animal root
  • This means that 45 out of 56 genes (80.4%) missing ctenophores or outgroups are incorrectly scored as consistent for sponge-sister in the original SK scoring.

Again, note that all of these results are based on the original SK scoring, not our corrected scoring.

Instructions to reproduce these sampling analyses from the original SK files are provided below.

Preparation

Clone this repository.

In the root of this repository, create a directory to hold the figshare data from the original manuscript:

mkdir figshare

Download the compressed files for the following directories from the SK figshare, uncompress them, and place them in the directory you created above:

figshare/CONCATENATED_DATA_MATRICES/
figshare/GENE-WISE_SCORES/ # Optional, only needed if you recompute gene wise files

Create and activate a conda environment for the analyses:

conda env create -f environment.yml
conda activate sk

Already done

These are steps that have already been done and have results committed to the repo. They are her for provenance, and you can rerun them if you like, but you don't need to.

Run analyses/sampling/scores.ipynbto process the gene wise scores from analyses/sampling/figshare/GENE-WISE_SCORES/. It does a few things:

  • Consolidates the scores from the different files into a single dataframe
  • Calculates some summary stats
  • Writes the results to gene_wise_scores_combined.tsv and gene_wise_scores_summary.tsv

We also need a dataframe that maps taxa to clades. I did this by dumping all the taxon names in all matrices and adding clade names in taxa_clades.tsv.

Matrix processing

Run the following to process the matrices:

cd analyses/sampling
python process_matrices.py
python validate_matrices.py

The last validation step is optional, but is recommended to ensure that the matrices were processed correctly. Note that the Chang2015 and Borowiec2015 matrices have no consistent genes in the original published analyses, so they will be skipped in the validation step.

This will create two sets of matrices:

  • analyses/sampling/matrices/full, which has the same partitions as figshare/CONCATENATED_DATA_MATRICES/ but with Dryodora glandiformis (the rogue ctenophore initially identified by SK) removed, partition names fixed to be consistent with those in the gene wise files, and models removed from the partition files.

  • analyses/sampling/matrices/filtered, derived from analyses/sampling/matrices/full but including only the consistent genes originally identified by SK. The additionally identified rogue sponge Cliona orientalis is also removed here.

Given their large sizes, these matrices are not committed to the repository, but can be regenerated by running the above commands.

Analyze gene statistics

Once the above steps are done, run analyses/sampling/gene_stats.ipynb to reproduce our analyses of gene taxon sampling and other gene and matrix statistics. This generates the results presented above in this section.

Corrected scoring

Files with original and corrected gene scores are in the analyses/scoring directory. These are for the "92.5" matrix only.

Phylogenetic analyses

Files with phylogenetic analyses of original and corrected matrices are in the analyses/phylogenies directory. These are for the "92.5" matrix only.

Additional issues addressed

Thanks to Steenwyk and King for helping us address the following issues:

  • Ryan matrices were missing from figshare, authors added them on 11/26/2025
  • We found that the sponge Cliona orientalis in the ClipKIT BUSCO trees was placed with Bilateria. Authors confirmed that, like the ctenophore Dryodora glandiformis, it should be removed from analyses

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors