Abstract

A protein superfamily contains distantly related proteins that have acquired diverse biological functions through a long evolutionary history. Phylogenetic analysis of the early evolution of protein superfamilies is a key challenge because existing phylogenetic methods show poor performance when protein sequences are too diverged to construct an informative multiple sequence alignment (MSA). Here, we propose the Graph Splitting (GS) method, which rapidly reconstructs a protein superfamily-scale phylogenetic tree using a graph-based approach. Evolutionary simulation showed that the GS method can accurately reconstruct phylogenetic trees and be robust to major problems in phylogenetic estimation, such as biased taxon sampling, heterogeneous evolutionary rates, and long-branch attraction when sequences are substantially diverge. Its application to an empirical data set of the triosephosphate isomerase (TIM)-barrel superfamily suggests rapid evolution of protein-mediated pyrimidine biosynthesis, likely taking place after the RNA world. Furthermore, the GS method can also substantially improve performance of widely used MSA methods by providing accurate guide trees.

One of the grand challenges in evolutionary biology is the phylogenetic analysis of a protein superfamily that contains remote homologs, proteins that share a very distant evolutionary origin (Rojas et al. 2012). The early evolution of protein superfamilies is of particular interest because it would tell us how the building blocks of life were established or how living things and the earth coevolved in their early histories (Gaucher et al. 2010). Moreover, the recent massive accumulation of sequence data is increasing the need for reconstructing large phylogenetic trees that include remote homologs (Warnow 2013). Whereas protein superfamilies are usually identified based on structural similarity, a sequence-based approach is needed for their phylogenetic analysis because sequence data can be obtained and analyzed in a highly systematic manner (Matsuda et al. 2003). However, the existing widely used molecular phylogenetic methods (e.g., Neighbor-Joining (NJ) (Saitou and Nei 1987), maximum likelihood (ML) (Felsenstein 1981), Maximum Parsimony (MP) (Camin and Sokal 1965), and Bayesian Inference (BI) methods (Yang and Rannala 1997)) show poor performance when applied to protein superfamily-scale data sets (Chan and Ragan 2013).

The most fundamental problem in the phylogenetic analysis of a protein superfamily is that remote homologs substantially increase the difficulty of constructing an informative multiple sequence alignment (MSA) (Ogden and Rosenberg 2006). Accumulating evidence suggests that the quality of an MSA critically affects the performance of the existing methods (Tan et al. 2015). Even classical procedures in phylogenetic analysis, such as constructing and filtering of MSA, are now being critically re-evaluated (Tan et al. 2015). To avoid the problem of the dependence on informative MSA, it has been proposed to use all-to-all pairwise sequence alignment (PSA) for calculating distance matrices and reconstructing phylogenetic trees using the NJ method (Thorne and Kishino 1992). Indeed, the pioneering studies showed that such PSA-based NJ methods (e.g., PhyPA) can outperform the ML and BI methods when data sets contain remote homologs (Xia 2016), suggesting that phylogenetic tree reconstruction based on distance matrices (or sequence similarity graphs [SSGs]) created by all-to-all PSA is promising for estimating protein superfamily-scale trees. It may also be notable that such a graph- or network-based approach has been successfully applied to analyze complex evolutionary histories, such as those of fusion proteins, haplotype genomes, and viruses (Bansal and Bafna 2008; Zhang et al. 2011; Jachiet et al. 2013; Corel et al. 2016). As a graph- or network-based tree reconstruction method, the Buneman tree method was previously proposed (Buneman 1971); however, this method requires large computational cost (Bryant and Moulton 1999) and cannot reconstruct superfamily-scale trees accurately (Besenbacher et al. 2005).

Here, we propose the Graph Splitting (GS) method, which rapidly reconstructs a protein superfamily-scale phylogenetic tree that contains remote homologs. Evolutionary simulation showed that the GS method can accurately reconstruct phylogenetic trees and be robust to major problems in phylogenetic estimation when sequences are substantially diverged. In addition, to statistically evaluate branch reliabilities like the bootstrap values and posterior probabilities do, the Edge Perturbation (EP) values were introduced for the GS method. An application of the GS method to an empirical data set of the triosephosphate isomerase (TIM)-barrel superfamily suggests rapid evolution of protein-mediated pyrimidine biosynthesis, likely taking place after the RNA world. Furthermore, it was also shown that the GS method can substantially improve performance of widely used MSA methods by providing accurate guide trees.

Materials and Methods

Overview

The overview of the GS method is shown in the top panel of Figure 1a. In brief, given a set of protein sequences, the GS method first calculates symmetrical relative sequence similarities by performing all-to-all PSA using MMSeqs2 (Steinegger and Söding 2017) and obtains a weighted and undirected SSG, whose nodes and edges represent the proteins and detected similarities, respectively. Then, the node-set in SSG is recursively split in a top-down manner by the spectral clustering algorithm (Paccanaro et al. 2006) until a tree (dendrogram) is obtained (Supplementary Fig. S1 and Text S1 available on Dryad at http://dx.doi.org/10.5061/dryad.ps0qf4r.). The spectral clustering method was adopted because it shows better performance than other methods, such as component analysis, hierarchical clustering, and the Markov cluster algorithm for protein family classification (Paccanaro et al. 2006). To make the method scalable to huge data sets, we devised an approximate and fast algorithm for minimizing the normalized cut (⁠|$N_{\rm cut})$| values (Shi and Malik 2000), which are used in the clustering algorithm.

GS and EP methods. a) Schematic diagrams of the methods. The GS method builds an SSG in which nodes and edges represent proteins and sequence similarities, respectively. The SSG is then recursively split by spectral clustering, and its clustering hierarchy produces a phylogenetic tree (dendrogram). The EP method evaluates the reliability of each phylogenetic branch by performing three steps. First, the sequence similarity scores of an SSG are randomly perturbed to produce a large number of SSGs. Second, the GS method is run on each SSG to reconstruct a phylogenetic tree. Third, a recovery rate is obtained for each branch in the original tree. b) The pseudocode of the GS method.
Figure 1.

GS and EP methods. a) Schematic diagrams of the methods. The GS method builds an SSG in which nodes and edges represent proteins and sequence similarities, respectively. The SSG is then recursively split by spectral clustering, and its clustering hierarchy produces a phylogenetic tree (dendrogram). The EP method evaluates the reliability of each phylogenetic branch by performing three steps. First, the sequence similarity scores of an SSG are randomly perturbed to produce a large number of SSGs. Second, the GS method is run on each SSG to reconstruct a phylogenetic tree. Third, a recovery rate is obtained for each branch in the original tree. b) The pseudocode of the GS method.

Aside from accurately reconstructing trees, any practical phylogenetic method needs to be able to provide reliability measurements on the estimated branches, such as the bootstrap values and posterior probabilities that are used in existing methods (Felsenstein 1985). These two measures cannot be easily obtained for a method that does not rely on MSA because their calculations require an informative MSA whose columns are independent. To overcome this obstacle, we developed the EP method for statistically evaluating branch reliability (Fig. 1a, bottom). Based on the fact that sequence similarity scores follow the extreme value distribution (Bastien and Maréchal 2008), the EP method repeatedly makes random perturbations to the scores. Subsequently, the GS method is run using the perturbed SSGs, and the proportion of times that each phylogenetic branch is recovered is calculated (this is called the EP value).

GS Method

Given a protein sequence data set of size |$n$|⁠, the GS method reconstructs a phylogenetic tree as explained below. For each protein sequence pair |$i$| and |$j$| in the data set, the symmetrical relative sequence similarity score between them is given by the following equation (Dufour et al. 2010):
where |$S^{'}_{bits} \left(i,j\right)$| is an alignment bit score calculated by all-to-all PSA, and |$i$| and |$j$| are the query and target sequences, respectively. It should be noted that |$0\le w_{ij} \le 1$|⁠, |$w_{ij} $| is symmetrical for |$i$| and |$j$| regardless of the implementation, and |$w_{ii} =1$|⁠. Then, an undirected and weighted graph |$G$| = (⁠|$V$|⁠, E, W) is defined where |$V$| is the set of nodes representing the protein sequences (i.e., |$\vert V\vert =n)$|⁠, |$E$| is a set of edges representing the sequence similarity hits, and |$W$| is a set of edge weights represented by a symmetric |$n\times n$| matrix whose each element is |$w_{ij}$|⁠. We set the threshold of |$E$|-value to 10 to make use of as much information as possible. To calculate all-to-all PSA, we use MMSeqs2 (Steinegger and Söding 2017) with the option -s 7.5 (most sensitive mode).
The GS method recursively splits a given node-set |$K$| (initially |$K=V)$| into two groups, |$A$| and |$B$| (⁠|$A\mathop \cup B=K$| and |$A\mathop \cap B=\phi)$|⁠, so that proteins in the same group and those in different groups are densely and sparsely connected, respectively. In graph theory, such a split can be calculated by minimizing the normalized cut value (⁠|$N_{\rm cut})$|⁠, which quantifies the amount of intergroup edges weighted by a normalizing factor for avoiding unnaturally biased splits (Shi and Malik 2000). |$N_{\rm cut}$| is defined by
where |$W_{AB} :=\mathop \sum_{i\in A,j\in B} w_{ij} $|⁠, |$d_i :=\mathop \sum_{j\in K} w_{ij} $|⁠, and |$d_X :=\mathop \sum_{i\in X} d_i $|. By defining an indicator vector |$p$| as
where
|$p$| that minimizes |$N_{\rm cut}$| can be efficiently calculated (Shi and Malik 2000). Because we allowed the elements of the indicator vector |$p$| to take continuous values, |$p$| needs to be converted to a discrete-value vector. Based on numerical simulations, we found that sorting its elements and splitting them into two groups where the difference between the neighboring elements is the largest efficiently gives a near-optimal discrete-value vector (data not shown). We refer to this approach as the relaxed algorithm. Note that this approach does not give a trivial split between a null set and all nodes theoretically.

Based on the theoretical background above, the GS method recursively splits the input node-set |$V $|until a tree (or dendrogram) is obtained by the method that is represented as pseudocode (Fig. 1b). The GS method is implemented in C++ (sources and binary packages are freely available at https://github.com/MotomuMatsui).

EP Method

The EP method provides reliability measures of branches estimated by the GS method and is analogous to the bootstrap values and posterior probabilities used in existing methods. The brief procedure of EP method is that 1) a Gumbel distribution is fitted to the observed distribution of sequence similarity scores, 2) randomized sequence similarity scores are sampled from the distribution, and 3) trees are repetitively reconstructed by the randomized sequence similarity scores using the GS method.

We describe its theoretical basis below. Sequence similarity bit scores are known to follow these equations:
where |$S\left( {i,j} \right)$| is the logarithmic odds of alignment scores and |$\lambda $| and |$K_{ij} $| are parameters derived from scoring matrices (Altschul et al. 1990). |$S\left( {i,j} \right)$| is known to follow the Gumbel distribution, a type of extreme value distribution (Altschul et al. 1990). Thus, we assumed that |$w_{ij}$| can also be explained by the Gumbel distribution. Numerical simulations confirmed that |$w_{ij}$| actually fits the Gumbel distribution, particularly when branch lengths are large (Supplementary Fig. S2 available on Dryad). This observation let us assume that the reliability of each branch estimated using the GS method can be evaluated by inducing random perturbations to sequence similarity scores of given SSG according to the Gumbel distribution. The scale and shape parameters of the Gumbel distribution were determined by simulations under an observed indel rate of 1/35.4 (Benner et al. 1993) (Supplementary Fig. S2 available on Dryad). Given |$G$| = (⁠|$V$|⁠, E, W), the algorithm of the EP method can be briefly outlined as follows:
  1. Reconstruct a tree |$T_{0}$| by applying the GS method to |$G$|⁠.

  2. Induce random perturbations to |$W$| according to the extreme value distribution with parameters location=|$w_{ij} $|⁠, scale=|$w_{ij} \left( {1-w_{ij} } \right)/3.5$|⁠, and shape=|$e^{-3w_{ij} }-1$|⁠.

  3. Reconstruct a tree |$T_{EP}$| using the GS method.

  4. Repeat steps 2–3 |$N$| times.

  5. Calculate the recovery rate of each branch of |$T$| in |$T_{EP}s$|⁠.

The EP method is implemented in C++.

Evolutionary Simulation

We generated 2,400 trees that had 20 external nodes. The tree shapes were randomly determined by the backward Yule process with a constant birth rate (one per unit time) using Bio::Tree::RandomFactory in BioPerl 1.6.924 (Stajich et al. 2002). Under this model, the lengths of the edges follow |$1-\ln \left( {\mbox{rand}\left( 1 \right)\times \left( {e-1} \right)+1} \right)$| substitution/site, where |$\mbox{rand}\left( 1 \right)$| represents a uniform random variable between 0 and 1. For each 400 trees, we applied different scaling factors (0.1, 0.2, 0.5, 1.0, 1.5, and 2.0) to shrink or expand the whole trees, and we prepared trees of various sizes. For each tree, the molecular evolution of a 1000 amino-acid protein sequence was simulated using the WAG+|$\Gamma $| substitution model (Whelan and Goldman 2001) with four different indel frequencies that followed the Zipfian (power-law) model (Fletcher and Yang 2009) using INDELible v1.03 (Fletcher and Yang 2009). The scaling parameter of the power-law distribution was set to the default value (1.7). The indel frequencies were set to 0.0, 0.02, 0.05, or 0.1 relative to the substitution rate (note that 600 simulations were performed for each indel frequency). We also performed simulations for different tree sizes (50, 100, and 200 external nodes) and different protein lengths (500 and 2,000 amino acids) with the same settings. For comparison with PhyPA (Xia 2016), we performed simulations with “AABlosumSym,” “AABlosumSymHalf,” “AABlosumAsym,” and “AABlosumAsymHalf” models using INDELible. Here, the comparison was conducted separately because PhyPA is a GUI tool, but under the same conditions of the PhyPA paper. Briefly, evolutionary simulation of 24 amino-acid sequences was conducted under conditions that tree topology and each interior branch length were symmetrical and 0.6 substitutions/site, symmetrical and 0.3 substitutions/site, asymmetrical and 0.3 substitutions/site, and asymmetrical and 0.15 substitutions/site, respectively.

To evaluate the performance of each phylogenetic method in the face of typical problems (biased taxon sampling, heterogeneous evolutionary rates, and long-branch attraction), additional simulations were performed. Because the simulation settings were almost unchanged from the original simulation, only the differences are described here. For simulation under biased taxon sampling, we generated 32-taxon trees that had eight different tree topologies, where topologies 1 and 7 were the most biased among the first seven and topology 8 was a complete pectinate tree (Fig. 4a, b). For each topology, four trees whose branch lengths were 0.1, 0.2, 0.5, or 1.0 substitution/site were prepared. Four indel frequencies were tested as in the original simulation. For each condition, 100 independent evolutionary simulations were performed (i.e., |$8 \times 4 \times 4 \times 100 = 12, 800$| simulations in total). For simulation under heterogeneous evolutionary rates, we used 32-taxon trees whose branch lengths were sampled from a normal distribution (using the rnorm function in R 3.4.3 (R Core Team 2015)) (Fig. 4d, e). For the branch lengths, the mean was set 0.1, 0.2, 0.5, or 1.0, and the standard deviation was set to 0.0, 0.1, 0.2, 0.5, or 1.0. Four indel frequencies were tested as in the original simulation. If negative branch lengths were obtained, they were set to 0. For each condition, 100 independent evolutionary simulations were performed (i.e., |$4 \times 5 \times 4 \times 100 = 8,000$| simulations in total). For simulation under long-branch attraction, we prepared 32-taxon trees that had two long branches (Fig. 4g, h). The lengths of the short branches were set to 0.1, 0.2, or 0.5, and the ratio between the lengths of the long and short branches was set to 1, 1.5, 2.0, 2.5, or 3.0. Four indel frequencies were tested as in the original simulation. For each condition, 100 independent evolutionary simulations were performed (i.e., |$3 \times 5 \times 4 \times 100 = 6,000$| simulations in total).

Phylogenetic Tree Reconstruction

MSA of the amino-acid sequences was conducted using MAFFT (mafft-linsi) v7.273 (Katoh and Standley 2013) with the default options. Distance matrices used in the MSA-with-complete-gap-deletion and the MSA-with-pairwise-gap-deletion approaches of the NJ method were computed using the phangorn 2.4.0 package (Schliep 2011) on R environment with the WAG+|$\Gamma $| substitution model and dist.ml function with parameter exclude=“all” and “pairwise,” respectively. Distance matrices used in the pairwise-gap-alignment approach of the NJ method were calculated based on the sequence similarity score. Phylogenetic reconstruction was conducted using the same package. Bootstrap values were computed using the bootstrap.phyDat function of the same package (bs = 100). For comparison, the NJ+PhyD* (Criscuolo and Gascuel 2008), BioNJ (Gascuel 1997), and BioNJ+PhyD* methods were also tested but showed little difference. In addition, PhyPA in DAMBE v7.0.12 (Xia 2018) was run to reconstruct PSA-based phylogenetic trees with options of gap open = 20, gap extension = 2, substitution matrix = Blosum62, tree-building algorithm = FastME with default options, and genetic distance = PoissonP.

The same MSA was used in the ML, MP, and BI methods. The ML method was conducted using RAxML v8.2.8. (Stamatakis 2014) with options -m PROTGAMMAIWAGX -f d. Bootstrap values were computed with options of -m PROTGAMMAIWAGX -f a -N 100. For comparison with other ML methods, PhyML v20120412 . (Guindon et al. 2010), IQ-Tree v1.6.9 (Nguyen et al. 2015), and FastTree v2.1.10 (Price et al. 2010) were also used with the following options: -d aa -m WAG -f e -a e, -st AA -m WAG+F, and -wag, respectively. The MP method used RAxML v8.2.8 (as the starting tree) and pratchet and acctran functions of the phangorn package. Bootstrap values were computed using the bootstrap.phyDat function of the same package (bs = 100). The BI method used MrBayes v3.2.6 (Ronquist et al. 2012) with options of lset rates = gamma; prset aamodelpr = fixed(wag); mcmcp ngen = 100,000, samplefreq = 1,000, nruns = 1, nchains = 4; and sumt burnin = 20, contype = halfcompat.

The GS method was computed as described above, where MMSeqs2 Release 2-23394 (Steinegger and Söding 2017) was used for the all-to-all PSA. EP values were computed with 100 replicates.

Computations were performed on a Mac Pro 6, 12-Core Intel Xeon E5 (2.7 GHz), with 64 GB memory and a 1 TB SSD (5.0 Giga transfers/s).

Evaluation of Accuracy of Phylogenetic Tree Reconstruction

The correctly recovered branch ratio (⁠|$R_{CRB})$| was calculated as follows:
where RF is the Robinson–Foulds distance (Robinson and Foulds 1981) between the estimated and true trees, calculated using the phangorn 2.4.0 package (Schliep 2011), |$M$| is the number of external nodes in the true tree, and |$O$| is the difference in the numbers of branches between the true and estimated trees. In brief, this metric represents a ratio of edges of the true tree that are recovered in the estimated tree. Effects of long-branch attraction were measured by numbers of branches in the shortest path (⁠|$D_{S})$| between nodes 1 and 32 (Fig. 4g, h).

Data Visualization and Network Analysis

We used iTOL 4.2.1 (Letunic and Bork 2011) and the R package Ape 5.1 (Paradis et al. 2004) to visualize phylogenetic trees. Cytoscape 3.6.1 (Smoot et al. 2011) was used for visualization of the networks. Violin plots were created using vioplot 0.2 in R. The transitivity of the SSGs was calculated using the transitivity function of iGraph 1.0.1 (Csardi and Nepusz 2006) in R.

Empirical Superfamily Data Analysis

Sequence data of 233 TIM-barrel superfamily proteins (ID: 2000031) were downloaded from the SCOP database (Andreeva et al. 2014) (http://scop2.mrc-lmb.cam.ac.uk/, downloaded April 22, 2015). Redundant sequences were removed using CD-HIT v4.6.8 (Li and Godzik 2006) to retrieve 72 protein sequences. Their protein structural data were downloaded from PDB (Berman et al. 2000) (http://www.rcsb.org/pdb, downloaded April 29, 2015). The root-mean-square deviations (RMSDs) and TM-scores of protein structural pairs were calculated using Mican v2013.1.29 (Minami et al. 2013) with the -|$r$| option. Stereo views were produced by rotating structures |$5^{\circ}$| and lining them up side-by-side using UCSF Chimera 1.10.1 (Pettersen et al. 2004). UCSF Chimera was also used to produce movies.

The significance of the clustering of proteins with similar structures in the reconstructed phylogenetic trees was statistically tested in the following steps: First, for each branch in the tree, two sets of external nodes that were divided by the branch were retrieved; second, an average RMSD and TM-score were calculated for every protein pair in each set; third, the differences between distributions of the RMSDs and TM-scores were tested using the two-sided Wilcoxon rank-sum test.

Sequence data for 243 protein superfamilies were obtained from SCOPe 2.07 database (http://scop.berkeley.edu/, downloaded June 17, 2018). Sequence similarity scores were calculated using MMSeqs2.

Application of the GS Method to Improve MSA and Existing Phylogenetic Methods

Using 600 sequence data sets obtained from evolutionary simulation for each indel rate, MSAs were estimated using MAFFT (mafft-linsi) v7.273 (Katoh and Standley 2013) and Clustal|$\Omega $| v1.2.4 (Sievers et al. 2011) with the default options. The accuracy of the estimated MSAs was assessed by modeler’s score (⁠|$f_{M})$| and developer’s score (⁠|$f_{D})$| (Nuin et al. 2006). These scores are defined as |$f_{M}=c$|/|$t$| and |$f_{D}=c$|/|$r$|⁠, where |$c$|⁠, |$t$| and |$r$| are the number of correctly aligned residue pairs in the calculated MSA, whole residue pairs in the calculated MSA and whole residue pairs in the reference (true) MSA, respectively. Then, the GS method was applied to each data set to reconstruct a phylogenetic tree (GS tree). MAFFT and Clustal|$\Omega$| were run again by feeding the GS tree as a guide tree with options treein and guidetree-in, respectively. The accuracy of the estimated MSAs was also assessed by |$f_{D}$|⁠. Differences in |$f_{D}$| of the original MSA and MSA with the GS tree as the guide tree were then examined.

A GS tree was used as an initial tree in the tree search algorithms of the ML (RAxML) and BI (MrBayes) methods, with the startingtree and -|$t$| options enabled, respectively. In addition, the ML and BI methods were run on MSAs that were estimated by MAFFT with the GS tree as the guide tree.

Results

The GS Method Accurately and Rapidly Reconstructs Superfamily-Scale Trees with Reliability Information

The performances of the NJ, ML, MP, BI, and GS methods were evaluated using evolutionary simulation. First, 2400 phylogenetic trees that had 20 external nodes were randomly generated based on the backward Yule process with a constant birth rate (one per unit time). Then, to prepare trees of various sizes, we applied different scaling factors (0.1, 0.2, 0.5, 1, 1.5, and 2) for every set of 400 of the 2,400 trees to shrink/expand the entire trees. On each tree, the molecular evolution of a 1000 amino-acid protein sequence was simulated using the WAG+|$\Gamma $| substitution model (Whelan and Goldman 2001) with four different indel frequencies that followed the power-law model (Fletcher and Yang 2009). Using the generated sequences at the external nodes (i.e., “modern” sequences) as input data, each of the methods reconstructed a tree. For the NJ, ML, MP, and BI methods, the MSA was constructed using Multiple Alignment using Fast Fourier Transform (MAFFT) (Katoh and Standley 2013). For the NJ, ML, and BI methods, the WAG+|$\Gamma $| substitution model was assumed as an evolutionary model. For the NJ method, three different approaches were used for calculating the distance matrices. An MSA-with-complete-deletion approach (NJ|$_{{\rm MSA}\_{\rm c}})$| removed every column in MSA that contained any gap, an MSA-with-pairwise-deletion approach (NJ|$_{{\rm MSA}\_{\rm p}})$| removed gaps only if necessary in the pairwise distance calculation, and a pairwise-sequence-alignment approach (NJ|$_{PSA})$| used similarity scores of PSAs without construction of an MSA (as similar to what PhyPA (Xia 2016) does). Finally, consistencies between the reconstructed and “true” trees were evaluated based on the number of correctly recovered branches (splits). This measure is theoretically equivalent to the Robinson–Foulds distance (Robinson and Foulds 1981) if the reconstructed trees are bifurcation trees (note that the BI method sometimes produces trees with multifurcations).

As shown by a typical example in Figure 2 and by ratios of correctly recovered branches in Figure 3a, the GS method outperformed the existing methods on the simulated data, if the sequences are sufficiently diverged (average sequence similarity scores |$<0.06$| [under frequent indels]–0.09 [under no indels]). Other ML programs (PhyML, IQ-Tree, and FastTree) gave results comparable to those of RAxML (Supplementary Fig. S3 available on Dryad). Indeed, average sequence similarity scores of protein families for which the GS method reconstructed phylogenetic trees with more supported branches (EP values |$\ge 0.9$|⁠) than those reconstructed by the NJ|$_{{\rm MSA}\_{\rm p}}$| method (bootstrap values |$\ge 0.8$|⁠) were smaller than those of the other protein families and fell into this category (Fig. 3b). We found that the GS method outperformed traditional methods when the transitivity of SSG reaches approximately 0.68, regardless of the indel rates (Fig. 3c) and sequence lengths (500 and 2,000 amino acids) (Supplementary Figs. S5 and S6 available on Dryad). Here, transitivity is a measure used in graph theory that quantifies how strongly nodes are clustered in a given graph (Barrat et al. 2004). Assume one sequence A is randomly chosen from a sequence data set and two sequences B and C are randomly chosen from the sequences that were hit by querying A against the data set. Then, transitivity = 0.68 means that B and C directly hit each other at the 68% probability (note that if A, B, and C are similar enough each other, transitivity asymptotically reaches 1). As we saw, this condition is fulfilled when average sequence similarity scores are less than 0.09, which approximately corresponds to 10% amino-acid identities on average, if there are no indels. Our observation indicates that transitivity can be used as a guide for choosing the GS method instead of other methods. The effectiveness of the EP method was also evaluated using the same simulation data set. Overall, branches with EP values |$>0.9$| or 0.8 showed high prediction precision of |$>80\%$| or 60%, respectively (Fig. 3d).

A typical simulation example that the GS method outperformed the existing methods. a) A randomly generated reference tree. The red and blue open circles indicate interior branches that were correctly recovered by the GS and Neighbor-Joining using MSA-with-pairwise-gap-deletion (NJ$_{{\rm MSA}\_{\rm p}})$ methods, respectively. The NJ$_{{\rm MSA}\_{\rm p}}$ method was used as an NJ method to obtain bootstrap values. b) A 300-bp portion of the MSA generated. c) Topological comparison between phylogenetic trees reconstructed by the GS (left) and NJ$_{{\rm MSA}\_{\rm p}}$ (right) methods. The numbers below the interior branches indicate EP and bootstrap values, respectively. The red branches are those supported by large EP ($\ge $0.9) and bootstrap ($\ge $0.8) values, respectively. The red and blue open circles represent correctly recovered branches by the GS and NJ$_{{\rm MSA}\_{\rm p}}$ methods, respectively.
Figure 2.

A typical simulation example that the GS method outperformed the existing methods. a) A randomly generated reference tree. The red and blue open circles indicate interior branches that were correctly recovered by the GS and Neighbor-Joining using MSA-with-pairwise-gap-deletion (NJ|$_{{\rm MSA}\_{\rm p}})$| methods, respectively. The NJ|$_{{\rm MSA}\_{\rm p}}$| method was used as an NJ method to obtain bootstrap values. b) A 300-bp portion of the MSA generated. c) Topological comparison between phylogenetic trees reconstructed by the GS (left) and NJ|$_{{\rm MSA}\_{\rm p}}$| (right) methods. The numbers below the interior branches indicate EP and bootstrap values, respectively. The red branches are those supported by large EP (⁠|$\ge $|0.9) and bootstrap (⁠|$\ge $|0.8) values, respectively. The red and blue open circles represent correctly recovered branches by the GS and NJ|$_{{\rm MSA}\_{\rm p}}$| methods, respectively.

Performance of the GS method in evolutionary simulations. a) Correctly recovered branch ratios of GS and existing methods against average sequence similarity scores. Colored lines are the LOWESS curves (solid black, NJ using MSA-with-complete-gap-deletion; dashed black, NJ using MSA-with-pairwise-gap-deletion; dotted black, NJ using pairwise-sequence-alignment; blue, ML; green, MP; gray, BI; red, GS). Note that the ML, MP, BI and NJ$_{{\rm MSA}\_{\rm p}}$ lines are heavily overlapped. The vertical lines indicate the average sequence similarity score thresholds where the GS method outperformed the other methods (through c to d). SSS, sequence similarity score. b) Average sequence similarity scores of protein families for which the GS method reconstructed phylogenetic trees with more supported branches (EP values $\ge $0.9) than the NJ$_{{\rm MSA}\_{\rm p}}$ method did (bootstrap values $\ge $0.8) (top) and those of the other protein families (bottom). The 243 protein superfamilies data set in total was obtained from the SCOPe database. The violin plots, red points, and gray boxes are the distributions, averages, and standard deviations, respectively. c) Transitivity of SSG against average sequence similarity scores. The gray plots and red lines are the calculated transitivities and LOWESS curves, respectively. The horizontal lines indicate the transitivities where the GS method outperformed the other methods regardless of the indel rate. d) Precision of branches estimated by the GS method under different thresholds of EP values. The colored lines are LOWESS curves of precisions of branches with EP values $\ge $0.9 (red), 0.8 (blue), or 0.0 (i.e., all branches, black). e) Correctly recovered branch ratios of the GS and existing methods for different tree sizes. The error bars are standard deviations. For NJ trees, only the one built using pairwise alignment is shown because it performed best. f) Computational times for different tree sizes. The error bars are standard deviations. Dotted purple, MAFFT; dotted orange, all-to-all MMSeqs2; dotted black, NJ with bootstrap analysis (100 replications); and dotted red, GS with EP analysis (100 replications).
Figure 3.

Performance of the GS method in evolutionary simulations. a) Correctly recovered branch ratios of GS and existing methods against average sequence similarity scores. Colored lines are the LOWESS curves (solid black, NJ using MSA-with-complete-gap-deletion; dashed black, NJ using MSA-with-pairwise-gap-deletion; dotted black, NJ using pairwise-sequence-alignment; blue, ML; green, MP; gray, BI; red, GS). Note that the ML, MP, BI and NJ|$_{{\rm MSA}\_{\rm p}}$| lines are heavily overlapped. The vertical lines indicate the average sequence similarity score thresholds where the GS method outperformed the other methods (through c to d). SSS, sequence similarity score. b) Average sequence similarity scores of protein families for which the GS method reconstructed phylogenetic trees with more supported branches (EP values |$\ge $|0.9) than the NJ|$_{{\rm MSA}\_{\rm p}}$| method did (bootstrap values |$\ge $|0.8) (top) and those of the other protein families (bottom). The 243 protein superfamilies data set in total was obtained from the SCOPe database. The violin plots, red points, and gray boxes are the distributions, averages, and standard deviations, respectively. c) Transitivity of SSG against average sequence similarity scores. The gray plots and red lines are the calculated transitivities and LOWESS curves, respectively. The horizontal lines indicate the transitivities where the GS method outperformed the other methods regardless of the indel rate. d) Precision of branches estimated by the GS method under different thresholds of EP values. The colored lines are LOWESS curves of precisions of branches with EP values |$\ge $|0.9 (red), 0.8 (blue), or 0.0 (i.e., all branches, black). e) Correctly recovered branch ratios of the GS and existing methods for different tree sizes. The error bars are standard deviations. For NJ trees, only the one built using pairwise alignment is shown because it performed best. f) Computational times for different tree sizes. The error bars are standard deviations. Dotted purple, MAFFT; dotted orange, all-to-all MMSeqs2; dotted black, NJ with bootstrap analysis (100 replications); and dotted red, GS with EP analysis (100 replications).

Performance of the GS method in the face of typical problems such as biased taxon sampling (a–c), heterogeneous evolutionary rates (d–f), and long-branch attraction (g–i). a) Schematic figure showing how 32 taxa were selected from a 64-taxon balanced tree for topologies 1–7. Topology 8 was a complete pectinate tree. b) Dendrograms of the topologies 1–7. c) Performance of the GS method in evolutionary simulation under biased taxon sampling (indel rate = 0.02). Violin plots are distributions of correctly recovered branch ratios at different tree topologies (for detailed results, see Supplementary Fig. S7 available on Dryad). Orange, NJ using MSA-with-complete-gap-deletion; yellow, NJ using MSA-with-pairwise-gap-deletion; black, NJ using pairwise-sequence-alignment; blue, ML method; green, MP method; gray, BI method; red, GS method. d) Dendrogram of a 32-taxon tree. The length of each branch ($a)$ was randomly determined based on the normal distribution (mean = 0.5, SD = 0.0, 0.2, 0.5, or 1.0). e) Examples of generated trees at different standard deviation parameters. f) Performance of the GS method in evolutionary simulation under heterogeneous evolutionary rates (indel rate = 0.02). Violin plots are distributions of correctly recovered branch ratios at different SDs (for detailed results, see Supplementary Fig. S8 available on Dryad). g) A phylogenetic tree with two long branches. h) Examples of generated trees at different branch length ratios between the long and normal branches. i) Performance of the GS method in evolutionary simulation under long-branch attraction (in accuracy of the entire tree reconstruction; indel rate = 0.02, branch length [$a$] = 0.5). Violin plots are distributions of correctly recovered branch ratios at different branch length ratios (for detailed results, see Supplementary Fig. S10 available on Dryad).
Figure 4.

Performance of the GS method in the face of typical problems such as biased taxon sampling (a–c), heterogeneous evolutionary rates (d–f), and long-branch attraction (g–i). a) Schematic figure showing how 32 taxa were selected from a 64-taxon balanced tree for topologies 1–7. Topology 8 was a complete pectinate tree. b) Dendrograms of the topologies 1–7. c) Performance of the GS method in evolutionary simulation under biased taxon sampling (indel rate = 0.02). Violin plots are distributions of correctly recovered branch ratios at different tree topologies (for detailed results, see Supplementary Fig. S7 available on Dryad). Orange, NJ using MSA-with-complete-gap-deletion; yellow, NJ using MSA-with-pairwise-gap-deletion; black, NJ using pairwise-sequence-alignment; blue, ML method; green, MP method; gray, BI method; red, GS method. d) Dendrogram of a 32-taxon tree. The length of each branch (⁠|$a)$| was randomly determined based on the normal distribution (mean = 0.5, SD = 0.0, 0.2, 0.5, or 1.0). e) Examples of generated trees at different standard deviation parameters. f) Performance of the GS method in evolutionary simulation under heterogeneous evolutionary rates (indel rate = 0.02). Violin plots are distributions of correctly recovered branch ratios at different SDs (for detailed results, see Supplementary Fig. S8 available on Dryad). g) A phylogenetic tree with two long branches. h) Examples of generated trees at different branch length ratios between the long and normal branches. i) Performance of the GS method in evolutionary simulation under long-branch attraction (in accuracy of the entire tree reconstruction; indel rate = 0.02, branch length [|$a$|] = 0.5). Violin plots are distributions of correctly recovered branch ratios at different branch length ratios (for detailed results, see Supplementary Fig. S10 available on Dryad).

Additional evolutionary simulation with trees that had 50, 100, and 200 nodes also showed that the GS method outperformed the other methods especially for larger trees (Fig. 3e). Finally, the computational times required for phylogenetic analyses using the GS and EP methods are smaller than NJ-method-based approaches and much smaller than those required for ML- and BI-method-based approaches (Fig. 3f).

Finally, we compared the GS method with an existing PSA-based phylogenetic method, PhyPA (Xia 2016), regarding accuracy (including that on diverged sequences) and computational speed using the same simulation procedures. Because PhyPA is a GUI tool, comparison was conducted independently and not thoroughly. While their accuracies were comparable, the GS method ran much faster than PhyPA (0.1 s compared to 2.5 min in case of 24 nodes) (Supplementary Fig. S4, available on Dryad).

The GS Method Is Robust to Major Problems in Phylogenetics

Further simulations were conducted to address major problems in phylogenetic analysis: biased taxon sampling, heterogeneous evolutionary rates, and long-branch attraction. To address biased taxon sampling, eight tree topologies that represented different levels of taxon sampling biases were tested using the same evolutionary simulation framework (Fig. 4a, b). The performance of the GS method was nearly unaffected by the biases and was better than the other methods, especially when the indel rates and branch lengths were large (Fig. 4c and Supplementary S7 available on Dryad). For the second factor, we tested different perturbations to the evolutionary rates in the simulations (Fig. 4d, e). Although strong perturbations worsened the performance of every method, the GS method was the best if the indel rates and branch lengths were large (Fig. 4f and Supplementary S8 available on Dryad). For the last factor, trees with two long branches were considered (Fig. 4g, h). According to the theory of long-branch attraction, those two long branches tend to become close to each other in reconstructed trees (Bergsten 2005). In the simulation analyses, GS method showed similar or less signs of long-branch attraction compared to the other methods (Supplementary Fig. S9 available on Dryad), whereas its performance in accurately reconstructing the entire tree topology was much better than the other methods (Fig. 4i and Supplementary S10 available on Dryad).

The GS Tree of TIM-Barrel Superfamily Proteins Shows Consistency With Their 3D Structures and Suggests Their Early History

As an empirical data set, we applied the GS method to the TIM-barrel superfamily (Copley and Bork 2000), one of the largest superfamilies that contain proteins with diverse and important enzymatic functions and whose “backbone” phylogeny is still under debate (Nagano et al. 2002; Goldman et al. 2016). The sequences of 72 TIM-barrel proteins across the three domains of life were downloaded from the SCOP database (Andreeva et al. 2014), and a phylogenetic tree (GS tree) was reconstructed using the GS method (Fig. 5a). Proteins of similar enzymatic function were properly clustered in the GS tree, and the EP values of many branches were greater than 0.9, showing high statistical robustness. The BI method was also applied to this data set to reconstruct another phylogenetic tree (BI tree), and most of the tree’s branches were supported by small posterior probabilities (Fig. 5b). Specifically, there were several notable topological discrepancies between the GS and BI trees (Fig. 5c). For example, dihydroorotate (DHO) dehydrogenases formed a single cluster in the GS tree but were split into three clusters in the BI tree. To evaluate which topology is more likely, we examined protein 3D structure data, which should reflect the distant evolutionary relationships more sensitively than the primary amino-acid sequences (Thornton et al. 1999). By quantifying structural similarities by RMSDs (Kabsch 1978) and TM-scores (Zhang and Skolnick 2004), we found that proteins with similar structures were more significantly clustered within the GS tree than the BI tree (⁠|$p = 1.55$|e|$^{-9}$| and 5.50e|$^{-10}$| for RMSD and TM-score, respectively; two-sided Wilcoxon rank-sum test) (Fig. 6a). As a typical example, we examined structures of DHO dehydrogenases, dihydropyrimidine (DPY) dehydrogenases, and glutamate synthases (Fig. 6b and Supplementary Movie S1 available on Dryad). Consistent with the GS tree, the structural alignment supported a close relationship between the DHO and DPY dehydrogenases (RMSD 2.81 Å, TM-score 0.78), closer than the relationship between DHO dehydrogenases and glutamate synthases (RMSD 3.71 Å, TM-score 0.59). Secondary structural analysis also supported a closer relationship between DHO and DPY dehydrogenases (Supplementary Fig. S11 available on Dryad). Another example was a specific group of flavin oxidoreductases (denoted by Flavin oxidoreductases [Group 1] in Fig. 5), whose positions were very different in the GS and BI trees. Again, consistent with the GS tree, the structural alignment supported a closer relationship to DHO dehydrogenases (RMSD 3.36 Å, TM-score 0.62) than to another group of flavin oxidoreductases (RMSD 3.64 Å, TM-score 0.68) (Fig. 6c and Supplementary Movie S2 available on Dryad). This result suggests that those two groups of flavin oxidoreductases likely originated independently, and their secondary structures actually show differences (Supplementary Fig. S11 available on Dryad). A comparison to the NJ|$_{{\rm MSA}\_{\rm p}}$|⁠, ML, and MP methods also showed that the GS tree better reflects the structural similarities overall (Supplementary Figs. S12–S14 available on Dryad).

Application of the GS method to TIM-barrel superfamily proteins. a) Phylogenetic tree of TIM-barrel superfamily proteins reconstructed using the GS method. The red branches are those supported by large EP values ($\ge $0.9). The inner circle represents functional classes. The outer circle summarizes those classes into several groups for comparison. b) Phylogenetic tree reconstructed using the BI method. The red branches are those supported by large posterior probabilities ($\ge $ 0.9). c) Summary of topological differences between the GS and BI trees. The PDB IDs of the proteins shown in Figure 6 are presented for reference. (S)-GGGP synthase, (S)-3-O-geranylgeranylglyceryl phosphate synthase; DEC reductase, 2, 4-dienoyl-CoA reductase; DHO dehydrogenase, dihydroorotate dehydrogenase; DPY dehydrogenase, dihydropyrimidine dehydrogenase; HepGP synthase, heptaprenylglyceryl phosphate synthase; IDP isomerase, isopentenyl-diphosphate delta-isomerase; OP reductase, 12-oxophytodienoate reductase; PENT reductase, pentaerythritol tetranitrate reductase; TMA dehydrogenase, trimethylamine dehydrogenase.
Figure 5.

Application of the GS method to TIM-barrel superfamily proteins. a) Phylogenetic tree of TIM-barrel superfamily proteins reconstructed using the GS method. The red branches are those supported by large EP values (⁠|$\ge $|0.9). The inner circle represents functional classes. The outer circle summarizes those classes into several groups for comparison. b) Phylogenetic tree reconstructed using the BI method. The red branches are those supported by large posterior probabilities (⁠|$\ge $| 0.9). c) Summary of topological differences between the GS and BI trees. The PDB IDs of the proteins shown in Figure 6 are presented for reference. (S)-GGGP synthase, (S)-3-O-geranylgeranylglyceryl phosphate synthase; DEC reductase, 2, 4-dienoyl-CoA reductase; DHO dehydrogenase, dihydroorotate dehydrogenase; DPY dehydrogenase, dihydropyrimidine dehydrogenase; HepGP synthase, heptaprenylglyceryl phosphate synthase; IDP isomerase, isopentenyl-diphosphate delta-isomerase; OP reductase, 12-oxophytodienoate reductase; PENT reductase, pentaerythritol tetranitrate reductase; TMA dehydrogenase, trimethylamine dehydrogenase.

Assessment of the GS tree by 3D structural analysis of TIM-barrel superfamily proteins. a) Violin plots of root-mean-square deviations (top) and TM scores (bottom) averaged over every protein pair in every subtree of the GS and BI trees. The red points and gray boxes indicate the averages and standard deviations, respectively. b) A stereo (parallel) view of 3D structural alignments of a DHO dehydrogenase (red, 4OQV), DPY dehydrogenase (blue, 1GTE) and glutamate synthase (green, 1EA0). Note that $\alpha $$_{1}$–$\alpha $$_{8}$ are characteristic alpha-helices of the ($\beta \alpha $)$_{8}$-barrel structure. Black letters indicate helices that showed particularly large differences in the 3D structural alignment. c) A stereo (parallel) view of 3D structural alignments of flavin oxidoreductase (group 1) (yellow, 1VHN), DHO dehydrogenase (red, 4OQV), and flavin oxidoreductase (group 2) (black, 1Z41).
Figure 6.

Assessment of the GS tree by 3D structural analysis of TIM-barrel superfamily proteins. a) Violin plots of root-mean-square deviations (top) and TM scores (bottom) averaged over every protein pair in every subtree of the GS and BI trees. The red points and gray boxes indicate the averages and standard deviations, respectively. b) A stereo (parallel) view of 3D structural alignments of a DHO dehydrogenase (red, 4OQV), DPY dehydrogenase (blue, 1GTE) and glutamate synthase (green, 1EA0). Note that |$\alpha $||$_{1}$||$\alpha $||$_{8}$| are characteristic alpha-helices of the (|$\beta \alpha $|)|$_{8}$|-barrel structure. Black letters indicate helices that showed particularly large differences in the 3D structural alignment. c) A stereo (parallel) view of 3D structural alignments of flavin oxidoreductase (group 1) (yellow, 1VHN), DHO dehydrogenase (red, 4OQV), and flavin oxidoreductase (group 2) (black, 1Z41).

Discussion

In this era of sequence data deluge, we envision that the GS method will contribute to deciphering the early evolution of various protein superfamilies. In this study, we applied the GS method to the TIM-barrel protein superfamily and obtained a highly resolved phylogenetic tree. DHO and DPY dehydrogenases are involved in pyrimidine biosynthesis. Under the RNA-world hypothesis, protein-mediated pyrimidine biosynthesis is presumed to be one of the most primordial pathways that was needed during a transitional phase from the RNA world to the modern DNA-RNA-protein world (Caetano-Anollés et al. 2007). Both the monophyly of the DHO dehydrogenases and the proximity of the DHO and DPY dehydrogenases would suggest an early appearance of these enzymes with a common origin, consistent with the RNA-world hypothesis. Alternatively, the polyphyly of the flavin oxidoreductases may reflect the universality of flavin as a coenzyme and the importance of oxidation-reduction reactions in various contexts (Walsh and Wencewicz 2013). Although many TIM-barrel proteins are multi-domain proteins and phylogenetic analyses of a single domain cannot decipher the entire evolutionary history of the TIM-barrel protein superfamily, the GS method is a promising scheme for obtaining clues to early evolution of protein superfamilies. To assess how the GS method can be generally applied to other protein superfamilies, we downloaded sequence data from the SCOPe database, which is the extension of the SCOP database and comprehensively and hierarchically classifies proteins by considering structure information. When compared with the NJ trees that are based on the same superfamily data sets, the GS trees have more strongly supported deep branches, which may be further examined using multiple approaches (Fig. 7).

Application of the GS method to protein superfamily data sets obtained from the SCOPe database. Phylogenetic trees of six protein superfamilies were reconstructed using both the GS and NJ$_{{\rm MSA}\_{\rm p}}$ methods. The red branches of the GS and NJ trees are those supported by large EP ($\ge $0.9) and bootstrap ($\ge $0.8) values, respectively. The outer circles represent SCOP concise classification strings (functional classes) according to the color legend on each right side (for description, see Supplementary Table S1 available on Dryad). A number in a dotted box indicates an average sequence similarity score of each superfamily proteins. A number below each tree indicates a ratio of interior branches that are supported.
Figure 7.

Application of the GS method to protein superfamily data sets obtained from the SCOPe database. Phylogenetic trees of six protein superfamilies were reconstructed using both the GS and NJ|$_{{\rm MSA}\_{\rm p}}$| methods. The red branches of the GS and NJ trees are those supported by large EP (⁠|$\ge $|0.9) and bootstrap (⁠|$\ge $|0.8) values, respectively. The outer circles represent SCOP concise classification strings (functional classes) according to the color legend on each right side (for description, see Supplementary Table S1 available on Dryad). A number in a dotted box indicates an average sequence similarity score of each superfamily proteins. A number below each tree indicates a ratio of interior branches that are supported.

The better performance of the GS method can be regarded as an extension of the previous reports of the good performance of the PSA-based NJ method on data sets that contain remote homologs (Thorne and Kishino 1992; Xia 2016). The reduced performances of the ML, BI, and NJ (except for NJ|$_{\rm PSA})$| methods correlate with the reduced accuracy of MSA (Supplementary Fig. S15 available on Dryad). In addition, the observation that the accuracies of the NJ methods follow the order NJ|$_{\rm PSA} >$| NJ|$_{{\rm MSA}\_{\rm p}} >$| NJ|$_{{\rm MSA}\_{\rm c}}$| also suggests that the deterioration of MSA quality substantially inhibits correct phylogenetic tree reconstruction (Supplementary Fig. S16 available on Dryad) (Xia 2016). Direct comparisons of the alignment accuracies of PSA and MSA were also conducted (Supplementary Fig. S17 available on Dryad). When sequences are substantially diverged, precision of the aligned residues (⁠|$f_{M})$| based on PSA becomes much better than that based on MSA, indicating that MSA tends to “forcibly” align residues too much and introduces substantial noise that impedes downstream analyses. Recall (⁠|$f_{D})$| based on PSA is, in contrast, slightly worse than that based on MSA, probably because information from multiple sequences sometimes helps MSA find correct alignment among several candidates. These results indicate that the former effect is more important than the latter in extracting informative phylogenetic signals from superfamily-scale data sets. It should also be noted that comparisons of the PSA-based and MSA-based GS methods show that the former outperforms the latter, indicating that the algorithm also contributes to the accuracy of the GS method (Supplementary Fig. S18 available on Dryad). A change of |$N_{\rm cut}$| to |$M_{\rm cut}$| (Ding et al. 2001) did not have a significant effect on the accuracy of the GS method (data not shown). A notable difference between the GS method and the other methods is that the GS method accepts only protein sequences as input data. In theory, all-to-all PSA and subsequent spectral clustering can also be applied to nucleotide sequences; however, because extracting remote evolutionary information from nucleotide sequences is thought to be difficult (Zhang and Kumar 1997), the GS method is currently optimized for protein sequences.

In addition to elucidation of the early evolution of protein superfamilies, the GS method can also help us solve “modern” problems, such as the extremely rapid molecular evolution of viruses (Clementi et al. 2004), drug-resistant microbes (Baym et al. 2016), and cancer cells (Gerlinger et al. 2014). As another promising application of the GS method, it can improve the performance of MSA methods by providing guide trees that empower their progressive algorithms (Yamada et al. 2016), because the GS method estimates trees without a precalculated MSA. We actually used the GS method to provide guide trees for MAFFT and Clustal |$\Omega $| (Sievers et al. 2011) and observed improvement in MSA qualities (Fig. 8a). This improvement as guide trees for MSA was similarly observed when PhyPA (Xia 2016) was used if computational speed was ignored (Supplementary Fig. S19 available on Dryad); thus, PSA-based phylogenetic methods can improve MSA performance. Furthermore, the GS method can provide the ML and BI methods with initial trees, which serve as promising starting points for efficiently searching a large space of tree topologies (Ronquist et al. 2012; Stamatakis 2014). Although we did not observe significant improvement in the ML and BI methods using GS-based initial trees, we did find that using MSA with a GS-based guide tree improves performance (Fig. 8b). Because GS trees reconstruct tree topologies without estimating branch lengths, combining GS and ML or BI methods are expected to be especially effective if branch lengths need to be estimated. We have provided a web server for running the GS and EP methods and for downloading the program source codes: http://gs.bs.s.u-tokyo.ac.jp/ and https://github.com/MotomuMatsui.

Effects of providing GS tree for phylogenetics as guide and initial trees. a) Effects of providing GS trees as guide trees for MSA. The $x$-axis represents average sequence similarity scores. The $y$-axis represents differences in accuracies of MSA with and without a GS tree as a guide tree (measured by $f_{D})$. Red lines are LOWESS curves. b) Effects of providing GS trees as initial trees in the ML and BI methods. LOWESS curves of correctly recovered branch ratios of phylogenetic trees are plotted against average sequence similarity scores. For the ML (top) and BI (bottom) methods, reconstruction was performed using the default method (blue), using a GS tree as an initial tree (dashed blue), using a GS-tree-guided MSA (black), and using a GS tree as an initial tree and a GS-tree-guided MSA (dashed black). For comparison, the results of the GS method are also shown (red).
Figure 8.

Effects of providing GS tree for phylogenetics as guide and initial trees. a) Effects of providing GS trees as guide trees for MSA. The |$x$|-axis represents average sequence similarity scores. The |$y$|-axis represents differences in accuracies of MSA with and without a GS tree as a guide tree (measured by |$f_{D})$|⁠. Red lines are LOWESS curves. b) Effects of providing GS trees as initial trees in the ML and BI methods. LOWESS curves of correctly recovered branch ratios of phylogenetic trees are plotted against average sequence similarity scores. For the ML (top) and BI (bottom) methods, reconstruction was performed using the default method (blue), using a GS tree as an initial tree (dashed blue), using a GS-tree-guided MSA (black), and using a GS tree as an initial tree and a GS-tree-guided MSA (dashed black). For comparison, the results of the GS method are also shown (red).

Supplementary Material

Data available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.ps0qf4r.

Funding

This work was supported by the Japan Society for the Promotion of Science [15J07743, 16H06154, 17H05834, 18H02493, 18H04136, and 17K07509], the Ministry of Education, Culture, Sports, Science and Technology in Japan [221S0002 and 16H06279], and the Canon Foundation.

Acknowledgments

The authors thank members of the Iwasaki laboratory, the editor, and three anonymous reviewers for helpful comments on this research. Computations were partially performed on the NIG supercomputer at the ROIS National Institute of Genetics and the SuperComputer System, Institute for Chemical Research, Kyoto University. The GS method source code and web server are freely available at the project web site (http://gs.bs.s.u-tokyo.ac.jp/).

References

Altschul
S.F.
,
Gish
W.
,
Miller
W.
,
Myers
E.W.
,
Lipman
D.J.
1990
.
Basic local alignment search tool
.
J. Mol. Biol.
215
:
403
410
.

Andreeva
A.
,
Howorth
D.
,
Chothia
C.
,
Kulesha
E.
,
Murzin
A.G.
2014
.
SCOP2 prototype: a new approach to protein structure mining
.
Nucleic Acids Res.
42
:
D310
D314
.

Bansal
V.
,
Bafna
V.
2008
.
HapCUT: an efficient and accurate algorithm for the haplotype assembly problem
.
Bioinformatics.
24
:
i153
i159
.

Barrat
A.
,
Barthélemy
M.
,
Pastor-Satorras
R.
,
Vespignani
A.
2004
.
The architecture of complex weighted networks
.
Proc. Natl. Acad. Sci. USA.
101
:
3747
3752
.

Bastien
O.
,
Maréchal
E.
2008
.
Evolution of biological sequences implies an extreme value distribution of type I for both global and local pairwise alignment scores
.
BMC Bioinformatics.
9
:
332
.

Baym
M.
,
Lieberman
T.D.
,
Kelsic
E.D.
,
Chait
R.
,
Gross
R.
,
Yelin
I.
,
Kishony
R.
2016
.
Spatiotemporal microbial evolution on antibiotic landscapes
.
Science.
353
:
1147
1151
.

Benner
S.A.
,
Cohen
M.A.
,
Gonnet
G.H.
1993
.
Empirical and structural models for insertions and deletions in the divergent evolution of proteins
.
J. Mol. Biol.
229
:
1065
1082
.

Bergsten
J.
2005
.
A review of long-branch attraction
.
Cladistics.
21
:
163
193
.

Berman
H.M.
,
Westbrook
J.
,
Feng
Z.
,
Gilliland
G.
,
Bhat
T.N.
,
Weissig
H.
,
Shindyalov
I.N.
,
Bourne
P.E.
2000
.
The protein data bank
.
Nucleic Acids Res.
28
:
235
242
.

Besenbacher
S.
,
Mailund
T.
,
Westh-Nielsen
L.
,
Pedersen
C.
2005
.
RBT—a tool for building refined Buneman trees
.
Bioinformatics.
21
:
1711
1712
.

Bryant
D.
,
Moulton
V.
1999
.
A polynomial time algorithm for constructing the refined Buneman tree
.
Appl Math Lett.
12
:
51
56
.

Buneman
P.
1971
.
The recovery of trees from measures of dissimilarity
. In:
Hodson
F.R.
,
Kendall
D.G.
,
Tăutu
P.
, editors.
Mathematics in the Archaeological and Historical Sciences
.
Edinburgh
:
Edinburgh University Press
. p.
387
395
.

Caetano-Anollés
G.
,
Kim
H.S.
,
Mittenthal
J.E.
2007
.
The origin of modern metabolic networks inferred from phylogenomic analysis of protein architecture
.
Proc. Natl. Acad. Sci. USA.
104
:
9358
9363
.

Camin
J.H.
,
Sokal
R.R.
1965
.
A method for deducing branching sequences in phylogeny
.
Evolution.
19
:
311
326
.

Chan
C.X.
,
Ragan
M.A.
2013
.
Next-generation phylogenomics
.
Biol. Direct.
8
:
3
.

Clementi
M.
,
Canducci
F.
,
Bagnarelli
P.
,
Menzo
S.
2004
.
Intra-host evolution of human immunodeficiency virus type 1 and viral fitness
.
New Microbiol.
27
:
41
44
.

Copley
R.R.
,
Bork
P.
2000
.
Homology among (⁠|$\beta \alpha )$|8 barrels: implications for the evolution of metabolic pathways
.
J. Mol. Biol.
303
:
627
641
.

Corel
E.
,
Lopez
P.
,
Méheust
R.
,
Bapteste
E.
2016
.
Network-thinking: graphs to analyze microbial complexity and evolution
.
Trends Microbiol.
24
:
224
237
.

Criscuolo
A.
,
Gascuel
O.
2008
.
Fast NJ-like algorithms to deal with incomplete distance matrices
.
BMC Bioinformatics.
9
:
166
.

Csardi
G.
,
Nepusz
T.
2006
.
The igraph software package for complex network research
.
InterJournal Complex Systems.
1695
:
1
9
.

Ding
C.H.Q.
,
He
X.
,
Zha
H.
,
Gu
M.
,
Simon
H.
2001
.
A min-max cut algorithmfor graph partitioning and data clustering
. In:
Proceedings of 2001 IEEE International Conference on Data Mining; 2001 November 29-December 2; San Jose, California
. p.
107
114
.

Dufour
Y.S.
,
Kiley
P.J.
,
Donohue
T.J.
2010
.
Reconstruction of the core and extended regulons of global transcription factors
.
PLoS Genet.
6
:
e1001027
.

Felsenstein
J.
1981
.
Evolutionary trees from DNA sequences: a maximum likelihood approach
.
J. Mol. Evol.
17
:
368
376
.

Felsenstein
J.
1985
.
Confidence-limits on phylogenies: an approach using the bootstrap
.
Evolution.
39
:
783
791
.

Fletcher
W.
,
Yang
Z.
2009
.
INDELible: a flexible simulator of biological sequence evolution
.
Mol. Biol. Evol.
26
:
1879
1888
.

Gascuel
O.
1997
.
BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data
.
Mol. Biol. Evol.
14
:
685
695
.

Gaucher
E.A.
,
Kratzer
J.T.
,
Randall
R.N.
2010
.
Deep phylogeny—how a tree can help characterize early life on Earth
.
Cold Spring Harb. Perspect. Biol.
2
:
a002238
.

Gerlinger
M.
,
McGranahan
N.
,
Dewhurst
S.M.
,
Burrell
R.A.
,
Tomlinson
I.
,
Swanton
C.
2014
.
Cancer: evolution within a lifetime
.
Ann. Rev. Genet.
48
:
215
236
.

Goldman
A.D.
,
Beatty
J.T.
,
Landweber
L.F.
2016
.
The TIM barrel architecture facilitated the early evolution of protein-mediated metabolism
.
J. Mol. Evol.
82
:
17
26
.

Guindon
S.
,
Dufayard
J.F.
,
Lefort
V.
,
Anisimova
M.
,
Hordijk
W.
,
Gascuel
O.
2010
.
New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0
.
Syst. Biol
.
59
:
307
321
.

Jachiet
P.A.
,
Pogorelcnik
R.
,
Berry
A.
,
Lopez
P.
,
Bapteste
E.
2013
.
MosaicFinder: identification of fused gene families in sequence similarity networks
.
Bioinformatics.
29
:
837
844
.

Kabsch
W.
1978
.
A discussion of the solution for the best rotation to relate two sets of vectors
.
Acta Crystallogr. A.
34
:
827
828
.

Katoh
K.
,
Standley
D.M.
2013
.
MAFFT multiple sequence alignment software version 7: improvements in performance and usability
.
Mol. Biol. Evol.
30
:
772
780
.

Letunic
I.
,
Bork
P.
2011
.
Interactive Tree of Life v2: online annotation and display of phylogenetic trees made easy
.
Nucleic Acids Res.
39
:
W475
W478
.

Matsuda
K.
,
Nishioka
T.
,
Kinoshita
K.
,
Kawabata
T.
,
Go
N.
2003
.
Finding evolutionary relations beyond superfamilies: fold-based superfamilies
.
Protein Sci.
12
:
2239
2251
.

Minami
S.
,
Sawada
K.
,
Chikenji
G.
2013
.
MICAN: a protein structure alignment algorithm that can handle multiple-chains, inverse alignments, C(⁠|$\alpha )$| only models, alternative alignments, and non-sequential alignments
.
BMC Bioinformatics.
14
:
24
.

Nagano
N.
,
Orengo
C.A.
,
Thornton
J.M.
2002
.
One fold with many functions: the evolutionary relationships between TIM barrel families based on their sequences, structures and functions
.
J. Mol. Biol.
321
:
741
765
.

Nguyen
L.T.
,
Schmidt
H.A.
,
von Haeseler
A.
,
Minh
B.Q.
2015
.
IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies
.
Mol. Biol. Evol.
32
:
268
274
.

Nuin
P.A.
,
Wang
Z.
,
Tillier
E.R.
2006
.
The accuracy of several multiple sequence alignment programs for proteins
.
BMC Bioinformatics.
7
:
471
.

Ogden
T.H.
,
Rosenberg
M.S.
2006
.
Multiple sequence alignment accuracy and phylogenetic inference
.
Syst. Biol.
55
:
314
328
.

Paccanaro
A.
,
Casbon
J.A.
,
Saqi
M.A.
2006
.
Spectral clustering of protein sequences
.
Nucleic Acids Res.
34
:
1571
1580
.

Paradis
E.
,
Claude
J.
,
Strimmer
K.
2004
.
APE: analyses of phylogenetics and evolution in R language
.
Bioinformatics.
20
:
289
290
.

Pettersen
E.F.
,
Goddard
T.D.
,
Huang
C.C.
,
Couch
G.S.
,
Greenblatt
D.M.
,
Meng
E.C.
,
Ferrin
T.E.
2004
.
UCSF Chimera—a visualization system for exploratory research and analysis
.
J. Comput. Chem.
25
:
1605
1612
.

Price
M.N.
,
Dehal
P.S.
,
Arkin
A.P.
2010
.
FastTree 2—approximately maximum-likelihood trees for large alignments
.
PLoS One.
5
:
e9490
.

R Core Team
.
2015
.
R: a language and environment for statistical computing. Vienna (Austria): R Foundation for Statistical Computing
.

Robinson
D.F.
,
Foulds
L.R.
1981
.
Comparison of phylogenetic trees
.
Math. Biosci.
53
:
131
147
.

Rojas
A.M.
,
Fuentes
G.
,
Rausell
A.
,
Valencia
A.
2012
.
The Ras protein superfamily: evolutionary tree and role of conserved amino acids
.
J. Cell Biol.
196
:
189
201
.

Ronquist
F.
,
Teslenko
M.
,
van der Mark
P.
,
Ayres
D.L.
,
Darling
A.
,
Hohna
S.
,
Larget
B.
,
Liu
L.
,
Suchard
M.A.
,
Huelsenbeck
J.P.
2012
.
MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space
.
Syst. Biol.
61
:
539
542
.

Saitou
N.
,
Nei
M.
1987
.
The neighbor-joining method: a new method for reconstructing phylogenetic trees
.
Mol. Biol. Evol.
4
:
406
425
.

Schliep
K.P.
2011
.
phangorn: phylogenetic analysis in R
.
Bioinformatics.
27
:
592
593
.

Shi
J.B.
,
Malik
J.
2000
.
Normalized cuts and image segmentation
.
IEEE Trans. Pattern Anal. Mach. Intell.
22
:
888
905
.

Sievers
F.
,
Wilm
A.
,
Dineen
D.
,
Gibson
T.J.
,
Karplus
K.
,
Li
W.
,
Lopez
R.
,
McWilliam
H.
,
Remmert
M.
,
Söding
J.
,
Thompson
J.D.
,
Higgins
D.G.
2011
.
Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega
.
Mol. Syst. Biol.
7
:
539
.

Smoot
M.E.
,
Ono
K.
,
Ruscheinski
J.
,
Wang
P.L.
,
Ideker
T.
2011
.
Cytoscape 2.8: new features for data integration and network visualization
.
Bioinformatics
.
27
:
431
432
.

Stajich
J.E.
,
Block
D.
,
Boulez
K.
,
Brenner
S.E.
,
Chervitz
S.A.
,
Dagdigian
C.
,
Fuellen
G.
,
Gilbert
J.G.
,
Korf
I.
,
Lapp
H.
,
Lehväslaiho
H.
,
Matsalla
C.
,
Mungall
C.J.
,
Osborne
B.I.
,
Pocock
M.R.
,
Schattner
P.
,
Senger
M.
,
Stein
L.D.
,
Stupka
E.
,
Wilkinson
M.D.
,
Birney
E.
2002
.
The Bioperl toolkit: Perl modules for the life sciences
.
Genome Res.
12
:
1611
1618
.

Stamatakis
A.
2014
.
RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies
.
Bioinformatics.
30
:
1312
1313
.

Steinegger
M.
,
Söding
J.
2017
.
MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets
.
Nat. Biotechnol.
35
:
1026
1028
.

Tan
G.
,
Muffato
M.
,
Ledergerber
C.
,
Herrero
J.
,
Goldman
N.
,
Gil
M.
,
Dessimoz
C.
2015
.
Current methods for automated filtering of multiple sequence alignments frequently worsen single-gene phylogenetic inference
.
Syst. Biol.
64
:
778
791
.

Thorne
J.L.
,
Kishino
H.
1992
.
Freeing phylogenies from artifacts of alignment
.
Mol. Biol. Evol.
9
:
1148
1162
.

Thornton
J.M.
,
Orengo
C.A.
,
Todd
A.E.
,
Pearl
F.M.
1999
.
Protein folds, functions and evolution
.
J. Mol. Biol.
293
:
333
342
.

Walsh
C.T.
,
Wencewicz
T.A.
2013
.
Flavoenzymes: versatile catalysts in biosynthetic pathways
.
Nat. Prod. Rep.
30
:
175
200
.

Warnow
T.
2013
.
Large-scale multiple sequence alignment and phylogeny estimation
. In:
Chauve
C.
,
El-Mabrouk
N.
,
Tannier
E.
, editors.
Models and algorithms for genome evolution
.
London
:
Springer
. p.
85
146
.

Whelan
S.
,
Goldman
N.
2001
.
A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach
.
Mol. Biol. Evol.
18
:
691
699
.

Xia
X.
2016
.
PhyPA: phylogenetic method with pairwise sequence alignment outperforms likelihood methods in phylogenetics involving highly diverged sequences
.
Mol. Phylogenet. Evol.
102
:
331
343
.

Xia
X.
2018
.
DAMBE7: new and improved tools for data analysis in molecular biology and evolution
.
Mol. Biol. Evol.
35
:
1550
1552
.

Yamada
K.D.
,
Tomii
K.
,
Katoh
K.
2016
.
Application of the MAFFT sequence alignment program to large data-reexamination of the usefulness of chained guide trees
.
Bioinformatics.
32
:
3246
3251
.

Yang
Z.H.
,
Rannala
B.
1997
.
Bayesian phylogenetic inference using DNA sequences: A Markov Chain Monte Carlo method
.
Mol. Biol. Evol.
14
:
717
724
.

Zhang
J.
,
Kumar
S.
1997
.
Detection of convergent and parallel evolution at the amino acid sequence level
.
Mol. Biol. Evol.
14
:
527
536
.

Zhang
S.B.
,
Zhou
S.Y.
,
He
J.G.
,
Lai
J.H.
2011
.
Phylogeny inference based on spectral graph clustering
.
J. Comput. Biol.
18
:
627
637
.

Zhang
Y.
,
Skolnick
J.
2004
.
Scoring function for automated assessment of protein structure template quality
.
Proteins.
57
:
702
710
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]
Associate Editor: Michael Charelston
Michael Charelston
Associate Editor
Search for other works by this author on: