Estimating amino acid substitution models from genome datasets: a simulation study on the performance of estimated models

Tinh, Nguyen Huy; Dang, Cuong Cao; Vinh, Le Sy

doi:10.1093/jeb/voad017

Abstract

Estimating parameters of amino acid substitution models is a crucial task in bioinformatics. The maximum likelihood (ML) approach has been proposed to estimate amino acid substitution models from large datasets. The quality of newly estimated models is normally assessed by comparing with the existing models in building ML trees. Two important questions remained are the correlation of the estimated models with the true models and the required size of the training datasets to estimate reliable models. In this article, we performed a simulation study to answer these two questions based on simulated data. We simulated genome datasets with different numbers of genes/alignments based on predefined models (called true models) and predefined trees (called true trees). The simulated datasets were used to estimate amino acid substitution model using the ML estimation methods. Our experiments showed that models estimated by the ML methods from simulated datasets with more than 100 genes have high correlations with the true models. The estimated models performed well in building ML trees in comparison with the true models. The results suggest that amino acid substitution models estimated by the ML methods from large genome datasets are a reliable tool for analyzing amino acid sequences.

Open in new tab Download slide

amino acid substitution models, time-reversible models, time-nonreversible models, maximum likelihood estimation methods, simulated amino acid data

Introduction

The amino acid (AA) substitution model describes the substitution rates among 20 AAs during the evolution. The substitutions among AAs are typically modelled by a time-continuous, time-homogeneous Markov process. The model is continuous in time indicating that the AA substitutions can happen at any point during the evolution process. The time-homogeneous property assumes that the substitution rates among AAs remain constant throughout the entire process. The AA substitution process is irreducible; therefore, it has a stationary distribution, i.e., the frequencies of AAs during the evolution are at equilibrium. The core component of the Markov process is the transition rate matrix Q representing the number of substitutions from one AA to another AA per time unit. The transition rates can be inferred from the exchangeability rates (i.e., the rate of change) between AAs and their frequencies. The AA substitution process might be assumed to be time-reversible indicating that the exchangeability rates between two AAs are the same in both directions. This assumption reduces the computational effort in estimating and using the models, however, the time-reversible property might be not biologically realistic and does not allow us to reconstruct rooted trees.

The AA substitution models play a vital role in analyzing the evolutionary relationships among protein sequences. They are used to calculate the likelihood values of phylogenetic trees to determine the maximum likelihood (ML) tree. They can be also employed as the score matrices to determine the similarity between protein sequences in sequence similarity searches or multiple sequence alignment algorithms.

The nucleotide substitution model has a small number of parameters that can be directly estimated from the alignment under the study. The AA substitution model contains a much larger number of parameters, i.e., 189 parameters for time-reversible models or 379 parameters for time-nonreversible models. A single-protein alignment does not provide enough information to estimate the parameters of the AA model. The parameters of the AA substitution model can be better estimated by using more data (Müller et al., 2002). Therefore, a big dataset of protein alignments is required to estimate the parameters of the AA substitution model to overcome the overfitting problem when estimating a large number of parameters (Dang et al., 2022; Le & Gascuel, 2008; Minh et al., 2021; Vinh, 2021; Whelan & Goldman, 2001).

A number of models have been empirically estimated from general datasets containing a large number of protein alignments of various species such as WAG (Whelan & Goldman, 2001) or LG (Le & Gascuel, 2008). Several clade-specific models have been estimated for different clades such as Q.mammal, Q.insect, Q.bird, Q.yeast, or Q.plant for mammals, insects, birds, yeasts, and plants, respectively (Minh et al., 2021). The empirically estimated models are called fixed models as their parameters are fixed.

In ML analyses, the best-fitting model for a specific alignment is normally selected from a set of available fixed models. However, it is possible that none of the available fixed models is appropriate for the alignment under the study. A number of methods have been proposed to enhance the use of fixed AA substitution models in Bayesian inferences such as average inferences over a set of fixed models (Huelsenbeck et al., 2008).

Several ML methods have been proposed to estimate AA substitution models (Dang et al., 2014, 2022; Le & Gascuel, 2008; Minh et al., 2021; Whelan & Goldman, 2001). The ML method QMaker has been introduced to estimate the time-reversible models from whole genome datasets (Minh et al., 2021). The authors applied QMaker to estimate several time-reversible models from genomes of plants, birds, yeasts, mammals, and insects. The clade-specific models outperform the existing models in building ML trees for their corresponding data.

The time-reversible assumption is mathematically and computationally convenient, however, studies revealed that the assumption may be violated (Naser-Khdour et al., 2019; Squartini & Arndt, 2008). Recently, the nQMaker method has been proposed to estimate time-nonreversible models from genome datasets (Dang et al., 2022). Experiments show that the time-nonreversible models are likely better than the time-reversible models in analyzing large alignments. Crucially, the time-nonreversible models allow us to reconstruct rooted phylogenetic trees.

To date, the quality of a newly estimated model is normally assessed by comparing it with existing models in building ML trees. For a real training dataset, the true AA substitution model is unknown, therefore, we are unable to evaluate the correlation between the estimated model and the true model; or compare the performance of the estimated model and the true model in building ML trees. The required size of a training dataset to estimate a reliable model is an important open question.

In this article, we used the alignment simulation program AliSim (Ly-Trong et al., 2022) to simulate different datasets each containing N alignments/genes $(N = 1, 2, 5, 10, 20, 50, 100, 200, 250)$ generated from four predefined AA substitution models (i.e., Q.plant, NQ.plant, Q.bird, and NQ.bird) and two predefined phylogenetic trees (i.e., the plant tree (Ran et al., 2018) and the bird tree (Jarvis et al., 2015)). The predefined model and the predefined tree used to generate a simulated dataset are considered/called the true model and the true tree of the simulated dataset. For each simulated dataset, we used the QMaker and nQMaker programs (Dang et al., 2022; Minh et al., 2021) to estimate a time-reversible model and a time-nonreversible model, respectively. We performed a number of analyses to directly compare the estimated models with the true models. We also assessed the ability of the estimated models in building ML trees in comparison with the true models.

Materials and methods

Data

We used the AliSim program (Ly-Trong et al., 2022) to simulate AA alignments. The AliSim program requires five parameters to simulate an alignment: a predefined AA substitution model, a site rate heterogeneous model, a phylogenetic tree, and the number of sequences together with the sequence length of the alignment. We simulated alignments with both time-reversible AA substitution models (called time-reversible alignments) and time-nonreversible AA substitution models (called time-nonreversible alignments).

We used real parameters obtained from the plant genome dataset of 1,000 genes, each containing 35 sequences (Ran et al., 2018) and the bird genome dataset of 6,295 genes, each consisting of 48 sequences (Jarvis et al., 2015) to simulate plant and bird genome datasets.

- Simulating plant genome datasets: To create a simulated dataset of N alignments/genes $(N = 1, 2, 5, 10, 20, 50, 100, 200, 250)$ ⁠, we randomly selected N genes from the plant genome. For each selected gene, the sequence length, the time-reversible substitution model Q.plant, the site rate heterogeneity model (estimated from the selected gene), and the unrooted phylogenetic tree (Minh et al., 2021) were provided to simulate a time-reversible alignment. Similarly, we used the time-nonreversible substitution model NQ.plant and the rooted phylogenetic tree (Dang et al., 2022) to simulate a time-nonreversible alignment.
- Simulating bird genome datasets: We used the same above procedure to simulate bird genome datasets with different numbers of alignments $(N = 1, 2, 5, 10, 20, 50, 100, 200, 250)$ ⁠.

For each simulation condition (a combination of plant or bird with the number of alignments N), we produced five simulated datasets. Each simulated dataset was used to estimate one AA substitution model. These models are called simulated models. Thus, we obtained five simulated models for each simulation condition.

To compare the performance of simulated models and the true models in building ML trees, we used the testing datasets including 308 plant testing genes and a subset of 500 bird testing genes obtained from the QMaker’s paper (Minh et al., 2021).

Methods

Estimating parameters of amino acid substitution model

We assume that the evolutionary process is independent among AA sites. We use a Markov process to model the AA substitution process with time-homogeneous, time-continuous, and stationary assumptions. The core component of the model is the so-called $Q = {q_{x y}}$ matrix of 20 × 20 instantaneous substitution rates among AAs. Technically, $q_{x y}$ represents the number of substitutions from AA $x$ to AA $y$ per time unit. We assign the diagonal values $q_{x x} = - \sum_{y \neq x} q_{x y}$ such that the sum of all elements on row $x$ of $Q$ equals zero. In phylogenetic analyses, the branch length of phylogenetic trees indicates the number of AA substitutions per site, therefore, we normalize $Q$ so that the total number of substitutions is one. Technically, $Q$ is divided by the normalization factor $μ = - \sum_{x = 1}^{20} q_{x x} π_{x}$ ⁠, where $π_{x}$ is the equilibrium frequency of AA $x$ ⁠. The normalized matrix $Q$ has 379 parameters.

To reduce the computational burdens, we might assume that the AA substitution process is time-reversible (i.e., the exchangeability rates between AA $x$ and AA $y$ are equal in both directions). As a result, the matrix $Q$ can be decomposed into two parts: a symmetric exchangeability rate matrix $R = {r_{x y}}$ and an AA frequency vector $Π = {π_{x}}$ such that $q_{x y} = π_{y} r_{x y}$ ⁠. The R matrix contains 189 parameters and the $Π$ vector consists of 19 parameters.

A training dataset used to estimate an AA substitution model includes N alignments denoted by $D = {D_{1}, \dots, D_{N}}$ ⁠. Let $T = {T_{1}, \dots, T_{N}}$ be the set of N trees for the training alignments, i.e., $T_{i}$ is the tree of alignment $D_{i}$ ⁠. As AA substitution rates vary among sites, the best site rate model is selected for each alignment. Typically, a site rate model combines a distribution of rates, a proportion of invariant sites (Gu et al., 1995; Yang, 1993), or a distribution-free rate model (Yang, 1995). Let $P = {ρ_{1}, \dots, ρ_{N}$ } be the set of site rate models for the training alignments, i.e., $ρ_{i}$ is the site rate model for alignment $D_{i}$ ⁠.

The ML estimation method determines the parameters of model Q, the phylogenetic trees T, and the parameters of site rate models P to maximize the likelihood value $L (Q | T, P, D)$ ⁠. We can calculate the likelihood value over N alignments of D as following:

\begin{array}{l} L (Q | T, P, D) = \prod_{i = 1 \dots N} L (Q | T_{i}, ρ_{i}, D_{i}) \end{array}

Optimizing $L (Q | T, P, D)$ with a large number of parameters is a computationally expensive task. Fortunately, the previous studies show that the parameters of Q can be optimized based on nearly optimal trees T and site rate models P (Dang et al., 2014, 2022; Le & Gascuel, 2008; Minh et al., 2021). Thus, the parameters of Q, T, and P can be estimated iteratively instead of simultaneously to overcome the computational burdens. In this article, we used the recently published ML estimation methods QMaker and nQMaker (Dang et al., 2022; Minh et al., 2021) to estimate time-reversible models and time-nonreversible models from the simulated datasets, respectively.

Alignment simulation

We used the AliSim program (Ly-Trong et al., 2022) to simulate alignments based on parameters of real alignments. Given a real alignment $R_{i}$ ⁠, the procedure to simulate an alignment $D_{i} \in D$ includes two steps (see Figure 1) described as follows:

Figure 1.

The procedure to simulate an alignment using parameters from a real alignment, a predefined amino acid substitution model, a predefined phylogenetic tree, and a site rate heterogeneity model.

Open in new tab Download slide

- Parameter inference step: Consider a real alignment $R_{i}$ including $m_{i}$ sequences with sequence length $l_{i}$ ⁠. If $R_{i}$ is selected from plant (bird) genome dataset, set the predefined AA substitution model Q^defined by the Q.plant (Q.bird) model, and the predefined tree T^defined by the plant (bird) tree. We used the IQ-TREE2 program (Minh et al., 2020) to determine the best site-rate heterogeneity model $ρ_{i}$ for $R_{i}$ ⁠.
- Simulation step: The model Q^defined, the tree T^defined, and the site-rate model $ρ_{i}$ were subsequently used to simulate the alignment $D_{i}$ with $m_{i}$ sequences of length $l_{i}$ using the AliSim program.

To generate a dataset D of N alignments, we conducted the simulation procedure N times with N different real alignments.

Model analyses and comparison

Let Q^sim be the AA substitution model estimated from a simulated dataset. The model Q^sim is called a simulated model. More precisely, let Q^simN be the AA substitution model estimated from a simulated dataset of N alignments, e.g., Q^sim100 be the AA substitution model estimated from a simulated dataset of 100 alignments.

First, we calculated the Pearson correlation between the true substitution model and the estimated model. We also analyzed the difference between AA substitution rates of the true model and that of the simulated model.

Second, we compared the quality of the simulated model Q^sim and the true model Q^defined in building ML trees on testing alignments. For each testing alignment $A_{i}$ if the ML tree $T_{i}^{d e f i n e d}$ constructed with the true model Q^defined is better (worse) than the tree $T_{i}^{s i m}$ constructed with the simulated model Q^sim, the true model Q^defined is considered better (worse) than the simulated model Q^sim.

Third, we measured the similarity between trees constructed with the true models and those constructed with the simulated models using the Robinson–Foulds (RF) distance (Robinson & Foulds, 1981). The RF distance is the number of splits which exist in one tree but not in the other tree. The RF distance is normalized by dividing by the total number of splits. The normalized RF distance (nRF) ranges from 0 (i.e., two trees are identical) to 1 (i.e., two trees are completely different).

Results

Model analysis

For each simulation condition, the average correlation between the true model and five simulated models was analyzed. First, we compared the true models Q.plant with simulated models Q^sim.plant; and the true model NQ.plant with simulated models NQ^sim.plant with respect to different numbers of genes (see Figure 2). The simulated models Q^sim.plant and NQ^sim.plant estimated from ≥ 100 genes have high correlations with the true models. For example, the average correlation between the true model Q.plant and simulated models Q^sim100.plant is 0.997. The average correlation between the true model NQ.plant and simulated models NQ^sim100.plant is close to 1. We note that the correlations between the true models and simulated models estimated from 200 or 250 alignments are all close to 1.

Figure 2.

The average Pearson correlations between the true model and simulated models for different simulation conditions. The simulated models estimated from datasets with ≥ 100 genes are highly correlated with the true models.

Open in new tab Download slide

We observed similar results for the bird datasets. The simulated model Q^sim.bird estimated with ≥ 100 genes has high correlations with the true model Q.bird (i.e., 0.999 for 100 genes; 0.9997 for 200 genes, and 0.9999 for 250 genes). The correlations between simulated models and the true models get higher with the increase in training data size used to estimate the models. For example, the average correlations between the true model NQ.bird and the simulated models NQ^sim.bird with 10, 100, and 200 genes are 0.9954, 0.9995, and 0.9997, respectively.

We calculated the standard deviation of correlations between the true model and the simulated models for every simulation condition (see Table 1). All standard deviations are small and close to 0 for the simulation conditions with ≥ 100 genes.

Table 1.

Open in new tab

The standard deviation of correlations between the true models and simulated models for different simulation conditions.

#genes Models	1	2	5	10	20	50	100	200	250
Q.plant	0.0110	0.0136	0.0043	0.0032	0.0032	0.0009	0.0011	0.0009	0.0006
NQ.plant	0.0048	0.0082	0.0007	0.0016	0.0004	0.0001	0.0001	0.0001	0.0001
Q.bird	0.0291	0.0105	0.0063	0.0060	0.0024	0.0014	0.0006	0.0003	0.0002
NQ.bird	0.0143	0.0147	0.0023	0.0020	0.0007	0.0002	0.0001	0.0001	0.0001

#genes Models	1	2	5	10	20	50	100	200	250
Q.plant	0.0110	0.0136	0.0043	0.0032	0.0032	0.0009	0.0011	0.0009	0.0006
NQ.plant	0.0048	0.0082	0.0007	0.0016	0.0004	0.0001	0.0001	0.0001	0.0001
Q.bird	0.0291	0.0105	0.0063	0.0060	0.0024	0.0014	0.0006	0.0003	0.0002
NQ.bird	0.0143	0.0147	0.0023	0.0020	0.0007	0.0002	0.0001	0.0001	0.0001

Table 1.

Open in new tab

The standard deviation of correlations between the true models and simulated models for different simulation conditions.

#genes Models	1	2	5	10	20	50	100	200	250
Q.plant	0.0110	0.0136	0.0043	0.0032	0.0032	0.0009	0.0011	0.0009	0.0006
NQ.plant	0.0048	0.0082	0.0007	0.0016	0.0004	0.0001	0.0001	0.0001	0.0001
Q.bird	0.0291	0.0105	0.0063	0.0060	0.0024	0.0014	0.0006	0.0003	0.0002
NQ.bird	0.0143	0.0147	0.0023	0.0020	0.0007	0.0002	0.0001	0.0001	0.0001

#genes Models	1	2	5	10	20	50	100	200	250
Q.plant	0.0110	0.0136	0.0043	0.0032	0.0032	0.0009	0.0011	0.0009	0.0006
NQ.plant	0.0048	0.0082	0.0007	0.0016	0.0004	0.0001	0.0001	0.0001	0.0001
Q.bird	0.0291	0.0105	0.0063	0.0060	0.0024	0.0014	0.0006	0.0003	0.0002
NQ.bird	0.0143	0.0147	0.0023	0.0020	0.0007	0.0002	0.0001	0.0001	0.0001

Calculating the Pearson correlation coefficient is a simple way to measure the linear correlation between two matrices. However, we note that the Pearson correlation depends on the range of substitution rates of the matrices. It is sensitive to large substitution rates, i.e., large substitution rates have a high impact on the value of Pearson correlation between two matrices.

For more details, we directly analyzed the relative differences between substitution rates of the matrices. We counted the number of substitution rates in the simulated models that are at least 2 times or 5 times larger (smaller) than that in the true models. Table 2 shows the comparisons between the true model Q.plant and its highest correlation model Q^sim.plant for the plant datasets; and between the true model Q.bird and its highest correlation model Q^sim.bird for the bird datasets. For example, 29 out of 400 elements in the simulated model Q^sim100.plant are at least 2 times greater than that in the true model Q.plant; and 9 elements are at least 2 times smaller than that in Q.plant. The simulated models estimated from datasets with less than 100 genes contain a considerable number of substitution rates that are at least 2 times different from that in the true models. There are only few elements in the simulated models estimated from 200 or 250 genes that are 2 times or 5 times different from that in the true models. Figure 3 demonstrates the relative difference between the true model Q.plant and its highest correlation model Q^sim100.plant (Figure 3A), and between the true model Q.bird and its highest correlation model Q^sim100.bird (Figure 3B).

Table 2.

Open in new tab

The average number of substitution rates that 2 times larger (2×), 5 times larger (5×), 2 times smaller (−2×), and 5 times smaller (−5×) when comparing time reversible model Q.plant and simulated model Q^sim.plant; and between Q.bird and simulated model Q^sim.bird.

		Number of genes
Comparing models	Larger/Smaller	1	2	5	10	20	50	100	200	250
Q^sim.plant vs. Q.plant	2×	171.8	154.6	108.6	85.2	56.6	38.8	29.8	18.6	13.2
	5×	150	126	80.4	59.6	34.4	19.8	13	8.4	6.8
	−2×	48	48	32	33	18.8	11.4	9.8	8.8	6.4
	−5×	17.6	14.2	9.8	8.8	4.4	4.6	4.4	4.2	4
Q^sim.bird vs. Q.bird	2×	218.8	148.2	136.6	93.6	69.6	38.8	36.2	18.4	14.4
	5×	199	132.8	109.4	71.6	34.4	12.4	8	3.6	2
	−2×	39.4	49	34.6	27.2	12.8	5.8	0.6	1.8	1.2
	−5×	12.6	9.6	5.6	3.6	0.4	0.4	0	0.4	0

		Number of genes
Comparing models	Larger/Smaller	1	2	5	10	20	50	100	200	250
Q^sim.plant vs. Q.plant	2×	171.8	154.6	108.6	85.2	56.6	38.8	29.8	18.6	13.2
	5×	150	126	80.4	59.6	34.4	19.8	13	8.4	6.8
	−2×	48	48	32	33	18.8	11.4	9.8	8.8	6.4
	−5×	17.6	14.2	9.8	8.8	4.4	4.6	4.4	4.2	4
Q^sim.bird vs. Q.bird	2×	218.8	148.2	136.6	93.6	69.6	38.8	36.2	18.4	14.4
	5×	199	132.8	109.4	71.6	34.4	12.4	8	3.6	2
	−2×	39.4	49	34.6	27.2	12.8	5.8	0.6	1.8	1.2
	−5×	12.6	9.6	5.6	3.6	0.4	0.4	0	0.4	0

Table 2.

Open in new tab

The average number of substitution rates that 2 times larger (2×), 5 times larger (5×), 2 times smaller (−2×), and 5 times smaller (−5×) when comparing time reversible model Q.plant and simulated model Q^sim.plant; and between Q.bird and simulated model Q^sim.bird.

		Number of genes
Comparing models	Larger/Smaller	1	2	5	10	20	50	100	200	250
Q^sim.plant vs. Q.plant	2×	171.8	154.6	108.6	85.2	56.6	38.8	29.8	18.6	13.2
	5×	150	126	80.4	59.6	34.4	19.8	13	8.4	6.8
	−2×	48	48	32	33	18.8	11.4	9.8	8.8	6.4
	−5×	17.6	14.2	9.8	8.8	4.4	4.6	4.4	4.2	4
Q^sim.bird vs. Q.bird	2×	218.8	148.2	136.6	93.6	69.6	38.8	36.2	18.4	14.4
	5×	199	132.8	109.4	71.6	34.4	12.4	8	3.6	2
	−2×	39.4	49	34.6	27.2	12.8	5.8	0.6	1.8	1.2
	−5×	12.6	9.6	5.6	3.6	0.4	0.4	0	0.4	0

		Number of genes
Comparing models	Larger/Smaller	1	2	5	10	20	50	100	200	250
Q^sim.plant vs. Q.plant	2×	171.8	154.6	108.6	85.2	56.6	38.8	29.8	18.6	13.2
	5×	150	126	80.4	59.6	34.4	19.8	13	8.4	6.8
	−2×	48	48	32	33	18.8	11.4	9.8	8.8	6.4
	−5×	17.6	14.2	9.8	8.8	4.4	4.6	4.4	4.2	4
Q^sim.bird vs. Q.bird	2×	218.8	148.2	136.6	93.6	69.6	38.8	36.2	18.4	14.4
	5×	199	132.8	109.4	71.6	34.4	12.4	8	3.6	2
	−2×	39.4	49	34.6	27.2	12.8	5.8	0.6	1.8	1.2
	−5×	12.6	9.6	5.6	3.6	0.4	0.4	0	0.4	0

Figure 3.

The bubble plots show relative differences between amino acid substitution rates in Q.plant and its highest correlation model Q^sim100.plant (A); and between Q.bird and its highest correlation model Q^sim100.bird (B). Both Q^sim100.plant and Q^sim100.bird were estimated from 100 genes. Notations: 2 × (5×) indicates that the substitution rate between two models is at least 2 times (5 times) different.

Open in new tab Download slide

Figure 4 displays the distribution of mean squared errors (MSE) between substitution rates of the true model and those of the simulated models (the MSE were calculated from five simulations). The highest MSE are 0.443 for the bird matrices and 0.329 for the plant matrices. For the case of 100 genes, the highest MSE of $Q^{s i m 100} . p l a n t$ and $Q^{s i m 100} . b i r d$ are 0.052 and 0.172, respectively. The average of all MSE of $Q^{s i m 100}$ is 0.00076 for the bird matrices and 0.00056 for the plant matrices.

Figure 4.

The distribution of mean squared errors between substitution rates of the true model Q.plant and simulated models Q^sim.plant (left); and the true model Q.bird and simulated models Q^sim.bird (right). The mean squared errors were calculated from five simulations.

Open in new tab Download slide

Similarly, we compared the time-nonreversible models NQ^sim.plant and NQ^sim.bird with their corresponding true models NQ.plant and NQ.bird (see Table 3). For example, the NQ^sim100.plant model consists of 39 substitution rates that are at least 2 times different from that in the true model NQ.plant. Similarly, the NQ^sim100.bird model contains 30 substitution rates that are at least 2 times different from that in the true model NQ.bird. As the time-nonreversible models consist of more parameters than the time-reversible models, more training data are required to estimate time-nonreversible models so that they are close to the true models. Figure 5 demonstrates the relative difference between the true model NQ.plant and its highest correlation model NQ^sim100.plant (Figure 5A); and between the true model NQ.bird and its highest correlation model NQ^sim100.bird (Figure 5B).

Table 3.

Open in new tab

The average number of amino acid substitution rates that 2 times larger (2×), 5 times larger (5×), 2 times smaller (−2×), and 5 times smaller (−5×) when comparing the true model NQ.plant with simulated model NQ^sim.plant; and between NQ.bird with simulated model NQ^sim.bird.

		Number of genes
Comparing models	Larger/Smaller	1	2	5	10	20	50	100	200	250
NQ^sim.plant vs. NQ.plant	2×	210	178.6	126	106.8	74.8	49.6	39.2	25	18.8
	5×	196.2	164.8	110.2	88.2	60.2	37.2	25.4	18.8	14
	−2×	51.4	54.4	52.4	49.4	43	28.4	23.2	17.6	17.2
	−5×	22.6	22.6	14	12.6	12.2	9.6	4	4	3.6
NQ^sim.bird vs. NQ.bird	2×	244.2	182.4	146.8	106	81.6	48.2	30.6	15	15.6
	5×	236.2	171	134.4	92.4	61.6	28.4	15.4	7	8.2
	−2×	41.8	53.6	57.6	56.8	41	18.6	11.2	7.2	7.6
	−5×	19.6	18.8	16.2	13.6	5.6	2	2	1.8	1.2

		Number of genes
Comparing models	Larger/Smaller	1	2	5	10	20	50	100	200	250
NQ^sim.plant vs. NQ.plant	2×	210	178.6	126	106.8	74.8	49.6	39.2	25	18.8
	5×	196.2	164.8	110.2	88.2	60.2	37.2	25.4	18.8	14
	−2×	51.4	54.4	52.4	49.4	43	28.4	23.2	17.6	17.2
	−5×	22.6	22.6	14	12.6	12.2	9.6	4	4	3.6
NQ^sim.bird vs. NQ.bird	2×	244.2	182.4	146.8	106	81.6	48.2	30.6	15	15.6
	5×	236.2	171	134.4	92.4	61.6	28.4	15.4	7	8.2
	−2×	41.8	53.6	57.6	56.8	41	18.6	11.2	7.2	7.6
	−5×	19.6	18.8	16.2	13.6	5.6	2	2	1.8	1.2

Table 3.

Open in new tab

The average number of amino acid substitution rates that 2 times larger (2×), 5 times larger (5×), 2 times smaller (−2×), and 5 times smaller (−5×) when comparing the true model NQ.plant with simulated model NQ^sim.plant; and between NQ.bird with simulated model NQ^sim.bird.

		Number of genes
Comparing models	Larger/Smaller	1	2	5	10	20	50	100	200	250
NQ^sim.plant vs. NQ.plant	2×	210	178.6	126	106.8	74.8	49.6	39.2	25	18.8
	5×	196.2	164.8	110.2	88.2	60.2	37.2	25.4	18.8	14
	−2×	51.4	54.4	52.4	49.4	43	28.4	23.2	17.6	17.2
	−5×	22.6	22.6	14	12.6	12.2	9.6	4	4	3.6
NQ^sim.bird vs. NQ.bird	2×	244.2	182.4	146.8	106	81.6	48.2	30.6	15	15.6
	5×	236.2	171	134.4	92.4	61.6	28.4	15.4	7	8.2
	−2×	41.8	53.6	57.6	56.8	41	18.6	11.2	7.2	7.6
	−5×	19.6	18.8	16.2	13.6	5.6	2	2	1.8	1.2

		Number of genes
Comparing models	Larger/Smaller	1	2	5	10	20	50	100	200	250
NQ^sim.plant vs. NQ.plant	2×	210	178.6	126	106.8	74.8	49.6	39.2	25	18.8
	5×	196.2	164.8	110.2	88.2	60.2	37.2	25.4	18.8	14
	−2×	51.4	54.4	52.4	49.4	43	28.4	23.2	17.6	17.2
	−5×	22.6	22.6	14	12.6	12.2	9.6	4	4	3.6
NQ^sim.bird vs. NQ.bird	2×	244.2	182.4	146.8	106	81.6	48.2	30.6	15	15.6
	5×	236.2	171	134.4	92.4	61.6	28.4	15.4	7	8.2
	−2×	41.8	53.6	57.6	56.8	41	18.6	11.2	7.2	7.6
	−5×	19.6	18.8	16.2	13.6	5.6	2	2	1.8	1.2

Figure 5.

The bubble plots show relative differences between amino acid substitution rates in NQ.plant and its highest correlation model NQ^sim100.plant (A); and between NQ.bird and its highest correlation model NQ^sim100.bird (B). Both NQ^sim100.plant and NQ^sim100.bird were estimated from 100 genes. Notations: 2 × (5×) indicates that the substitution rate between two models is at least 2 times (5 times) different.

Open in new tab Download slide

Figure 6 shows the distribution of MSE between substitution rates of nonreversible models (the MSE were calculated from five simulations). The MSE of the time-nonreversible matrices are lower than that of the time-reversible matrices. They are all smaller than 0.205 and close to zero when the number of genes ≥ 100.

Figure 6.

The distribution of mean squared errors between substitution rates of the true model NQ.plant and simulated models NQ^sim.plant (left); and the true model NQ.bird and simulated models NQ^sim.bird (right). The mean squared errors were calculated from five simulations.

Open in new tab Download slide

Likelihood comparison

We used log-likelihood values to examine the performance of simulated models in building ML trees (see Figure 7). We compared the average log-likelihood values of 5 simulated models with that of the true model for each simulation condition. We note that a simulated model is considered better (worse) than the true model for a testing alignment if the tree constructed with the simulated model has a higher (lower) log-likelihood value than that constructed with the true model.

Figure 7.

The performance of the true models and the simulated models in building ML trees for different simulation conditions. The average proportion of testing alignments where the simulated model yields better a log-likelihood value than the true model.

Open in new tab Download slide

Figure 7 shows that trees constructed with true models have higher likelihood values than those constructed with simulated models for a large number of testing alignments. The likelihood of trees with simulated models gets closer to that with the true models when more genes are used to estimate models. For example, trees constructed with Q.plant have higher likelihood values than those with Q^sim10.plant, Q^sim50.plant, and Q^sim100.plant for 89.9%, 72.1%, and 68.8% of testing alignments, respectively.

We also examined the performance of reversible models Q^sim and nonreversible models NQ^sim in building ML trees (see Figure 7). The nonreversible models NQ^sim estimated from small datasets (i.e., less than 50 genes) were not as good as their corresponding reversible models Q^sim. The performance of NQ^sim gets better with the increase in training data size used to estimate the models. For example, NQ^100sim.plant was better than Q^100sim.plant in building ML trees for 61.69% of the plant alignments; and NQ^100sim.bird was better than Q^100sim.bird for 56.8% of the bird alignments. It is recommended to use model selection programs such as ModelFinder (Kalyaanamoorthy et al., 2017) to determine the best-fit model for the alignment under study (Dang et al., 2022; Minh et al., 2021).

The simulated plant models estimated from ≥ 100 genes performed considerably well in building ML trees in comparison with the true models. For example, the simulated model Q^sim100.plant helped build better ML trees than the true model Q.plant in 31.1% of testing alignments. Similarly, the simulated model NQ^sim100.plant outperformed the true model NQ.plant in building ML trees for about 27% of testing alignments.

We obtained similar results for the bird datasets. The simulated models Q^sim100.bird and NQ^sim100.bird were better than the true models Q.bird and NQ.bird in 38.6% and 31.4% of the testing alignments, respectively. The simulated bird models estimated from small datasets including 10 or 20 genes were worse than the true models for a majority of testing cases.

Topology comparison

We examined the topological difference between trees constructed with the true models and those with the simulated models. The average nRF distances (see Table 4) show that trees constructed with the true models and those with the simulated models are mostly identical. For plant alignments, trees inferred using the simulated models estimated from ≥ 20 genes are identical to trees constructed with the true model. For the bird alignments, the average nRF distances between trees constructed with the true models and those inferred using the simulated models estimated from ≥ 100 alignments are close to zero. For reversible models, the average nRF distances between trees constructed with simulated models and those inferred with the true models are zero when ≥ 50 genes were used to estimate the models (the average nRF distances between trees built with the general model LG and those with the true models are 0.006 and 0.022 for the plant and bird datasets, respectively). Table 4 shows that the nRF distances between true trees and trees inferred from individual genes are small for the plant dataset, and considerably large for the bird dataset.

Table 4.

Open in new tab

The average nRF distances among the plant tree, the bird tree, and trees constructed with the simulated models and the true models. If $n R F = 0$ ⁠, two trees have identical topology.

#genes nRF	1	2	5	10	20	50	100	200	250
Q.plant vs. Q^sim.plant	0	0.012	0.006	0	0	0	0	0	0
NQ.plant vs. NQ^sim.plant	0.018	0.012	0.012	0.006	0	0	0	0	0
Q.bird vs. Q^sim.bird	0.020	0	0.106	0.004	0.004	0	0	0	0
NQ.bird vs. NQ^sim.bird	0.120	0.04	0.238	0.048	0.088	0	0.035	0.09	0.039
Q^sim.plant vs. plant tree	0.031	0.018	0.025	0.031	0.031	0.031	0.031	0.031	0.031
NQ^sim.plant vs. plant tree	0.018	0.018	0.025	0.031	0.031	0.031	0.031	0.031	0.031
Q^sim.bird vs. bird tree	0.556	0.529	0.556	0.556	0.556	0.556	0.556	0.556	0.556
NQ^sim.bird vs. bird tree	0.507	0.511	0.409	0.538	0.528	0.538	0.542	0.451	0.416

#genes nRF	1	2	5	10	20	50	100	200	250
Q.plant vs. Q^sim.plant	0	0.012	0.006	0	0	0	0	0	0
NQ.plant vs. NQ^sim.plant	0.018	0.012	0.012	0.006	0	0	0	0	0
Q.bird vs. Q^sim.bird	0.020	0	0.106	0.004	0.004	0	0	0	0
NQ.bird vs. NQ^sim.bird	0.120	0.04	0.238	0.048	0.088	0	0.035	0.09	0.039
Q^sim.plant vs. plant tree	0.031	0.018	0.025	0.031	0.031	0.031	0.031	0.031	0.031
NQ^sim.plant vs. plant tree	0.018	0.018	0.025	0.031	0.031	0.031	0.031	0.031	0.031
Q^sim.bird vs. bird tree	0.556	0.529	0.556	0.556	0.556	0.556	0.556	0.556	0.556
NQ^sim.bird vs. bird tree	0.507	0.511	0.409	0.538	0.528	0.538	0.542	0.451	0.416

Table 4.

Open in new tab

The average nRF distances among the plant tree, the bird tree, and trees constructed with the simulated models and the true models. If $n R F = 0$ ⁠, two trees have identical topology.

#genes nRF	1	2	5	10	20	50	100	200	250
Q.plant vs. Q^sim.plant	0	0.012	0.006	0	0	0	0	0	0
NQ.plant vs. NQ^sim.plant	0.018	0.012	0.012	0.006	0	0	0	0	0
Q.bird vs. Q^sim.bird	0.020	0	0.106	0.004	0.004	0	0	0	0
NQ.bird vs. NQ^sim.bird	0.120	0.04	0.238	0.048	0.088	0	0.035	0.09	0.039
Q^sim.plant vs. plant tree	0.031	0.018	0.025	0.031	0.031	0.031	0.031	0.031	0.031
NQ^sim.plant vs. plant tree	0.018	0.018	0.025	0.031	0.031	0.031	0.031	0.031	0.031
Q^sim.bird vs. bird tree	0.556	0.529	0.556	0.556	0.556	0.556	0.556	0.556	0.556
NQ^sim.bird vs. bird tree	0.507	0.511	0.409	0.538	0.528	0.538	0.542	0.451	0.416

#genes nRF	1	2	5	10	20	50	100	200	250
Q.plant vs. Q^sim.plant	0	0.012	0.006	0	0	0	0	0	0
NQ.plant vs. NQ^sim.plant	0.018	0.012	0.012	0.006	0	0	0	0	0
Q.bird vs. Q^sim.bird	0.020	0	0.106	0.004	0.004	0	0	0	0
NQ.bird vs. NQ^sim.bird	0.120	0.04	0.238	0.048	0.088	0	0.035	0.09	0.039
Q^sim.plant vs. plant tree	0.031	0.018	0.025	0.031	0.031	0.031	0.031	0.031	0.031
NQ^sim.plant vs. plant tree	0.018	0.018	0.025	0.031	0.031	0.031	0.031	0.031	0.031
Q^sim.bird vs. bird tree	0.556	0.529	0.556	0.556	0.556	0.556	0.556	0.556	0.556
NQ^sim.bird vs. bird tree	0.507	0.511	0.409	0.538	0.528	0.538	0.542	0.451	0.416

We also analyzed the average branch length of trees reconstructed with the simulated models for different simulation conditions (see Table 5). The average branch length of the true plant tree is 0.051. The average branch length of plant trees constructed with the simulated models ranges from 0.0061 to 0.065. For example, the average branch lengths of trees constructed with Q^sim100.plant and NQ^sim100.plant are 0.062 and 0.061, respectively. The true bird tree has an average branch length of 0.017. The bird trees built with the simulated models have average branch lengths from 0.027 to 0.032. The average branch lengths of trees constructed with Q^sim100.bird and NQ^sim100.bird are both equal to 0.027. In general, the bigger training datasets resulted in better models for estimating the branch lengths of trees. Note that the simulated models estimated from ≥ 100 genes produced trees with similar branch lengths.

Table 5.

Open in new tab

The average of branch length of trees constructed with the simulated models for different simulation conditions.

#genes	1	2	5	10	20	50	100	200	250
Q^sim.plant	0.065	0.064	0.063	0.062	0.062	0.062	0.062	0.062	0.062
NQ^sim.plant	0.064	0.063	0.062	0.061	0.061	0.061	0.061	0.061	0.061
Q^sim.bird	0.030	0.028	0.028	0.027	0.027	0.027	0.027	0.027	0.027
NQ^sim.bird	0.032	0.029	0.027	0.027	0.027	0.027	0.027	0.026	0.027

#genes	1	2	5	10	20	50	100	200	250
Q^sim.plant	0.065	0.064	0.063	0.062	0.062	0.062	0.062	0.062	0.062
NQ^sim.plant	0.064	0.063	0.062	0.061	0.061	0.061	0.061	0.061	0.061
Q^sim.bird	0.030	0.028	0.028	0.027	0.027	0.027	0.027	0.027	0.027
NQ^sim.bird	0.032	0.029	0.027	0.027	0.027	0.027	0.027	0.026	0.027

Table 5.

Open in new tab

The average of branch length of trees constructed with the simulated models for different simulation conditions.

#genes	1	2	5	10	20	50	100	200	250
Q^sim.plant	0.065	0.064	0.063	0.062	0.062	0.062	0.062	0.062	0.062
NQ^sim.plant	0.064	0.063	0.062	0.061	0.061	0.061	0.061	0.061	0.061
Q^sim.bird	0.030	0.028	0.028	0.027	0.027	0.027	0.027	0.027	0.027
NQ^sim.bird	0.032	0.029	0.027	0.027	0.027	0.027	0.027	0.026	0.027

#genes	1	2	5	10	20	50	100	200	250
Q^sim.plant	0.065	0.064	0.063	0.062	0.062	0.062	0.062	0.062	0.062
NQ^sim.plant	0.064	0.063	0.062	0.061	0.061	0.061	0.061	0.061	0.061
Q^sim.bird	0.030	0.028	0.028	0.027	0.027	0.027	0.027	0.027	0.027
NQ^sim.bird	0.032	0.029	0.027	0.027	0.027	0.027	0.027	0.026	0.027

We also calculated the MSE and standard deviations of branch lengths of trees constructed with the simulated models in comparison with the true trees (see Table 6). The MSE values are small and decrease with the increase in the number of genes. For example, the average MSE values of trees with Q^sim1.plant is 0.001003 and that with Q^sim100.plant is 0.000848.

Table 6.

Open in new tab

The average of mean square errors (×1,000) and standard deviations (×1,000) of the average branch length of trees inferred with the simulated models in comparison with the plant tree and the bird tree.

#genes	Q^sim.plant		Q^sim.bird		NQ^sim.plant		NQ^sim.bird
#genes	MSE	SD	MSE	SD	MSE	SD	MSE	SD
1	1.003	0.120	2.218	0.310	1.033	0.124	2.044	0.542
2	0.974	0.133	1.989	0.167	0.877	0.056	1.913	0.338
5	0.893	0.032	2.219	0.312	0.827	0.036	1.836	0.357
10	0.845	0.009	1.901	0.130	0.811	0.012	1.789	0.115
20	0.856	0.018	1.773	0.103	0.806	0.004	1.916	0.568
50	0.851	0.014	1.774	0.130	0.795	0.005	1.727	0.053
100	0.848	0.009	1.844	0.114	0.795	0.004	1.840	0.135
200	0.853	0.005	1.775	0.133	0.795	0.004	1.945	0.234
250	0.848	0.003	1.735	0.106	0.793	0.004	1.829	0.209

#genes	Q^sim.plant		Q^sim.bird		NQ^sim.plant		NQ^sim.bird
#genes	MSE	SD	MSE	SD	MSE	SD	MSE	SD
1	1.003	0.120	2.218	0.310	1.033	0.124	2.044	0.542
2	0.974	0.133	1.989	0.167	0.877	0.056	1.913	0.338
5	0.893	0.032	2.219	0.312	0.827	0.036	1.836	0.357
10	0.845	0.009	1.901	0.130	0.811	0.012	1.789	0.115
20	0.856	0.018	1.773	0.103	0.806	0.004	1.916	0.568
50	0.851	0.014	1.774	0.130	0.795	0.005	1.727	0.053
100	0.848	0.009	1.844	0.114	0.795	0.004	1.840	0.135
200	0.853	0.005	1.775	0.133	0.795	0.004	1.945	0.234
250	0.848	0.003	1.735	0.106	0.793	0.004	1.829	0.209

Table 6.

Open in new tab

The average of mean square errors (×1,000) and standard deviations (×1,000) of the average branch length of trees inferred with the simulated models in comparison with the plant tree and the bird tree.

#genes	Q^sim.plant		Q^sim.bird		NQ^sim.plant		NQ^sim.bird
#genes	MSE	SD	MSE	SD	MSE	SD	MSE	SD
1	1.003	0.120	2.218	0.310	1.033	0.124	2.044	0.542
2	0.974	0.133	1.989	0.167	0.877	0.056	1.913	0.338
5	0.893	0.032	2.219	0.312	0.827	0.036	1.836	0.357
10	0.845	0.009	1.901	0.130	0.811	0.012	1.789	0.115
20	0.856	0.018	1.773	0.103	0.806	0.004	1.916	0.568
50	0.851	0.014	1.774	0.130	0.795	0.005	1.727	0.053
100	0.848	0.009	1.844	0.114	0.795	0.004	1.840	0.135
200	0.853	0.005	1.775	0.133	0.795	0.004	1.945	0.234
250	0.848	0.003	1.735	0.106	0.793	0.004	1.829	0.209

#genes	Q^sim.plant		Q^sim.bird		NQ^sim.plant		NQ^sim.bird
#genes	MSE	SD	MSE	SD	MSE	SD	MSE	SD
1	1.003	0.120	2.218	0.310	1.033	0.124	2.044	0.542
2	0.974	0.133	1.989	0.167	0.877	0.056	1.913	0.338
5	0.893	0.032	2.219	0.312	0.827	0.036	1.836	0.357
10	0.845	0.009	1.901	0.130	0.811	0.012	1.789	0.115
20	0.856	0.018	1.773	0.103	0.806	0.004	1.916	0.568
50	0.851	0.014	1.774	0.130	0.795	0.005	1.727	0.053
100	0.848	0.009	1.844	0.114	0.795	0.004	1.840	0.135
200	0.853	0.005	1.775	0.133	0.795	0.004	1.945	0.234
250	0.848	0.003	1.735	0.106	0.793	0.004	1.829	0.209

Discussion

The ML methods have been proposed to estimate both time-reversible and time-nonreversible models from genome datasets. The true AA substitution models are unknown for real data; therefore, we are unable to evaluate the quality of estimated models in comparison with the true models based on real data. In this article, we compared the estimated models and the true models based on the simulated data. Our study based on the simulated genome datasets with different numbers of genes revealed that both time-reversible and time-nonreversible models estimated by the ML methods from large genome datasets (i.e., datasets with 100 or more genes) are highly correlated with the true model.

The estimated models from large datasets performed well in building ML phylogenetic trees. The significant tests showed that the models derived from large genome datasets and the true models were not significantly better than each other for a majority of testing alignments. The tree topologies constructed with the simulated models and that inferred with the true models are nearly identical. As the true models are not available, the models estimated from large datasets can be reliably used in studying the evolution of protein sequences.

Estimating AA substitution models from large genome datasets is very time-consuming. Only five simulations per set were performed in this study because it took us months to estimate all models from both QMaker and nQMaker estimation methods with different simulation conditions. We note that the standard deviations of all estimated models with ≥ 100 genes and the true models are close to zero.

Data availability

The datasets and scripts used in this article are available at https://doi.org/10.6084/m9.figshare.22336384.

Author contributions

Nguyen Huy Tinh (Investigation [equal], Software [equal], Visualization [equal], Writing—original draft [equal]), Cuong Dang (Conceptualization [equal], Methodology [equal], Resources [lead], Supervision [equal], Writing—review & editing [supporting]), and Vinh Le (Conceptualization [equal], Funding acquisition [lead], Methodology [equal], Project administration [lead], Supervision [equal], Writing—review & editing [lead])

Funding

This research was funded by the Vietnam National Foundation for Science and Technology Development (NAFOSTED; [102.01.2019.06 to C.C.D. and L.S.V.]).

Conflicts of interest

All authors declare that they have no conflict of interest.

References

Dang

,

C. C.

,

Le

,

V. S.

,

Gascuel

,

O.

, …

Le

,

Q. S.

(

2014

).

FastMG: A simple, fast, and accurate maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets

.

BMC Bioinformatics

,

15

(

1

),

341

. https://doi.org/10.1186/1471-2105-15-341

Dang

,

C. C.

,

Minh

,

B. Q.

,

McShea

,

H.

, …

Lanfear

,

R.

(

2022

).

NQMaker: Estimating time nonreversible amino acid substitution models

.

Systematic Biology

,

71

(

5

),

1110

–

1123

. https://doi.org/10.1093/sysbio/syac007

Gu

,

X.

,

Fu

,

Y. X.

, &

Li

,

W. H.

(

1995

).

Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites

.

Molecular Biology and Evolution

,

12

(

4

),

546

–

557

. https://doi.org/10.1093/oxfordjournals.molbev.a040235

Huelsenbeck

,

J. P.

,

Joyce

,

P.

,

Lakner

,

C.

, &

Ronquist

,

F.

(

2008

).

Bayesian analysis of amino acid substitution models

.

Philosophical Transactions of the Royal Society B: Biological Sciences

,

363

(

1512

),

3941

–

3953

. https://doi.org/10.1098/rstb.2008.0175

Jarvis

,

E. D.

,

Mirarab

,

S.

,

Aberer

,

A. J.

, …

Ho

,

S. Y. W.

(

2015

).

Phylogenomic analyses data of the avian phylogenomics project

.

GigaScience

,

4

(

1

),

1

–

9

. https://doi.org/10.1186/s13742-014-0038-1

Kalyaanamoorthy

,

S.

,

Minh

,

B. Q.

,

Wong

,

T. K. F.

, …

Jermiin

,

L. S.

(

2017

).

ModelFinder: Fast model selection for accurate phylogenetic estimates

.

Nature Methods

,

14

(

6

),

587

–

589

. https://doi.org/10.1038/nmeth.4285

Le

,

S. Q.

, &

Gascuel

,

O.

(

2008

).

An improved general amino acid replacement matrix

.

Molecular Biology and Evolution

,

25

(

7

),

1307

–

1320

. https://doi.org/10.1093/molbev/msn067

Ly-Trong

,

N.

,

Naser-Khdour

,

S.

,

Lanfear

,

R.

, &

Minh

,

B. Q.

(

2022

).

AliSim: A fast and versatile phylogenetic sequence simulator for the genomic era

.

Molecular Biology and Evolution

,

39

(

5

),

msac092

. https://doi.org/10.1093/molbev/msac092

Minh

,

B. Q.

,

Dang

,

C. C.

,

Vinh

,

Le Sy

, &

Lanfear

,

R.

(

2021

).

QMaker: Fast and accurate method to estimate empirical models of protein evolution

.

Systematic Biology

,

70

(

5

),

1046

–

1060

. https://doi.org/10.1093/sysbio/syab010

Minh

,

B. Q.

,

Schmidt

,

H.

,

Chernomor

,

O.

, …

Lanfear

,

R.

(

2020

).

IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era

.

Molecular Biology and Evolution

,

37

(

5

),

1530

–

1534

.

Müller

,

T.

,

Spang

,

R.

, &

Vingron

,

M.

(

2002

).

Estimating amino acid substitution models: A comparison of Dayhoff’s estimator, the resolvent approach and a maximum likelihood method

.

Molecular Biology and Evolution

,

19

(

1

),

8

–

13

. https://doi.org/10.1093/oxfordjournals.molbev.a003985

Naser-Khdour

,

S.

,

Minh

,

B. Q.

,

Zhang

,

W.

, …

Bryant

,

D.

(

2019

).

The prevalence and impact of model violations in phylogenetic analysis

.

Genome Biology and Evolution

,

11

(

12

),

3341

–

3352

. https://doi.org/10.1093/gbe/evz193

Ran

,

J. H.

,

Shen

,

T. T.

,

Wang

,

M. M.

, &

Wang

,

X. Q.

(

2018

).

Phylogenomics resolves the deep phylogeny of seed plants and indicates partial convergent or homoplastic evolution between Gnetales and angiosperms

.

Proceedings of the Royal Society B: Biological Sciences

,

285

(

1881

),

1

–

9

. https://doi.org/10.1098/rspb.2018.1012

Robinson

,

D. F.

, &

Foulds

,

L. R.

(

1981

).

Comparison of phylogenetic trees

.

Mathematical Biosciences

,

53

(

1-2

),

131

–

147

. https://doi.org/10.1016/0025-5564(81)90043-2

Squartini

,

F.

, &

Arndt

,

P. F.

(

2008

).

Quantifying the stationarity and time reversibility of the nucleotide substitution process

.

Molecular Biology and Evolution

,

25

(

12

),

2525

–

2535

. https://doi.org/10.1093/molbev/msn169

Vinh

,

Le Sy.

(

2021

).

Modeling amino acid substitutions for whole genomes

.

Journal of Computer Science and Cybernetics

,

37

(

4

),

351

–

363

. https://doi.org/10.15625/1813-9663/37/4/15937

Whelan

,

S.

, &

Goldman

,

N.

(

2001

).

A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach

.

Molecular Biology and Evolution

,

18

(

5

),

691

–

699

. https://doi.org/10.1093/oxfordjournals.molbev.a003851

Yang

,

Z.

(

1993

).

Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites

.

Molecular Biology and Evolution

,

10

(

6

),

1396

–

1401

. https://doi.org/10.1093/oxfordjournals.molbev.a040082

Yang

,

Z.

(

1995

).

A space-time process model for the evolution of DNA sequences

.

Genetics

,

139

(

2

),

993

–

1005

. https://doi.org/10.1093/genetics/139.2.993

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/pages/standard-publication-reuse-rights)

Download all slides

Month:	Total Views:
December 2023	14
January 2024	50
February 2024	97
March 2024	65
April 2024	4
May 2024	4
June 2024	7
July 2024	21
August 2024	15
September 2024	6
October 2024	6
November 2024	4
December 2024	7
January 2025	18
February 2025	19
March 2025	30
April 2025	20
May 2025	4

Article Contents

Estimating amino acid substitution models from genome datasets: a simulation study on the performance of estimated models

Abstract

Introduction

Materials and methods

Data

Methods

Estimating parameters of amino acid substitution model

Alignment simulation

Model analyses and comparison

Results

Model analysis

Likelihood comparison

Topology comparison

Discussion

Data availability

Author contributions

Funding

Conflicts of interest

References

Citations

Views

Altmetric

Email alerts

Related articles in

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Estimating amino acid substitution models from genome datasets: a simulation study on the performance of estimated models

Abstract

Introduction

Materials and methods

Data

Methods

Estimating parameters of amino acid substitution model

Alignment simulation

Model analyses and comparison

Results

Model analysis

Likelihood comparison

Topology comparison

Discussion

Data availability

Author contributions

Funding

Conflicts of interest

References

Citations

Views

Altmetric

Email alerts

Related articles in

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only