-
PDF
- Split View
-
Views
-
Cite
Cite
Asif U Tamuri, Mario dos Reis, A Mutation–Selection Model of Protein Evolution under Persistent Positive Selection, Molecular Biology and Evolution, Volume 39, Issue 1, January 2022, msab309, https://doi.org/10.1093/molbev/msab309
- Share Icon Share
Abstract
We use first principles of population genetics to model the evolution of proteins under persistent positive selection (PPS). PPS may occur when organisms are subjected to persistent environmental change, during adaptive radiations, or in host–pathogen interactions. Our mutation–selection model indicates protein evolution under PPS is an irreversible Markov process, and thus proteins under PPS show a strongly asymmetrical distribution of selection coefficients among amino acid substitutions. Our model shows the criteria (where ω is the ratio of nonsynonymous over synonymous codon substitution rates) to detect positive selection is conservative and indeed arbitrary, because in real proteins many mutations are highly deleterious and are removed by selection even at positively selected sites. We use a penalized-likelihood implementation of the PPS model to successfully detect PPS in plant RuBisCO and influenza HA proteins. By directly estimating selection coefficients at protein sites, our inference procedure bypasses the need for using ω as a surrogate measure of selection and improves our ability to detect molecular adaptation in proteins.
Introduction
Equation (1), which is the relative probability of fixation of selected over neutral mutations (Fisher 1930; Wright 1931; Kimura 1962; McCandlish et al. 2015), has important implications for understanding molecular adaptation in proteins. For a sample of protein-coding sequences from various species, the ratio between the number of substitutions at nonsynonymous sites (which are under selection) and at synonymous sites (which are under weak or no selection) should approximately follow the dynamics of equation (1) (Nielsen and Yang 2003). This ratio, commonly known as , is widely used as a test of molecular adaptation in proteins, with , ω = 1, and interpreted as evidence of molecular adaptation (positive selection), neutral evolution, and purifying selection respectively.
However, Kimura’s relative rate of molecular evolution (eq. 1), based on the infinite-sites model (Kimura 1969; Sawyer and Hartl 1992), assumes all new mutations appear at new sites in the genome. This assumption appears unrealistic for proteins. Nielsen and Yang (2003) have argued that if amino acid fitnesses are reassigned every time a new mutation appears at a site in a protein (so that the selection coefficient, S, is always the same at the site), then equation (1) gives the relationship of S and ω under a finite-sites model. However, it is not clear in which condition this fitness reassignment should apply: if an i to j mutation has selection coefficient S, then the reverse j to i mutation should have coefficient , but Nielsen and Yang’s model assumes it reverts to S. Without this assumption it does not appear possible to equate .
Spielman and Wilke (2015), and dos Reis (2015), used the Fisher–Wright mutation–selection model (Fisher 1930; Wright 1931; Halpern and Bruno 1998) to derive the relationship between ω and the selection coefficients acting on codon sites within a finite-sites model. They showed that when selection coefficients are constant over time, that is, they are not reassigned (dos Reis 2015; Spielman and Wilke 2015); whereas can be achieved for a short period of time after selection coefficients undergo a single shift during an adaptive event, for example, when a virus adapts to a new host (dos Reis 2015).
However, the relationship between ω and selection coefficients under the more general case of persistent changes in selection over time appears unclear. This case, which we term persistent positive selection (PPS), is important because selection coefficients acting at codon sites may change repeatedly during persistent environmental changes, during adaptive radiations, and in host–pathogen interactions (such as in a virus evading herd immunity in a host population). Thus, understanding how PPS affects ω in proteins can inform the development of methods to detect positive selection and give us insight onto the mechanisms of adaptive evolution in general.
Here, we develop a mutation–selection model of codon substitution under PPS. The new model can be used to study the mechanistic relationship between the scaled selection coefficients and ω, providing insight into the evolutionary dynamics of proteins under PPS. Furthermore, we develop a penalized-likelihood implementation of the model and successfully use it to detect PPS directly in real proteins bypassing the need to use ω as a surrogate measure of selection. Analysis under the new model indicates codon substitution is an irreversible Markov process, leading to a highly asymmetrical distribution of selection coefficients among substitutions in proteins under PPS. More strikingly, the PPS model shows the criteria to detect molecular adaptation in proteins is conservative and indeed arbitrary, as we find evidence of PPS at codon sites where .
New Approaches
The PPS Codon Substitution Model
We develop the new model by integrating the nonhomogeneous selection model of Kimura and Ota (1970) with the mutation–selection codon substitution model of Halpern and Bruno (1998). Consider a population of organisms with haploid genome number N. That is, the number of copies of the genome in the population is N (i.e., the population size is N if the organism is haploid and if it is diploid). Suppose a site k in a protein-coding gene is fixed for codon i in the population, and the scaled Malthusian fitness of i is . A new mutant codon j appears at the site and has initial selective advantage . The selective advantage then decays exponentially as a function of time (Kimura and Ota 1970), for example, due to gradual environmental change. Kimura and Ota (1970) showed the fixation probability of j is approximately where is constant and . In other words, the fixation probability of j is the same as that of an allele with intermediate, but constant, selective advantage .
It appears other types of decay function lead to the same fixation probability. For example, the same result is obtained in the case of frequency-dependent selection when the fitness of j decays exponentially as a function of the frequency of j in the population (dos Reis 2013). In the case of frequency-dependent selection, once j becomes fixed, any new mutant alleles may have high fitnesses because they would be rare. We expect this type of dynamics in, for example, viruses escaping the herd immunity of a host population. Similarly, if the environment gradually shifts between two states, then the selective advantage of j or i would be continuously reset depending on the particular environment. This would then lead to resetting (or reassignment) of the fitnesses of i and j. This persistent change in the selection coefficient is what we term PPS. We formalize codon substitution under the PPS model next.
Let the selection coefficient for the mutation be , where and are constant. Let the selection coefficient for the reverse mutant, , be . In other words, we have partitioned the fitnesses of j and i into two components: a constant component, and , representing structural constrains of the protein on the amino acid encoded by the codon; and Zk, the PPS component. Thus, when , the selection coefficient is persistently reset with new mutations.
Irreversibility of Codon Substitution under PPS
Equation (2) describes codon substitution as a continuous Markov process. Polymorphisms are ignored and the population is assumed to switch from i to j instantaneously. This assumption appears reasonable if , for all μij (Bulmer 1991). The proportion of time location k remains fixed for j (i.e., the stationary frequency of j) is . A Markov process is said to be reversible in equilibrium if it satisfies the detailed-balance condition (Grimmet and Stirzaker 2004). When Zk = 0, the model of equation (2) is reversible (Yang and Nielsen 2008). However, when the process is, in general, irreversible because the detailed balance condition does not hold. When , the stationary frequencies are found by solving the system of equations with the constraint . We calculate the irreversibility index for site k as , where indicates evolution at site k is irreversible, and Ik = 0 otherwise (Huelsenbeck et al. 2002).
Identifying Protein Locations under PPS
Given an alignment of protein-coding genes with corresponding phylogeny, the model of equation (2) can be used to estimate the and Zk using maximum penalized likelihood. To estimate the , we use the Dirichlet-based penalty of Tamuri et al. (2014) and for Zk, we use an exponential penalty with parameter λ (see Materials and Methods). For each site in the alignment, we compare the null model Zk = 0 (no PPS) against (PPS) using a likelihood-ratio test. Because of the boundary condition () in the test and the use of penalized likelihood, the distribution of the likelihood-ratio statistic does not follow the typical distribution. Thus, we use Cox (1962) simulation approach as used in phylogenetics (Goldman 1993) to obtain the appropriate null distribution (see Materials and Methods).
The Relationship between Selection Coefficients and ω
See dos Reis (2015) for the full derivation. Spielman and Wilke (2015) give a slightly different definition of ωk (see also Jones et al. 2016, Youssef et al. 2020).
Results
Detection of PPS in Simulated Data
Extensive simulations on the estimation of are available in Tamuri et al. (2012, 2014). Here, our focus is on using simulations to assert whether sites under PPS () can be identified using Cox’s method. We simulate codon alignments (1,000 codons in length) on a 512-taxa phylogeny, under various strengths of PPS, with and 10. The values of are drawn from random distributions to produce sharp amino acid profiles as in real proteins (see Materials and Methods). These and Zk values result in ωk values roughly between 0.05 and 6 (eq. 3). When Zk = 0, 6.6% of sites are incorrectly detected to be under PPS, which is slightly higher than the 5% error I threshold (table 1). When the selective advantage is slight (, the method roughly identifies 44% of sites under PPS (table 1). The power of the method is excellent and roughly over 95% when the selective advantage is strong (). We note the exponential penalty on Zk has a noticeable, albeit slight, effect on the power of the test. When the penalty parameter, λ is small, the resulting penalty is diffuse and the penalty is weak. However, as λ increases, the penalty becomes stronger with probability density in the exponential moving toward zero. In this case, estimates of are more strongly penalized and this translates in a small reduction in the power of the test (table 1). We note the penalized likelihood method used here is essentially the same as posterior mode finding giving our penalties are proper probability densities (Cox and O’Sullivan 1990; Tamuri et al. 2014), and thus the penalties on Zk and act as prior densities which regularize the parameter estimates (Cox and O’Sullivan 1990).
Performance of the LRT for Detecting PPS Sites in Simulated Data after FDR Correction (5%).
True Model . | . | . | . |
---|---|---|---|
swMutSel | FPR at 0.05 Significance | ||
0.066 | 0.066 | 0.066 | |
swMutSel+PPS | TPR at 0.05 Significance | ||
(Z = 2) | 0.441 | 0.452 | 0.449 |
(Z = 5) | 0.952 | 0.952 | 0.947 |
(Z = 10) | 0.965 | 0.963 | 0.960 |
True Model . | . | . | . |
---|---|---|---|
swMutSel | FPR at 0.05 Significance | ||
0.066 | 0.066 | 0.066 | |
swMutSel+PPS | TPR at 0.05 Significance | ||
(Z = 2) | 0.441 | 0.452 | 0.449 |
(Z = 5) | 0.952 | 0.952 | 0.947 |
(Z = 10) | 0.965 | 0.963 | 0.960 |
Note.—FPR, false-positive rate; TPR, true-positive rate.
Performance of the LRT for Detecting PPS Sites in Simulated Data after FDR Correction (5%).
True Model . | . | . | . |
---|---|---|---|
swMutSel | FPR at 0.05 Significance | ||
0.066 | 0.066 | 0.066 | |
swMutSel+PPS | TPR at 0.05 Significance | ||
(Z = 2) | 0.441 | 0.452 | 0.449 |
(Z = 5) | 0.952 | 0.952 | 0.947 |
(Z = 10) | 0.965 | 0.963 | 0.960 |
True Model . | . | . | . |
---|---|---|---|
swMutSel | FPR at 0.05 Significance | ||
0.066 | 0.066 | 0.066 | |
swMutSel+PPS | TPR at 0.05 Significance | ||
(Z = 2) | 0.441 | 0.452 | 0.449 |
(Z = 5) | 0.952 | 0.952 | 0.947 |
(Z = 10) | 0.965 | 0.963 | 0.960 |
Note.—FPR, false-positive rate; TPR, true-positive rate.
Detection of PPS in Real Proteins
We tested for PPS sites in three real sequence data sets: the hemagglutinin protein (HA) from human influenza H1N1 virus, the rbcL protein subunit from flowering plants, and the mitochondrial cytochrome b (CYTB) protein from mammals (table 2). Given the multiple sequence alignment, phylogeny, and mutational parameters, we estimated the and Zk using two penalty strengths, and 0.05. We then performed the LRT of PPS versus no PPS and used false discovery rate (FDR) at the 5% level to identify sites under PPS. Using the weak penalty, , we detected PPS () at 65 sites in the plant rbcL and 18 sites in the influenza HA, but we found no PPS sites in mammal CYTB (table 2). Interestingly, only 55 out 65 of PPS sites in rbcL have . For HA, all 18 PPS sites also have . The location of PPS sites and estimated ωk values are shown in figure 1A to A″. When using the stronger penalty, , the number of sites detected in rbcL and HA are reduced to 50 and 17 sites respectively (table 2). This is not unexpected because, as noted above, stronger penalties push estimates of Zk toward zero affecting the likelihood ratio test.

Analysis of proteins under the PPS mutation–selection model. (A–A″) Estimates of ω at protein sites. (B–B″) Distribution of selection coefficients among nonsynonymous substitutions. (C–C″) Relationship between ω and average selection at protein sites. Sites under PPS () are indicated in red in A–A″ and C–C″, and their contribution to the distribution of selection coefficients indicated in red in B–B″. In C–C′, the solid line is equation (1). The penalty on Zk is .
Data Set . | # Taxa . | # Sites . | # Z > 0 . | () . |
---|---|---|---|---|
. | . | . | . | . |
Plant rbcL | 478 | 466 | 65 (55) | 50 (40) |
Influenza HA | 466 | 589 | 18 (18) | 17 (14) |
Mammal CYTB | 418 | 407 | 0 (0) | — |
Data Set . | # Taxa . | # Sites . | # Z > 0 . | () . |
---|---|---|---|---|
. | . | . | . | . |
Plant rbcL | 478 | 466 | 65 (55) | 50 (40) |
Influenza HA | 466 | 589 | 18 (18) | 17 (14) |
Mammal CYTB | 418 | 407 | 0 (0) | — |
Data Set . | # Taxa . | # Sites . | # Z > 0 . | () . |
---|---|---|---|---|
. | . | . | . | . |
Plant rbcL | 478 | 466 | 65 (55) | 50 (40) |
Influenza HA | 466 | 589 | 18 (18) | 17 (14) |
Mammal CYTB | 418 | 407 | 0 (0) | — |
Data Set . | # Taxa . | # Sites . | # Z > 0 . | () . |
---|---|---|---|---|
. | . | . | . | . |
Plant rbcL | 478 | 466 | 65 (55) | 50 (40) |
Influenza HA | 466 | 589 | 18 (18) | 17 (14) |
Mammal CYTB | 418 | 407 | 0 (0) | — |
The Distribution of Selection Coefficients at Sites under PPS Is Asymmetrical
We estimated the distribution of selection coefficients among nonsynonymous substitutions (Tamuri et al. 2014) in the three protein-coding genes analyzed (fig. 1B to B″). For non-PPS sites (i.e., sites where Zk = 0), the distribution of selection coefficients is symmetrical, with a mode at S = 0, because in this case codon substitution is reversible and the detailed balance condition guarantees the proportions of slightly advantageous and deleterious mutations fixed in the population will be equal over time (Yang and Nielsen 2008). However, among PPS sites in plant rbcL and influenza HA, the distribution is highly skewed with a mode at S > 10 because irreversibility of the substitution process means the detailed balance condition does not apply, and hence there is a persistent excess of advantageous mutations being substituted into the population. For example, for sites with , the irreversibility index is as high as 0.12, indicating there is a deviation of 12% of substitutions from detailed balance, which is a strong deviation (fig. 2A). Larger values of Zk are also associated with faster substitution rates (fig. 2B) and larger ωk values (fig. 2C). For example, for sites with , the corresponding ωk values range from about 1 to 4 (fig. 2C).

Relationship between Zk and evolutionary parameters for PPS sites in HA and rbcL. (A) Irreversibility index, Ik, versus Zk. The index is normalized to give the expected excess number of substitutions from detailed balance. (B) Site substitution rate, , versus Zk. Note the are scaled so that they give the relative rate with respect to a neutral sequence (Tamuri et al. 2014). Thus, if rk = 1, then the site evolves at the same rate as, say, a pseudogene. (C) Nonsynonymous rate, ωk versus Zk. The penalty on Zk is in all cases.
PPS Sites Are under Strong Purifying Constraints
Figure 1C to C″ shows the estimated for the three data sets plotted against ωk. For 43 PPS sites in rbcL and one PPS site in HA, we find that . This shows PPS sites are effectively under a mixture of purifying selection against deleterious amino acid substitutions, and diversifying selection in favor of a few amino acids that substitute rapidly among each other. This trend is evidenced when studying the pattern of PPS substitution in the influenza HA protein. The H1N1 influenza virus entered the human population sometime prior to the 1918 influenza pandemic (Taubenberger et al. 2005; dos Reis et al. 2009) and has remained largely as a single lineage since then, except from the introduction of a separate lineage of reassortant H1N1 swine virus in the 2009 pandemic (Smith et al. 2009). Figure 3 shows the pattern of amino acid substitution for the 18 PPS sites in influenza HA between 1918 and 2009. For example, site 3 remained virtually fixed for alanine between 1918 and the late 1990s, and then suffered several back and forth substitutions between alanine and valine between the late 1990s and 2009, whereas site 142 has been characterized by shifts between lysine and asparagine between 1918 and 2009. It is clear from figure 3 that the majority of PPS sites in the HA protein are characterized by back-and-forth substitutions among a fairly reduced set of amino acids.

Pattern of amino acid substitution in PPS sites of human influenza (H1N1) HA protein between 1918 and 2009. The penalty on Zk is . Each colored dot represents a particular amino acid as indicated in the legend.
Discussion
Mutation–selection models of codon substitution have been successfully used to study the distribution of selection coefficients in proteins (Rodrigue et al. 2010; Tamuri et al. 2014), to detect selection shifts during adaptation (Parto and Lartillot 2017), shifting balance (Jones et al. 2016), and to understand protein evolution given structural constraints (Youssef et al. 2020). Previous works have also accommodated a ω parameter within the mutation–selection model to detect adaptation at amino acid sites (Yang and Nielsen 2008; Rodrigue and Lartillot 2017; Rodrigue et al. 2021). However in these works ω is a separate parameter and not a function of the selection coefficients and thus its population genetics interpretation is not clear (Rodrigue and Lartillot 2017). Here, we extended the mutation–selection framework to the case of PPS without the need for the additional ω parameter. Instead, in the new model, ω is a function of the selection coefficients and we believe this modeling approach can help gain insight on the nature of protein adaptation.
The new PPS model is flexible as it appears to have performed well for the different modes of selection studied here. For example, rbcL is the major subunit of the RuBisCo enzyme responsible for the fixation of carbon during photosynthesis. The efficiency of RuBisCo is affected by environmental factors and rbcL has been under persistent adaptive pressures during the successful adaptive radiation of angiosperms around the ecoregions of the world (Kapralov and Filatov 2007; Parto and Lartillot 2018). This is akin to the environment change model envisaged by Kimura and Ota (1970). On the other hand, the influenza HA protein is the classic example of positive selection on a pathogen evading its hosts’ herd immunity (Fitch et al. 1997), and we showed here the PPS model performed well in detecting this mode of adaptation. We believe the new PPS model, together with previous mutation–selection models that relaxed the assumption of constant fitnesses (Tamuri et al. 2012; Parto and Lartillot 2017), now encompass the major modes of selection in proteins.
We would like to note here two features of coding-sequence evolution that are ignored in our formulation of the PPS mutation–selection model. First, the model assumes amino acid sites within the protein evolve independently. This is unrealistic because amino acids are linked and their substitution pattern is affected by interactions with other amino acids within the folded protein (Pollock et al. 2012; Youssef et al. 2020). In particular, substitutions toward suboptimal amino acids can be compensated by rapid substitution in another interacting amino acid, so as to reduce contact energies in the folded protein (Pollock et al. 2012). How these rapid substitutions affect evolutionary dynamics within PPS and how they should be accommodated within the inference model will require further research (Youssef et al. 2020). Second, the model assumes polymorphism is absent and new mutations either become fixed or lost instantaneously. This assumption appears reasonable for most populations of plants and animals because, in these, the scaled mutation rates, , are much less than one (Lynch and Conery 2003). Even for influenza, a fast-evolving RNA virus, estimates of are in the order of (Zhao and Illingworth 2019). However, levels of standing polymorphism can be substantial in many microorganisms (Lynch and Conery 2003) or for some loci under certain forms of selection (e.g., selection in favor heterozygotes, Hughes and Nei 1988). Incorporating polymorphism within the mutation–selection inference machinery will be challenging, but recent polymorphism-aware phylogenetic approaches may provide a way forward (De Maio et al. 2015).
Perhaps the most important insight from the application of the PPS model to real data is that the criteria to detect positive selection in proteins is conservative. As we show here, sites under PPS are also under strong purifying constraints, and, at equilibrium, produce many deleterious mutations that are removed by selection. Because ωk is the weighted average over the rate of all possible synonymous substitutions at a site, it follows ωk will be reduced if there are many deleterious mutations at the site even if the site is shifting rapidly among a few positively selection amino acids. We believe this insight should be incorporated into the much faster codon substitution models used in phylogenomic analyses, such as the branch-site model (Yang and Nielsen 2002), to improve power in detecting adaptation in proteins.
Materials and Methods
Maximum Penalized Likelihood Estimation and Likelihood Ratio Test of PPS
The swMutSel model (Tamuri et al. 2012, 2014) is the special case of swMutSel-PPS when Zk = 0. We use swMutSel as a null model () and swMutSel-PPS as the alternative model () in a likelihood-ratio test. The vector of fitnesses at site k, and the PPS component, Zk are estimated by maximizing a penalized likelihood. The penalty on is the Dirichlet-based penalty of Tamuri et al. (2014), whereas for Zk we use an exponential penalty , where the regularization parameter, λ, controls the strength of the penalty. When λ = 0, there is no penalty while leads to increasingly stronger penalties on the estimation of Zk. During inference, the (eq. 2) are scaled in terms of the expected number of neutral substitutions per site (Tamuri et al. 2012). This guarantees all sites are normalized to the same timescale. To speed up computation, the mutational parameters, required to construct μij, and the branch lengths on the phylogeny are estimated under the FMutSel0 model (Yang and Nielsen 2008) as explained in Tamuri et al. (2014). We note only the differences among the enter equation (2), thus, the fitness for the most common amino acid at site k is set to zero. Large negative values are capped to −10 during numerical optimization. We recommend the optimization routine is repeated three times using different parameter start values to ensure convergence to the correct estimates.
Let the maximum penalized log-likelihood for site k be and , under the H0 and H1 hypotheses respectively. The test statistic is the difference in log-likelihoods . If the test statistic is significantly different from zero, this is evidence site k is evolving under PPS. The distribution of the statistic, when the null hypothesis is true, does not follow a distribution. There are two reasons for this. First, because Zk = 0 is at the boundary of parameter space, the test statistic would be, asymptotically, distributed as a 50:50 mixture of a distribution and a 0.5 point probability mass at 0 (Self and Liang 1987; Goldman and Whelan 2000). The second reason is that the penalty on Zk affects the 50:50 proportion because the penalty forces the estimates of Zk toward zero.
Because we do not know what the asymptotic distribution of δk should be, we use Cox’s (1962) Monte Carlo simulation to obtain the null distribution of δk. For a given site k in the alignment, we simulate N replicate sites on the phylogeny using the maximum penalized likelihood estimates (MPLEs) of under H0. The distribution, Δk, is determined by the difference in log-likelihood between the two models for each simulated site: where is the log-likelihood difference for the i-th simulation. If the test statistic from the real data (δk) is larger than, say, 95% of Δk, we reject the null hypothesis H0 (no PPS) and accept the alternative hypothesis H1 (PPS) at the significance level. Cox’s approach has been shown to work well in phylogenetic data sets (Goldman 1993). When analyzing an ensemble of sites in a multiple sequence alignment, we correct for multiple testing using a FDR procedure to select candidate PPS sites (Benjamini and Hochberg 1995).
Padé Approximation to Calculate the Matrix Exponential
Appropriate values for q and j are chosen according to the size of A and the desired accuracy in the calculation of (Moler and Van Loan 2003: table 1).
In our model, the calculation of the likelihood at a site involves multiple computations of for every branch in the phylogeny. We choose q and j according to the largest branch length t. Because , we calculate all necessary and once and cache these in memory throughout the likelihood calculation. Calculating once is more efficient than setting and applying the Padé approximation directly. Instead, we compute , and finally for each value of t. We found this matrix exponentiation algorithm is approximately 1.5 times faster than the Taylor series approximation suggested in phylogenetics (Yang 2014), albeit using more memory to store the precalculated matrix powers.
Simulated Data
To test the specificity and sensitivity of the LRT for PPS, we simulated sites on a balanced 512-taxa tree with branch lengths equal to 0.0125 neutral substitutions per site (Tamuri et al. 2014). We simulated sites under a null model with no PPS (), and under the alternative model with PPS () with three strengths of selection , and with 1,000 sites simulated under each model setup. Following Tamuri et al. (2014), amino acid fitnesses for each site were sampled from a bimodal normal distribution with ten randomly selected amino acids chosen to have and the remaining amino acids to have . This simulation setup was chosen because it leads to simulated data that captures two important features seen in real data: 1) A bimodal distribution of selection coefficients among mutations, and 2) a sharp distribution of amino acid preferences among sites (Tamuri et al. 2012, 2014).
Simulated data were then analyzed with the swMutSel software to estimate model parameters. The branch lengths and mutation parameters were fixed to their true values (k = 2, ) throughout the analysis and only the sitewise fitnesses and diversifying selection parameters were estimated. For each simulation setup, we calculated the MPLE and the LRT as described above, using N = 100 replicates in Cox’s procedure. In all analyses, the Dirichlet penalty on has , and three strengths of penalty on Zk were tested, {0.01, 0.5, 1.0}.
Using the LRT results, we determined the false-positive and false-negative rates. The false-positive rate is calculated by determining the number of tests that incorrectly rejected the null hypothesis (Zk = 0). The true-positive rate is calculated from the number of tests that correctly rejected the null hypothesis ().
Real Sequence Data
We downloaded 3,120 HA protein-coding sequences of human influenza H1 viruses (excluding 2009 pandemic-H1N1 and partial sequences) from the NIAID Influenza Research Database (Squires et al. 2012); we downloaded 3,490 RuBisCO eudicotyledon sequences from a previous study (Stamatakis et al. 2010); and we downloaded CYTB genes of placental mammals from NCBI RefSeq (O’Leary et al. 2016) mitochondria genomes. We reduced the HA and RuBisCO data sets to 466 and 478 sequences respectively by using CD-HIT (Fu et al. 2012) with clustering thresholds of 99.3% and 96% of amino acid sequence identity. The CYTB data were reduced to 418 sequences by keeping one sequence per mammal genus. Sequences were aligned using PRANK (Loytynoja and Goldman 2005), and the alignments used to estimate tree topologies with RAxML under the GTRCAT model (Stamatakis 2014). Because the swMutSel-PPS model is irreversible, trees must be rooted. Thus, outgroups were used to root the trees: avian influenza (HA), monocotyledons (RuBisCO), and monotremes (CYTB). Outgroups were removed and analyses carried out on the rooted ingroup tree (for the PPS model), and the unrooted ingroup tree (for the no PPS model). Sites with residues in fewer than 50 taxa were not analyzed. This corresponds to 31, 23, and 27 sites in the rbcL, HA, and CYTB alignments respectively. We note Zk is not identifiable if a site is conserved for a single amino acid. Such conserved sites have the same likelihood under the H0 and H1 hypotheses. MPLE and LRT were carried out as described above using in the Dirichlet penalty. We note estimates of are different between the two models (Zk = 0 vs. Zk > 0, supplementary fig. S1, Supplementary Material online). Before carrying out the FDR correction to select candidate sites under PPS, we verify the distribution of P values is uniform (supplementary fig. S2, Supplementary Material online).
Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.
Acknowledgments
We thank Nick Goldman for valuable comments in the design of the LRT test. M.d.R. is supported by Biotechnology and Biological Sciences Research Council (BBSRC, UK) award BB/T01282X/1.
Data Availability
Alignments, phylogenies, and analysis output are available in FigShare, doi: 10.6084/m9.figshare.14637765.v1. The mutation–selection PPS codon substitution model is implemented in the swMutSel computer program available at github.com/tamuri.