Accurate Estimation of Gene Evolutionary Rates Using XRATE, with an Application to Transmembrane Proteins

Author Notes

Abstract

XRATE implements algorithms for comparative annotation, ancestral reconstruction, evolutionary rate estimation, and simulation. Its modeling repertoire includes phylogenetic stochastic context–free grammars and phylo-hidden Markov models. Following earlier tests of XRATE as a machine-learning tool suitable for alignment annotation, we now report the first tests of XRATE as a precise quantitative instrument for estimating evolutionary rates. We implement a codon model similar to that of Goldman and Yang (1994) (A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol 11: 725–736) and show that XRATE's parameter estimates are consistent with those of PAML. To demonstrate its utility, we apply the model to measure the difference in selective strength (ω) between intracellular and secreted regions of type I transmembrane proteins. In 215 of 303 instances, a complex model with individual ω for each region provides a better fit to the data than the simpler single ω value model. Secreted portions of type I transmembrane proteins show an elevation in ω similar to that seen for secreted protein genes. Less stringent purifying selection is thus a general property of the extracellular milieu, rather than being specific to only soluble and secreted proteins.

evolutionary rates, expectation maximization, transmembrane proteins, stochastic context–free grammar

Issue Section:

research articles

Introduction

Evolutionary theory lies at the core of biology, and stochastic evolutionary models form the core of bioinformatics theory. Phylo-grammars (Klosterman et al. 2006) generalize hidden Markov models (HMMs, Durbin et al. 1998) and phylogenetic models of sequence evolution into a single class of models that are applicable to phylogenetic reconstruction, ancestral sequence reconstruction, genome annotation, and substitution rate measurement, among others.

XRATE is a toolset for implementing and estimating phylo-grammars. An XRATE user supplies a customizable phylo-grammar format specifying the underlying structure of the evolutionary model. XRATE's fast, generic Expectation Maximization (EM) algorithm then fits this model to training data. Model parameters can be analyzed and interpreted; the model can then be used further to discover similar features in new alignments, to reconstruct probable ancestral sequences, or to simulate other alignments of comparable structure. A single framework for these applications offers great consistency: For example, a gene predictor can be prototyped and trained on real data, its substitution rates inspected, and its false-positive rate estimated by direct simulation, all using one program. XRATE can be used both from the command line for large-scale analyses and through a web interface that assists in model building, fitting, and analysis (http://harmony.biowiki.org/xrei).

Due to its general nature, XRATE is able to implement models that are more usually provided by tools specialized for particular applications: comparative exon finding as in EXONIPHY (Siepel and Haussler 2004); detection of conserved elements as in PHASTCONS (Siepel et al. 2005); detection of local lineage-specific accelerations as in DLESS (Pollard et al. 2006); and RNA structure prediction as in PFOLD (Knudsen and Hein 1999). The specialization of each of these tools for a particular purpose limits its ability to be extended for use in other, related studies. Tools such as PAML (Yang 2007), PAUP (Swofford 2002), and the general-purpose HYPHY (Pond et al. 2005) can estimate rates, trees, and ancestral states for a broad range of point substitution models but not context-dependent substitution models or phylo-HMMs. By providing a uniform interface to all these models and a documented model format XRATE allows the user to experiment much more widely.

Our previous paper benchmarked XRATE as a machine-learning tool for predictive annotation (Klosterman et al. 2006). Here, we evaluate the performance of XRATE, not as an annotation tool but as a rate measurement tool. We implement the Goldman and Yang model (Goldman and Yang 1994) of nonsynonymous–synonymous codon substitution in XRATE and, using simulated data, directly compare its performance with that of PAML, a widely used software implementing this model. The comparable performance of XRATE and PAML provides a robust foundation for building models that are more powerful, and where the rate parameters themselves are of direct biological interest.

In order to demonstrate XRATE's flexibility, we have used it to address a biological issue of whether a gene's evolutionary rate is influenced by the intra or extracellular location in which its protein is expressed. It has previously been observed that soluble secreted proteins tend to be associated with increased ω values relative to cytoplasmic proteins (Winter et al. 2004; Julenius and Pedersen 2006), with ω = d_N /d_S, the nonsynonymous (d_N) to synonymous (d_S) substitution rate ratio. These elevated ω values are interpreted as resulting from reduced purifying selection or enhanced positive selection on secreted protein sequences. We were interested in investigating whether elevated ω values reflect a tendency that is specific to secreted, soluble proteins, perhaps because of the distinctive biology of these molecules, or whether elevated ω values are a more general trend applicable to any protein region that is exposed to the extracellular milieu.

Type I membrane proteins provide an ideal opportunity to distinguish between these two hypotheses. These proteins consist of a single transmembrane segment that separates the secreted N-terminal region from the intracellular C-terminal region. If elevated ω is a property of the extracellular environment, we might expect a tendency for elevated ω values within these proteins’ secreted N-terminal regions relative to their nonsecreted C-terminal regions. Here, we use XRATE to assess whether a difference in ω values can be discerned between these two distinct regions within type I membrane proteins. This analysis exploits the ability of XRATE to estimate single values of the synonymous substitution rate per site d_S and the transition–transversion rate ratio κ shared across both parts of the protein, while simultaneously estimating different nonsynonymous substitution rates per site d_N for each of the two regions.

Methods

Rate Estimation

The “phylo-EM” algorithm for continuous-time Markov chains on trees is an efficient and accurate technique for obtaining maximum likelihood estimates of parameters for diverse molecular substitution models. The algorithm, a specialization of the general EM algorithm (Dempster et al. 1977), was described initially for the general reversible rate matrix by Holmes and Rubin (2002). A more comprehensive and statistically rigorous presentation is that of Hobolth and Jensen (2005) that describes how the algorithm can be applied to parametric models commonly used in molecular evolution, such as the HKY85 model (Hasegawa et al. 1985).

The XRATE program (Klosterman et al. 2006) implements a combination of the phylo-EM algorithm with the Baum–Welch and Inside–Outside algorithms (Durbin et al. 1998), which are other specializations of the EM algorithm used for training HMMs and stochastic context–free grammars (SCFGs). The resulting probabilistic models, which combine genetic features (exons, introns, foldback structures, etc.) with phylogenetic trees, have been called “phylo-HMMs” and “phylo-SCFGs.”

The codon models of Goldman and Yang (1994) implemented in the software package PAML (Yang 2007) have frequently been applied to the analysis of protein-coding sequence. XRATE is able to implement this class of models using the codon substitution rate matrix Q = {q_ij}, with the substitution rate q_ij from codon i to codon j given as

The parameters R_s and R_n are the synonymous and nonsynonymous substitution rates per codon, respectively, and R_ti and R_tv are the transition and transversion rates per codon, respectively. The codon equilibrium frequency of codon j (π_j) can be estimated from the data or fixed at the observed codon frequencies. The nonsynonymous and synonymous substitution rates per site, d_N and d_S, their ratio, ω, and the transition–transversion rate ratio, κ, are determined from Q as described in Goldman and Yang (1994).

We used XRATE's ability to combine the codon substitution model with a hidden Markov chain to model the difference of ω between the N-terminal and C-terminal segments of transmembrane proteins. Here, we assume that the synonymous substitution rate per site d_S and the transition–transversion rate ratio, κ, are unaffected by changes in selective strength and the relevant parameters are thus shared between both parts of the protein. In the resultant “two_ω” model, R_s, R_ti, and R_tv are shared between the N-terminal and C-terminal segments, whereas R_n is allowed to vary. This model was compared with the “one_ω” model, in which R_n is shared between the two segments as well. Such an analysis comes naturally to XRATE as it can fit both models directly using annotations in the multiple alignment that denote the locations of the N- and C-terminal regions. Reported ω values are maximum likelihood estimates as returned by the EM algorithm.

Benchmarking

One hundred sequence pairs of 1,000 codons were simulated using EVOLVER of the PAML package for various values of ω, κ, and d_S. Each set of sequence pairs was then submitted to rate estimation by either PAML or XRATE. Codons were sampled from a uniform distribution where each codon appears at an identical frequency (1/61). For estimation, codon frequencies were described by the F3X4 model, which employs an empirical codon usage table derived from observed nucleotide frequencies in each of the three codon positions. Results are reported as relative errors: |(expected − observed)|/expected.

In a second experiment, we varied the sequence length between 50 and 5,000 codons with ω = 0.1, κ = 2.0, and d_S = 1.0. Except for the parameter “mininc,” which determines the precision of estimates by specifying the minimum fractional increase in log-likelihood per optimization step before the EM algorithm is deemed to have converged, XRATE was used with default parameters. We found that the default value of mininc (10⁻⁴) was too high to achieve reliable estimates and thus instead set it to 10⁻⁶. We used default parameters in PAML for the Goldman and Yang model. Estimations that failed to converge were discarded.

In order to benchmark the “two”_ω model, we constructed pseudotransmembrane sequences. Each pseudotransmembrane sequence was constructed by concatenating the sequence of two individual simulations using EVOLVER with the same parameters κ = 2.0, d_S = 1.0 but with different length and ω values. The first segment was kept constant at 500 codons and ω = 0.1, whereas the length of the second segment varied from 10 to 500 codons and ω varied from 0.1 to 0.2.

We compared the one_ω and two_ω models with log-likelihood ratio tests (Silvey 1970) that are widely used in computational biology (Goldman 1993; Felsenstein 2003). The two models are nested with one degree of freedom difference. The two_ω model was considered to fit the data better than the one_ω model if the statistic 2Δ=2(ln(⁠_2ω)−ln(⁠_1ω)) exceeded the 95% point of a χ² distribution with one degree of freedom, with _1ω and _2ω being the estimated likelihoods of the one_ω and two_ω models, respectively.

The true positive rate was defined as the proportion of significant log-likelihood ratio tests if ω varied between the N-terminal and C-terminal segments. Similarly, the false-positive rate was defined as the proportion of significant log-likelihood ratio tests for identical ω.

Transmembrane Protein Sets

We used a set of annotated transmembrane proteins from TOPDB (Tusnády et al. 2008), revision 1.0. TOPDB represents experimentally verified topology information gathered from literature and public databases. For our analysis, we selected 169 human type I transmembrane proteins spanning the plasma membrane and with an extracellular N-terminal domain. TOPDB provides the sequence coordinates of both the membrane-spanning helix and the signal peptide.

The transmembrane proteins were subsequently mapped to multiple alignments from OPTIC (Heger and Ponting 2008) of simple 1:1 ortholog sets from four mammals (Homo sapiens, Mus musculus, Canis familiaris, and Monodelphis domestica). Multiple alignment columns representing the transmembrane segment, signal peptide and containing gaps were removed. Multiple alignments containing fewer than 100 codons in either terminus were discarded. Sixty-three multiple alignments fulfilled these criteria.

A second, slightly larger set was obtained by extracting proteins annotated as “Single-pass type I membrane protein” in UniprotKB/Swissprot (The UniProt Consortium 2008) and containing both a signal peptide and a single transmembrane segment. In this set, 303 multiple alignments were available for rate estimation. However, the presence and location of signal peptide and transmembrane helix in UniprotKB annotations is uncertain as these are mostly inferred from homologs or have been predicted computationally.

Type II transmembrane proteins that have an intracellular N-terminal region and an extracellular C-terminal region could provide a complementary test. We attempted to build a reference set similar to type I transmembrane proteins but obtained insufficient data with at most nine multiple alignments.

Sets of secreted and cytoplasmic proteins were obtained from UniprotKB/Swissprot filtering for species (H. sapiens). Proteins with isoforms in the mitochondrion or with transmembrane segments were discarded. The set of secreted proteins contained the word “secreted” and the set of cytoplasmic proteins contained the word “cytoplasmic” in the subcellular location annotation of a UniprotKB entry. Proteins present in both sets due to isoforms were removed. Multiple alignments were built in the same way as for transmembrane proteins by mapping to simple 1:1 orthology sets from four mammals (see above). We obtained 740 and 954 multiple alignments for secreted and cytoplasmic proteins, respectively.

Implementation

XRATE is a part of the DART package and is implemented in C++. We have extended XRATE with a library of python modules that allow transparent model creation, execution, and result parsing. Source code is freely available under an open source license from http://biowiki.org/DART. Sample grammars of popular models are part of the distribution or can be obtained from the DART web site (http://biowiki.org/XrateGrammars). The grammars used in this work have been added as Supplemental Information (Supplementary Material online).

Results

Rate Estimation

XRATE estimated d_S, ω, and κ with an accuracy equivalent to that of PAML (fig. 1) for a typical range of biological sequence variation (d_S (0.1–2.0), ω (0.05–1.4), and κ (0.5–4.0)) using simulated sequences of 1,000 codons. The average relative error was identical between XRATE and PAML and was less than 10% for almost the entire parameter range (fig. 2). The error increased for very large ω (>1), small d_S (<0.2), and small κ (<1.0). At large d_S=2.0, rate estimation is possible with both methods if ω is small, but fail if ω is 1.0 or greater. Maximum likelihood estimates of PAML correspond to previous benchmarks (Yang and Nielsen 2000). The complete benchmark is available as Supplementary Information (Supplementary Material online). Equivalent results were obtained when nonuniform codon frequencies were used (data not shown).

FIG. 1.—

XRATE estimates evolutionary rates with similar accuracy to CODONML. Shown are mean estimates of d_S (left) and ω (right) for CODONML (top) and XRATE (bottom) from 100 randomly generated pairwise alignments produced by EVOLVER (κ = 2.0, 1,000 codons). Error bars indicate 1SD. Diagonal lines represent a perfect estimator.

Open in new tab Download slide

FIG. 2.—

XRATE estimates evolutionary rates with similar accuracy to CODONML. Plot of relative errors in CODONML and XRATE. The three panels show the mean, maximum, and SD of relative errors. Each panel displays the results of the two methods side by side in a series of blocks. Each block is a set of experiments for a fixed κ and varying ω (x-axis) and d_S (y-axis). The color indicates the magnitude of the relative error at a given set of parameter value. Each data point represents the results of 100 pairwise alignments randomly generated by EVOLVER (1,000 codons) for each combination of parameters.

Open in new tab Download slide

Both programs required the same amount of data to make accurate estimates (fig. 3): At least 700 codons are necessary to achieve an average error of less than 10% in all three parameters d_S, ω, and κ. As expected from their reduced information content, the error increases with shorter sequences.

FIG. 3.—

Mean relative error of estimates of d_S, κ, and ω vary with sequence length. Shown are estimates of 100 simulated sequences for ω = 0.1, κ = 2.0, and d_S = 1.0.

Open in new tab Download slide

We tested the capability of XRATE to detect differences in ω between two segments of a protein evolving with different rates. To this end, we simulated sequences with two regimes of ω and employed the log-likelihood ratio test between one_ω and two_ω models.

The true positive rate depended both on sequence length and the magnitude of the difference (fig. 4): More data are required to predict smaller changes reliably. At lengths of 100 codons, the true positive rate was 33% to detect a difference between ω_N= 0.13 and ω_C= 0.10. We estimate the false-positive rate to be at most 14%, higher than the expected 5% type I errors from the log-likelihood ratio test.

FIG. 4.—

True and false-positive rates of detecting differences in ω_N and ω_C vary with sequence length and Δω = ω_N − ω_C. Shown are the proportions of passed log-likelihood ratio tests with varying length and Δω in experiments of 100 simulated sequences. Each simulated sequence consists of a segment of length 500 codons, ω_N = 0.1, κ = 2.0, and d_S = 1.0 and a segment of the same κ and d_S, but varying length and ω_C. The first column indicates the false-positive rate of the test as the proportion of significant differences detected when Δω = 0. The other columns indicate true positive rates as the proportion of significant differences when Δω > 0.

Open in new tab Download slide

Transmembrane Proteins

The two_ω model was found to provide a better fit to the data than the one_ω model for 47 of the 63 TOPDB multiple alignments (log-likelihood ratio test, P < 0.05) indicating that selection often acts differentially between the N- and C-terminal regions of type I transmembrane proteins. The distribution of ω for extracellular N-terminal regions (ω_N) is shifted toward larger ω values than for intracellular C-terminal regions (ω_C;fig. 5A). The difference between ω_N and ω_C values, Δω = ω_N − ω_C, for single genes tends to be small (fig. 5B, mean = 0.075, median = 0.042, standard deviation, SD = 0.13) yet statistically significant (Wilcoxon one-sided signed rank test, P < 10⁻⁵) and comparable with the difference in ω between secreted and cytoplasmic protein genes (difference of means = 0.049, difference of medians = 0.049).

FIG. 5.—

The secreted portions of type I transmembrane proteins show the same shift toward higher ω values as secreted and soluble proteins. (A) Cumulative distributions of ω values for N- and C-terminal regions of 63 type I transmembrane proteins and the estimated ω values for soluble secreted (ω_E, n = 740) and cytoplasmic (ω_I, n = 954) proteins. The latter two sets were derived in a similar fashion to type I transmembrane proteins. The distributions of ω for soluble secreted and intracellular protein genes are significantly different (Kolmogorov–Smirnov-test, P < 10⁻⁵). (B) Cumulative distribution of the difference in ω between the N-terminal and C-terminal regions for type I transmembrane proteins.

Open in new tab Download slide

We confirmed the statistical significance of a 0.042 median difference using 63 shuffled TOPDB multiple alignments. The shuffling procedure randomly permutated columns but kept codons intact and permitted columns to switch between N- and C-terminal regions. In these shuffled alignments, the median absolute difference Δ|ω| = |ω_N−ω_C| is 0.012 (mean = 0.016, SD = 0.011), a more than 3-fold difference in magnitude compared with the observed difference of 0.042.

In order to assess the biological significance of Δω, we compared the median ω_N and ω_C values with the distribution of ω values computed for 15,158 human–mouse 1:1 orthologous genes. The median ω_N (0.18) and median ω_C (0.08) lie at the 76 percentile and 44 percentile, respectively, of the human–mouse ω distribution. Thus, half of all N-terminal regions show as low a constraint as 25% of the least constrained protein-coding genes, whereas half of all C-terminal regions are as constrained as the 44% of the most constrained protein-coding genes.

The set of type I transmembrane proteins based on Uniprot is expected to contain a greater number of annotation errors than the TOPDB set owing to the presence and location of signal peptide and transmembrane helix often being inferred, rather than being experimentally observed. Nevertheless, for this Uniprot set, the greater divergence of secreted protein sequence remained detectable. For 224 of 303 log-likelihood ratio tests, the two_ω model provided a significantly better fit to the data than the one_ω model. These 224 proteins showed significant differences between ω_N and ω_C of 0.028 (mean) and 0.016 (median) (SD = 0.13, Wilcoxon one-sided signed rank test, P < 10⁻⁴).

We conclude that a tendency for an increased ω value is, indeed, a property of the extracellular environment and not specific to secreted, soluble proteins.

Discussion

We have demonstrated the utility of XRATE in estimating evolutionary rates and have shown that the popular codon models of PAML lie within its capability. Our application to type I transmembrane proteins elaborates previous findings that rates tend to be higher in secreted proteins (Winter et al. 2004; Julenius and Pedersen 2006) by demonstrating a tendency for higher divergence in the secreted portion of a transmembrane molecule relative to its cytoplasmic region. We were able to exploit the ability of XRATE to estimate single values of d_S and κ for each molecule, while simultaneously estimating d_N and thus ω in different regions of this molecule.

The tendency for soluble or membrane tethered, and secreted, sequences to evolve more rapidly than sequences localized to the cytoplasm may reflect a reduction in stringency of purifying selection. This is because secreted molecules tend to be expressed in only limited numbers of tissues (Winter et al. 2004) and therefore tend to accumulate deleterious fixed mutations at greater rates when compared with widely expressed, housekeeping, genes whose mutations are more likely to be lethal or substantially deleterious. Nevertheless, because extracellular molecules are more likely to have confronted, or been susceptible to subversion, by pathogens and parasites it remains possible that positive selection of amino acid changes has contributed to the elevation of ω values we observed for extracellular sequence regions.

Here, we have only introduced the simplest models of PAML, but more complex models that vary rates between branches and between sites are equally possible.

Both XRATE and PAML employ optimization algorithms that approximate the global maximum likelihood by local gradient ascent. The accuracy of an estimate is thus also dependent on the stopping criterion. We obtained high accuracy only after decreasing the parameter XRATE mininc to 10⁻⁶ but have not investigated if PAML's accuracy could be optimized in a similar fashion.

The flexibility of XRATE comes at a computational price. XRATE is, for equivalent models and on the same hardware, slower than the specialist PAML by a factor of 130 or more. Nevertheless, recent large increases in computational power ameliorate this price and place the extensibility of XRATE within reach of most bioinformaticians and evolutionary biologists.

The benchmarks described here build confidence in the accuracy of XRATE's quantitative estimates of evolutionary rate parameters. This also helps establish a basis for reusing models as components of more complex models or using the same models for different purposes. For example, the empirical codon model, estimated by Kosiol et al. (2007) in order to analyze rates of dinucleotide and trinucleotide substitutions, has since been used to reconstruct ancestral exons and is also a component of more elaborate models for 1) pinpointing loss-of-selection in pseudogene formation and 2) simulating realistic alignments of genome sequence.

Our experience has been that significantly more-accurate rate estimates are required of a molecular evolution tool than of a gene-finding or alignment tool. Therefore, this is the most rigorous benchmark of XRATE that has yet been performed, building new confidence in the goal of model integration for sequence analysis. XRATE is demonstrated to be a reliable instrument for rate estimation, as well as alignment annotation. The results and methodology of this benchmark will strongly inform the design of the nascent XRATE grammar library.

Research was performed at MRC Functional Genomics Unit, University of Oxford, Department of Physiology, Anatomy and Genetics South Parks Road, Oxford OX1 3QX, United Kingdom.

References

Dempster

Laird

Rubin

Maximum likelihood from incomplete data via the EM algorithm

J R Stat Soc Series B (Methodol)

1977

, vol.

(pg.

)

Month:	Total Views:
February 2017	3
March 2017	1
April 2017	1
May 2017	1
June 2017	4
July 2017	1
August 2017	2
October 2017	4
November 2017	2
December 2017	15
January 2018	9
February 2018	10
March 2018	9
April 2018	3
May 2018	5
June 2018	15
July 2018	10
August 2018	7
September 2018	5
October 2018	2
November 2018	5
December 2018	5
January 2019	5
February 2019	6
March 2019	6
April 2019	12
May 2019	5
June 2019	7
July 2019	11
August 2019	8
September 2019	4
October 2019	17
November 2019	8
December 2019	15
January 2020	7
February 2020	5
March 2020	4
April 2020	14
May 2020	4
June 2020	5

Article Contents

Accurate Estimation of Gene Evolutionary Rates Using XRATE, with an Application to Transmembrane Proteins

Abstract

Introduction

Methods

Rate Estimation

Benchmarking

Transmembrane Protein Sets

Implementation

Results

Rate Estimation

Transmembrane Proteins

Discussion

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only