Abstract

XRATE implements algorithms for comparative annotation, ancestral reconstruction, evolutionary rate estimation, and simulation. Its modeling repertoire includes phylogenetic stochastic context–free grammars and phylo-hidden Markov models. Following earlier tests of XRATE as a machine-learning tool suitable for alignment annotation, we now report the first tests of XRATE as a precise quantitative instrument for estimating evolutionary rates. We implement a codon model similar to that of Goldman and Yang (1994) (A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol 11: 725–736) and show that XRATE's parameter estimates are consistent with those of PAML. To demonstrate its utility, we apply the model to measure the difference in selective strength (ω) between intracellular and secreted regions of type I transmembrane proteins. In 215 of 303 instances, a complex model with individual ω for each region provides a better fit to the data than the simpler single ω value model. Secreted portions of type I transmembrane proteins show an elevation in ω similar to that seen for secreted protein genes. Less stringent purifying selection is thus a general property of the extracellular milieu, rather than being specific to only soluble and secreted proteins.

Introduction

Evolutionary theory lies at the core of biology, and stochastic evolutionary models form the core of bioinformatics theory. Phylo-grammars (Klosterman et al. 2006) generalize hidden Markov models (HMMs, Durbin et al. 1998) and phylogenetic models of sequence evolution into a single class of models that are applicable to phylogenetic reconstruction, ancestral sequence reconstruction, genome annotation, and substitution rate measurement, among others.

XRATE is a toolset for implementing and estimating phylo-grammars. An XRATE user supplies a customizable phylo-grammar format specifying the underlying structure of the evolutionary model. XRATE's fast, generic Expectation Maximization (EM) algorithm then fits this model to training data. Model parameters can be analyzed and interpreted; the model can then be used further to discover similar features in new alignments, to reconstruct probable ancestral sequences, or to simulate other alignments of comparable structure. A single framework for these applications offers great consistency: For example, a gene predictor can be prototyped and trained on real data, its substitution rates inspected, and its false-positive rate estimated by direct simulation, all using one program. XRATE can be used both from the command line for large-scale analyses and through a web interface that assists in model building, fitting, and analysis (http://harmony.biowiki.org/xrei).

Due to its general nature, XRATE is able to implement models that are more usually provided by tools specialized for particular applications: comparative exon finding as in EXONIPHY (Siepel and Haussler 2004); detection of conserved elements as in PHASTCONS (Siepel et al. 2005); detection of local lineage-specific accelerations as in DLESS (Pollard et al. 2006); and RNA structure prediction as in PFOLD (Knudsen and Hein 1999). The specialization of each of these tools for a particular purpose limits its ability to be extended for use in other, related studies. Tools such as PAML (Yang 2007), PAUP (Swofford 2002), and the general-purpose HYPHY (Pond et al. 2005) can estimate rates, trees, and ancestral states for a broad range of point substitution models but not context-dependent substitution models or phylo-HMMs. By providing a uniform interface to all these models and a documented model format XRATE allows the user to experiment much more widely.

Our previous paper benchmarked XRATE as a machine-learning tool for predictive annotation (Klosterman et al. 2006). Here, we evaluate the performance of XRATE, not as an annotation tool but as a rate measurement tool. We implement the Goldman and Yang model (Goldman and Yang 1994) of nonsynonymous–synonymous codon substitution in XRATE and, using simulated data, directly compare its performance with that of PAML, a widely used software implementing this model. The comparable performance of XRATE and PAML provides a robust foundation for building models that are more powerful, and where the rate parameters themselves are of direct biological interest.

In order to demonstrate XRATE's flexibility, we have used it to address a biological issue of whether a gene's evolutionary rate is influenced by the intra or extracellular location in which its protein is expressed. It has previously been observed that soluble secreted proteins tend to be associated with increased ω values relative to cytoplasmic proteins (Winter et al. 2004; Julenius and Pedersen 2006), with ω = dN /dS, the nonsynonymous (dN) to synonymous (dS) substitution rate ratio. These elevated ω values are interpreted as resulting from reduced purifying selection or enhanced positive selection on secreted protein sequences. We were interested in investigating whether elevated ω values reflect a tendency that is specific to secreted, soluble proteins, perhaps because of the distinctive biology of these molecules, or whether elevated ω values are a more general trend applicable to any protein region that is exposed to the extracellular milieu.

Type I membrane proteins provide an ideal opportunity to distinguish between these two hypotheses. These proteins consist of a single transmembrane segment that separates the secreted N-terminal region from the intracellular C-terminal region. If elevated ω is a property of the extracellular environment, we might expect a tendency for elevated ω values within these proteins’ secreted N-terminal regions relative to their nonsecreted C-terminal regions. Here, we use XRATE to assess whether a difference in ω values can be discerned between these two distinct regions within type I membrane proteins. This analysis exploits the ability of XRATE to estimate single values of the synonymous substitution rate per site dS and the transition–transversion rate ratio κ shared across both parts of the protein, while simultaneously estimating different nonsynonymous substitution rates per site dN for each of the two regions.

Methods

Rate Estimation

The “phylo-EM” algorithm for continuous-time Markov chains on trees is an efficient and accurate technique for obtaining maximum likelihood estimates of parameters for diverse molecular substitution models. The algorithm, a specialization of the general EM algorithm (Dempster et al. 1977), was described initially for the general reversible rate matrix by Holmes and Rubin (2002). A more comprehensive and statistically rigorous presentation is that of Hobolth and Jensen (2005) that describes how the algorithm can be applied to parametric models commonly used in molecular evolution, such as the HKY85 model (Hasegawa et al. 1985).

The XRATE program (Klosterman et al. 2006) implements a combination of the phylo-EM algorithm with the Baum–Welch and Inside–Outside algorithms (Durbin et al. 1998), which are other specializations of the EM algorithm used for training HMMs and stochastic context–free grammars (SCFGs). The resulting probabilistic models, which combine genetic features (exons, introns, foldback structures, etc.) with phylogenetic trees, have been called “phylo-HMMs” and “phylo-SCFGs.”

The codon models of Goldman and Yang (1994) implemented in the software package PAML (Yang 2007) have frequently been applied to the analysis of protein-coding sequence. XRATE is able to implement this class of models using the codon substitution rate matrix Q = {qij}, with the substitution rate qij from codon i to codon j given as
The parameters Rs and Rn are the synonymous and nonsynonymous substitution rates per codon, respectively, and Rti and Rtv are the transition and transversion rates per codon, respectively. The codon equilibrium frequency of codon j (πj) can be estimated from the data or fixed at the observed codon frequencies. The nonsynonymous and synonymous substitution rates per site, dN and dS, their ratio, ω, and the transition–transversion rate ratio, κ, are determined from Q as described in Goldman and Yang (1994).

We used XRATE's ability to combine the codon substitution model with a hidden Markov chain to model the difference of ω between the N-terminal and C-terminal segments of transmembrane proteins. Here, we assume that the synonymous substitution rate per site dS and the transition–transversion rate ratio, κ, are unaffected by changes in selective strength and the relevant parameters are thus shared between both parts of the protein. In the resultant “two_ω” model, Rs, Rti, and Rtv are shared between the N-terminal and C-terminal segments, whereas Rn is allowed to vary. This model was compared with the “one_ω” model, in which Rn is shared between the two segments as well. Such an analysis comes naturally to XRATE as it can fit both models directly using annotations in the multiple alignment that denote the locations of the N- and C-terminal regions. Reported ω values are maximum likelihood estimates as returned by the EM algorithm.

Benchmarking

One hundred sequence pairs of 1,000 codons were simulated using EVOLVER of the PAML package for various values of ω, κ, and dS. Each set of sequence pairs was then submitted to rate estimation by either PAML or XRATE. Codons were sampled from a uniform distribution where each codon appears at an identical frequency (1/61). For estimation, codon frequencies were described by the F3X4 model, which employs an empirical codon usage table derived from observed nucleotide frequencies in each of the three codon positions. Results are reported as relative errors: |(expected − observed)|/expected.

In a second experiment, we varied the sequence length between 50 and 5,000 codons with ω = 0.1, κ = 2.0, and dS = 1.0. Except for the parameter “mininc,” which determines the precision of estimates by specifying the minimum fractional increase in log-likelihood per optimization step before the EM algorithm is deemed to have converged, XRATE was used with default parameters. We found that the default value of mininc (10−4) was too high to achieve reliable estimates and thus instead set it to 10−6. We used default parameters in PAML for the Goldman and Yang model. Estimations that failed to converge were discarded.

In order to benchmark the “two”_ω model, we constructed pseudotransmembrane sequences. Each pseudotransmembrane sequence was constructed by concatenating the sequence of two individual simulations using EVOLVER with the same parameters κ = 2.0, dS = 1.0 but with different length and ω values. The first segment was kept constant at 500 codons and ω = 0.1, whereas the length of the second segment varied from 10 to 500 codons and ω varied from 0.1 to 0.2.

We compared the one_ω and two_ω models with log-likelihood ratio tests (Silvey 1970) that are widely used in computational biology (Goldman 1993; Felsenstein 2003). The two models are nested with one degree of freedom difference. The two_ω model was considered to fit the data better than the one_ω model if the statistic 2Δ=2(ln(formula2ω)−ln(formula1ω)) exceeded the 95% point of a χ2 distribution with one degree of freedom, with formula1ω and formula2ω being the estimated likelihoods of the one_ω and two_ω models, respectively.

The true positive rate was defined as the proportion of significant log-likelihood ratio tests if ω varied between the N-terminal and C-terminal segments. Similarly, the false-positive rate was defined as the proportion of significant log-likelihood ratio tests for identical ω.

Transmembrane Protein Sets

We used a set of annotated transmembrane proteins from TOPDB (Tusnády et al. 2008), revision 1.0. TOPDB represents experimentally verified topology information gathered from literature and public databases. For our analysis, we selected 169 human type I transmembrane proteins spanning the plasma membrane and with an extracellular N-terminal domain. TOPDB provides the sequence coordinates of both the membrane-spanning helix and the signal peptide.

The transmembrane proteins were subsequently mapped to multiple alignments from OPTIC (Heger and Ponting 2008) of simple 1:1 ortholog sets from four mammals (Homo sapiens, Mus musculus, Canis familiaris, and Monodelphis domestica). Multiple alignment columns representing the transmembrane segment, signal peptide and containing gaps were removed. Multiple alignments containing fewer than 100 codons in either terminus were discarded. Sixty-three multiple alignments fulfilled these criteria.

A second, slightly larger set was obtained by extracting proteins annotated as “Single-pass type I membrane protein” in UniprotKB/Swissprot (The UniProt Consortium 2008) and containing both a signal peptide and a single transmembrane segment. In this set, 303 multiple alignments were available for rate estimation. However, the presence and location of signal peptide and transmembrane helix in UniprotKB annotations is uncertain as these are mostly inferred from homologs or have been predicted computationally.

Type II transmembrane proteins that have an intracellular N-terminal region and an extracellular C-terminal region could provide a complementary test. We attempted to build a reference set similar to type I transmembrane proteins but obtained insufficient data with at most nine multiple alignments.

Sets of secreted and cytoplasmic proteins were obtained from UniprotKB/Swissprot filtering for species (H. sapiens). Proteins with isoforms in the mitochondrion or with transmembrane segments were discarded. The set of secreted proteins contained the word “secreted” and the set of cytoplasmic proteins contained the word “cytoplasmic” in the subcellular location annotation of a UniprotKB entry. Proteins present in both sets due to isoforms were removed. Multiple alignments were built in the same way as for transmembrane proteins by mapping to simple 1:1 orthology sets from four mammals (see above). We obtained 740 and 954 multiple alignments for secreted and cytoplasmic proteins, respectively.

Implementation

XRATE is a part of the DART package and is implemented in C++. We have extended XRATE with a library of python modules that allow transparent model creation, execution, and result parsing. Source code is freely available under an open source license from http://biowiki.org/DART. Sample grammars of popular models are part of the distribution or can be obtained from the DART web site (http://biowiki.org/XrateGrammars). The grammars used in this work have been added as Supplemental Information (Supplementary Material online).

Results

Rate Estimation

XRATE estimated dS, ω, and κ with an accuracy equivalent to that of PAML (fig. 1) for a typical range of biological sequence variation (dS (0.1–2.0), ω (0.05–1.4), and κ (0.5–4.0)) using simulated sequences of 1,000 codons. The average relative error was identical between XRATE and PAML and was less than 10% for almost the entire parameter range (fig. 2). The error increased for very large ω (>1), small dS (<0.2), and small κ (<1.0). At large dS=2.0, rate estimation is possible with both methods if ω is small, but fail if ω is 1.0 or greater. Maximum likelihood estimates of PAML correspond to previous benchmarks (Yang and Nielsen 2000). The complete benchmark is available as Supplementary Information (Supplementary Material online). Equivalent results were obtained when nonuniform codon frequencies were used (data not shown).

XRATE estimates evolutionary rates with similar accuracy to CODONML. Shown are mean estimates of dS (left) and ω (right) for CODONML (top) and XRATE (bottom) from 100 randomly generated pairwise alignments produced by EVOLVER (κ = 2.0, 1,000 codons). Error bars indicate 1SD. Diagonal lines represent a perfect estimator.
FIG. 1.—

XRATE estimates evolutionary rates with similar accuracy to CODONML. Shown are mean estimates of dS (left) and ω (right) for CODONML (top) and XRATE (bottom) from 100 randomly generated pairwise alignments produced by EVOLVER (κ = 2.0, 1,000 codons). Error bars indicate 1SD. Diagonal lines represent a perfect estimator.

XRATE estimates evolutionary rates with similar accuracy to CODONML. Plot of relative errors in CODONML and XRATE. The three panels show the mean, maximum, and SD of relative errors. Each panel displays the results of the two methods side by side in a series of blocks. Each block is a set of experiments for a fixed κ and varying ω (x-axis) and dS (y-axis). The color indicates the magnitude of the relative error at a given set of parameter value. Each data point represents the results of 100 pairwise alignments randomly generated by EVOLVER (1,000 codons) for each combination of parameters.
FIG. 2.—

XRATE estimates evolutionary rates with similar accuracy to CODONML. Plot of relative errors in CODONML and XRATE. The three panels show the mean, maximum, and SD of relative errors. Each panel displays the results of the two methods side by side in a series of blocks. Each block is a set of experiments for a fixed κ and varying ω (x-axis) and dS (y-axis). The color indicates the magnitude of the relative error at a given set of parameter value. Each data point represents the results of 100 pairwise alignments randomly generated by EVOLVER (1,000 codons) for each combination of parameters.

Both programs required the same amount of data to make accurate estimates (fig. 3): At least 700 codons are necessary to achieve an average error of less than 10% in all three parameters dS, ω, and κ. As expected from their reduced information content, the error increases with shorter sequences.

Mean relative error of estimates of dS, κ, and ω vary with sequence length. Shown are estimates of 100 simulated sequences for ω = 0.1, κ = 2.0, and dS = 1.0.
FIG. 3.—

Mean relative error of estimates of dS, κ, and ω vary with sequence length. Shown are estimates of 100 simulated sequences for ω = 0.1, κ = 2.0, and dS = 1.0.

We tested the capability of XRATE to detect differences in ω between two segments of a protein evolving with different rates. To this end, we simulated sequences with two regimes of ω and employed the log-likelihood ratio test between one_ω and two_ω models.

The true positive rate depended both on sequence length and the magnitude of the difference (fig. 4): More data are required to predict smaller changes reliably. At lengths of 100 codons, the true positive rate was 33% to detect a difference between ωN= 0.13 and ωC= 0.10. We estimate the false-positive rate to be at most 14%, higher than the expected 5% type I errors from the log-likelihood ratio test.

True and false-positive rates of detecting differences in ωN and ωC vary with sequence length and Δω = ωN − ωC. Shown are the proportions of passed log-likelihood ratio tests with varying length and Δω in experiments of 100 simulated sequences. Each simulated sequence consists of a segment of length 500 codons, ωN = 0.1, κ = 2.0, and dS = 1.0 and a segment of the same κ and dS, but varying length and ωC. The first column indicates the false-positive rate of the test as the proportion of significant differences detected when Δω = 0. The other columns indicate true positive rates as the proportion of significant differences when Δω > 0.
FIG. 4.—

True and false-positive rates of detecting differences in ωN and ωC vary with sequence length and Δω = ωNωC. Shown are the proportions of passed log-likelihood ratio tests with varying length and Δω in experiments of 100 simulated sequences. Each simulated sequence consists of a segment of length 500 codons, ωN = 0.1, κ = 2.0, and dS = 1.0 and a segment of the same κ and dS, but varying length and ωC. The first column indicates the false-positive rate of the test as the proportion of significant differences detected when Δω = 0. The other columns indicate true positive rates as the proportion of significant differences when Δω > 0.

Transmembrane Proteins

The two_ω model was found to provide a better fit to the data than the one_ω model for 47 of the 63 TOPDB multiple alignments (log-likelihood ratio test, P < 0.05) indicating that selection often acts differentially between the N- and C-terminal regions of type I transmembrane proteins. The distribution of ω for extracellular N-terminal regions (ωN) is shifted toward larger ω values than for intracellular C-terminal regions (ωC;fig. 5A). The difference between ωN and ωC values, Δω = ωNωC, for single genes tends to be small (fig. 5B, mean = 0.075, median = 0.042, standard deviation, SD = 0.13) yet statistically significant (Wilcoxon one-sided signed rank test, P < 10−5) and comparable with the difference in ω between secreted and cytoplasmic protein genes (difference of means = 0.049, difference of medians = 0.049).

The secreted portions of type I transmembrane proteins show the same shift toward higher ω values as secreted and soluble proteins. (A) Cumulative distributions of ω values for N- and C-terminal regions of 63 type I transmembrane proteins and the estimated ω values for soluble secreted (ωE, n = 740) and cytoplasmic (ωI, n = 954) proteins. The latter two sets were derived in a similar fashion to type I transmembrane proteins. The distributions of ω for soluble secreted and intracellular protein genes are significantly different (Kolmogorov–Smirnov-test, P < 10−5). (B) Cumulative distribution of the difference in ω between the N-terminal and C-terminal regions for type I transmembrane proteins.
FIG. 5.—

The secreted portions of type I transmembrane proteins show the same shift toward higher ω values as secreted and soluble proteins. (A) Cumulative distributions of ω values for N- and C-terminal regions of 63 type I transmembrane proteins and the estimated ω values for soluble secreted (ωE, n = 740) and cytoplasmic (ωI, n = 954) proteins. The latter two sets were derived in a similar fashion to type I transmembrane proteins. The distributions of ω for soluble secreted and intracellular protein genes are significantly different (Kolmogorov–Smirnov-test, P < 10−5). (B) Cumulative distribution of the difference in ω between the N-terminal and C-terminal regions for type I transmembrane proteins.

We confirmed the statistical significance of a 0.042 median difference using 63 shuffled TOPDB multiple alignments. The shuffling procedure randomly permutated columns but kept codons intact and permitted columns to switch between N- and C-terminal regions. In these shuffled alignments, the median absolute difference Δ|ω| = |ωNωC| is 0.012 (mean = 0.016, SD = 0.011), a more than 3-fold difference in magnitude compared with the observed difference of 0.042.

In order to assess the biological significance of Δω, we compared the median ωN and ωC values with the distribution of ω values computed for 15,158 human–mouse 1:1 orthologous genes. The median ωN (0.18) and median ωC (0.08) lie at the 76 percentile and 44 percentile, respectively, of the human–mouse ω distribution. Thus, half of all N-terminal regions show as low a constraint as 25% of the least constrained protein-coding genes, whereas half of all C-terminal regions are as constrained as the 44% of the most constrained protein-coding genes.

The set of type I transmembrane proteins based on Uniprot is expected to contain a greater number of annotation errors than the TOPDB set owing to the presence and location of signal peptide and transmembrane helix often being inferred, rather than being experimentally observed. Nevertheless, for this Uniprot set, the greater divergence of secreted protein sequence remained detectable. For 224 of 303 log-likelihood ratio tests, the two_ω model provided a significantly better fit to the data than the one_ω model. These 224 proteins showed significant differences between ωN and ωC of 0.028 (mean) and 0.016 (median) (SD = 0.13, Wilcoxon one-sided signed rank test, P < 10−4).

We conclude that a tendency for an increased ω value is, indeed, a property of the extracellular environment and not specific to secreted, soluble proteins.

Discussion

We have demonstrated the utility of XRATE in estimating evolutionary rates and have shown that the popular codon models of PAML lie within its capability. Our application to type I transmembrane proteins elaborates previous findings that rates tend to be higher in secreted proteins (Winter et al. 2004; Julenius and Pedersen 2006) by demonstrating a tendency for higher divergence in the secreted portion of a transmembrane molecule relative to its cytoplasmic region. We were able to exploit the ability of XRATE to estimate single values of dS and κ for each molecule, while simultaneously estimating dN and thus ω in different regions of this molecule.

The tendency for soluble or membrane tethered, and secreted, sequences to evolve more rapidly than sequences localized to the cytoplasm may reflect a reduction in stringency of purifying selection. This is because secreted molecules tend to be expressed in only limited numbers of tissues (Winter et al. 2004) and therefore tend to accumulate deleterious fixed mutations at greater rates when compared with widely expressed, housekeeping, genes whose mutations are more likely to be lethal or substantially deleterious. Nevertheless, because extracellular molecules are more likely to have confronted, or been susceptible to subversion, by pathogens and parasites it remains possible that positive selection of amino acid changes has contributed to the elevation of ω values we observed for extracellular sequence regions.

Here, we have only introduced the simplest models of PAML, but more complex models that vary rates between branches and between sites are equally possible.

Both XRATE and PAML employ optimization algorithms that approximate the global maximum likelihood by local gradient ascent. The accuracy of an estimate is thus also dependent on the stopping criterion. We obtained high accuracy only after decreasing the parameter XRATE mininc to 10−6 but have not investigated if PAML's accuracy could be optimized in a similar fashion.

The flexibility of XRATE comes at a computational price. XRATE is, for equivalent models and on the same hardware, slower than the specialist PAML by a factor of 130 or more. Nevertheless, recent large increases in computational power ameliorate this price and place the extensibility of XRATE within reach of most bioinformaticians and evolutionary biologists.

The benchmarks described here build confidence in the accuracy of XRATE's quantitative estimates of evolutionary rate parameters. This also helps establish a basis for reusing models as components of more complex models or using the same models for different purposes. For example, the empirical codon model, estimated by Kosiol et al. (2007) in order to analyze rates of dinucleotide and trinucleotide substitutions, has since been used to reconstruct ancestral exons and is also a component of more elaborate models for 1) pinpointing loss-of-selection in pseudogene formation and 2) simulating realistic alignments of genome sequence.

Our experience has been that significantly more-accurate rate estimates are required of a molecular evolution tool than of a gene-finding or alignment tool. Therefore, this is the most rigorous benchmark of XRATE that has yet been performed, building new confidence in the goal of model integration for sequence analysis. XRATE is demonstrated to be a reliable instrument for rate estimation, as well as alignment annotation. The results and methodology of this benchmark will strongly inform the design of the nascent XRATE grammar library.

Research was performed at MRC Functional Genomics Unit, University of Oxford, Department of Physiology, Anatomy and Genetics South Parks Road, Oxford OX1 3QX, United Kingdom.

References

Dempster
AP
Laird
NM
Rubin
DB
,
Maximum likelihood from incomplete data via the EM algorithm
J R Stat Soc Series B (Methodol)
,
1977
, vol.
39
(pg.
1
-
38
)
Durbin
R
Eddy
SR
Krogh
A
Mitchison
G
Biological sequence analysis
,
1998
Cambridge (MA)
Cambridge University Press
Felsenstein
J
Inferring phylogenies
,
2003
Sunderland (MA)
Sinauer Associates
Goldman
N
,
Statistical tests of models of DNA substitution
J Mol Evol
,
1993
, vol.
36
(pg.
182
-
198
)
Goldman
N
Yang
Z
,
A codon-based model of nucleotide substitution for protein-coding DNA sequences
Mol Biol Evol
,
1994
, vol.
11
(pg.
725
-
736
)
Hasegawa
M
Kishino
H
Yano
T
,
Dating of the human-ape splitting by a molecular clock of mitochondrial DNA
J Mol Evol
,
1985
, vol.
22
(pg.
160
-
174
)
Heger
A
Ponting
CP
,
OPTIC: orthologous and paralogous transcripts in clades
Nucleic Acids Res
,
2008
, vol.
36
(pg.
D267
-
D270
)
Hobolth
A
Jensen
JL
,
Statistical inference in evolutionary models of DNA sequences via the EM algorithm
Stat Appl Genet Mol Biol
,
2005
, vol.
4
 
Article 18
Holmes
I
Rubin
GM
,
An expectation maximization algorithm for training hidden substitution models
J Mol Biol
,
2002
, vol.
317
(pg.
753
-
764
)
Julenius
K
Pedersen
AG
,
Protein evolution is faster outside the cell
Mol Biol Evol
,
2006
, vol.
23
(pg.
2039
-
2048
)
Klosterman
P
Uzilov
A
Bendana
Y
Bradley
R
Chao
S
Kosiol
C
Goldman
N
Holmes
I
,
XRate: a fast prototyping, training and annotation tool for phylo-grammars
BMC Bioinformatics
,
2006
, vol.
7
pg.
428
Knudsen
B
Hein
J
,
RNA secondary structure prediction using stochastic context-free grammars and evolutionary history
Bioinformatics
,
1999
, vol.
15
(pg.
446
-
454
)
Kosiol
C
Holmes
I
Goldman
N
,
An empirical codon model for protein sequence evolution
Mol Biol Evol
,
2007
, vol.
24
(pg.
1464
-
1479
)
Pollard
KS
Salama
SR
Lambert
N
et al.
(16 co-authors)
,
An RNA gene expressed during cortical development evolved rapidly in humans
Nature
,
2006
, vol.
443
(pg.
167
-
172
)-
)
Pond
SLK
Frost
SDW
Muse
SV
,
HyPhy: hypothesis testing using phylogenies
Bioinformatics
,
2005
, vol.
21
(pg.
676
-
679
)
Siepel
A
Bejerano
G
Pedersen
JS
et al.
(16 co-authors)
,
Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes
Genome Res
,
2005
, vol.
15
(pg.
1034
-
1050
)-
)
Siepel
A
Haussler
D
Bourne
PE
Gusfield
D
,
Computational identification of evolutionarily conserved exons
Proceedings of the eighthannual international conference on research in computational molecular biology
,
2004
 
[Internet] ACM, San Diego (CA) p. 177–186. Available from: http://portal.acm.org/citation.cfm?doid=974614.974638 [cited 2008 Aug 26]
Silvey
SD
Statistical inference
,
1970
London
Chapman and Hall
Swofford
DL
PAUP*: phylogenetic analysis using parsimony (*and other methods). Version 4
,
2002
Sunderland (MA)
Sinauer Associates
The UniProt Consortium
,
The universal protein resource (UniProt)
Nucleic Acids Res
,
2008
, vol.
36
(pg.
D190
-
D195
)
Tusnády
GE
Kalmár
L
Simon
I
,
TOPDB: topology data bank of transmembrane proteins
Nucleic Acids Res
,
2008
, vol.
36
(pg.
D234
-
D239
)
Winter
EE
Goodstadt
L
Ponting
CP
,
Elevated rates of protein secretion, evolution, and disease among tissue-specific genes
Genome Res
,
2004
, vol.
14
(pg.
54
-
61
)
Yang
Z
,
PAML 4: phylogenetic analysis by maximum likelihood
Mol Biol Evol
,
2007
, vol.
24
(pg.
1586
-
1591
)
Yang
Z
Nielsen
R
,
Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models
Mol Biol Evol
,
2000
, vol.
17
(pg.
32
-
43
)

Author notes

Jeffrey Thorne, Associate Editor

Supplementary data