Positive and Negative Selection on Noncoding DNA in Drosophila simulans

Author Notes

Abstract

There is now a wealth of evidence that some of the most important regions of the genome are found outside those that encode proteins, and noncoding regions of the genome have been shown to be subject to substantial levels of selective constraint, particularly in Drosophila. Recent work has suggested that these regions may also have been subject to the action of positive selection, with large fractions of noncoding divergence having been driven to fixation by adaptive evolution. However, this work has focused on Drosophila melanogaster, which is thought to have experienced a reduction in effective population size (N_e), and thus a reduction in the efficacy of selection, compared with its closest relative Drosophila simulans. Here, we examine patterns of evolution at several classes of noncoding DNA in D. simulans and find that all noncoding DNA is subject to the action of negative selection, indicated by reduced levels of polymorphism and divergence and a skew in the frequency spectrum toward rare variants. We find that the signature of negative selection on noncoding DNA and nonsynonymous sites is obscured to some extent by purifying selection acting on preferred to unpreferred synonymous codon mutations. We investigate the extent to which divergence in noncoding DNA is inferred to be the product of positive selection and to what extent these inferences depend on selection on synonymous sites and demography. Based on patterns of polymorphism and divergence for different classes of synonymous substitution, we find the divergence excess inferred in noncoding DNA and nonsynonymous sites in the D. simulans lineage difficult to reconcile with demographic explanations.

Drosophila simulans, noncoding DNA, natural selection, adaptive evolution, McDonald–Kreitman test, codon usage bias

Introduction

Noncoding DNA makes up a large fraction of most eukaryotic genomes, and yet relatively little is known about the functional importance of sequences that are not translated into proteins. Recently, a number of multilocus and comparative genomics studies have examined patterns of molecular evolution in noncoding DNA in various Drosophila species, and these have revealed slower rates of evolution and higher levels of selective constraint in long (>80 base pairs [bp]) introns and intergenic sequences, when compared with synonymous sites in coding regions (Bergman and Kreitman 2001; Halligan et al. 2004; Kohn et al. 2004; Andolfatto 2005; Haddrill, Charlesworth, et al. 2005; Bachtrog and Andolfatto 2006; Halligan and Keightley 2006). This is assumed to be due to the presence of cis-regulatory elements (Casillas et al. 2007) or conserved RNA secondary structures, for which there is some direct experimental evidence (e.g., Stephan and Kirby 1993; Kirby et al. 1995; Leicht et al. 1995; Carlini et al. 2001; Chen and Stephan 2003; Bergman et al. 2005; Gallo et al. 2006).

Although these comparative genomics approaches show that divergence between species is reduced in some classes of noncoding DNA, indicating that these sequences are functionally constrained and thus subject to negative selection, this conclusion cannot be firmly supported without evidence from within-species patterns of variability to rule out mutation rate variation. Similarly, if noncoding sequences are functionally important, they are likely to be subject to the action of positive selection as well as negative selection, and the signatures of these different types of selection can only be distinguished using an approach that combines within-species polymorphism data with between-species measures of divergence (McDonald and Kreitman 1991). This type of approach has previously provided evidence that a considerable proportion of the protein-coding sequence divergence between species has been driven to fixation by positive selection (Fay et al. 2002; Smith and Eyre-Walker 2002; Sawyer et al. 2003).

Andolfatto (2005), expanding upon previous findings by Jenkins et al. (1995) and Kohn et al. (2004), examined patterns of molecular evolution in several classes of noncoding DNA (long introns, untranslated transcribed regions [UTRs], and intergenic regions), using within-species polymorphism data in Drosophila melanogaster and between-species divergence to the closely related Drosophila simulans. For all classes of noncoding sequence (compared with synonymous sites), Andolfatto (2005) found reduced levels of polymorphism and divergence, high selective constraint (ca. 40–70%), and a skew in the frequency spectrum of mutations toward rare variants, all of which indicate the action of negative selection. However, he also found a significant excess of between-species divergence relative to polymorphism (again compared with synonymous sites) for almost all classes of noncoding sequence, which is a signature of adaptive evolution. Using an extension of the McDonald–Kreitman approach (McDonald and Kreitman 1991; Fay et al. 2001), Andolfatto (2005) estimated that a substantial fraction of the divergence in noncoding regions between D. melanogaster and D. simulans was driven to fixation by positive selection (ca. 20% for intronic and intergenic sequences and 60% for UTRs). These results indicate that noncoding regions of the D. melanogaster genome are functionally significant and have been subject to the action of both positive and negative selection.

However, D. melanogaster may not be the most appropriate species for studying these types of patterns of molecular evolution because it may be unusual compared with other Drosophila species. Based on measures of nucleotide diversity at synonymous sites, D. melanogaster is thought to have experienced a reduction in effective population size (N_e) compared with its closest relative D. simulans (Aquadro et al. 1988; Akashi 1995, 1996; Moriyama and Powell 1996; Andolfatto 2001; Eyre-Walker et al. 2002). This difference in N_e predicts a reduction in the efficacy of selection in D. melanogaster compared with D. simulans and thus lower levels of adaptive evolution and higher rates of fixation of mildly deleterious mutations in the D. melanogaster lineage (Hill and Robertson 1966; Kimura 1983).

There is some evidence for a reduction in the efficacy of selection and thus a lower N_e in the D. melanogaster lineage compared with the D. simulans lineage. In particular, levels of codon usage bias are reduced, and proteins are longer in D. melanogaster relative to D. simulans (Akashi 1995, 1996). In addition, levels of amino acid (relative to synonymous) polymorphism are higher in D. melanogaster, consistent with a smaller N_e (Choudhary and Singh 1987; Aquadro et al. 1988; Moriyama and Powell 1996; Andolfatto 2001).

N_e is likely to be particularly important when selection is weak, such as selection for codon usage (Akashi 1995; Akashi and Schaeffer 1997; McVean and Vieira 2001) because a mutation with mild fitness effects in a large population could be effectively neutral in a smaller population. Given that selection on noncoding sites may be fairly weak compared with nonsynonymous sites (Haddrill et al. 2007), the difference in effective population size in D. simulans may have a substantial impact on patterns of evolution at noncoding sites. Of particular concern is that a lower population size and thus a lower efficacy of selection, in D. melanogaster relative to D. simulans, may have led to inaccurate estimates of adaptive evolution parameters in previous studies. With polymorphism data from D. simulans, we can ask whether previous inferences are robust to the choice of species in which polymorphism was surveyed. Further, with information from a suitable outgroup species (Drosophila yakuba), we can investigate lineage-specific changes and distinguish between divergence that has accumulated due to a relaxation in selection from that accumulated due to adaptive evolution.

A recent study has examined genome-wide polymorphism and divergence data for D. simulans, finding evidence for the effects of both negative and positive selection (Begun et al. 2007). The scale of this analysis is impressive. However, the level of coverage in this study was, on average, only 3.9 individuals per locus. This small sample size prohibits analyses based on the frequencies of polymorphisms, which can be extremely useful in identifying signatures of negative and positive selection (Akashi 1999; Nielsen 2005). In addition, D. simulans populations are structured (Hamblin and Veuille 1999), and some are likely to have experienced recent bottleneck events associated with the expansion of the species out of Africa (Hamblin and Veuille 1999; Andolfatto 2001; Wall et al. 2002; Baudry et al. 2006). Thus, mixed population samples, or specifically those focusing on non-African populations, may not be ideal for making inferences about selection, given that patterns of variation may be dominated by demographic factors.

Here, we analyze polymorphism and divergence data for 67 loci (∼33 kb per individual in total). We surveyed polymorphism in a sample of 20 lines of D. simulans from a Madagascan population (the proposed geographic origin of the species; Dean and Ballard 2004), yielding considerable information about polymorphism frequencies. The loci are largely a subset of the loci examined by Andolfatto (2005) in D. melanogaster and include coding DNA and both intronic and UTR noncoding DNA. We use these data to look for signatures of both negative and positive selection in noncoding DNA in order to examine the significance of these processes in shaping patterns of molecular evolution in D. simulans and also to assess whether previous results for D. melanogaster are typical of the melanogaster subgroup.

Materials and Methods

Data Collection

We collected data for 21 coding regions (average surveyed length 675 bp), 22 UTRs (average surveyed length 382 bp), and 24 introns (average surveyed length 449 bp). A subset of these were surveyed by Andolfatto (2005) in D. melanogaster (19 coding, 20 UTRs, and 9 introns). All loci were X-linked genomic fragments and were surveyed in a sample of 20 D. simulans individuals from a Madagascan population (Dean and Ballard 2004) and reside in regions of high recombination in D. simulans (Wall et al. 2002). Further information about all 67 surveyed loci can be found in supplementary tables 1 and 2 (Supplementary Material online).

Briefly, sequence data were collected as follows. A single male fly was selected from each line and genomic DNA extracted using the Puregene DNA extraction kit. Polymerase chain reaction was used to amplify the appropriate genomic DNA fragment and then primers and unincorporated nucleotides were removed using exonuclease I and shrimp alkaline phosphatase. Fragments were directly sequenced on both strands using the Big Dye version 3.0 cycle sequencing kit (Applied Biosystems, Foster City, CA) and run on an ABI 3730 capillary sequencer. Sequence trace files were edited using Sequencher (Gene Codes Corporation, Ann Arbor, MI). The orthologous regions from D. melanogaster and D. yakuba were added to each alignment, using sequences downloaded from FlyBase (http://flybase.org/, Release 4.2) and the D. yakuba genome project (http://insects.eugenes.org/species/blast/), respectively, and aligned using MUSCLE (http://www.drive5.com/muscle) with adjustments to preserve reading frames. In some cases, regions that were particularly difficult to align were masked. This disproportionately affected introns and is expected to bias estimates of divergence downward. Details of these regions are given in supplementary table 3 (Supplementary Material online). The sequence data from this study have been submitted to GenBank under accession numbers EU744978–EU746317.

Analysis

The estimated number of synonymous sites, nonsynonymous sites, average pairwise diversity (π), average pairwise divergence to D. melanogaster (D_xy), as well as counts of the number of polymorphisms (S) were performed with a library of Perl scripts (Polymorphorama) written by P.A. and D.B. The number of nonsynonymous and synonymous sites was estimated using the Nei and Gojobori (1986) method. Average pairwise divergence (D_xy) estimates were either corrected for multiple hits using a Jukes–Cantor correction (Jukes and Cantor 1969) or, in the case of synonymous sites, the Kimura (1980) 2-parameter model. Multiply hit sites were included in all analyses, but insertion–deletion polymorphisms and polymorphic sites overlapping alignment gaps were excluded.

For lineage-specific estimates of divergence, we reconstructed an ancestor of D. melanogaster–D. simulans (ANC) sequence, using D. yakuba as an outgroup, by maximum likelihood as implemented in the “codeml” (for coding regions, free ratio model [model = 1]) and “baseml” (for noncoding regions) programs of PAML (Yang 1997). We assigned codon usage states (with the most likely states given probabilities of 1), unpreferred (U) or preferred (P), to each codon according to the codon preference table of Andolfatto (2007), which is based on a genome-wide analysis of codon usage in D. melanogaster.

For analyses comparing polymorphism and divergence, we pooled site classes across loci. Each of the putatively selected noncoding site classes was compared with a subset of synonymous sites (see below) as a putatively neutral standard. Using the McDonald–Kreitman approach (McDonald and Kreitman 1991), we compared the ratio of putatively neutral with putatively selected polymorphic and divergent sites using a Fisher's exact test. Following Andolfatto (2005), we used an extension of this approach (Fay et al. 2001) to estimate the proportion of divergence driven to fixation by positive selection (α) as α = 1−(D_SP_x/D_xP_S), where the subscripts S and X denote the putatively neutral and putatively selected sequence classes, respectively, and and ⁠, where D_i and P_i are the number of divergent and polymorphic variants at locus i, respectively, and n is the number of loci in a particular sequence class. The number of divergent sites at a locus (D) was corrected for multiple hits using a Jukes–Cantor correction (Jukes and Cantor 1969). Confidence intervals (CIs) for α were estimated using a nonparametric bootstrap procedure, with resampling by site. This method has been shown to only slightly underestimate CIs, given estimates of recombination and nucleotide diversity in Drosophila (Andolfatto 2005a).

For comparison, we also carried out these analyses using nonsynonymous sites as the putatively selected site class. In addition, for nonsynonymous sites, it was possible to compare estimates of α calculated using the summing across loci method above to those calculated using the methods of Smith and Eyre-Walker (2002), Bierne and Eyre-Walker (2004), and Welch (2006) to ensure that estimates from summing across loci were not markedly different from other methods (see supplementary results, Supplementary Material online).

Results

Reduced Polymorphism and Divergence in Noncoding Regions

We surveyed a total of ∼14 kb of coding sequence and ∼19 kb of noncoding sequence per individual. Polymorphism and divergence summaries for each sequence class are shown in table 1, using both D. melanogaster and ANC as an outgroup. Data for individual loci are presented in supplementary table 4 (Supplementary Material online).

Table 1

Open in new tab

Polymorphism and Divergence in Coding and Noncoding DNA of Drosophila simulans

Sequence Class	No. Regions	Mean πa	Mean D_xyb	Dc	P^all(d⁾	P^all(e⁾	P¹⁽df⁾	P¹⁽ef⁾
melanogaster outgroup
Synonymous	21	3.02	13.87	59g	110g	—	65g	—
Nonsynonymous	21	0.19	1.24	123	158	0.025	50	<10⁻⁴
Noncoding	46	1.13	5.44	809	1,212	0.078	463	<10⁻⁴
Introns	24	1.29	6.13	505	804	0.188	312	0.001
UTRs	22	0.95	4.69	304	408	0.021	151	<10⁻⁴
5′ UTRs	10	0.89	4.95	166	189	0.003	78	<10⁻⁴
3′ UTRs	12	0.99	4.47	138	219	0.218	73	<10⁻⁴
ANC outgrouph
Synonymous	21	3.02	6.10	29g	110g	—	63g	—
Nonsynonymous	21	0.18	0.68	57	142	0.128	43	<10⁻⁴
Noncoding	46	1.01	2.31	233	983	0.650	357	0.167
Introns	24	1.22	2.33	115	678	0.074	247	1.000
UTRs	22	0.79	2.29	118	305	0.119	110	0.001
5′ UTRs	10	0.87	2.45	67	180	0.180	71	0.014
3′ UTRs	12	0.71	2.14	51	125	0.118	39	0.001

Sequence Class	No. Regions	Mean πa	Mean D_xyb	Dc	P^all(d⁾	P^all(e⁾	P¹⁽df⁾	P¹⁽ef⁾
melanogaster outgroup
Synonymous	21	3.02	13.87	59g	110g	—	65g	—
Nonsynonymous	21	0.19	1.24	123	158	0.025	50	<10⁻⁴
Noncoding	46	1.13	5.44	809	1,212	0.078	463	<10⁻⁴
Introns	24	1.29	6.13	505	804	0.188	312	0.001
UTRs	22	0.95	4.69	304	408	0.021	151	<10⁻⁴
5′ UTRs	10	0.89	4.95	166	189	0.003	78	<10⁻⁴
3′ UTRs	12	0.99	4.47	138	219	0.218	73	<10⁻⁴
ANC outgrouph
Synonymous	21	3.02	6.10	29g	110g	—	63g	—
Nonsynonymous	21	0.18	0.68	57	142	0.128	43	<10⁻⁴
Noncoding	46	1.01	2.31	233	983	0.650	357	0.167
Introns	24	1.22	2.33	115	678	0.074	247	1.000
UTRs	22	0.79	2.29	118	305	0.119	110	0.001
5′ UTRs	10	0.87	2.45	67	180	0.180	71	0.014
3′ UTRs	12	0.71	2.14	51	125	0.118	39	0.001

π is the average pairwise divergence per nucleotide site between alleles (%).

D_xy is the average Jukes–Cantor corrected divergence from the outgroup (%).

D is the number of divergent sites (Jukes–Cantor corrected).

P^all/P¹ are the number of polymorphic sites including all polymorphisms/excluding those at a frequency of less than 5%.

P^all/P¹ are probabilities from McDonald–Kreitman tests including all polymorphisms/excluding those at a frequency of less than 5%.

Excluding polymorphisms present at a frequency of less than 10% and 15% did not alter the conclusions.

The synonymous site class counts for the McDonald–Kreitman tests include only P → P and U → U changes (see text for details).

ANC outgroup is the reconstructed ancestor of D. simulans and Drosophila melanogaster used as the outgroup.

Table 1

Open in new tab

Polymorphism and Divergence in Coding and Noncoding DNA of Drosophila simulans

Sequence Class	No. Regions	Mean πa	Mean D_xyb	Dc	P^all(d⁾	P^all(e⁾	P¹⁽df⁾	P¹⁽ef⁾
melanogaster outgroup
Synonymous	21	3.02	13.87	59g	110g	—	65g	—
Nonsynonymous	21	0.19	1.24	123	158	0.025	50	<10⁻⁴
Noncoding	46	1.13	5.44	809	1,212	0.078	463	<10⁻⁴
Introns	24	1.29	6.13	505	804	0.188	312	0.001
UTRs	22	0.95	4.69	304	408	0.021	151	<10⁻⁴
5′ UTRs	10	0.89	4.95	166	189	0.003	78	<10⁻⁴
3′ UTRs	12	0.99	4.47	138	219	0.218	73	<10⁻⁴
ANC outgrouph
Synonymous	21	3.02	6.10	29g	110g	—	63g	—
Nonsynonymous	21	0.18	0.68	57	142	0.128	43	<10⁻⁴
Noncoding	46	1.01	2.31	233	983	0.650	357	0.167
Introns	24	1.22	2.33	115	678	0.074	247	1.000
UTRs	22	0.79	2.29	118	305	0.119	110	0.001
5′ UTRs	10	0.87	2.45	67	180	0.180	71	0.014
3′ UTRs	12	0.71	2.14	51	125	0.118	39	0.001

Sequence Class	No. Regions	Mean πa	Mean D_xyb	Dc	P^all(d⁾	P^all(e⁾	P¹⁽df⁾	P¹⁽ef⁾
melanogaster outgroup
Synonymous	21	3.02	13.87	59g	110g	—	65g	—
Nonsynonymous	21	0.19	1.24	123	158	0.025	50	<10⁻⁴
Noncoding	46	1.13	5.44	809	1,212	0.078	463	<10⁻⁴
Introns	24	1.29	6.13	505	804	0.188	312	0.001
UTRs	22	0.95	4.69	304	408	0.021	151	<10⁻⁴
5′ UTRs	10	0.89	4.95	166	189	0.003	78	<10⁻⁴
3′ UTRs	12	0.99	4.47	138	219	0.218	73	<10⁻⁴
ANC outgrouph
Synonymous	21	3.02	6.10	29g	110g	—	63g	—
Nonsynonymous	21	0.18	0.68	57	142	0.128	43	<10⁻⁴
Noncoding	46	1.01	2.31	233	983	0.650	357	0.167
Introns	24	1.22	2.33	115	678	0.074	247	1.000
UTRs	22	0.79	2.29	118	305	0.119	110	0.001
5′ UTRs	10	0.87	2.45	67	180	0.180	71	0.014
3′ UTRs	12	0.71	2.14	51	125	0.118	39	0.001