Abstract

Evolution of gene expression mediated by cis-regulatory changes is thought to be an important contributor to organismal adaptation, but identifying adaptive cis-regulatory changes is challenging due to the difficulty in knowing the expectation under no positive selection. A new approach for detecting positive selection on transcription factor binding sites (TFBSs) was recently developed, thanks to the application of machine learning in predicting transcription factor (TF) binding affinities of DNA sequences. Given a TFBS sequence from a focal species and the corresponding inferred ancestral sequence that differs from the former at n sites, one can predict the TF-binding affinities of many n-step mutational neighbors of the ancestral sequence and obtain a null distribution of the derived binding affinity, which allows testing whether the binding affinity of the real derived sequence deviates significantly from the null distribution. Applying this test genomically to all experimentally identified binding sites of 3 TFs in humans, a recent study reported positive selection for elevated binding affinities of TFBSs. Here, we show that this genomic test suffers from an ascertainment bias because, even in the absence of positive selection for strengthened binding, the binding affinities of known human TFBSs are more likely to have increased than decreased in evolution. We demonstrate by computer simulation that this bias inflates the false positive rate of the selection test. We propose several methods to mitigate the ascertainment bias and show that almost all previously reported positive selection signals disappear when these methods are applied.

Introduction

Organism-level phenotypic changes in adaptive evolution are believed to be often caused by gene expression alterations brought by changes in cis-regulatory sequences (King and Wilson 1975; Wray 2007; Carroll 2008; Jones et al. 2012; Signor and Nuzhdin 2018). However, it is challenging to test positive selection on cis-regulatory sequences because the neutral expectation is difficult to know, unlike the test of positive selection on protein-coding sequences where the neutral expectation is usually assumed to be reflected by synonymous substitutions (Li et al. 1985; Nei and Gojobori 1986; McDonald and Kreitman 1991; Nei and Kumar 2000). While one can test positive selection on the cis-regulatory sequence of a gene by comparing it with the synonymous sites in the gene (Andolfatto 2005), the comparison would rely on the assumption that the regions being compared have equal mutation rates. The nonneutrality of many synonymous mutations (Lind et al. 2010; Lawrie et al. 2013; Sharon et al. 2018; She and Jarosz 2018; Shen et al. 2022) further complicates the test. Comparing the cis-regulatory sequence of a gene with the intron sequences of the gene is another choice (Haygood et al. 2007), but it similarly depends on the assumption that the regions being compared have equal mutation rates and that introns evolve neutrally. Additionally, because the number of functional sites in a cis-regulatory sequence is usually small, the above comparisons are generally statistically underpowered.

Because functional changes of cis-regulatory sequences typically occur via altering their bindings to trans-regulatory factors such as transcription factors (TFs), selection on cis-regulatory sequences is in a large part mediated by selection on binding affinities (Berg et al. 2004; Moses 2009). In theory, one can compare an observed evolutionary change in binding affinity with a null distribution in the absence of selection to test whether the cis-regulatory sequence has been positively selected. Such a test is possible only if the binding affinities of numerous potential mutant sequences, which would be labor-intensive to quantify experimentally, are known. This problem was recently solved by using machine learning to predict TF-binding affinities (Liu and Robinson-Rechavi 2020). Specifically, based on previously established techniques (Ghandi et al. 2014; Lee et al. 2015; Ghandi et al. 2016), Liu and Robinson-Rechavi trained a gapped k-mer support vector machine (gkm-SVM) using TF-binding sites (TFBSs) experimentally identified through chromatin immunoprecipitation followed by sequencing (ChIP-seq). This trained program is then used to calculate SVM weights (i.e. contribution to the overall binding affinity) of all possible 10-mers, based on which a binding affinity score (SVM score) can be calculated for an arbitrary sequence by summing up SVM weights of all 10-mers that it contains.

With the above tool, one can predict the binding affinities of many mutational neighbors of an ancestral regulatory sequence to obtain a null distribution of binding affinity changes and compare the observed evolutionary change (deltaSVM) with this null distribution. If the observed deltaSVM is in the right 5% or 1% tail of the null distribution, one could conclude that positive selection for a higher binding affinity to the TF of interest acted in the evolution of the regulatory sequence (Prabhakar et al. 2006; Gittelman et al. 2015). For example, based on this method, Liu and Robinson-Rechavi (2020) reported positive selection for elevated binding affinities in human evolution since the human–chimpanzee split for a few percent of the binding sites of each of 3 TFs examined: CEBPA, HNF4A, and CTCF.

In such tests of positive selection, the TFBSs considered are typically acquired from ChIP-seq peaks (Park 2009) identified in a focal species such as the human in the above study. This means that only TFBSs with relatively high binding affinities are included in the analysis, which could cause an ascertainment bias because binding affinities are more likely to have increased for TFBSs with relatively high affinities than for TFBSs with relatively low affinities even in the absence of positive selection for higher affinities (Fig. 1). As a result, one may miscall chance increases of binding affinities as signals of positive selection, raising the false positive rate in the test of positive selection. An analogy of this problem is to test whether students generally perform better in the final exam than in the midterm exam while considering only those who perform well in the final exam.

Schematic illustration of the ascertainment bias in the selection test of Liu and Robinson-Rechavi (2020). The left panel represents the scenario where all potential TFBSs in an ancestor are included in the test so there is no ascertainment bias, whereas the right panel represents the scenario in the actual test of Liu and Robinson-Rechavi where there is an ascertainment bias. Here, n1, n2, n3, and n4 denote the number of potential TFBSs that underwent each type of binding affinity change, respectively. Pincrease denotes the observed proportion of TFBSs with increased binding affinities and is unbiasedly estimated in the left panel. In the right panel, however, Pincrease is overestimated because TFBSs with low-derived affinities (in the focal species) are not included.
Fig. 1.

Schematic illustration of the ascertainment bias in the selection test of Liu and Robinson-Rechavi (2020). The left panel represents the scenario where all potential TFBSs in an ancestor are included in the test so there is no ascertainment bias, whereas the right panel represents the scenario in the actual test of Liu and Robinson-Rechavi where there is an ascertainment bias. Here, n1, n2, n3, and n4 denote the number of potential TFBSs that underwent each type of binding affinity change, respectively. Pincrease denotes the observed proportion of TFBSs with increased binding affinities and is unbiasedly estimated in the left panel. In the right panel, however, Pincrease is overestimated because TFBSs with low-derived affinities (in the focal species) are not included.

In this study, we use computer simulation of neutral evolution to demonstrate the influence of the ascertainment bias on the test of positive selection on TFBSs. We propose 2 methods to mitigate this impact and show that almost all signals of positive selection previously reported for the binding sites of the 3 human TFs (Liu and Robinson-Rechavi 2020) disappear upon ascertainment bias mitigations.

Results

Interpretation of P-values in the Selection Test

For each TFBS subject to Liu and Robinson-Rechavi's one-tailed test of positive selection for elevated binding affinity, the P-value is the probability that deltaSVM under neutrality (denoted Δ for short) is greater than the observed value (Δobs), given the ancestral sequence and the number of nucleotide substitutions separating the ancestral and derived sequences (note that Liu and Robinson-Rechavi required this number to be at least 2). That is, P=Prob(Δ>Δobs). However, for a TFBS from a real dataset, the probability that it shows Δ>Δobs under neutrality is the conditional probability P=Prob(Δ>Δobs|I)=Prob(Δ>ΔobsI)Prob(I), where I denotes the event that the TFBS is included in the dataset. That is, the P-value is conditioned not only on the ancestral state but also implicitly on the derived state as long as Prob(I) is dependent on the derived state. Because TFBSs with higher Δobs should have higher Prob(I) but lower Prob(Δ>Δobs), events I and Δ>Δobs are not mutually independent but tend to avoid each other. That is, Prob(Δ>ΔobsI)<Prob(Δ>Δobs)Prob(I). Thus, P=Prob(Δ>Δobs|I)=Prob(Δ>ΔobsI)Prob(I)<Prob(Δ>Δobs)Prob(I)Prob(I)=P. This inequation shows that P will be underestimated if P′ is interpreted as P. Under neutrality, if the probability for P < 0.01 is 0.01, the probability for P’ < 0.01 must be greater than 0.01, inflating the false positive rate in the test of positive selection. If the test is unbiased, the action of positive selection can be inferred when the proportion of TFBSs showing P < 0.01 is greater than 1%. However, this inference is no longer valid if P′ instead of P is calculated because the fraction of TFBSs showing P′ < 0.01 under neutrality is expected to exceed 0.01. Therefore, the ascertainment bias is a problem both in assessing positive selection on an individual TFBS and for a dataset containing many TFBSs.

False Positive Rates Under Various Levels of the Ascertainment Bias

We performed a computer simulation to evaluate the impact of the ascertainment bias on the false positive rate. We first generated 50,000 10-nucleotide random sequences as ancestral TFBS sequences. While TFBSs in ChIP-seq data are much longer than 10 nucleotides, actual TF-binding sequences are around 10 nucleotides. Hence, our simulation is biologically relevant. In each sequence, we then introduced 2 random substitutions at 2 randomly picked sites to generate the corresponding derived sequence. We treated these sequences as potential binding sites of human CEBPA studied by Liu and Robinson-Rechavi. To mimic the ascertainment bias, we removed 10%, 20%, …, 90% of the simulated TFBSs with the lowest derived SVM scores, obtaining 9 datasets with increasing ascertainment biases. We then performed a one-tailed test of positive selection for an elevated CEBPA binding affinity for each TFBS. Following Liu and Robinson-Rechavi, we called a case statistically significant when the right-tail probability is below 0.01 (see Materials and Methods). Because all substitutions introduced are random, all cases of positive selection identified are false positives. Indeed, each of the 9 datasets with ascertainment biases showed a significantly higher false positive rate than that of the original dataset without the ascertainment bias (P<0.05, two-sample Z-test for proportions; Table 1) and the false positive rate increased with the proportion of TFBSs removed (Spearman's ρ>0.99, P<10300; Table 1). For the most extreme level of ascertainment bias considered, the false positive rate is 11 times the expected level. We repeated the above analysis by treating the sequences as potential binding sites of 2 other human TFs studied by Liu and Robinson-Rechavi (HNF4A and CTCF) and obtained highly similar results (supplementary tables S1 and S2, Supplementary Material online).

Table 1

All simulated TFBSs for human CEBPA and subsets of the simulated TFBSs with the highest derived SVM scores

SubsetNumber of TFBSsNumber (fraction) of TFBSs with positive deltaSVMsNumber (fraction) of false positivesP-value
Complete dataset50,00024,908 (49.82%)553 (1.106%)N.A.
Top 90%45,00024,380 (54.18%)553 (1.230%)4.2×102
Top 80%40,00023,281 (58.20%)553 (1.383%)1.0×104
Top 70%35,00021,810 (62.31%)553 (1.580%)1.2×109
Top 60%30,00020,003 (66.68%)553 (1.843%)3.5×1018
Top 50%25,00017,748 (70.99%)553 (2.212%)1.7×1032
Top 40%20,00015,085 (75.43%)553 (2.765%)5.3×1057
Top 30%15,00012,012 (80.08%)553 (3.687%)6.9×10102
Top 20%10,0008,464 (84.64%)553 (5.530%)8.2×10198
Top 10%5,0004,448 (88.96%)552 (11.040%)<10300
SubsetNumber of TFBSsNumber (fraction) of TFBSs with positive deltaSVMsNumber (fraction) of false positivesP-value
Complete dataset50,00024,908 (49.82%)553 (1.106%)N.A.
Top 90%45,00024,380 (54.18%)553 (1.230%)4.2×102
Top 80%40,00023,281 (58.20%)553 (1.383%)1.0×104
Top 70%35,00021,810 (62.31%)553 (1.580%)1.2×109
Top 60%30,00020,003 (66.68%)553 (1.843%)3.5×1018
Top 50%25,00017,748 (70.99%)553 (2.212%)1.7×1032
Top 40%20,00015,085 (75.43%)553 (2.765%)5.3×1057
Top 30%15,00012,012 (80.08%)553 (3.687%)6.9×10102
Top 20%10,0008,464 (84.64%)553 (5.530%)8.2×10198
Top 10%5,0004,448 (88.96%)552 (11.040%)<10300

P-value in the last column is from a test of the null hypothesis that the false positive rate in a subset equals that in the complete dataset (two-sample Z-test for proportions).

Table 1

All simulated TFBSs for human CEBPA and subsets of the simulated TFBSs with the highest derived SVM scores

SubsetNumber of TFBSsNumber (fraction) of TFBSs with positive deltaSVMsNumber (fraction) of false positivesP-value
Complete dataset50,00024,908 (49.82%)553 (1.106%)N.A.
Top 90%45,00024,380 (54.18%)553 (1.230%)4.2×102
Top 80%40,00023,281 (58.20%)553 (1.383%)1.0×104
Top 70%35,00021,810 (62.31%)553 (1.580%)1.2×109
Top 60%30,00020,003 (66.68%)553 (1.843%)3.5×1018
Top 50%25,00017,748 (70.99%)553 (2.212%)1.7×1032
Top 40%20,00015,085 (75.43%)553 (2.765%)5.3×1057
Top 30%15,00012,012 (80.08%)553 (3.687%)6.9×10102
Top 20%10,0008,464 (84.64%)553 (5.530%)8.2×10198
Top 10%5,0004,448 (88.96%)552 (11.040%)<10300
SubsetNumber of TFBSsNumber (fraction) of TFBSs with positive deltaSVMsNumber (fraction) of false positivesP-value
Complete dataset50,00024,908 (49.82%)553 (1.106%)N.A.
Top 90%45,00024,380 (54.18%)553 (1.230%)4.2×102
Top 80%40,00023,281 (58.20%)553 (1.383%)1.0×104
Top 70%35,00021,810 (62.31%)553 (1.580%)1.2×109
Top 60%30,00020,003 (66.68%)553 (1.843%)3.5×1018
Top 50%25,00017,748 (70.99%)553 (2.212%)1.7×1032
Top 40%20,00015,085 (75.43%)553 (2.765%)5.3×1057
Top 30%15,00012,012 (80.08%)553 (3.687%)6.9×10102
Top 20%10,0008,464 (84.64%)553 (5.530%)8.2×10198
Top 10%5,0004,448 (88.96%)552 (11.040%)<10300

P-value in the last column is from a test of the null hypothesis that the false positive rate in a subset equals that in the complete dataset (two-sample Z-test for proportions).

Methods for Mitigating the Ascertainment Bias

With the ascertainment bias, the expected false positive rate under neutrality is unknown for a given dataset. Hence, one cannot directly use the proportion of TFBSs in a dataset that show significant positive selection signals to infer whether positive selection has occurred. Rather, one can infer the occurrence of positive selection on an individual TFBS by calculating the adjusted P-value after correcting for multiple testing, where the number of tests should be the sum of the considered and missing TFBSs. Below we present 2 different methods for estimating the number of missing TFBSs.

The first method regards twice the number of TFBSs with positive deltaSVMs as the total number of tests, because the number of TFBSs with negative deltaSVMs should be similar to that with positive deltaSVMs in the absence of ascertainment bias. This expectation is supported by the observation that almost exactly 50% of all simulated TFBSs have positive deltaSVMs (Table 1, supplementary tables S1 and S2, Supplementary Material online). Similarly, in the mouse data analyzed by Liu and Robinson-Rechavi (2020), which consists of TFBSs identified in the focal lineage and 2 nonfocal lineages, ascertainment bias is expected to be much less severe for the “conserved” subset of TFBSs that are identified from all 3 lineages (see Discussion). Indeed, the fraction of “conserved” TFBSs with positive deltaSVMs is 48.14%, 48.66%, and 48.73% for CEBPA, HNF4A, and FOXA1, respectively. Note that this method likely still undercounts the total number of tests, because it cannot recover TFBSs with low binding affinities in both the ancestral and focal taxa (i.e. n4 in Fig. 1).

The second method regards the number of potential binding sites of a given TF in the genome that have at least 2 nucleotide differences between the ancestral and derived sequences as the number of total tests. If the binding sites for the TF is on average l nucleotides long, there are (L-l + 1) P(n2)L  P(n2) total tests, where L >> l is the total number of nucleotides in the genome and P(n2) is the fraction of TFBSs that have at least 2 nucleotide differences between the ancestral and derived sequences (see Materials and Methods). In the above calculation, potential binding sites are overlapping. However, to be more conservative in computing the total number of tests, we may require the potential binding sites to be nonoverlapping, which yields an estimate of L  P(n2)/l potential binding sites in the genome.

Most Positive Selection Signals Disappear Upon Ascertainment Bias Mitigations

As mentioned, Liu and Robinson-Rechavi (2020) reported the detection of positive selection on the binding sites of each of 3 human TFs investigated, because >1% of the TFBSs fall in the right 1% tail of the null distribution of deltaSVM. The TFBSs examined were all identified from human ChIP-seq data, so the test suffered from ascertainment biases. Below we use the 2 methods proposed in the preceding section to mitigate the ascertainment bias and reevaluate the evidence for positive selection by estimating adjusted P-values. If we use the cutoff of adjusted P-value = 1%, there are no positively selected binding sites for any of the 3 TFs when the ascertainment bias is uncorrected or corrected by either method (Table 2). If we use the cutoff of adjusted P-value = 5%, 109 CEBPA binding sites show significant positive selection signals when the ascertainment bias is uncorrected. This number reduces to 87 and 0 under the 2 methods of correction, respectively (Table 2). Binding sites of the other 2 human TFs show even weaker signals of positive selection. No HNF4A binding sites have adjusted P < 0.05 even when the ascertainment is not corrected (Table 2). A total of 161 CTCF binding sites have adjusted P < 0.05 when the ascertainment bias is uncorrected, but this number reduces to 0 when method 1 or 2 is used to correct the ascertainment bias (Table 2).

Table 2

Number and fraction of positively selected human TFBSs after correction for multiple testing

Multiple testing correctionaTranscription factor
CEBPA
(n = 5,807)
HNF4A
(n = 12,734)
CTCF
(n = 80,074)
Adjusted P < 0.01No correction436 (7.51%)612 (4.81%)2,821 (3.52%)
Method 00 (0%)0 (0%)0 (0%)
Method 10 (0%)0 (0%)0 (0%)
Method 20 (0%)0 (0%)0 (0%)
Adjusted P < 0.05No correction1,135 (19.55%)1,922 (15.09%)9,132 (11.40%)
Method 0109 (1.88%)0 (0%)161 (0.20%)
Method 187 (1.50%)0 (0%)0 (0%)
Method 20 (0%)0 (0%)0 (0%)
Multiple testing correctionaTranscription factor
CEBPA
(n = 5,807)
HNF4A
(n = 12,734)
CTCF
(n = 80,074)
Adjusted P < 0.01No correction436 (7.51%)612 (4.81%)2,821 (3.52%)
Method 00 (0%)0 (0%)0 (0%)
Method 10 (0%)0 (0%)0 (0%)
Method 20 (0%)0 (0%)0 (0%)
Adjusted P < 0.05No correction1,135 (19.55%)1,922 (15.09%)9,132 (11.40%)
Method 0109 (1.88%)0 (0%)161 (0.20%)
Method 187 (1.50%)0 (0%)0 (0%)
Method 20 (0%)0 (0%)0 (0%)

aThe total number of tests considered in the multiple testing correction is the observed number of TFBSs in the dataset (method 0), twice the number of observed TFBSs with positive deltaSVMs (method 1), or an estimate of the number of potential binding sites in the genome (method 2), respectively. Results for “no correction” is based on P-values reported by Liu and Robinson-Rechavi (2020).

Table 2

Number and fraction of positively selected human TFBSs after correction for multiple testing

Multiple testing correctionaTranscription factor
CEBPA
(n = 5,807)
HNF4A
(n = 12,734)
CTCF
(n = 80,074)
Adjusted P < 0.01No correction436 (7.51%)612 (4.81%)2,821 (3.52%)
Method 00 (0%)0 (0%)0 (0%)
Method 10 (0%)0 (0%)0 (0%)
Method 20 (0%)0 (0%)0 (0%)
Adjusted P < 0.05No correction1,135 (19.55%)1,922 (15.09%)9,132 (11.40%)
Method 0109 (1.88%)0 (0%)161 (0.20%)
Method 187 (1.50%)0 (0%)0 (0%)
Method 20 (0%)0 (0%)0 (0%)
Multiple testing correctionaTranscription factor
CEBPA
(n = 5,807)
HNF4A
(n = 12,734)
CTCF
(n = 80,074)
Adjusted P < 0.01No correction436 (7.51%)612 (4.81%)2,821 (3.52%)
Method 00 (0%)0 (0%)0 (0%)
Method 10 (0%)0 (0%)0 (0%)
Method 20 (0%)0 (0%)0 (0%)
Adjusted P < 0.05No correction1,135 (19.55%)1,922 (15.09%)9,132 (11.40%)
Method 0109 (1.88%)0 (0%)161 (0.20%)
Method 187 (1.50%)0 (0%)0 (0%)
Method 20 (0%)0 (0%)0 (0%)

aThe total number of tests considered in the multiple testing correction is the observed number of TFBSs in the dataset (method 0), twice the number of observed TFBSs with positive deltaSVMs (method 1), or an estimate of the number of potential binding sites in the genome (method 2), respectively. Results for “no correction” is based on P-values reported by Liu and Robinson-Rechavi (2020).

Discussion

We showed that a previously proposed genomic test for positive selection for increased binding affinities of TFBSs suffers from inflated false positive errors due to an ascertainment bias. Because this bias is inevitable when real data are used, without an appropriate correction, the test would be unable to identify positively selected TFBSs accurately. After applying corrections as well as performing additional analyses, we found no evidence for positive selection on the binding sites of human HNF4A and CTCF and no to weak signals of positive selection on the binding sites of human CEBPA.

Using simulated data, we demonstrated that the false positive rate increases with the severity of the ascertainment bias, which is mimicked by removing from the dataset a subset of TFBSs with the lowest derived SVM scores. However, it is unclear whether the above approach accurately recapitulates the ascertainment bias that real data are subject to, because the exact relationship between the binding affinity of a TFBS and its probability of inclusion in the dataset is unknown. In addition, it is challenging to predict the overall binding affinity of a relatively long (i.e. >>10 bp) DNA segment. While deltaSVM, which is the sum of differences in SVM scores of all 10-mers encompassing the mutated sites, is sufficient for quantifying the effect of mutations, the summed SVM score for a long DNA segment is unlikely to be a good measure of the binding affinity of the segment because not all 10-mers in the segment contribute equally to TF binding and ChIP-seq detectability. As a result, it is difficult to estimate accurately the impact of the ascertainment bias on the selection test in real data analysis, and future studies are needed to address this problem. Indeed, when we examined a simulated dataset consisting of shuffled TFBS sequences detected by ChIP-seq (i.e. nucleotide sites in a real TFBS sequence are randomly reordered; see Materials and Methods), we found the distribution of the summed SVM score largely overlapping between the shuffled and unshuffled sequences (supplementary fig. S1, Supplementary Material online), indicating that the summed SVM score is not a reliable measure of the overall binding affinity of a TFBS and that removing TFBSs with the smallest summed SVM scores is unlikely to accurately emulate the ascertainment bias in actual data analysis. Notwithstanding, a positive correlation between the fraction of TFBSs removed and the false positive rate is still observed in this dataset (Spearman's ρ=0.976, P<10300; supplementary table S3, Supplementary Material online).

We considered 2 different methods to recover the missing TFBSs caused by the ascertainment bias. While method 1 probably recovers TFBSs that have detectable binding affinities in the ancestral node but not in the focal species (i.e. n2 in Fig. 1), it does not recover TFBSs that lack detectable binding in both species (i.e. n4 in Fig. 1). Hence, method 1 likely provides insufficient corrections. This said, if there are already no significant positive selection signals after the application of method 1, more stringent corrections would not be needed. However, if there are positive selection signals after the application of method 1, the identified TFBSs should be further tested by applying more rigorous corrections or be treated as candidates for further tests such as an experimental verification of fitness effects or a population genetic test of selective sweeps.

The second correction method (method 2) considers the number of potential TFBSs in the whole genome. In theory, this method provides the most complete correction. However, estimating the total number of potential TFBSs (i.e. number of sequence segments that can potentially become TFBSs in evolution) can be challenging. In this study, we assumed that all genomic regions can potentially become a TFBS. This assumption is reasonable at least qualitatively, because “leaky expression” caused by fortuitous TF binding is indeed widespread, as evidenced by the observation that over three quarters of the human genome is transcribed (Kellis et al. 2014) and that about one half of 120-nucleotide random sequences can drive gene expression in yeast (Xu et al. 2023). However, if different genomic regions have different potentials to evolve into a TFBS in the focal lineage, method 2 could overcorrect the ascertainment bias when regions with low potentials are counted. This said, if such heterogeneities can be modeled, method 2 should provide an appropriate correction of the ascertainment bias.

It should be noted that, because the P-value of the selection test is calculated based on a finite number of simulated mutants, the test inevitably loses power when multiple testing is corrected. For instance, Liu and Robinson-Rechavi (2020) considered 104 simulated mutants per test, so the smallest possible P-value was 10−4, yet the number of TFBSs in the dataset can be very large (e.g. on the order of 104 for HNF4A and CTCF; Table 2). While possible, the missing TFBSs are unlikely to be true targets of selection for stronger binding given their lack of detectable binding. Hence, the probability is low for missing TFBSs to have small P-values (i.e. below 0.05 or 0.01) if they are included in the test. But, when this is the case, multiple testing corrections could make the test overly conservative.

Our reanalysis showed no to weak signals of positive selection on the TFBSs of CEBPA (Table 2). This result appears consistent with the validation tests Liu and Robinson-Rechavi (2020) performed. Specifically, one of their validations, inspired by the McDonald–Kreitman test (McDonald and Kreitman 1991), compares the ratio of the number of substitutions to the number of polymorphisms between TFBSs detected to be under positive selection and the other TFBSs in the dataset. This test alone, however, is not a sufficient means of validation, because many binding sites not under positive selection are likely subject to negative selection because of stabilizing selection on their binding affinities, and negative selection can reduce the divergence-to-polymorphism ratio (Eyre-Walker 2002). Liu and Robinson-Rechavi further showed that the target genes of the positively selected TFBSs tend to have lowered expression variance across human populations, suggesting that the binding affinities of TFBSs not under positive selection are not generally under stronger stabilizing selection than TFBSs under positive selection. The 2 validation tests, together with our reanalysis, suggest the possibility of positive selection on some CEBPA binding sites in the human lineage. Liu and Robinson-Rechavi (2020) reported relatively high fractions of positively selected CTCF binding sites in human brain tissues relative to other tissues, but no statistic test for this comparison was performed. We note that, because no CTCF binding site has an adjusted P-value smaller than 0.05 under either method 1 or method 2 in the test of positive selection, the proportion of positively selected binding sites is zero in every tissue and thereby is not higher in brain tissues than in other tissues.

In addition to human TFBSs, Liu and Robinson-Rechavi also analyzed TFBSs in mice and fruit flies. These 2 datasets do not suffer from the ascertainment bias as severely as do the human datasets, because TFBSs detected in focal and nonfocal species (or lineages) are all included in the mouse and fly analyses. However, these datasets are not completely free of the ascertainment bias because only TFBSs with sufficiently strong binding in at least one species can be included, whereas proto-TFBSs that have low binding in all species considered (and presumably in their most recent common ancestor too) cannot. That is, the low-to-low class (n4) in Fig. 1 are still missing, and the number of TFBSs in the dataset is still smaller than the size of the ideal “complete” dataset. As a result, selection tests have inflated false positive errors. For example, the fraction of TFBSs with increased binding should be n1n1+n2+n3+n4 but is instead computed by n1n1+n2+n3 (see Fig. 1). Because n4 is presumably the largest among the four n values, the above 2 fractions are very different. This problem can be mitigated to some extent by broader phylogenetic sampling, though ancestral sequence reconstruction would be subject to more errors when more divergent species are included. Another problem in analyses involving divergent species is that there could be divergence in the TF's binding preference. If this is the case, and only binding to a focal species’ TF is considered, there could be a spurious excess of increase in binding affinity.

It is worth noting that predictive sequence patterns learned by gkm-SVM can potentially be interpreted by different means (Shrikumar et al. 2019). Both Liu and Robinson-Rechavi (2020) and the present study used deltaSVM to quantify the effect of substitutions on binding affinity when testing selection on TFBSs. Although deltaSVM has potential pitfalls, the selection test is unlikely to be sensitive to the interpretation method because the test is nonparametric. Even if a different interpretation method is used in the test, the ascertainment bias is still expected as long as the TFBSs tested for positive selection come from experiments that detect TF binding.

We conclude that the previously proposed genomic test is unable to rigorously estimate the prevalence of positive selection on TFBSs due to an ascertainment bias. There are multiple ways to mitigate the bias, but it remains challenging to adequately correct but not overcorrect the bias. We suggest that it is necessary to combine multiple methods to verify the signal of positive selection on TFBSs, including broader phylogenetic sampling, correction for ascertainment bias, correction for multiple testing, and other means to test selection, to robustly estimate the prevalence of positive selection on TFBSs and identify individual TFBSs where nucleotide substitutions have been driven by positive selection.

Finally, it is worth pointing out that the type of ascertainment bias encountered here is not unique to the test of positive selection on TFBSs, especially at the postgenomic era. Evolutionary analyses of subsets of genomic data that satisfy certain criteria are potentially subject to ascertainment biases when the criteria are correlated with the factors being investigated. A prime example is the test of Ohno's hypothesis of X-chromosome dosage compensation in mammals (i.e. doubling of the expressions of X-linked genes to compensate the degeneration of their Y-linked counterparts) (Ohno 1967), where considering only genes with expression levels higher than a cutoff produced misleading results (He et al. 2011; Lin et al. 2012). Caution should be exerted in designing such tests.

Materials and Methods

Analysis of Simulated TFBSs

To generate a dataset that represents potential TFBSs, we randomly generated 50,000 distinct 10-nucleotide sequences with equal probabilities of the 4 nucleotides at each site. In each sequence, we randomly picked 2 sites and make a random change at each site such that the derived sequence differs from the original, ancestral sequence by 2 substitutions (with no multiple hits and no mutational bias).

Following Liu and Robinson-Rechavi (2020), for each pair of ancestral and derived TFBSs, we performed a one-tailed test of positive selection for an increased binding affinity of the derived TFBS. We separately considered 3 human TFs: CEBPA, HNF4A, and CTCF. Binding affinities (SVM weights) of 10-mers were acquired from https://github.com/ljljolinq1010/A-robust-method-for-detecting-positive-selection-on-regulatory-sequences/blob/master/data/human_SVM_model/CEBPA/kmer_10_library_weigths.txt, https://github.com/ljljolinq1010/A-robust-method-for-detecting-positive-selection-on-regulatory-sequences/blob/master/data/human_SVM_model/HNF4A/kmer_10_library_weigths.txt, and https://github.com/ljljolinq1010/A-robust-method-for-detecting-positive-selection-on-regulatory-sequences/blob/master/data/human_CTCF_adaptation/human_SVM_model/all_merged_ctcf_kmer_10_library_weigths.txt for CEBPA, HNF4A, and CTCF, respectively. CEBPA and HNF4A are both liver-specific TFs; CTCF is expressed in multiple tissues and the binding sites studied are the union of the binding sites identified from multiple individual tissues (Liu and Robinson-Rechavi 2020). For each ancestral TFBS, we generated 1,000 random mutants as the control set. Each mutant was a 2-step neighbor of the ancestral sequence and was generated as in the generation of the derived sequences. We simulated only 1,000 random mutants as the control set for each TFBS because the total number of 2-step neighbors of a 10-mer is only (102)×3×3=405. Positive selection is inferred for a TFBS if fewer than 10 sequences in the control set has higher SVM scores than that of the derived sequence (i.e. P<0.01). Because the simulated TFBSs were subject to no selection, all positive cases identified in this analysis were false positives.

To mimic the ascertainment bias, we ranked the simulated TFBSs based on the SVM score of the derived sequence and removed the bottom 10%, 20%, … 90% of TFBSs, respectively, obtaining 9 subsets subject to increasing ascertainment biases. We then counted the number of TFBSs that showed P<0.01 in the selection test in each subset to investigate the relationship between the severity of the ascertainment bias and the false positive rate.

Analysis of Shuffled TFBS Sequences

For each human CEBPA binding site identified from ChIP-seq, we randomly reordered its nucleotide sites to obtain a sequence that matches the real sequence's length and nucleotide composition. The same reordering was also applied to the corresponding ancestral sequence, such that the specific types of nucleotide substitutions in the real sequences are also preserved. For each sequence, the total SVM score for CEBPA binding was calculated by adding up the SVM scores of all 10-mers in the human sequence. Ascertainment bias was generated by removing shuffled sequences with the smallest derived total SVM. For the selection test, 2,000 random mutants were generated for each TFBS as the control set. Although Liu and Robinson-Rechavi generated 10,000 mutants when performing the test, using a smaller number of mutants should not increase the number of positive test results. Because there are no true positives in the simulated data and because it is the overall false positive rate that is of interest, a smaller control set should not affect the conclusion.

Reanalysis of Human TFBSs

We examined the binding sites of human CEBPA, HNF4A, and CTCF, respectively. These TFBSs were previously studied by Liu and Robinson-Rechavi (2020), who used human (Homo sapiens) as the focal species and used human, chimpanzee (Pan troglodytes), and gorilla (Gorilla gorilla) to infer the sequence of the human–chimpanzee common ancestor. Data files analyzed here were made available by Liu and Robinson-Rechavi at https://github.com/ljljolinq1010/A-robust-method-for-detecting-positive-selection-on-regulatory-sequences/tree/master/data/human_deltaSVM (CEBPA and HNF4A binding sites) and https://github.com/ljljolinq1010/A-robust-method-for-detecting-positive-selection-on-regulatory-sequences/blob/master/data/human_CTCF_adaptation/human_deltaSVM/ctcf_deltaSVM_highertailTest.txt (CTCF binding sites).

We performed multiple testing corrections following the Benjamini–Hochberg procedure (Benjamini and Hochberg 1995) and computed adjusted P-values from P-values of the right-tail test in Liu and Robinson-Rechavi (2020). For the TFBS with the ith smallest P-value in a given dataset, the adjusted P-value is Padjust=min{1,MPii}, where Pi is the P-value reported by Liu and Robinson-Rechavi (2020) and M is the total number of tests to correct for. When multiple TFBSs have equal P-values, the greatest rank number is used; for example, if there are k (>1) TFBSs with the same P-value and there are i1 TFBSs with P-values that are smaller than these k TFBSs, the k TFBSs all have the rank number of j=i+k1. We applied 3 methods to calculate Padjust, which we refer to as methods 0, 1, and 2, respectively. In method 0, M is simply the number of TFBSs in the dataset. In method 1, M equals twice the number of TFBSs with positive deltaSVMs in the dataset. In method 2, M=P(n2)Ll, where L is the genome size, l is the mean length of TFBS sequences in the dataset, and P(n2) is the probability that there are at least 2 substitutions between the ancestral TFBS and its derived version. L was set to be 6×109 nucleotides, which is a rough estimate of human's haploid genome size multiplied by 2 because a TFBS can be on either the Watson or Crick strand and l was computed from the empirical data to equal 269, 340, and 537 nucleotides for CEBPA, HNF4A, and CTCF, respectively. P(n2) was calculated as the ratio of the number of TFBSs used in the selection test (first row of Table 2) and the total number of TFBSs identified by ChIP-seq (16,212 for CEBPA, 27,782 for HNF4A, and 118,970 for CTCF).

We assume that the ranks of P-values of TFBSs with P < 0.01 or P < 0.05 do not change upon the inclusion of missing TFBSs, because the missing TFBSs have low binding affinities and are unlikely to be true positives in the selection test.

All simulations and analyses were conducted in an R environment (R Core Development Team 2010).

Supplementary Material

Supplementary material is available at Molecular Biology and Evolution online.

Acknowledgments

We thank Dr. Jialin Liu for his generous help during this study and S. Song, H. Xu, and 2 anonymous reviewers for their valuable comments.

Funding

This work was supported by the research grant R35GM139484 from the U.S. National Institutes of Health to J.Z. D.J. was supported by the Rackham Predoctoral Fellowship from the University of Michigan when performing this study.

Data Availability

Data needed to replicate the work are provided in the paper, supplementary materials, and GitHub (https://github.com/RexJiangEvoBio/TFBS).

References

Andolfatto
 
P
.
Adaptive evolution of non-coding DNA in Drosophila
.
Nature
.
2005
:
437
(
7062
):
1149
1152
. https://doi.org/10.1038/nature04107.

Benjamini
 
Y
,
Hochberg
 
Y
.
Controlling the false discovery rate—a practical and powerful approach to multiple testing
.
J R Stat Soc Series B Stat Methodol
.
1995
:
57
:
289
300
. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x.

Berg
 
J
,
Willmann
 
S
,
Lässig
 
M
.
Adaptive evolution of transcription factor binding sites
.
BMC Evol Biol
.
2004
:
4
(
1
):
42
. https://doi.org/10.1186/1471-2148-4-42.

Carroll
 
SB
.
Evo-devo and an expanding evolutionary synthesis: a genetic theory of morphological evolution
.
Cell
.
2008
:
134
(
1
):
25
36
. https://doi.org/10.1016/j.cell.2008.06.030.

Eyre-Walker
 
A
.
Changing effective population size and the McDonald-Kreitman test
.
Genetics
.
2002
:
162
(
4
):
2017
2024
. https://doi.org/10.1093/genetics/162.4.2017.

Ghandi
 
M
,
Lee
 
D
,
Mohammad-Noori
 
M
,
Beer
 
MA
.
Enhanced regulatory sequence prediction using gapped k-mer features
.
PLoS Comput Biol
.
2014
:
10
(
7
):
e1003711
. https://doi.org/10.1371/journal.pcbi.1003711.

Ghandi
 
M
,
Mohammad-Noori
 
M
,
Ghareghani
 
N
,
Lee
 
D
,
Garraway
 
L
,
Beer
 
MA
.
gkmSVM: an R package for gapped-kmer SVM
.
Bioinformatics
.
2016
:
32
(
14
):
2205
2207
. https://doi.org/10.1093/bioinformatics/btw203.

Gittelman
 
RM
,
Hun
 
E
,
Ay
 
F
,
Madeoy
 
J
,
Pennacchio
 
L
,
Noble
 
WS
,
Hawkins
 
RD
,
Akey
 
JM
.
Comprehensive identification and analysis of human accelerated regulatory DNA
.
Genome Res
.
2015
:
25
(
9
):
1245
1255
. https://doi.org/10.1101/gr.192591.115.

Haygood
 
R
,
Fedrigo
 
O
,
Hanson
 
B
,
Yokoyama
 
KD
,
Wray
 
GA
.
Promoter regions of many neural- and nutrition-related genes have experienced positive selection during human evolution
.
Nat Genet
.
2007
:
39
(
9
):
1140
1144
. https://doi.org/10.1038/ng2104.

He
 
X
,
Chen
 
X
,
Xiong
 
Y
,
Chen
 
Z
,
Wang
 
X
,
Shi
 
S
,
Wang
 
X
,
Zhang
 
J
.
He et al. reply
.
Nat Genet
.
2011
:
43
(
12
):
1171
1172
. https://doi.org/10.1038/ng.1010.

Jones
 
FC
,
Grabherr
 
MG
,
Chan
 
YF
,
Russell
 
P
,
Mauceli
 
E
,
Johnson
 
J
,
Swofford
 
R
,
Pirun
 
M
,
Zody
 
MC
,
White
 
S
, et al.  
The genomic basis of adaptive evolution in threespine sticklebacks
.
Nature
.
2012
:
484
(
7392
):
55
61
. https://doi.org/10.1038/nature10944.

Kellis
 
M
,
Wold
 
B
,
Snyder
 
MP
,
Bernstein
 
BE
,
Kundaje
 
A
,
Marinov
 
GK
,
Ward
 
LD
,
Birney
 
E
,
Crawford
 
GE
,
Dekker
 
J
, et al.  
Defining functional DNA elements in the human genome
.
Proc Natl Acad Sci U S A
.
2014
:
111
(
17
):
6131
6138
. https://doi.org/10.1073/pnas.1318948111.

King
 
MC
,
Wilson
 
AC
.
Evolution at two levels in humans and chimpanzees
.
Science
.
1975
:
188
(
4184
):
107
116
. https://doi.org/10.1126/science.1090005.

Lawrie
 
DS
,
Messer
 
PW
,
Hershberg
 
R
,
Petrov
 
DA
.
Strong purifying selection at synonymous sites in D. melanogaster
.
PLoS Genet
.
2013
:
9
(
5
):
e1003527
. https://doi.org/10.1371/journal.pgen.1003527.

Lee
 
D
,
Gorkin
 
DU
,
Baker
 
M
,
Strober
 
BJ
,
Asoni
 
AL
,
McCallion
 
AS
,
Beer
 
MA
.
A method to predict the impact of regulatory variants from DNA sequence
.
Nat Genet
.
2015
:
47
(
8
):
955
961
. https://doi.org/10.1038/ng.3331.

Li
 
WH
,
Wu
 
CI
,
Luo
 
CC
.
A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes
.
Mol Biol Evol
.
1985
:
2
(
2
):
150
174
. https://doi.org/10.1093/oxfordjournals.molbev.a040343.

Lin
 
F
,
Xing
 
K
,
Zhang
 
J
,
He
 
X
.
Expression reduction in mammalian X chromosome evolution refutes Ohno’s hypothesis of dosage compensation
.
Proc Natl Acad Sci U S A
.
2012
:
109
(
29
):
11752
11757
. https://doi.org/10.1073/pnas.1201816109.

Lind
 
PA
,
Berg
 
OG
,
Andersson
 
DI
.
Mutational robustness of ribosomal protein genes
.
Science
.
2010
:
330
(
6005
):
825
827
. https://doi.org/10.1126/science.1194617.

Liu
 
J
,
Robinson-Rechavi
 
M
.
Robust inference of positive selection on regulatory sequences in the human brain
.
Sci Adv
.
2020
:
6
(
48
):.

McDonald
 
JH
,
Kreitman
 
M
.
Adaptive protein evolution at the Adh locus in Drosophila
.
Nature
.
1991
:
351
(
6328
):
652
654
. https://doi.org/10.1038/351652a0.

Moses
 
AM
.
Statistical tests for natural selection on regulatory regions based on the strength of transcription factor binding sites
.
BMC Evol Biol
.
2009
:
9
(
1
):
286
. https://doi.org/10.1186/1471-2148-9-286.

Nei
 
M
,
Gojobori
 
T
.
Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions
.
Mol Biol Evol
.
1986
:
3
(
5
):
418
426
. https://doi.org/10.1093/oxfordjournals.molbev.a040410.

Nei
 
M
,
Kumar
 
S
.
Molecular evolution and phylogenetics
.
New York
:
Oxford University Press
;
2000
.

Ohno
 
S
.
Sex chromosomes and sex-linked genes
.
New York
:
Springer-Verlag
;
1967
.

Park
 
PJ
.
ChIP-seq: advantages and challenges of a maturing technology
.
Nat Rev Genet
.
2009
:
10
(
10
):
669
680
. https://doi.org/10.1038/nrg2641.

Prabhakar
 
S
,
Noonan
 
JP
,
Pääbo
 
S
,
Rubin
 
EM
.
Accelerated evolution of conserved noncoding sequences in humans
.
Science
.
2006
:
314
(
5800
):
786
. https://doi.org/10.1126/science.1130738.

R Core Development Team
.
R: a language and environment for statistical computing
.
Vienna (Austria)
:
R Core Development Team
;
2010
.

Sharon
 
E
,
Chen
 
SA
,
Khosla
 
NM
,
Smith
 
JD
,
Pritchard
 
JK
,
Fraser
 
HB
.
Functional genetic variants revealed by massively parallel precise genome editing
.
Cell
.
2018
:
175
(
2
):
544
557.e516
. https://doi.org/10.1016/j.cell.2018.08.057.

She
 
R
,
Jarosz
 
DF
.
Mapping causal variants with single-nucleotide resolution reveals biochemical drivers of phenotypic change
.
Cell
.
2018
:
172
(
3
):
478
490.e15
. https://doi.org/10.1016/j.cell.2017.12.015.

Shen
 
X
,
Song
 
S
,
Li
 
C
,
Zhang
 
J
.
Synonymous mutations in representative yeast genes are mostly strongly non-neutral
.
Nature
.
2022
:
606
(
7915
):
725
731
. https://doi.org/10.1038/s41586-022-04823-w.

Shrikumar
 
A
,
Prakash
 
E
,
Kundaje
 
A
.
GkmExplain: fast and accurate interpretation of nonlinear gapped k-mer SVMs
.
Bioinformatics
.
2019
:
35
(
14
):
i173
i182
. https://doi.org/10.1093/bioinformatics/btz322.

Signor
 
SA
,
Nuzhdin
 
SV
.
The evolution of gene expression in cis and trans
.
Trends Genet
.
2018
:
34
(
7
):
532
544
. https://doi.org/10.1016/j.tig.2018.03.007.

Wray
 
GA
.
The evolutionary significance of cis-regulatory mutations
.
Nat Rev Genet
.
2007
:
8
(
3
):
206
216
. https://doi.org/10.1038/nrg2063.

Xu
 
H
,
Li
 
C
,
Xu
 
C
,
Zhang
 
J
.
Chance promoter activities illuminate the origins of eukaryotic intergenic transcriptions
.
Nat Commun
.
2023
:
14
(
1
):
1826
. https://doi.org/10.1038/s41467-023-37610-w.

This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]
Associate Editor: Katja Nowick
Katja Nowick
Associate Editor
Search for other works by this author on:

Supplementary data