Using Environmental Correlations to Identify Loci Underlying Local Adaptation

Coop, Graham; Witonsky, David; Di Rienzo, Anna; Pritchard, Jonathan K

doi:10.1534/genetics.110.114819

Abstract

Loci involved in local adaptation can potentially be identified by an unusual correlation between allele frequencies and important ecological variables or by extreme allele frequency differences between geographic regions. However, such comparisons are complicated by differences in sample sizes and the neutral correlation of allele frequencies across populations due to shared history and gene flow. To overcome these difficulties, we have developed a Bayesian method that estimates the empirical pattern of covariance in allele frequencies between populations from a set of markers and then uses this as a null model for a test at individual SNPs. In our model the sample frequencies of an allele across populations are drawn from a set of underlying population frequencies; a transform of these population frequencies is assumed to follow a multivariate normal distribution. We first estimate the covariance matrix of this multivariate normal across loci using a Monte Carlo Markov chain. At each SNP, we then provide a measure of the support, a Bayes factor, for a model where an environmental variable has a linear effect on the transformed allele frequencies compared to a model given by the covariance matrix alone. This test is shown through power simulations to outperform existing correlation tests. We also demonstrate that our method can be used to identify SNPs with unusually large allele frequency differentiation and offers a powerful alternative to tests based on pairwise or global F_ST. Software is available at http://www.eve.ucdavis.edu/gmcoop/.

LOCAL adaptation to divergent environments can lead to dramatic differences in average phenotype between populations of the same species. Such variation offers particularly compelling evidence of natural selection when it is correlated with variation in environmental factors over multiple independent geographic regions and/or species. For example, in warm-blooded species, individuals at higher latitudes tend to be smaller than those found at the equator (Bergmann 1847; Allen 1877) and birds in humid climates tend to be darker in pigmentation than those in drier habitats (Gloger 1833). There are many other examples of phenotypic adaptations to local environments, including cyptic pigmentation in deer mice (Sumner 1929; Mullen and Hoekstra 2008), body size and pigmentation gradients in Drosophila (e.g., Coyne and Beecham 1987; Hueyet al. 2000; Pool and Aquadro 2007), skin pigmentation clines in humans (Relethford 1997), and toxic soil resistance in plants (Jain and Bradshaw 1966). Such patterns were among the earliest types of evidence used to demonstrate the action of local adaptation as a force driving phenotypic differences between populations within a species (e.g., Huxley 1939; Mayr 1942).

Correlations between phenotype and environment may be mirrored at the level of individual genetic polymorphisms, where at some loci, allele frequencies strongly differentiate populations that live in different environments. Such correlations can arise when selection pressures exerted by the environmental variable are sufficiently divergent between geographic locations, such that differences in allele frequency can be maintained in the face of gene flow (e.g., Haldane 1948; Slatkin 1973; Nagylaki 1975; Lenormand 2002). The study of geographic patterns of genetic variation has a long history, with some of the earliest work on genetic polymorphism being the study of clines in the frequency of cytologically visible inversion polymorphisms (Dobzhansky 1948). Other examples include loci involved in adaptations to high altitude (Storzet al. 2007; McCrackenet al. 2009), pigmentation (Hoekstraet al. 2004), and life-history changes (Schmidtet al. 2008). One particularly impressive example of adaptive response to selection is provided by the ADH locus in Drosophila melanogaster, alleles of which show a strong gradient with latitude (Berry and Kreitman 1993). It has been observed that the ADH cline is quickly reestablished after the introduction of the species onto different continents and that it responds quickly to climate change (Uminaet al. 2005).

The advent of genome-wide data sets with individuals from many populations, across a wide geographic range (e.g., Nordborget al. 2005; Jakobssonet al. 2008; Liet al. 2008; Autonet al. 2009), allows investigators to obtain a systematic view of the processes shaping local adaptation and to gain valuable insights into the genetic and ecological basis of adaptation and speciation. It can also provide support for adaptive explanations for phenotypic variation, for example, suggesting an impact of selection on variation that is linked to human metabolic diseases (Thompsonet al. 2004; Younget al. 2005; Hancocket al. 2008, 2010; Pickrellet al. 2009).

Some of the earliest tests of selection on genetic markers were based on identifying loci that showed extreme allele frequency differences among populations (Cavalli-Sforza 1966; Lewontin and Krakauer 1973), using statistics such as F_ST, and there are now a range of methods predicated on this idea (e.g., Beaumont and Balding 2004; Foll and Gaggiotti 2008). Our goal here differs, as we seek to identify loci where the allele frequencies show unusually strong correlations with one or more environmental variables. Such loci may be under selection driven by those environmental factors or correlated selection pressures. However, this goal is complicated by the fact that allele frequencies are typically correlated among closely related populations; since such populations tend to be geographically proximate they often share environmental variables (see Novembre and DiRienzo 2009 for a recent discussion). This means that neighboring populations can rarely be treated as independent observations. Thus, a naive test of correlation between population frequency and an environmental variable will often have a high false positive rate. This situation is somewhat analogous to the reduced number of independent contrasts when comparing traits across species due to the shared phylogeny (Felsenstein 1985). The nonindependence of populations is also known to be an issue when using F_ST as a summary statistic to identify selected loci (Robertson 1975; Excoffieret al. 2009).

To illustrate the problem, Figure 1 shows the allele frequencies of a SNP in a series of 52 human populations, as a function of the distance of each population from the equator (Figure 1 is redrawn from a similar plot in Thompsonet al. 2004). The SNP is AGT M235T and the allele that increases in frequency with latitude is known to reduce sodium retention (Liftonet al. 1993), which may have been selectively favored in cooler northern climes. However, as is apparent in Figure 1, populations cluster by broad geographic region for both allele frequency and distance from the equator. Thus, the correlation between allele frequency and environmental variable is clearly supported by far fewer independent observations than the 52 points plotted in Figure 1. Moreover, it is not clear how much of the variation in allele frequency in Figure 1 is due to sampling error in some of the smaller samples or genetic drift. For example, are the low allele frequencies in Oceania—which support an environmental correlation—simply due to sampling error or genetic drift?

Figure 1.—

The distance from the equator for each of 52 human populations, plotted against sample allele frequencies for the SNP AGT M235T in each population. The points are colored according to the geographic region each population belongs to, following region definitions of Rosenberget al. (2002). The data were generated using HGDP samples by Thompsonet al. (2004) and are replotted on the basis of a figure in that article.

Open in new tab Download slide

In this article, we develop a model to overcome these difficulties by accounting for differences in sample sizes and for the null correlation of allele frequencies across populations when testing for correlation between an environmental variable and allele frequencies. To do this, we use a set of control loci to estimate a null model of how allele frequencies covary across populations. We can then test whether the correlation seen between the allele frequencies at a marker of interest and an environmental variable is greater than expected given this null model. We concentrate on markers such as SNPs that are codominant and usually biallelic, but we note in the discussion how the model can be extended to other types of markers. The method developed here can be applied to continuous or discrete environmental variables. We demonstrate the method by applying it to genome-wide SNP data from humans. Elsewhere we have applied this method to human genome-wide SNP data, for a range of environmental and ecological variables (Hancocket al. 2008, 2010).

METHODS

We develop a model for the joint allele frequencies across populations. One way to do so would be to use a fully explicit model of demographic history, but such models are very difficult to parameterize and to fit for so many populations. Instead, we intentionally avoid using an explicit model of the historical relationships between populations and opt for a flexible, less model-based parameterization. Under our null model, the population allele frequency in each population may deviate away from an ancestral (or global) allele frequency due to genetic drift. Populations covary in the deviation that they take, allowing some populations to be more closely related genetically to each other due to the effects of shared population history or gene flow. Thus, our null model is specified by the covariance structure of allele frequencies across populations. To estimate this covariance structure, we assume that a transform of the population frequency of an allele across populations has a multivariate normal distribution. We estimate the covariance matrix of this normal distribution using our control SNPs. This provides the null model that we use to assess the correlation between the environmental variable and allele frequencies at a SNP of interest.

Null model:

Suppose that L unlinked SNPs have been typed in K populations. First we estimate the null model using all L of these markers; if the number of markers is very large, then to improve computational speed we may estimate the null model using a large random subset L of all the markers typed. (As discussed below, we find that the null model parameters estimated from different subsets of the data are highly consistent, provided that L is large enough.) In some applications, the markers that we are interested in testing for environmental correlations may be among the L (since we expect that a small fraction of selected loci will have little impact on the parameter estimates for the null model). Alternatively, the L may represent a set of well-matched control markers (see Hancocket al. 2008, 2010, for discussion).

The data at locus l consist of the number of times we see alleles 1 and 2, respectively, in each of the k populations. (The two alleles are labeled arbitrarily.) Then let n_l = {n_1l, … ,n_Kl} be the observed counts of allele 1 and m_l = {m_1l, · · · , m_Kl} be the counts of allele 2, in populations

\(1,{\ldots},K\)

⁠; hence the total number of allele copies genotyped at locus l in population k is n_kl + m_kl. This notation ignores any missing data at site l, which is appropriate if we assume that any missing data are missing at random with respect to genotype. The population allele frequencies at locus l will be denoted by x_1l, … , x_Kl, and we assume that the observed counts of each allele are the result of binomial draws from these frequencies, independently for each k and l. The population allele frequencies are unknown in advance, and must be estimated from the data.

We aim to construct a model for the joint distribution of allele frequencies across the k populations. If these frequencies were not constrained to be between zero and one, it would be natural to model the population allele frequencies of a SNP across populations as being multivariate normal. To overcome the constrained nature of allele frequencies, we follow Nicholsonet al. (2002) by assuming that the population allele frequency in a subpopulation, x_kl, is normally distributed around an ancestral allele frequency ε_l (0 < ε_l < 1), but that the densities of x_kl above 1 and below 0 are replaced with point masses on 1 and 0, respectively. Further, we adopt the assumption of Nicholsonet al. (2002) that, for a particular subpopulation, the variance of this normal distribution is a product of a factor that is constant across loci multiplied by a locus-specific term: i.e., ε_l(1 − ε_l). This model was introduced to describe a pure drift model where the allele frequency within each population independently drifts from the ancestral allele frequency. The normal distribution was chosen because, when the frequency of an allele in the current generation is ε, the binomial sampling of the next generation can be approximated as the frequency in the next generation being ∼N(ε_l, ε_l(1 − ε_l)/(2N_e)) (in a population of effective size N_e). Thus, after τ generations of genetic drift, the distribution of allele frequency can be approximated as ∼N(y, C_kε_l(1 − ε_l)), where C_k = τ/(2N_e) is shared across loci, for τ ≪ N_e (Nicholsonet al. 2002). The estimate C_k can also be viewed as a model-based population-specific estimate of F_ST, a relationship that holds for low values of C_k (Nicholsonet al. 2002), a point that we return to briefly in the discussion.

Put in notational form, we assume that the population allele frequency x_kl is related by a simple transform g() to a surrogate population allele variable θ_kl that is not constrained to be between 0 and 1:

\[x_{kl}{=}g(\mathrm{{\theta}}_{kl}){=}\left\{\begin{array}{ll}0&\mathrm{if}\mathrm{{\theta}}_{kl}{<}0\\\mathrm{{\theta}}_{kl}&0{\leq}\mathrm{{\theta}}_{kl}{\leq}1\\1&\mathrm{{\theta}}_{kl}{>}1.\end{array}\right.\]

(1)

Therefore, for a locus l there is a set of θ_l = θ_1l, … , θ_Kl, where θ_kl has a marginal distribution ∼N(ε_l, ε_l(1 – ε_l)C_k) (Nicholsonet al. 2002). The point masses in the distribution of x_kl on zero and one (θ_kl ≤ 0 and θ_kl ≥ 1) represent the probability that the allele has been lost or fixed in the kth population, respectively.

Since we want to explicitly estimate the covariance in allele frequencies across populations, rather than assuming independent normally distributed variables across populations (as in Nicholsonet al. 2002), we assume that the θ_l have a multivariate normal distribution

\[P(\mathrm{{\theta}}_{l}\mathrm{{\vert}}{\Omega},\mathrm{{\varepsilon}}_{l})\mathrm{{\sim}}N(\mathrm{{\varepsilon}}_{l},\mathrm{{\varepsilon}}_{l}(1{-}\mathrm{{\varepsilon}}_{l}){\Omega}),\]

(2)

where N(μ, V) is a multivariate normal distribution with mean μ and variance–covariance matrix V. When the off-diagonal components of Ω are set to zero, this model is the same as that of Nicholsonet al. (2002) with minor differences in parameterization.

We can write the joint posterior of the parameters (θ_l, Ω, ε_l) at a single locus l up to a normalizing constant, as

\[P(\mathrm{{\theta}}_{l},{\Omega},\mathrm{{\varepsilon}}_{l}\mathrm{{\vert}}\mathbf{\mathrm{n}}_{l},\mathbf{\mathrm{m}}_{l}){\propto}P(\mathbf{\mathrm{n}}_{l},\mathbf{\mathrm{m}}_{l}\mathrm{{\vert}}x_{l}{=}g(\mathrm{{\theta}}_{l}))P(\mathrm{{\theta}}_{l}\mathrm{{\vert}}{\Omega},\mathrm{{\varepsilon}}_{l})P({\Omega})P(\mathrm{{\varepsilon}}_{l}),\]

(3)

where P(Ω) and P(ε_l) are priors on the covariance matrix and ancestral frequency, respectively. Our prior on the ancestral allele frequency ε is symmetric Beta with parameter λ. For the results presented here, we set λ = 1, reflecting the fact that the SNPs have been ascertained to be polymorphic. But in practice the choice of λ makes relatively little difference, as when there are many populations, there is reasonably good information about the parameter ε; we note, however, that ε is generally “biased” toward matching the allele frequency of geographic regions with large numbers of populations sampled.

Our prior on the variance–covariance matrix Ω, P(Ω), is somewhat more complicated, as it must have weight only on the set of positive definite matrices (a requirement of the multivariate normal distribution). We use the inverse Wishart prior, which is often used as a prior on variance–covariance matrices because it is the conjugate prior for the variance–covariance matrix of a multivariate normal. The inverse Wishart is parameterized by W(ρR⁻¹,ρ), where R is the prior K × K shape matrix and ρ (where ρ ≥ K) is a parameter controlling the strength of the prior (i.e., how much the posterior draws of the covariance matrix resemble the shape matrix). To make the prior as weak as possible we set ρ = K. We set R to be the identity matrix (i.e., R_ij = 1 if i = j, and 0 otherwise), but investigate the effect of this choice of prior later.

The covariance matrix is shared over loci, and so the joint posterior for all the loci is

\[P({\Omega},\mathrm{{\theta}}_{1},{\ldots},\mathrm{{\theta}}_{L},\mathrm{{\varepsilon}}_{1},{\ldots},\mathrm{{\varepsilon}}_{L}\mathrm{{\vert}}\mathbf{\mathrm{n}}_{1},\mathbf{\mathrm{m}}_{1},{\ldots},\mathbf{\mathrm{n}}_{L},\mathbf{\mathrm{m}}_{L}){\propto}\left\{{{\prod}_{l{=}1}^{l{=}L}}P(\mathbf{\mathrm{n}}_{l},\mathbf{\mathrm{m}}_{l}\mathrm{{\vert}}{\,}\mathbf{\mathrm{x}}_{l}{=}g(\mathrm{{\theta}}_{l}))P(\mathrm{{\theta}}_{l}\mathrm{{\vert}}{\Omega},\mathrm{{\varepsilon}}_{l})P(\mathrm{{\varepsilon}}_{l})\right\}P({\Omega}).\]

(4)

We use Markov chain Monte Carlo (MCMC) to explore the posterior of the covariance matrix and other parameters. Our MCMC scheme is presented in appendixa.

Alternative model:

Having estimated a null model of how SNP frequencies vary across populations (i.e., the posterior of the covariance matrix Ω), we now use this model to investigate whether allele frequencies at a SNP of interest are significantly correlated with an environmental variable Y. To do this, we allow the allele frequency, θ_l, to be dependent on Y (Y is standardized to have mean zero and a standard deviation of 1). The surrogate allele frequencies θ_l have a deviation from the ancestral allele frequency ε_l that is linearly proportional to the environmental variable Y with coefficient β; i.e.,

\[P(\mathrm{{\theta}}_{l}\mathrm{{\vert}}{\Omega},\mathrm{{\varepsilon}}_{l},\mathrm{{\beta}})\mathrm{{\sim}}N(\mathrm{{\varepsilon}}_{l}{+}\mathrm{{\beta}}\mathrm{Y},\mathrm{{\varepsilon}}_{l}(1{-}\mathrm{{\varepsilon}}_{l}){\Omega}).\]

(5)

We note that while this model predicts a linear relationship between θ_l and Y, this does not necessarily imply a linear relationship between the population allele frequencies x_l and Y due to the boundaries for the population allele frequency at 0 and 1. Note that β and Y could also be multivariate, allowing combinations of climate variables to be investigated. However, the allele frequencies at any one locus are intrinsic noisy; thus there is limited information about even a single climate variable correlation and so we refrain from implementing this multivariate option.

We place a prior on β, P(β), and then estimate the posterior

\[P(\mathrm{{\theta}}_{l}{\Omega},\mathrm{{\varepsilon}}_{l},\mathrm{{\beta}}\mathrm{{\vert}}\mathbf{\mathrm{n}}_{l},\mathbf{\mathrm{m}}_{l}){\propto}P(\mathbf{\mathrm{n}}_{l},\mathbf{\mathrm{m}}_{l}\mathrm{{\vert}}\mathbf{\mathrm{x}}_{l}{=}g(\mathrm{{\theta}}_{l}))P(\mathrm{{\theta}}_{l}\mathrm{{\vert}{\Omega}},\mathrm{{\varepsilon}}_{l},\mathrm{{\beta}})P({\Omega})P(\mathrm{{\varepsilon}}_{1})P(\mathrm{{\beta}}),\]

(6)

where P(β) is defined to be a uniform distribution between our choice of the minimum and maximum values of β, β_min and β_max. We use the posterior of the covariance matrix estimated from the control SNPs as the prior for the covariance matrix for a single locus. Since we perform the test at each locus separately, we further assume that the information from the control loci provides all of the information about the covariance matrix. Thus, rather than exploring the posterior of the covariance matrix given both the control SNPs and the test SNP, we can simply use draws from the posterior of the matrix that have previously been generated from the control SNPs. In practice, in our applications to human SNP data, we found that we have sufficient information (i.e., enough control loci) that the posterior mass of the covariance matrix is tightly concentrated on a single matrix and so it makes little difference if we use a single draw of covariance matrix from the posterior given the control loci and ignore the small uncertainty in this matrix. Thus, for the applications in this article, we simply use a single draw from the posterior of the matrix, while noting that for applications with smaller numbers of markers, it may be important to incorporate the uncertainty in the matrix.

To summarize the support for the alternative model M₁ (i.e., the model with β) compared to the null model M₀ we calculate the Bayes factor

\[\frac{P(M_{1}\mathrm{{\vert}}\mathbf{\mathrm{n}}_{l},\mathbf{\mathrm{m}}_{l})}{P(M_{0}\mathrm{{\vert}}\mathbf{\mathrm{n}}_{l},\mathbf{\mathrm{m}}_{l})}{=}\frac{{{\int}}P(\mathbf{\mathrm{n}}_{l},\mathbf{\mathrm{m}}_{l}\mathrm{{\vert}}\mathrm{{\theta}}_{l})P(\mathrm{{\theta}}_{l}\mathrm{{\vert}}\mathrm{{\beta}},\mathrm{{\varepsilon}}_{l},{\Omega})P({\Omega})P(\mathrm{{\varepsilon}}_{l})P(\mathrm{{\beta}})d\mathrm{{\beta}}d\mathrm{{\theta}}_{l}d\mathrm{{\varepsilon}}_{l}d{\Omega}}{{{\int}}P(\mathbf{\mathrm{n}}_{l},\mathbf{\mathrm{m}}_{l}\mathrm{{\vert}}\mathrm{{\theta}}_{l})P(\mathrm{{\theta}}_{l}\mathrm{{\vert}}\mathrm{{\varepsilon}}_{l},{\Omega})P({\Omega})P(\mathrm{{\varepsilon}}_{l})d\mathrm{{\theta}}_{l}d\mathrm{{\varepsilon}}_{l}d{\Omega}},\]

(7)

where P(M₀ | n_l, m_l) is the posterior probability of the data at locus l under the null model (β = 0), found by integrating the right-hand side of Equation 3 over all the parameters of the null model, and P(M₁ | n_l, m_l) is given by the integral of Equation 6 over all parameters of the alternative model. The calculation of the Bayes factor is discussed in appendixb, where we present a way in which many environmental correlations can be tested using a single run of the MCMC for a SNP. The importance sampler will perform best when the target distribution is close to that simulated under, i.e., when the posterior of the data under the alternative model is close to that under the null model. This suggests that the importance sampler will converge quickly when the data resemble data generated under the null model; i.e., our estimates of large Bayes factors will be noisier. As with all MCMC algorithms the estimates of Bayes factors produced by the method should be checked with multiple runs of the algorithm. We find good agreement between Bayes factors over independent runs and also between the posterior of β estimated by the importance sampler and posteriors for β estimated by MCMC (results not shown).

Assessing significance:

Various aspects of population history are likely to violate the simple assumptions of our model; therefore the addition of the linear dependence on an environmental variable might improve the fit to the frequencies of even neutral alleles. Indeed, when investigating the performance of the method, on control sets of SNPs chosen to be neutral (e.g., Conradet al. 2006; Hancocket al. 2008), we found that the distribution of Bayes factors differed dramatically between environmental variables. This variation in the distribution of Bayes factors, across environmental variables, was not seen in data sets simulated using the matrix estimated from the Human Genome Diversity Project (HGDP) data, suggesting that it represents a feature of the data (results not shown). Thus, a large Bayes factor supporting the alternative model may not be strong evidence that the SNP has been the target of selection. A robust way to overcome this problem is to apply the method to all of the control SNPs and build an “empirical distribution” of Bayes factors from the control SNPs. The Bayes factor for the SNP of interest (for a given environmental variable) can then be compared to this distribution to judge its significance (Hancocket al. 2008, 2010). For applications of the method we recommend that these empirical distributions are constructed separately for bins of mean global allele frequency and ascertainment scheme; this will control for features of the data not well captured by the model. We direct interested readers to Hancocket al. (2010), where these recommendations have been implemented in a genome-wide scan for adaptive alleles.

RESULTS

To explore our method, we initially applied our method to the genotype data gathered by Conradet al. (2006) for 927 individuals from the 52 human populations of the HGDP panel. These populations represent a reasonable sampling from around the world, although there are some notable gaps in the sampling. Conradet al. (2006) typed 2333 SNPs in 32 autosomal regions to study patterns of linkage disequilibrium. Although the SNPs within each region are in partial linkage disequilibrium, and thus violate our assumption of independence between SNPs, parameter estimates of the model should not be biased as a result (although this violation may lead to overconfident parameter estimates). Consistent with this, we get very similar results if we run the analysis on subsets of the genomic regions (results not shown). We first estimated the 52 × 52 variance–covariance matrix of the HGDP populations. We show a single draw from the posterior of the covariance matrix in Figure 2A and the correlation matrix computed from this matrix in Figure 2B. These matrices reveal the close genetic relationship of populations from the same geographic region, which is qualitatively similar to the groups identified by STRUCTURE for these data (Conradet al. 2006) and samples (Rosenberget al. 2002; Liet al. 2008). Also, the Uygur and Hazara populations are clearly picked out as showing higher covariance between the East Asian and Western Eurasia blocks than other populations within the blocks, consistent with the hypothesis that these populations result from recent admixture events between these broad geographic regions (Rosenberget al. 2002).

Figure 2.—

(A) A single draw from the posterior of the covariance matrix estimated for the HGDP SNPs of Conradet al. (2006). (B) The correlation matrix calculated from the covariance matrix shown in A. The matrices are displayed as heat maps with lighter colors corresponding to higher values. The rows and columns of these matrices have been arranged by broad geographic label.

Open in new tab Download slide

The convergence of the MCMC to the posterior is relatively quick and dropping the first 5000 iterations was more than sufficient as a burn-in. Multiple independent runs with different starting positions quickly converged to similar matrices. Draws from the posterior showed relatively small fluctuations around the matrix displayed in Figure 2A, suggesting that the matrix is reliably estimated from this data set.

The posterior of the matrix estimated from these data is reasonably unaffected by the choice of prior of the covariance matrix. If instead of setting the shape matrix of the prior on the covariance matrix (R) to the identity matrix, we set it to specify a strong prior correlation (i.e., R_ij = 1 if i = j and 0.99 for i ≠ j), there is no notable difference in the posterior estimate of the covariance matrix. For example, the Mantel matrix correlations between draws of the posterior matrix within a run are very similar to those observed between runs with different priors (results not shown). For applications to small numbers of markers and/or sample sizes, it may be advisable to use this prior shape matrix that reflects a strong correlation of allele frequencies between populations. This will encourage the method to combine information over small samples and will reduce the chance that populations with small samples will add noise to the test.

Power simulation:

Next we explored the power of our method to detect correlations between allele frequencies and environmental variables in the HGDP data. Simulating selection in a large number of populations under nonequilibrium models of history is challenging and unappealing, as it requires the arbitrary choice of many parameters. Instead we took an empirical approach and altered the sample frequency of the SNPs in the Conradet al. (2006) HGDP panel by adding a linear effect of the environment variable. If the current sample frequency of an allele in population k is X_k = n_k/(n_k + m_k), then our new frequency is

\(X{^\prime}_{k}\)

= X_k+βY_k, where Y_k is the environmental variable in the kth population (we rescaled Y to have mean zero and variance 1). We converted these to sample frequencies by rounding n_k

\(X{^\prime}_{k}\)

to the nearest integer

\(n_{k}{^\prime}\)

⁠; if

\(n_{k}{^\prime}\)

is negative or exceeds the sample size (n_k+m_k), then

\(n_{k}{^\prime}\)

is set to zero or n_k + m_k, respectively. Informally, this linear shift may be thought of as modeling strong selection that acted recently on the frequencies of the allele across populations. We conducted these power simulations for a range of β.

We then calculated the Bayes factor to assess support for a correlation between Y and the modified SNP frequencies. For comparison, we also calculated the power of a number of other test statistics aimed at detecting the correlation between the environmental variable and the sample allele frequencies: Spearman's rank correlation ρ, the P-value from a linear regression model, and the P-value in a linear model obtained after first regressing out the first three principal components of the genetic data. We note that none of these three alternative methods offers a well-calibrated statistic; i.e., the P-values were not uniform under the null model (i.e., β = 0). Therefore, for these alternative methods and our Bayes factors we used the empirical distribution of a test statistic to correctly set the cutoff threshold for significance. To do this, we calculate the test statistic for all SNPs, applying no linear effect of the environmental variable (i.e., β = 0), and create an empirical distribution for each of these test statistics. We then find the 5% cutoff for this empirical distribution and any test statistic lower than this is declared significant at the 5% level. To explore the power of the various methods to detect an environmental correlation, we chose a geographic variable, i.e., latitude, and a climate variable, i.e., summer precipitation. Latitude was chosen because it has been used in a number of previous studies (e.g., Beckmanet al. 1994; Thompsonet al. 2004; Younget al. 2005; Hancocket al. 2008) and summer precipitation was chosen as an example where all methods should have good power because summer precipitation is relatively uncorrelated with genetic patterns (it has only mildly significant correlations with the first four principal components of the genetic data). In Figure 3, we show our power to detect a latitudinal effect on population allele frequency. As can be seen, the methods that account for the genetic structure of the populations outperform those that do not, and our method is the most powerful.

Figure 3.—

The power of various methods to detect a correlation between latitude and allele frequency.

Open in new tab Download slide

When the environmental variable is reasonably uncorrelated with the major axes of variation in genetic data, the power of all of the methods improves, compared to the latitudinal results, as the “noise” caused by genetic relatedness between populations is less confounded with the signal that we are trying to detect. This can be seen clearly by comparing the power to detect an effect with summer precipitation (see Figure 4) to the latitudinal case, where all methods have lower power (note that both environmental variables have been standardized and so the average change in frequency is approximately the same in both cases). Also, the improvement in power of methods that account for genetic structure over those that do not is greatly reduced if the environmental variable is not strongly correlated with the principal axes of variation in the genetic data. For example, the power of Spearman's rank test to detect an effect of summer precipitation is comparable to that of the principal component method.

Figure 4.—

The power of various methods to detect a correlation between summer precipitation and allele frequency.

Open in new tab Download slide

We also explored our power to detect a “continental” effect. To this end, we created an environmental variable Y, where Y = 1 for all European populations and Y = −1 for all non-European populations. We again calculated our power to detect such an effect using our estimated Bayes factor as a test statistic. For comparison, we also calculated the power to detect the effect using various F_ST-based measures between broad geographic areas: pairwise F_ST between the Middle East and Europe, pairwise F_ST between Central Asia and Europe, and a European population-specific measure of F_ST (these were all calculated according to Cockerham and Weir 1986 and Weir and Hill 2002). We also calculated a global F_ST using “continent”-level labels, but it performed very poorly, presumably because the changes in allele frequency were restricted to a subset of populations, and so was not included. The results of the power simulations are shown in Figure 5. Our method has higher power than the standard F_ST-based methods, perhaps because it effectively combines information from all of the pairwise comparisons in the data while accounting for the noise and covariance in each comparison (see also Excoffieret al. 2009 for discussion).

Figure 5.—

The power of various methods to detect a “European effect” on allele frequency.

Open in new tab Download slide

Data application:

To demonstrate the utility of the method for genome-wide data, we applied the method to the 640,698 autosomal SNPs typed by Liet al. (2008) in the HGDP–CEPH. An expanded exploration of these data for a range of environmental variables is given in Hancocket al. (2010). Since the genotyped SNPs come from three different ascertainment panels (Eberleet al. 2007), we estimated three different covariance matrices, by sampling three different sets of 10,000 SNPs at random from the three different SNP sets. Draws from the posterior of the three covariance matrices estimated were qualitatively very similar to the example shown in Figure 2 and to each other and showed little variation within runs of the MCMC (results not shown).

We calculated the Bayes factors for all autosomal SNPs for a number of different continental effects, using the covariance matrix for each SNP that matched its ascertainment set. All of the calculated Bayes factors, along with the matrices used, are available for download at http://www.eve.ucdavis.edu/gmcoop/. In Figure 6, we show the Bayes factors for all autosomal SNPs for two of the effects tested: a European effect and a Western Eurasian effect (Europe, Middle East, and Central Asia). For example, for the European effect we set Y = 1 for all European populations and Y = −1 for all other populations and then standardize Y.

A plot of the log10 Bayes factor for each SNP along the human genome for (A) a European effect and (B) a western Eurasian effect. Bayes factors <1 are not plotted. The numbers on the x-axis indicate chromosome number, with SNPs on different chromosomes colored alternately in red and black. We list the name of the gene that is nearest to each of the six highest-ranking SNPs in each plot (considering only the peak SNP in each cluster of high Bayes factors).

Figure 6.—

A plot of the log₁₀ Bayes factor for each SNP along the human genome for (A) a European effect and (B) a western Eurasian effect. Bayes factors <1 are not plotted. The numbers on the x-axis indicate chromosome number, with SNPs on different chromosomes colored alternately in red and black. We list the name of the gene that is nearest to each of the six highest-ranking SNPs in each plot (considering only the peak SNP in each cluster of high Bayes factors).

Open in new tab Download slide

Among our top hits for a European effect are previously identified hits at TLR6 (Toddet al. 2007; Pickrellet al. 2009), SLC45A2 (Nortonet al. 2007), and HERC2 (Sulemet al. 2007). Variants at these genes are known to be involved in immune response, skin pigmentation, and eye color, respectively. All of these alleles strongly differentiate the European populations from the closely related Middle Eastern and Central Asian populations. Instructively, SNPs in the Lactase gene are not among our top European effects, despite the selected allele being at high frequency within Europe and almost absent in the Middle East. This is likely due to the fact that the selected allele is also at high frequency in Central Asia (Bersaglieriet al. 2004) and in fact SNPs near the lactase gene have some of the largest Bayes factors for a joint Europe–Central Asian effect genome-wide.

Our top hit for a Western Eurasian effect (Figure 6) is the previously identified signal at SLC24A5 (Lamasonet al. 2005). Interestingly, EDAR is among our top hits for a Western Eurasian effect; this gene was previously identified by Sabetiet al. (2007) as the putative target of the strong selective sweep in East Asians (and is among the top Bayes factor signals for an East Asian–America effect). The Bayes factor signal in Western Eurasia hints that this genomic region might have undergone separate sweeps in Western Eurasia and East Asia. To explore the signal for a second sweep using haplotype patterns in this region, we calculated the empirical P-value for the XPEHH statistic (Sabetiet al. 2007), which compared haplotype homozygosity between two populations, for a window containing this gene (see Pickrellet al. 2009 for details). The empirical P-value was 0.029 in Europe, 0.046 in the Middle East, and 0.026 in Central Asia, suggesting evidence of a second sweep, although less strongly than the XP-EHH signal in East Asians for this region (empirical P-value of 0.0015).

DISCUSSION

In this article, we develop a flexible model for examining the correlation in allele frequencies across populations, parameterized by the covariance matrix. Our main goal was to use this estimated covariance matrix to perform a parametric test of the effect of an environmental or continental variable on the frequency of an allele at a SNP, while controlling for the correlation of allele frequencies across populations. Therefore, this model has strong similarities to generalized linear mixed models, where the environmental variable is a fixed effect and the covariance matrix governs the random effects. Thus, the approach is conceptually similar to the approaches introduced to map phenotypes across individuals in strongly structured populations; in these approaches a kinship matrix is used to account for differences in the background genetic relatedness (Yuet al. 2006; Kanget al. 2008). The proposed test results in a considerable improvement in power to estimate effects over methods widely used in the literature. This gain of power comes from the fact that the estimated covariance matrix informs the model as to how to weight the different populations.

For a single locus, the covariance matrix is proportional to the variance and covariance of allele frequencies around a common mean, where the constant of proportionality is the binomial variance ε_l(1 – ε_l). This means that the elements of the estimated matrix have a direct intepretation as a parametric estimate of the pairwise and population-specific F_ST. This interpretation holds only for relatively low levels of drift, as the boundaries at zero and one distort the relationship between the variance of the multivariate normal and F_ST for large levels of drift. Nicholsonet al. (2002) have an extensive discussion of this interpretation of the variance in the case where each population drifts independently from some shared ancestral population, and Weir and Hill (2002) and Samantaet al. (2009) discuss this connection for the maximum-likelihood estimator of the sample covariance matrix when levels of drift are low and sample sizes are large. The framework presented here thus provides a Bayesian model-based estimate of F_ST matrices discussed in Weir and Hill (2002). An alternative model-based formulation of population-specific F_ST is offered by the beta-binomial island-model framework (Balding and Nichols 1995; see also Balding 2003). This island-model framework considers populations at an equilibrium of mutation–migration–drift balance, as opposed to the nonequilibrium pure drift model of Nicholsonet al. (2002); the merits of these two approaches are discussed in Nicholsonet al. (2002), Baldinget al. (2002), and Balding (2003). The island-model framework has been extended to identify loci that have outlying allele frequencies with respect to particular populations (Beaumont and Balding, 2004; Bazinet al. 2010). Our aim here has been to develop a test of environmental selective gradients; thus while we have implemented this in the spirit of the pure-drift model of Nicholsonet al. (2002) our framework could be implemented into the island-model framework and would likely perform comparably [see also the discussion of choice of g() below].

The study of the pattern of the covariance of allele frequencies across populations has a long history as the covariance contains information about the history and levels of gene flow between populations (Cavalli-Sforzaet al. 1964; see also the review by Felsenstein 1982). It may be fruitful to adapt the Bayesian framework presented here to infer the historical relationships between the sampled populations. For example, the drift on different branches of a tree of populations could be approximated by normal deviations, which would allow rapid calculation of the branch lengths (see RoyChoudhuryet al. 2008 for a recent presentation of the problem). However, the pairwise covariance of allele frequencies across populations can be used only to learn about the average coalescent times within and between pairs of populations (Slatkin 1991) and so this approach could not be directly used to distinguish between isolation models and migration models (see McVean 2009, for discussion).

Here, we take a Bayesian approach that fully models the uncertainty in allele frequencies due to sampling. Thus, while we have discussed the model in terms of allele frequencies across population samples, it could be applied as an individual-based analysis where the genotype of each individual represents two draws from a unique “population.” This formulation may be useful when population labels are not known a priori. The current interest in principal component analysis (PCA), now almost exclusively based upon individual-level rather than population-level analyses, suggests that such applications would be useful, given that PCA is a decomposition of the covariance matrix (see McVean 2009, for discussion). Our method can also be extended to other marker types, e.g., for biallelic dominant markers, by using a different form from P(n_l, m_l | x_l) (e.g., Falushet al. 2007). Likewise, the method could also be applied to data for pooled sample data or next-generation short-read sequencing data—once again by modifiying P(n_l, m_l | x_l). We have also experimented with different transforms of the population frequencies [i.e., g()], e.g., a logit transform, and found that they gave very similar results, particularly in terms of the test of environmental variables (results not shown). Such transforms may be useful for applying the method to multiallelic systems such as microsatellites (e.g., Wasseret al. 2004).

We summarize the support for the model with an effect on an environmental variable compared to a model without a linear effect using a Bayes factor. In the application to the HGDP data, we ranked the SNPs by Bayes factor. A posterior predictive P-value (Rubin 1984) could be obtained by simulation from the posterior distribution of the null model, which would likely lead to a similar ranking. However, we have deliberately refrained from utilizing the method to make statements about the absolute “significance” of the correlation seen at specific SNPs, as we are somewhat skeptical about the fit of the null model even to data with no environmental dependence. Rather we suggest that careful comparison of the empirical distribution of test statistics (in our case Bayes factors) between a set of putatively selectively neutral control markers and candidate SNPs of interest is the most convincing way forward (Hancocket al. 2008). This can be accomplished in a genome-wide setting by genic to nongenic SNPs (assuming that nongenic sites are less likely to be functional) to judge the evidence for an enrichment of selection signals in the tails of a test statistic (Barreiroet al. 2008; Coopet al. 2009; Hancocket al. 2010). The empirical approach in turn has some serious drawbacks, the most obvious of which is deciding what statistical cutoff to use, as the choice of cutoff reflects one's prior beliefs of the prelevance of selection (Teshimaet al. 2006).

It is hard to predict in advance how often strong correlations between allele frequencies and environmental variables will form across a species range. However, it is likely that strong gene flow and the parallel mutation will both act to reduce the likelihood of strong correlations. If selection is not strongly divergent across a species range, i.e., the locally adaptated alleles are not selected against in other regions, then the selected allele will be spread across the species range by migration. Under these circumstances correlations may temporarily form but they may not persist for long, and these occurrences will also depend critically upon where the mutation arose and patterns of migration. [Even standard allele frequency differentiation-based methods may not identify a rapidly spreading sweep, and haplotype-based methods may be more informative (e.g., Voightet al. 2005).] In addition, the method will tend to detect only those loci where the environmental variable had a consistent effect on the frequency of a particular allele (due to either hitchhiking or the direct action of selection) and so may not detect regions of the genome where in different populations the same selection pressure has caused different haplotypes to go to fixation. Thus, if rates of gene flow are low across a species range compared to mutation rates toward the adaptive phenotype, then repeated evolution of a phenotype may occur by different genetic routes in different parts of the species range. For example, the genetic basis of pigmentation differs between geographic regions within a number of species (e.g., Hoekstra and Nachman 2003; Nortonet al. 2007; Edwardset al. 2010). Under these circumstances, the frequency of an allele will be correlated with an environmental variable only in parts of the species range. This suggests that it may be profitable to perform the analysis, including the estimation of the covariance matrix, separately in different geographic regions.

In closing, we note that while the method presented here is potentially very useful in identifying selected loci via their correlation with environmental variables, we caution against overintepreting the correlations (or lack of correlations) found. It is unlikely that causal selection pressures can be identified by such correlations as many environmental and ecological variables covary. Further, as outlined above, correlations may exist as a transient stage during the spread of a selected allele (even in the absence of a causal relationship). Thus we view this method as a powerful way of highlighting interesting loci and correlations that can be further explored by follow-up studies.

APPENDIX A: SAMPLING FROM THE POSTERIOR

We first describe how we calculate the posterior under the null model, and then we discuss the calculation of the Bayes factor in the next section. We wish to estimate the posterior of Ω integrating over our uncertainty in θ and ε. To do this, we use MCMC, where in each iteration of the MCMC algorithm we sequentially update the different parameters. Conditional on Ω and θ_l, we update ε_l at each locus using a standard Metropolis update, i.e., add a small normally distributed deviation to ε_l and accept the new ε_l if it falls within the range (0, 1), with a probability given by the ratio of the posteriors under the current and new value of ε.

We could use a similar proposal to update each θ_kl but since the θ_l are highly correlated over populations, we found that updating them individually results in a high rejection rate. Thus, we update the entire vector of transformed population frequencies (θ_l) simultaneously for a locus in a way that attempts to account for that correlation. Our proposal for the new transformed frequencies at a locus l,

\(\mathrm{{\theta}{^\prime}}_{l}\)

⁠, is a small deviation from our current vector of θ_l. This random deviation is chosen to have the correct covariance structure over populations

\(\mathrm{{\theta}{^\prime}}_{l}\)

= θ_l + Δ, where Δ = CX, X is a length K vector of standard uncorrelated normals, and C is the Cholesky decomposition of the current covariance matrix (i.e., Ω = CC^T). We then accept

\(\mathrm{{\theta}{^\prime}}_{l}\)

with probability given by the ratio of the posteriors.

As the inverse Wishart is the conjugate prior for the covariance–variance matrix of a multivariate normal, given all the θ_l and ε_l over all loci, the posterior is itself inverse Wishart with form

\[W((\mathrm{{\rho}}R{+}L{\hat{S}})^{{-}1},{\,}\mathrm{{\rho}}{+}L),\]

(A1)

where

\({\hat{S}}\)

is the sample estimate of covariance matrix of θ_l over loci,

\({\hat{S}}{=}(1/K){\sum}_{l{=}1}^{L}(1/\mathrm{{\varepsilon}}_{l}(1{-}\mathrm{{\varepsilon}}_{l}))(\mathrm{{\theta}}_{l}{-}\mathrm{{\varepsilon}}_{l})(\mathrm{{\theta}}_{l}{-}\mathrm{{\varepsilon}}_{l})^{T}\)

⁠. Therefore, we can update our covariance matrix using Gibbs sampling, by sampling from the Wishart distribution given in Equation A1 using the algorithm described in Odell and Feiveson (1966).

APPENDIX B: EVALUATING THE BAYES FACTOR

There are a number of ways to evaluate this Bayes factor; for example, we could allow the MCMC to move between the models with and without the linear factor β and estimate the Bayes factor by the proportion of time the MCMC spends in the alternative model. However, because we wish to apply the test to large sets of SNPs, e.g., genome-wide data sets for a large number of environment variables, we make use of importance sampling to quickly calculate the Bayes factor from a single run of the MCMC under the null distribution. Specifically the ratio of probabilities

\(P(M_{1}\mathrm{{\vert}}\mathbf{\mathrm{n}}_{l},\mathbf{\mathrm{m}}_{l})/P(M_{0}\mathrm{{\vert}}\mathbf{\mathrm{n}}_{l},\mathbf{\mathrm{m}}_{l})\)

can be written as

\[{=}\frac{{{\int}}P(\mathbf{\mathrm{n}}_{l},\mathbf{\mathrm{m}}_{l}\mathrm{{\vert}}\mathrm{{\theta}}_{l})P(\mathrm{{\theta}}_{l}\mathrm{{\vert}}\mathrm{{\beta}},\mathrm{{\varepsilon}}_{l},{\Omega})P({\Omega})P(\mathrm{{\varepsilon}}_{l})P(\mathrm{{\beta}})d\mathrm{{\beta}}d\mathrm{{\theta}}_{l}d\mathrm{{\varepsilon}}_{l}d{\Omega}}{{{\int}}P(\mathbf{\mathrm{n}}_{l},\mathbf{\mathrm{m}}_{l}\mathrm{{\vert}}\mathrm{{\theta}}_{l})P(\mathrm{{\theta}}_{l}\mathrm{{\vert}{\varepsilon}}_{l},{\Omega})P({\Omega})P(\mathrm{{\varepsilon}}_{l})d\mathrm{{\theta}}_{l}d\mathrm{{\varepsilon}}_{l}d{\Omega}}\]

(B1)

\[{=}\frac{{{\int}}P(\mathbf{\mathrm{n}}_{l},\mathbf{\mathrm{m}}_{l}\mathrm{{\vert}}\mathrm{{\theta}}_{l})(P(\mathrm{{\theta}}_{l}\mathrm{{\vert}}\mathrm{{\beta}},\mathrm{{\varepsilon}}_{l},{\Omega})/P(\mathrm{{\theta}}_{l}\mathrm{{\vert}{\varepsilon}}_{l},{\Omega}))P(\mathrm{{\theta}}_{l}\mathrm{{\vert}{\varepsilon}}_{l},{\Omega})P({\Omega})P(\mathrm{{\varepsilon}}_{l})d\mathrm{{\theta}}_{l}d\mathrm{{\varepsilon}}_{l}d{\Omega}}{{{\int}}P(\mathbf{\mathrm{n}}_{l},\mathbf{\mathrm{m}}_{l}\mathrm{{\vert}}\mathrm{{\theta}}_{l})P(\mathrm{{\theta}}_{l}\mathrm{{\vert}{\varepsilon}}_{l},{\Omega})P({\Omega})P(\mathrm{{\varepsilon}}_{l})d\mathrm{{\theta}}_{l}d\mathrm{{\varepsilon}}_{l}d{\Omega}}.\]

(B2)

Writing

\[W(\mathrm{{\theta}}_{l},\mathrm{{\beta}},\mathrm{{\varepsilon}}_{l},{\Omega}){=}\frac{P(\mathrm{{\theta}}_{l}\mathrm{{\vert}}\mathrm{{\beta}},\mathrm{{\varepsilon}}_{l},{\Omega})}{P(\mathrm{{\theta}}_{l}\mathrm{{\vert}{\varepsilon}}_{l},{\Omega})}\]

(B3)

we see that

\[\frac{{{\int}}W(\mathrm{{\theta}}_{l},\mathrm{{\beta}},\mathrm{{\varepsilon}}_{l},{\Omega})P(\mathbf{\mathrm{n}}_{l},\mathbf{\mathrm{m}}_{l}\mathrm{{\vert}}\mathrm{{\theta}}_{l})P(\mathrm{{\theta}}_{l}\mathrm{{\vert}{\varepsilon}}_{l},{\Omega})P({\Omega})P(\mathrm{{\varepsilon}}_{l})d\mathrm{{\theta}}_{l}d\mathrm{{\varepsilon}}_{l}d{\Omega}}{{{\int}}P(\mathbf{\mathrm{n}}_{l},\mathbf{\mathrm{m}}_{l}\mathrm{{\vert}}\mathrm{{\theta}}_{l})P(\mathrm{{\theta}}_{l}\mathrm{{\vert}{\varepsilon}}_{l},{\Omega})P({\Omega})P(\mathrm{{\varepsilon}}_{l})d\mathrm{{\theta}}_{l}d\mathrm{{\varepsilon}}_{l}d{\Omega}}{=}{{\int}}W(\mathrm{{\theta}}_{l},\mathrm{{\beta}},\mathrm{{\varepsilon}}_{l},{\Omega})P(\mathrm{{\theta}}_{l},\mathrm{{\varepsilon}}_{l},\mathrm{{\Omega}{\vert}}\mathbf{\mathrm{n}}_{l},\mathbf{\mathrm{m}}_{l})d\mathrm{{\theta}}_{l}d\mathrm{{\varepsilon}}_{l}d{\Omega}\]

(B4)

and thus that the Bayes factor is the expected value of W(θ_l, β, ε_l, Ω) integrated over the null model posterior and so can be evaluated by averaging W(θ_l, β, ε_l, Ω) over the MCMC on θ_l, ε_l, Ω (see, for example, Liu 2002). We do this by averaging W(θ_l, β, ε_l, Ω) simultaneously for a grid of β-values and obtain an estimate of the Bayes factor by numerically integrating these grid points over the uniform prior. The fact that the Bayes factor for a particular environmental variable can be evaluated using the MCMC for the null model means that we can evaluate the Bayes factor quickly for as many environmental variables as required, using a single run of the MCMC for each SNP.

Footnotes

Communicating editor: M. K. Uyenoyama

Acknowledgements

The authors thank Molly Przeworski, John Novembre, and members of the Di Rienzo, Pritchard, Przeworski, Stephens groups and the University of California at Davis Evolutionary Discussion Group for helpful discussions. We thank John Novembre, Molly Przeworski, Peter Ralph, and the reviewers for helpful comments on the manuscript. This work was supported by a Packard Foundation and Howard Hughes Medical Institute grant to J.K.P., by National Institutes of Health grants (DK56670 and GM79558) to A.D., and by a Sloan Foundation Fellowship to G.C.

References

Allen, J.,

1877

The influence of physical conditions in the genesis of species.

Radic. Rev.

1

:

108

–140.

Auton, A., K. Bryc, A. R. Boyko, K. E. Lohmueller, J. Novembre et al.,

2009

Global distribution of genomic diversity underscores rich complex history of continental human populations.

Genome Res.

19

:

795

–803.

Month:	Total Views:
January 2021	9
February 2021	66
March 2021	135
April 2021	77
May 2021	69
June 2021	60
July 2021	62
August 2021	66
September 2021	55
October 2021	67
November 2021	96
December 2021	63
January 2022	92
February 2022	64
March 2022	82
April 2022	69
May 2022	100
June 2022	54
July 2022	57
August 2022	90
September 2022	95
October 2022	87
November 2022	69
December 2022	79
January 2023	77
February 2023	73
March 2023	94
April 2023	81
May 2023	78
June 2023	70
July 2023	55
August 2023	63
September 2023	80
October 2023	84
November 2023	115
December 2023	75
January 2024	89
February 2024	101
March 2024	81
April 2024	93

Article Contents

Using Environmental Correlations to Identify Loci Underlying Local Adaptation Free

Abstract

METHODS

Null model:

Alternative model:

Assessing significance:

RESULTS

Power simulation:

Data application:

DISCUSSION

APPENDIX A: SAMPLING FROM THE POSTERIOR

APPENDIX B: EVALUATING THE BAYES FACTOR

Footnotes

Acknowledgements

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

Using Environmental Correlations to Identify Loci Underlying Local Adaptation