-
PDF
- Split View
-
Views
-
Cite
Cite
Andreas Futschik, Christian Schlötterer, The Next Generation of Molecular Markers From Massively Parallel Sequencing of Pooled DNA Samples, Genetics, Volume 186, Issue 1, 1 September 2010, Pages 207–218, https://doi.org/10.1534/genetics.110.114397
- Share Icon Share
Abstract
Next generation sequencing (NGS) is about to revolutionize genetic analysis. Currently NGS techniques are mainly used to sequence individual genomes. Due to the high sequence coverage required, the costs for population-scale analyses are still too high to allow an extension to nonmodel organisms. Here, we show that NGS of pools of individuals is often more effective in SNP discovery and provides more accurate allele frequency estimates, even when taking sequencing errors into account. We modify the population genetic estimators Tajima's π and Watterson's θ to obtain unbiased estimates from NGS pooling data. Given the same sequencing effort, the resulting estimators often show a better performance than those obtained from individual sequencing. Although our analysis also shows that NGS of pools of individuals will not be preferable under all circumstances, it provides a cost-effective approach to estimate allele frequencies on a genome-wide scale.
NEXT generation sequencing (NGS) is about to revolutionize biology. Through a massive parallelization, NGS provides an enormous number of reads, which permits sequencing of entire genomes at a fraction of the costs for Sanger sequencing. Hence, for the first time it has become feasible to obtain the complete genomic sequence for a large number of individuals. For several organisms, including humans, Drosophila melanogaster, and Arabidopsis thaliana, large resequencing projects are well on their way. Nevertheless, despite the enormous cost reduction, genome sequencing on a population scale is still out of reach for the budget of most laboratories. The extraction of as much statistical information as possible at cost as low as possible has therefore already attracted considerable interest. See, for instance, Jiang et al. (2009) for the modeling of sequencing errors and Erlich et al. (2009) for the efficient tagging of sequences.
Current genome-wide resequencing projects collect the sequences individual by individual. To obtain full coverage of the entire genome and to have high confidence that all heterozygous sites were discovered, it is required that genomes are sequenced at a sufficiently high coverage. As many of the reads provide only redundant information, cost could be reduced by a more effective sampling strategy.
In this report, we explore the potential of DNA pooling to provide a more cost-effective approach for SNP discovery and genome-wide population genetics. Sequencing a large pool of individuals simultaneously keeps the number of redundant DNA reads low and provides thus an economic alternative to the sequencing of individual genomes. On the other hand, more care has to be taken to establish an appropriate control of sequencing errors. Obviously haplotype information is not available from pooling experiments, but this will often be outweighed by the increased accuracy in population genetic inference.
Focusing on biallelic loci, our analysis shows that with sufficiently large pool sizes, pooling usually outperforms the separate sequencing of individuals, both for estimating allele frequencies and for inference of population genetic parameters. When sequencing errors are not too common, pooling seems also to be a good choice for SNP detection experiments. To avoid the additional challenges encountered with individual sequencing of diploid individuals, we compare pooling with individual sequencing of haploid individuals. See Lynch (2008, 2009) for a discussion of next generation sequencing of diploid individuals. Our results for the pooling experiments should be also applicable to a diploid setting, as we are just merging pools of size 2 to a larger pool in this case, leading to a pool size of n = 2nd for nd diploid individuals. In the methods section, we derive several mathematical expressions that permit us to compare pooling with separate sequencing of individuals. These formulas are then applied in the results section to illustrate the differences in accuracy between the approaches. A reader who is interested only in the actual differences under several scenarios might therefore want to move directly to the results section.
METHODS
Throughout, we consider an individual sequencing project where k individuals are sequenced each with an expected coverage λ, by which we mean that any given locus is sequenced λ times on average. For a comparable pooling experiment that involves the same amount of sequencing effort, the expected coverage will then be kλ; i.e., any particular locus will be read kλ times on average from the pool consisting of n individuals. In practice, one might for instance sequence each of the k individuals on a separate Illumina lane with coverage λ. With the same sequencing effort, the pool could be sequenced on k lanes simultaneously, leading to a total coverage of kλ.
For the convenience of the reader, we summarize our notation in Table 1.
Symbol or notation . | Description . |
---|---|
k | No. of haploid individuals used for separate sequencing |
λ | Expected no. of times a locus is read for an individual using separate sequencing |
n | Size of the pool in a pooling experiment |
J | Random no. of individuals for which reads are actually available at a particular locus with individual sequencing (J ≤ k) |
M | Random no. of reads for a particular locus in a pooling experiment [E(M) = kλ] |
p | Relative frequency of the allele of interest in the population |
F(P)(b, γ) | Poisson cumulative distribution function ( \(F_{(\mathrm{P})}(b,{\,}\mathrm{{\gamma}}){=}{\sum}_{i{=}0}^{b}(\mathrm{{\gamma}}^{i}{\,}/i!)\mathrm{exp}({-}\mathrm{{\gamma}})\) ) |
F(B)(x, M, p) | Binomial cumulative distribution function \((F_{(\mathrm{B})}(x,{\,}M,{\,}p){=}{\sum}_{i{=}0}^{x}{[}\begin{array}{l}M\\i\end{array}{]}p^{i}(1{-}p)^{M{-}i})\) |
\(\mathrm{{\hat{{\theta}}}}_{\mathrm{{\pi}}}^{(b){\ast}}\) | Bias-corrected version of Tajima's π for a pooling experiment when the minor allele frequency is required to be at least b. For b = 1, \(\mathrm{{\hat{{\theta}}}}_{\mathrm{{\pi}}}^{(b){\ast}}{=}\mathrm{{\hat{{\theta}}}}_{\mathrm{{\pi}}}^{{\ast}}\) |
\(\mathrm{{\hat{{\theta}}}}_{\mathrm{W}}^{(b){\ast}}\) | Bias-corrected version of Watterson's θ for a pooling experiment when the minor allele frequency is required to be at least b ≥ 1 |
Symbol or notation . | Description . |
---|---|
k | No. of haploid individuals used for separate sequencing |
λ | Expected no. of times a locus is read for an individual using separate sequencing |
n | Size of the pool in a pooling experiment |
J | Random no. of individuals for which reads are actually available at a particular locus with individual sequencing (J ≤ k) |
M | Random no. of reads for a particular locus in a pooling experiment [E(M) = kλ] |
p | Relative frequency of the allele of interest in the population |
F(P)(b, γ) | Poisson cumulative distribution function ( \(F_{(\mathrm{P})}(b,{\,}\mathrm{{\gamma}}){=}{\sum}_{i{=}0}^{b}(\mathrm{{\gamma}}^{i}{\,}/i!)\mathrm{exp}({-}\mathrm{{\gamma}})\) ) |
F(B)(x, M, p) | Binomial cumulative distribution function \((F_{(\mathrm{B})}(x,{\,}M,{\,}p){=}{\sum}_{i{=}0}^{x}{[}\begin{array}{l}M\\i\end{array}{]}p^{i}(1{-}p)^{M{-}i})\) |
\(\mathrm{{\hat{{\theta}}}}_{\mathrm{{\pi}}}^{(b){\ast}}\) | Bias-corrected version of Tajima's π for a pooling experiment when the minor allele frequency is required to be at least b. For b = 1, \(\mathrm{{\hat{{\theta}}}}_{\mathrm{{\pi}}}^{(b){\ast}}{=}\mathrm{{\hat{{\theta}}}}_{\mathrm{{\pi}}}^{{\ast}}\) |
\(\mathrm{{\hat{{\theta}}}}_{\mathrm{W}}^{(b){\ast}}\) | Bias-corrected version of Watterson's θ for a pooling experiment when the minor allele frequency is required to be at least b ≥ 1 |
Symbol or notation . | Description . |
---|---|
k | No. of haploid individuals used for separate sequencing |
λ | Expected no. of times a locus is read for an individual using separate sequencing |
n | Size of the pool in a pooling experiment |
J | Random no. of individuals for which reads are actually available at a particular locus with individual sequencing (J ≤ k) |
M | Random no. of reads for a particular locus in a pooling experiment [E(M) = kλ] |
p | Relative frequency of the allele of interest in the population |
F(P)(b, γ) | Poisson cumulative distribution function ( \(F_{(\mathrm{P})}(b,{\,}\mathrm{{\gamma}}){=}{\sum}_{i{=}0}^{b}(\mathrm{{\gamma}}^{i}{\,}/i!)\mathrm{exp}({-}\mathrm{{\gamma}})\) ) |
F(B)(x, M, p) | Binomial cumulative distribution function \((F_{(\mathrm{B})}(x,{\,}M,{\,}p){=}{\sum}_{i{=}0}^{x}{[}\begin{array}{l}M\\i\end{array}{]}p^{i}(1{-}p)^{M{-}i})\) |
\(\mathrm{{\hat{{\theta}}}}_{\mathrm{{\pi}}}^{(b){\ast}}\) | Bias-corrected version of Tajima's π for a pooling experiment when the minor allele frequency is required to be at least b. For b = 1, \(\mathrm{{\hat{{\theta}}}}_{\mathrm{{\pi}}}^{(b){\ast}}{=}\mathrm{{\hat{{\theta}}}}_{\mathrm{{\pi}}}^{{\ast}}\) |
\(\mathrm{{\hat{{\theta}}}}_{\mathrm{W}}^{(b){\ast}}\) | Bias-corrected version of Watterson's θ for a pooling experiment when the minor allele frequency is required to be at least b ≥ 1 |
Symbol or notation . | Description . |
---|---|
k | No. of haploid individuals used for separate sequencing |
λ | Expected no. of times a locus is read for an individual using separate sequencing |
n | Size of the pool in a pooling experiment |
J | Random no. of individuals for which reads are actually available at a particular locus with individual sequencing (J ≤ k) |
M | Random no. of reads for a particular locus in a pooling experiment [E(M) = kλ] |
p | Relative frequency of the allele of interest in the population |
F(P)(b, γ) | Poisson cumulative distribution function ( \(F_{(\mathrm{P})}(b,{\,}\mathrm{{\gamma}}){=}{\sum}_{i{=}0}^{b}(\mathrm{{\gamma}}^{i}{\,}/i!)\mathrm{exp}({-}\mathrm{{\gamma}})\) ) |
F(B)(x, M, p) | Binomial cumulative distribution function \((F_{(\mathrm{B})}(x,{\,}M,{\,}p){=}{\sum}_{i{=}0}^{x}{[}\begin{array}{l}M\\i\end{array}{]}p^{i}(1{-}p)^{M{-}i})\) |
\(\mathrm{{\hat{{\theta}}}}_{\mathrm{{\pi}}}^{(b){\ast}}\) | Bias-corrected version of Tajima's π for a pooling experiment when the minor allele frequency is required to be at least b. For b = 1, \(\mathrm{{\hat{{\theta}}}}_{\mathrm{{\pi}}}^{(b){\ast}}{=}\mathrm{{\hat{{\theta}}}}_{\mathrm{{\pi}}}^{{\ast}}\) |
\(\mathrm{{\hat{{\theta}}}}_{\mathrm{W}}^{(b){\ast}}\) | Bias-corrected version of Watterson's θ for a pooling experiment when the minor allele frequency is required to be at least b ≥ 1 |
SNP detection:
A SNP is detected at a site if the site is polymorphic, i.e., if at least two alleles A and a are found in the sequenced sample. We consider SNP detection both in the context of pooling experiments and for individual sequencing. To assess the performance of these two competing scenarios, we look both at the power and at the probability of falsely calling a SNP due to sequencing errors.
Generally speaking, an experimental design that provides high power while keeping the probability of incorrectly detecting a SNP small is preferable. When individuals are sequenced separately, the probability of sequencing errors being interpreted as true SNPs can be reduced by a sufficiently high expected coverage if the genotype of an individual is inferred by the majority of reads. Note that in the case of diploid individuals, the distinction between sequencing errors and true SNPs is significantly more complicated. In pooling experiments, a simple way to control the probability of falsely detecting SNPs both in the haploid and in the diploid case is to require a certain minimum number of reads for the minor allele to call a SNP. We extend work by Eberle and Kruglyak (2000) on SNP detection and derive both the power and error rates for pooling experiments and for separate sequencing.
Separate sequencing of individuals:
The resulting error probabilities can be made very small by ensuring a coverage λ that is large enough. Obviously a more sophisticated rule will be needed when sequencing diploid individuals.
Pooling experiment:
Allele frequency inference:
We now turn to the pooling experiment, assuming again a population proportion, p, of A alleles. With LA again denoting the number of A alleles in a pooled sample of size n, we assume MA (Ma) reads of the A (a) allele from this sample. This leads to M = MA + Ma reads for the site under investigation.
Allele frequency estimators for pooled samples that also take into account quality scores of the individual reads are discussed in Holt et al. (2009). The computation of variances for these estimators would depend on the specific assumptions of a probability model for the quality scores.
Estimating population genetic parameters:
Two widely used summary statistics in population genetics are Tajima's π and Watterson's θ. We investigate the influence of the two sequencing strategies on the accuracy of these summary statistics. According to our simulations, both summary statistics show a significantly smaller variance for pooled samples. However, in particular for small pools, the estimators show some bias. The reason for the bias is that multiple reads of the same sequence are entering the normalizing constant as independently sampled sequences, if the estimators are computed in a standard way for pooled samples. Sequencing errors also lead to bias, and if a minimum minor allele frequency is required to make sequencing errors rare, this needs to be taken into account. For individual sequencing, the effect of omitting singletons has been studied by Knudsen and Miyamoto (2009) as well as Achaz (2008). On the basis of the expected values of Tajima's π and Watterson's θ, we introduce modified normalizing constants that make the resulting estimators unbiased under neutrality. These bias-corrected estimators are then compared with those obtained from individual sequencing. (See results.)
To also correct for sequencing errors, two approaches seem feasible. If an unbiased estimate for the sequencing errors is available, such an estimate could be used to correct
RESULTS
SNP detection:
For many biological applications SNP genotyping provides a cost-effective approach, and SNP discovery is the first step required. We compared the efficiency of SNP discovery using an approach in which each individual is sequenced separately with a pooling approach. Figure 1 shows that the comparative efficiency of pooling depends both on the expected coverage and on the minimum number of reads for allele calling used for error protection. While pooling experiments provide a higher probability of SNP detection in most cases, it is expected to be less efficient, if both the coverage is small and a high minimum number of reads is required. This is not entirely unexpected, since an increased number of reads required for the inference of the minor allele reduces the probability of detecting SNPs in a pooling experiment. The higher the expected coverage, the more inefficient individual sequencing becomes. As long as not chosen too small, the size of the pool seems to play a less important role. Figure 2 addresses the problem of wrongly identifying a sequencing error as a SNP. Irrespective of the assumed model of sequencing errors (see methods for further details), a high probability of sequencing errors makes SNP calling from pools highly unreliable. On the other hand, if sequencing error rates are reduced (e.g., by quality filtering), a suitable lower bound on the minimum allele frequency for detecting a SNP makes pooling very reliable for the identification of SNPs. Interestingly, in some cases, we found pooling to result in fewer erroneous SNP calls than individual sequencing.
![Probability of detecting a SNP with relative minor allele frequency p in the population when a certain minimum number of reads is required as a detection threshold. The colored lines indicate the probabilities for sequencing experiments using a pooled sample [purple dashed, no error correction; red dotted, minor allele frequency (MAF) at least 2; blue dashed-dotted, MAF ≥ 4; green long dashed, MAF ≥ 6]. Solid black line: experiment where k = 10 haploid individuals are sequenced separately, with expected coverage: (A) λ = 5, (B) λ = 10, and (C) λ = 20 per individual. For pooling experiments, the expected total coverage is kλ. Pool sizes are either 50 (left) or 200 (right).](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/genetics/186/1/10.1534_genetics.110.114397/5/m_207fig1.jpeg?Expires=1750550323&Signature=RBoklucSOQheMvoCwQWhYhG-0GnFesmASiXmCOLud1WR7yDrqk4B4p~g8GG1cl7i201RubaNwTL8zBWFBOd92SmV1oTLM-9NqXh9ZqNxpzTuLYCGJWkDUYDeyBp5dg0vNCRt6Z1xwj6lNzCBw-kavlL4kR7R8T09PFI0WSqnXj8O490va8fDPBB4Vf84ApFs5KuM0bQf8Z~e6-8vxPxWrpcNZ1m4FshyV4Vi8j~mYzyFztJn~s7aqzlaKzadbQZmBYKW1twdoJBgSq81OEpPYlKI8sTCQma-EKxg7dJFhPb0-DemZcYti2D-6bDlcSFEj2F0VjsT0hh1RfsnfrvSRg__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Probability of detecting a SNP with relative minor allele frequency p in the population when a certain minimum number of reads is required as a detection threshold. The colored lines indicate the probabilities for sequencing experiments using a pooled sample [purple dashed, no error correction; red dotted, minor allele frequency (MAF) at least 2; blue dashed-dotted, MAF ≥ 4; green long dashed, MAF ≥ 6]. Solid black line: experiment where k = 10 haploid individuals are sequenced separately, with expected coverage: (A) λ = 5, (B) λ = 10, and (C) λ = 20 per individual. For pooling experiments, the expected total coverage is kλ. Pool sizes are either 50 (left) or 200 (right).
![Log probability of falsely detecting a SNP at a nonsegregating site, in dependence on the logarithm of the sequencing error probability. The colored lines indicate the probabilities for sequencing experiments using a pooled sample [purple dashed, no error correction; red dotted, minor allele frequency (MAF) at least 2; blue dashed-dotted, MAF ≥ 4; green long dashed, MAF ≥ 6]. Solid black line: experiment where k = 10 haploid individuals are sequenced separately and the most frequently read base at a position is chosen for the sequenced individual. Expected coverage: (A) λ = 5, (B) λ = 10, and (C) λ = 20 per individual. For pooling experiments, the expected total coverage is kλ. Since the pool size is not relevant in this context, we plot results for completely dependent (left) and independent (right) sequencing errors instead. See methods for a more detailed description of these scenarios.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/genetics/186/1/10.1534_genetics.110.114397/5/m_207fig2.jpeg?Expires=1750550323&Signature=OoqWeeFl8wZEGN~t03WN-1mZwjdOcz9iVDvBUKoKLFODpxKW10F7iznpcgvvd72KFcajLD13lZ~MBT9AOPG59VrqPaOedMjc7WHC65sST7tojhsXTS13eHzdTw1uNqKXQr0K1Q2SUmUKHtg~RQfr8UatUq7u-7upz-p61FCWHWLEcHSydEPY4eQAxMM-IgSUzgdtIPfaxCod~SSOBNb8HP0Ig6zfn-nzz9kZ3J5ZyCXEAS1HWDneVIPAjWbY347JT~GHAdYbyWSJz7~pLyXIOt7nbub2r8jnR-zS9~xmKPoJsEVSMqcfCmffIXwzJ-uVX51XXX6YWYvUr3F2bcCV8Q__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Log probability of falsely detecting a SNP at a nonsegregating site, in dependence on the logarithm of the sequencing error probability. The colored lines indicate the probabilities for sequencing experiments using a pooled sample [purple dashed, no error correction; red dotted, minor allele frequency (MAF) at least 2; blue dashed-dotted, MAF ≥ 4; green long dashed, MAF ≥ 6]. Solid black line: experiment where k = 10 haploid individuals are sequenced separately and the most frequently read base at a position is chosen for the sequenced individual. Expected coverage: (A) λ = 5, (B) λ = 10, and (C) λ = 20 per individual. For pooling experiments, the expected total coverage is kλ. Since the pool size is not relevant in this context, we plot results for completely dependent (left) and independent (right) sequencing errors instead. See methods for a more detailed description of these scenarios.
Allele frequency inference:
In population genetics, the allele frequency spectrum is of central interest. Estimating the allele frequency spectrum of a population is subject to sampling variation. In an individual-based sequencing strategy, most of the sampling variation comes from the selection of individuals used for DNA sequencing. The advantage of the pooling approach is that this sampling error can be dramatically reduced by including a large number of individuals in the pool. On the other hand, a second level of sampling error arises in the pooling approach from the fact that not all chromosomes in the pool are sequenced and some chromosomes may be sequenced more than once. We start by discussing the situation where individuals contribute equal amounts of probe material and refer to the last paragraph of the section for the case when this assumption is violated.
In methods, we obtained expression (10) for the ratio of the variances of the estimated relative allele frequency both for a pooling experiment Rp (pool size n) and for a classical experiment with individual sequencing Rc. For a large enough expected coverage λ and with k individuals sequenced, this equation can be approximated by the following quick rule of thumb: Pooling will lead to a smaller variance for those experimental setups that satisfy
So far we compared the individual-based and pooling strategies only for the same number of sequenced reads. Alternatively, the superiority of the pooling approach could be expressed by the reduction of sequencing costs. Figure 3 compares the pooling approach to sequencing of individuals when both methods provide the same accuracy for allele frequency estimates. Suppose that k individuals are sequenced separately, each at an expected coverage λ. Then k* indicates the cost in single-genome sequencing equivalents that results in the same accuracy as sequencing k genomes individually. If, for instance, k = 20 and k* = 10, then pooling would give the same accuracy with half the sequencing effort, corresponding to an individual sequencing project with 10 instead of 20 individuals. Figure 3 clearly indicates that larger pool sizes increase the advantage of sequencing pools. A higher sequence coverage (λ) for sequencing of individuals further improves the cost effectiveness of pooling.

Sequencing effort k* of a pooling experiment to get allele frequency estimates with the same accuracy as in a standard experimental setup where k individuals are sequenced separately. (“o”, pool size n = 50; “+”, n = 100; “x”, n = 500.)
In genome-wide association studies, the association between allele frequencies and traits (diseases) is investigated. A possible approach is to test whether alleles have different frequencies in two pools that differ with respect to the trait of interest (see Sham et al. 2002). Since the ratio of variances (10) does not depend on the allele frequencies in the subpopulations, the standard deviation entering the test statistic will differ by the square root of (10) between a pooling and a classical experimental setup. If the square root of (10) is
Estimating population genetic parameters:
We now compare the estimation of the scaled mutation parameter using Watterson's θ and Tajima's π under our two experimental setups. For this purpose, we simulated 100 samples each consisting of 500 sequences under neutrality with mutation parameter θ = 10, using the ms software (Hudson 2002). For separate sequencing, we took random subsamples of size k = 10 from each sample, thus simulating separate sequencing of 10 individuals each with an expected number λ of reads. With pooling, we took samples of size n of the 500 simulated sequences. From this pool, reads were taken independently for each locus l by making a random number of draws Ml with replacement. The quantities Ml were chosen according to a Poisson distribution with expected value kλ. Figure 4 illustrates that there can be considerable bias when Watterson's θ and Tajima's are used naively. For Tajima's π, we therefore used the bias correction (14) for individual loci and added the estimates across loci using (15). For Watterson's θ, the bias was corrected using Equation 18 for each locus.
Neglecting sequencing errors for the moment, it turns out that the pooling approach with bias correction leads to more accurate estimates of θ and π, provided that the size of the pool is large enough. For small pools, multiple reads of the same chromosome become more common, which affects the accuracy of the estimates negatively (Figure 4).

Expected value of the estimates obtained from pooled samples depending on the pool size n: Watterson's θ and Tajima's π. True value θ = 10 (green line). There is a considerable bias, if n is small compared to kλ, illustrating the need to use a bias correction with the estimates. Solid black line, λ = 30; red dashed line, λ = 5. (For Tajima's π, the bias does not depend on λ.)
We now investigate the pooling approach when including a protection against sequencing errors by removing all segregating sites where the minor allele that has frequency x satisfies x = 1 or alternatively x ≤ 2. Again, the normalizing constants have been adapted to avoid bias. Let b denote the minimum required minor allele frequency.
Figure 5 shows the relative advantage of pooling conditional on different minimum minor allele frequencies. Pooling still leads to a decreased variance under neutrality as long as the pool size is large enough. Not unexpectedly, the reduction in variance is now somewhat smaller for Watterson's
![Variance ratio (Varpooled/Varstandard) of the bias-corrected version of Watterson's θ and Tajima's π depending on the pool size n. We consider pooling both without [minor allele frequency (MAF) ≥ 1] and with a protection (MAF ≥ 2, MAF ≥ 3) against sequencing errors. (Only segregating sites with MAF above the stated threshold are included.) The horizontal green line denotes the break-even ratio of 1, where both the pooled and the classical experiment leads to estimates with equal variances. Pooling always performs better, as soon as the size of the pool exceeds the number of separately sequenced individuals. Solid black line, λ = 30; red dashed line, λ = 5. Standard setup is shown with k = 10 individuals sequenced separately.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/genetics/186/1/10.1534_genetics.110.114397/5/m_207fig5.jpeg?Expires=1750550323&Signature=BvImlj0Uo8PEWWeb~wX~E5uboVjadO1-3JBTq5SKkrBY7WLKHLLqqDpc5qc01ywoimlyqwUqk30Gum2vejr6tpfHGkSdU78yNMQ8hAQfYyA8bft-U4YzSOaYxZrrbH4y-eWjOuwG2EAOvL14bVkjOCXw0EVZUuxpYFKk0Q6Dt3L9fpfX6M2UbybrRMvgE4HI1HQShA2wjF35loL8LY88TG4SSq5bf4Zn1aztUfR95IZnDAJNFWR8N7LkyVhf0XfgCkpB6dIrcI8tHcPWhjP4DToKP6RDwxbukgX9ge2-dcmZa0GyqNzOaazj9SKhaJxreXDp6S7L4j3W53dSH6nTpw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Variance ratio (Varpooled/Varstandard) of the bias-corrected version of Watterson's θ and Tajima's π depending on the pool size n. We consider pooling both without [minor allele frequency (MAF) ≥ 1] and with a protection (MAF ≥ 2, MAF ≥ 3) against sequencing errors. (Only segregating sites with MAF above the stated threshold are included.) The horizontal green line denotes the break-even ratio of 1, where both the pooled and the classical experiment leads to estimates with equal variances. Pooling always performs better, as soon as the size of the pool exceeds the number of separately sequenced individuals. Solid black line, λ = 30; red dashed line, λ = 5. Standard setup is shown with k = 10 individuals sequenced separately.
Unequal amounts of probe material:
One obvious source of error in the pooling approach is the heterogeneity in DNA amounts due to measurement errors. In experiments that rely on PCR amplification, the heterogeneity can be expected to be particularly strong.
Individuals for which a larger DNA amount has been included in the DNA pool will be overrepresented, which potentially causes a change in allele frequency estimates. This affects the bias and the variance also for our considered population genetic summary statistics.
To investigate the sensitivity of population genetic estimates based on pooling experiments, we simulated a scenario involving unequal amounts of probe material. We set the expected amount of probe material to one and allowed for log-normally distributed multiplicative deviations from this expected value. More specifically, the deviation factors were chosen independently for each individual contributing to the pool according to exp(Xi), where Xi (1 ≤ i ≤ n) are normal N(0, log(scale)) random variables. Thus the median amount of probe material is always equal to 1. If the deviation factor has a value of exp(Xi) = 1.5, this means that the respective individual will have a 50% higher chance of being sequenced than another with a factor of exp(Xi) = 1. Similarly, a value of 0.8 means a 20% decreased chance of being read.
As our first scenario (scale = 2), slightly more than 30% of all individuals differed at least twofold from the median. In other words, for a pool of size n = 100, the most abundant individual contributed ∼16 times the probe material of the least abundant individual. We also simulated a more extreme scenario (scale = 8), where ∼30% of the individuals differed at least eightfold from the median. As further parameters we chose λ = 30, k = 10, n ∈ [5, 200].
As the amount of heterogeneity in the sample will usually be unknown, we applied the same bias correction as for equal amounts of probe material. We measured the deviation from the true θ by the mean squared error, as this accounts for bias and variance.
![Bias (solid lines) and variance (dashed lines) of Watterson's θ and Tajima's π depending on the extent of heterogeneity in probe material. Black lines, moderate heterogeneity (scale = 2); red lines, high heterogeneity (scale = 8). In the top row bias and standard deviations are plotted for the population genetic estimates. The bottom row contains the squared bias and the variance that add up to the mean squared error. [Further parameters, λ = 30, k = 10; log-normal parameters, μ = 0, σ = log(scale).]](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/genetics/186/1/10.1534_genetics.110.114397/5/m_207fig6.jpeg?Expires=1750550323&Signature=lH6dQ7k74Q6EJjADIKzeDRdH4ZrEOlDDEJxptwUcv8HvgVUaLab4JiXsYKqPH9TqaKnfO0aooK-1G9Rxw0twGISMyLNQOIVJh0L0F5QNXyCEKPGv38eQ2t2tjvoseKbqKWZyldeJIWvW3iJwqSshcHUuzsa9oMh5mX3rSGvMA2~QMjIiWiSNebOyFgOYLhpyhCkA4UWcj1fuWPDWjp0WPfguKiAY34o9Lq~ddzUblKsd8f3VS2swmxsHXreYfN5UhXXjMO7XkmkffuZVdSQdBMoXeGEP5C99GI2q19r2LRo7Uuu~GqVGn1Vob9xe2VZQuqJdqcH0jSZjBao9LqLCXA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Bias (solid lines) and variance (dashed lines) of Watterson's θ and Tajima's π depending on the extent of heterogeneity in probe material. Black lines, moderate heterogeneity (scale = 2); red lines, high heterogeneity (scale = 8). In the top row bias and standard deviations are plotted for the population genetic estimates. The bottom row contains the squared bias and the variance that add up to the mean squared error. [Further parameters, λ = 30, k = 10; log-normal parameters, μ = 0, σ = log(scale).]
![Mean squared error ratio (MSEpooled/MSEstandard) of Watterson's θ and Tajima's π depending on the pool size n and for λ = 30. Solid black line, the same amount of probe material is available for all individuals; red dashed line, the amount of probe material differs from individual to individual according to log-normal factors. For the top two panels, both curves are nearly identical. The median factor is always 1, and with a scale of 2, ∼32% of all probes deviate by a factor of more than the value given by scale. For a scale value of 2 (for instance), 16% of probes involve more than double the median probe amount, and another 16% contain less than one-half the median amount. [Log-normal parameters: μ = 0, σ = log(scale), scale ∈ {2, 8}.]](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/genetics/186/1/10.1534_genetics.110.114397/5/m_207fig7.jpeg?Expires=1750550323&Signature=ik1O9fH2PouRNAXLyKn0cZWsp56iBpIKtsL6KR6rlDZRnAQC86xnj8F~biVJSs2mBFr6HaKvcHkM79zbQZOL7NJ5va3K-s-9ozoDSFUsRKx5DQg1JqWpUtStq3ocMJot0FvIL-cN5w4lmIC~oihmc8QCCrNwiMW4NIttxSwldLo4vRx-JCDZD5ihShfxpzpDONPIn4rAmxpVbt5wH5Y87Ww5z-MLkdoCWwyqhMKYjrC08RCC~eHFzst~L69NfwQPJywaNiMT8yyuPx2r3ltoVd~-WWxWrx5LASz-BadK0d9jgccBI10jV1lqi2ZLj8y9d~RZmx4pkGwM2ztDmY~PNg__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Mean squared error ratio (MSEpooled/MSEstandard) of Watterson's θ and Tajima's π depending on the pool size n and for λ = 30. Solid black line, the same amount of probe material is available for all individuals; red dashed line, the amount of probe material differs from individual to individual according to log-normal factors. For the top two panels, both curves are nearly identical. The median factor is always 1, and with a scale of 2, ∼32% of all probes deviate by a factor of more than the value given by scale. For a scale value of 2 (for instance), 16% of probes involve more than double the median probe amount, and another 16% contain less than one-half the median amount. [Log-normal parameters: μ = 0, σ = log(scale), scale ∈ {2, 8}.]
Heterogeneity in probe material also affects the accuracy of the estimated allele frequencies, as the variance of the estimator based on a pooled sample becomes larger. However, this effect can be kept small, by choosing a pool of a large enough size. This is illustrated in Figure 8, where it can be seen that pooling leads for large enough pool sizes eventually to smaller variances even for a high level of heterogeneity in probe material (scale = 8).
![Variance ratios (Varpooled/Varstandard) when estimating allele frequencies in the case where the amount of probe material also differs from individual to individual according to log-normal factors. The median factor is always 1, and with a scale of s, ∼32% of all probes deviate by a factor of more than s. For the scale value s = 2 (for instance) 16% of probes involve more than double the median probe amount, and another 16% contain less than one-half the median amount. Ratios <1 indicate that pooling leads to estimates with a smaller variance. Individual sequencing is carried out for 10 individuals with an expected coverage of λ = 10. Scales: s = 2 (red dashed line), s = 4 (green dotted line), s = 8 (blue dashed-dotted line). [Log-normal parameters: μ = 0, σ = log(scale), scale ∈ {2, 4, 8}.]](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/genetics/186/1/10.1534_genetics.110.114397/5/m_207fig8.jpeg?Expires=1750550323&Signature=jfn24aD3rk05BfwumiOmMi8uoDSaF994jI6epArM7XaO5f4pTQBG0DeawjYI1xrE80jbxCiz2SQmGeAFn084J2e2uoIaQTTCuznoJOj16bCuOxwdzIMWwXtteKJYgY4lSUFr8akZdNZ2JQ8wvFdMCD8iMywkMojrlxXacEbYIhH4NQyTSehjgDCQ4fCqCsJdijZoWEreCRp83CNB~hRGZMG-XonjbvTbRknM~IOkksKzkfZC-J89ZDdje8Om3UM49GAskO9gLdrQ~Tr7TGk2rlzieDK-S7GA~qhtd3xWoBjcG6dibFZrDe0ucFMzWl~jiCpulp68JILOIo21UEkvqA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Variance ratios (Varpooled/Varstandard) when estimating allele frequencies in the case where the amount of probe material also differs from individual to individual according to log-normal factors. The median factor is always 1, and with a scale of s, ∼32% of all probes deviate by a factor of more than s. For the scale value s = 2 (for instance) 16% of probes involve more than double the median probe amount, and another 16% contain less than one-half the median amount. Ratios <1 indicate that pooling leads to estimates with a smaller variance. Individual sequencing is carried out for 10 individuals with an expected coverage of λ = 10. Scales: s = 2 (red dashed line), s = 4 (green dotted line), s = 8 (blue dashed-dotted line). [Log-normal parameters: μ = 0, σ = log(scale), scale ∈ {2, 4, 8}.]
DISCUSSION
Over the past decades we have been witnessing a continuous turnover of molecular markers used in genetic research. To a large extent this turnover has been driven by the advances in molecular biology and technology. With the arrival of the second-generation sequencing technologies, this race is about to come to an end—rather than relying on a more or less representative fraction of the genome, it has come into reach to have full genomic sequences available for multiple individuals.
With further technological advances, it is anticipated that it will become possible to sequence individual genomes at a cost that allows even small laboratories to perform population analyses on a genome scale. Currently, this is not possible as the costs are still too high. In this study, we showed that sequencing pools of individuals provides an excellent alternative that permits genome-wide polymorphism surveys at very moderate costs.
This is the first report systematically exploring the parameter range for which DNA pooling provides an advantage compared to individual genome sequencing.
Our result that NGS of DNA pools often provides a reliable and cost-effective means for genome-wide allele frequency estimates is supported by some recent studies using NGS to analyze DNA pools of selected genomic regions. Van Tassell et al. (2008) sequenced a complexity-reduced DNA pool using the Illumina Genome Analyzer. For a subset of the identified SNPs, they compared the allele frequency estimates from the Illumina sequencing to those obtained by genotyping the same individuals. Despite that SNP frequency estimates were undoubtedly affected by a substantial assignment error (Palmieri and Schlötterer 2009) due to the short reads and the complexity-reducing procedure, Van Tassell et al. (2008) observed a correlation of 0.67 between the two methods. Hence, there is very little doubt that NGS is an effective tool to provide accurate genome-wide allele frequency estimates from DNA pools.
We anticipate that the analysis of DNA pools will provide a wide range of applications. In population genetics, it will be possible to compare patterns of differentiation on a genomic scale. Thus, patterns of local adaptation and heterogeneity in gene flow among different genomic regions can be identified. Also, for association mapping DNA pools are very powerful (Sham et al. 2002). In contrast to SNP arrays, however, resequencing of DNA pools will always include the causative SNP and thus provide a higher statistical power. Our study provides the basis for an adequate experimental design of future pooling experiments.
Footnotes
Communicating editor: D. Begun
Acknowledgements
We especially thank C. Kosiol, N. de Maio, and R. Kofler for helpful comments on earlier versions of the manuscript. We are grateful to A. Vasemägi and J. Wolf for general discussions about pooling for NGS, and we also thank the reviewers for helpful comments. This work was supported by a Wiener Wissenschafts-, Forschungs- und Technologiefonds grant to A.F. and C.S. as well as by Fonds zur Förderung der wissenschaftlichen Forschung grants (P 19832-B11 and L403-B11) awarded to C.S.
References
Achaz, G.,
Eberle, M., and L. Kruglyak,
Erlich, Y., K. Chang, A. Gordon, R. Ronen, O. Navon et al.,
Futschik, A., and F. Gach,
Holt, K., Y. Teo, H. Li, S. Nair, G. Dougan et al.,
Hudson, R. R.,
Jiang, R., S. Tavaré and P. Marjoram,
Knudsen, B., and M. M. Miyamoto,
Lynch, M.,
Lynch, M.,
Palmieri, N., and C. Schlötterer,
Sham, P., J. S. Bader, I. Craig, M. O'Donovan and M. Owen,