-
PDF
- Split View
-
Views
-
Cite
Cite
Emanuel M Fonseca, Linh N Tran, Hannah Mendoza, Ryan N Gutenkunst, Modeling Biases from Low-Pass Genome Sequencing to Enable Accurate Population Genetic Inferences, Molecular Biology and Evolution, Volume 42, Issue 1, January 2025, msaf002, https://doi.org/10.1093/molbev/msaf002
- Share Icon Share
Abstract
Low-pass genome sequencing is cost-effective and enables analysis of large cohorts. However, it introduces biases by reducing heterozygous genotypes and low-frequency alleles, impacting subsequent analyses such as model-based demographic history inference. Several approaches exist for inferring an unbiased allele frequency spectrum (AFS) from low-pass data, but they can introduce spurious noise into the AFS. Rather than correcting the AFS, here, we developed an approach that incorporates low-pass biases into the demographic modeling and directly analyzes the AFS from low-pass data. Our probabilistic model captures biases from the Genome Analysis Toolkit multisample calling pipeline, and we implemented it in the population genomic inference software dadi. We evaluated the model using simulated low-pass datasets and found that it alleviated low-pass biases in inferred demographic parameters. We further validated the model by downsampling 1000 Genomes Project data, demonstrating its effectiveness on real data. Our model is widely applicable and substantially improves model-based inferences from low-pass population genomic data.
Introduction
Enabled by reduced sequencing costs, population genetics has experienced a revolution, from focusing on a limited number of loci to now encompassing entire genomes (Maddison et al. 1992; Reid et al. 2016; Marchi et al. 2022). Yet researchers must still trade off (i) the extent of the genome to be sequenced, (ii) the depth of coverage for each sample, and (iii) the number of sequenced samples (Lou et al. 2021; Martin et al. 2021; Duckett et al. 2023). One way to address this trade off is to sequence one reference sample at high coverage depth while sequencing others at lower depth (Lou et al. 2021). Low-pass sequencing, in which the genome is sequenced at a lower depth of coverage, avoids many of the financial, methodological, and computational challenges of high-pass sequencing (Li et al. 2011). Furthermore, limited availability of DNA can also make high depth impractical, especially for ancient samples and museum or herbarium specimens (Mota et al. 2023).
Despite its advantages, low-pass sequencing may lead to an incomplete and biased representation of genetic diversity within a population (e.g. Vieira et al. 2013; Han et al. 2014; Fox et al. 2019). Low-frequency genomic variants may not be detected (Fumagalli 2013), and genotypes may be less accurate (Nielsen et al. 2011). Low-pass sequencing increases the likelihood of miscalling heterozygous loci as homozygous (Duitama et al. 2011; Gorjanc et al. 2015), due to a lack of sufficient reads on homologous chromosomes to distinguish between different alleles at a given locus. These issues can then bias downstream analyses. It is thus important for analysis methods to accommodate low-pass sequencing (see Carstens et al. 2022 for a discussion of related issues).
The allele frequency spectrum (AFS) is a powerful summary of population genomic data (Sawyer and Hartl 1992; Wakeley 2009). Briefly, the AFS is matrix which records the number alleles observed at given frequencies in a sample of individuals from one or more populations. The AFS is often the basis for inferring demographic history (Gutenkunst et al. 2009) or distributions of fitness effects (Kim et al. 2017). In low-pass sequencing, the loss of alleles and the excess of homozygosity can bias the estimation of the AFS (Fumagalli 2013) and thus those inferences.
To address the challenges of low-pass data, several tools have emerged (Bryc et al. 2013; Blischak et al. 2018; Meisner and Albrechtsen 2018) to estimate the AFS from low-pass data. One of the most widely adopted is ANGSD (Korneliussen et al. 2014), which offers a diverse range of analyses tailored for low-pass sequencing data. To infer an AFS, ANGSD uses sample allele frequency likelihoods, which can be computed either directly from raw data or, more frequently, from genotype likelihoods (Nielsen et al. 2012). These likelihoods quantify the probability of observing the complete set of read data for multiple individuals at specific genomic sites, given particular sample allele frequencies (Nielsen et al. 2012; Korneliussen et al. 2014), enabling ANGSD to estimate allele frequencies. However, as the number of samples increases, ANGSD becomes computationally inefficient and numerically unstable (Han et al. 2015). To address this, a score-limited dynamic programming algorithm was introduced, offering significantly greater efficiency by scaling linearly with the number of genomes, unlike ANGSD’s quadratic complexity (Han et al. 2015). Another method, winsfs, uses a stochastic expectation–maximization (EM) algorithm developed to address common problems like overfitting, high-memory usage, and long runtime associated with standard methods for estimating AFS from low-pass sequencing data (Rasmussen et al. 2022). A related tool, the Bayesian genotype caller, calls genotypes from high-throughput sequencing, including low-pass sequencing, using population-level information and sequencing error rates to improve accuracy (Maruki and Lynch 2017).
While tools for inferring an unbiased AFS have demonstrated substantial utility, they also present notable limitations. For instance, many analyses depend on differentiating between variant types, such as synonymous versus nonsynonymous sites. But such differentiation requires calling genotypes, which the developers of ANGSD advise against (Korneliussen et al. 2014). In cases of low or moderate sequencing depth, AFS inference process can introduce significant uncertainty due to limited data, potentially leading to errors or biases in downstream analyses (Lou et al. 2021). Furthermore, we observe here that these tools sometimes fall short of fully correcting for biases inherent in low-pass sequencing.
Rather than attempting to estimate an unbiased AFS from low-pass data, we developed a probabilistic model of low-pass AFS biases. We incorporated it into the population genomic inference software dadi (Gutenkunst et al. 2009), so the biased low-pass AFS can be directly analyzed. Our model is based on the multisample genotype calling pipeline of the Genome Analysis Toolkit (GATK), the most widely used tool for calling variants from read data (McKenna et al. 2010; Van der Auwera and O’Connor 2020). We assessed the accuracy of our model using simulated low-depth data as well as subsampled data from the 1000 Genomes Project (Fairley et al. 2020, https://www.internationalgenome.org/). We found that our model accurately captures low-depth biases in the AFS and enables accurate inference of demographic history from low-pass data.
Model for Low-Pass Biases
Our approach is to incorporate low-pass sequencing biases into the model inference process. From a user perspective, our approach begins when the AFS is computed from an input Variant Call Format (VCF) file containing population genomic data. While parsing the VCF, population-specific distributions for depth of coverage per site per individual are also extracted. (Note that our model does not incorporate quality scores, beyond the filtering done by GATK. The defaults for those filters ensure only high-quality reads are included in our analysis.) These distributions are passed to a function which wraps the demographic model function implemented in dadi. The demographic model function takes in an assumed set of demographic model parameters and returns the corresponding model AFS. The wrapper function takes as fixed inputs , the number of sequenced individuals; , the number of individuals the AFS is subsampled down to; and , the distribution of sequencing depths. (See Table 1 for a summary of key mathematical notation.) The wrapper function acts on the AFS returned by the demographic model function and applies the transformations described below to generate an AFS that is biased by low-pass sequencing. The user uses this wrapped function in place of the original demographic model function during parameter optimization, to maximize the composite likelihood of their (biased) AFS data.
Number of sequenced individuals | |
Number of individuals AFS is subsampled to | |
Distribution of sequencing depths d | |
f | True alternate allele count |
Genotype partition function | |
Numbers of true reference homozygotes, heterozygotes, and alternate homozygotes | |
Probability of observing n alternative reads | |
Probability a heterozygous individual is miscalled |
Number of sequenced individuals | |
Number of individuals AFS is subsampled to | |
Distribution of sequencing depths d | |
f | True alternate allele count |
Genotype partition function | |
Numbers of true reference homozygotes, heterozygotes, and alternate homozygotes | |
Probability of observing n alternative reads | |
Probability a heterozygous individual is miscalled |
Number of sequenced individuals | |
Number of individuals AFS is subsampled to | |
Distribution of sequencing depths d | |
f | True alternate allele count |
Genotype partition function | |
Numbers of true reference homozygotes, heterozygotes, and alternate homozygotes | |
Probability of observing n alternative reads | |
Probability a heterozygous individual is miscalled |
Number of sequenced individuals | |
Number of individuals AFS is subsampled to | |
Distribution of sequencing depths d | |
f | True alternate allele count |
Genotype partition function | |
Numbers of true reference homozygotes, heterozygotes, and alternate homozygotes | |
Probability of observing n alternative reads | |
Probability a heterozygous individual is miscalled |
In the dadi command-line interface, dadi-cli (Huang et al. 2023), low-pass bias correction is seamlessly incorporated into the workflow. When using GenerateFs, users can specify—calc-coverage to create a .coverage.pickle file containing per-sample depth of coverage data. During demographic inference, the InferDM command accepts this file through the—coverage-model flag, allowing the LowPass model to adjust the AFS based on observed coverage, thereby addressing low-pass sequencing biases directly within the optimization process. Integration with dadi and dadi-cli require no additional free parameters, enabling an efficient yet thorough adjustment for low-pass bias correction developed in this study.
When biases arises from low-pass sequencing, the AFS may be affected by both the loss of low-frequency variants and the misidentification of heterozygous individuals as homozygous. These two effects result in a deficit of variant sites and misleading shifts in allele frequencies, respectively. Moreover, the data must often be subsampled to generate an AFS for analysis, because not all individuals will be called at all sites. We account for these distortions by sequentially modeling the probabilities of a variable site being called, of that site having enough called individuals for subsampling, and of having its allele frequency misestimated.
The specific choices in our model are motivated by the default GATK multisample calling algorithm, in which information from all samples is used to identify whether a site is variant. In particular, we assume that a site will only be called as variant if at least two alternate allele reads are observed. Once a site is identified as variant, an individual will be called as missing if zero reads are observed, homozygous if all reads correspond to a single allele, and heterozygous if at least one reference and one alternate read are observed. For simplicity, we first describe the case of sequencing individuals from a single population.
First, we calculate the probability that a true variant site is called as variant. Consider a site in which the true alternate allele count within our sample of individuals is f. Those f alternate alleles can be distributed among the sampled alleles in many ways. To quantify those ways, we define the partition function , which is an array of integer partitions with n entries that sum to the allele frequency f such that all entries in the partition are 0, 1, or 2 (corresponding to the possible genotype values). For example, the partitions defined by are and . Each possible partition within can occur in ways, where , , and denote the number of partition entries equal to 0, 1, or 2. (The factor of accounts for the two possible haplotypes the alternate allele could lie on in each heterozygote.) The corresponding probability of each partition within is then the number of ways it can occur divided by the total over all partitions within . The relation between the sample allele frequency and potential genotypes has previously been explored to develop exact tests of Hardy–Weinberg equilibrium (Wigginton et al. 2005).
Let denote the distribution of read depth d within the population sample, which we assume to be shared among all individuals. For an individual homozygous for the alternate allele, the probability of observing a alternate reads is simply . For a heterozygous individual, the probability of zero alternate reads is
Here, we sum over the distribution of depths, and at each depth each read has a chance of containing the reference allele, so the probability of all reads being reference is . Similarly, the probability of exactly one alternate read is
Note that for depth d, there are d possible configurations with one alternate read and reference reads.
For a given partition within that has true genotype counts , , and , there are multiple ways of failing to identify the variant site. The probability of zero reads supporting the alternate allele is
The probability of exactly one read supporting the alternate allele is
Here, the two terms account for the probability that the alternate read occurs in one of the heterozygotes or homozygotes, respectively. The overall probability of not calling a variant site for a given partition is thus . And the overall probability of not calling a variant site with a given true allele frequency f is the sum of these probabilities over partitions , weighted by the partition probabilities. For any given coverage distribution, the probability of calling a variant site increases rapidly with allele frequency f (supplementary fig. S1, Supplementary Material online).
Second, we calculate the probability that a site has enough called individuals to enter the AFS. When analyzing low-pass data, generating an AFS for the full sample size may result in the loss of many sites where not all individuals were called. Consequently, it is common to subsample the data to some lower sample size ; only sites with calls for at least individuals can then be analyzed. To calculate the probability that a site can be subsampled, we want to calculate the probability that at least individuals have at least one read, conditional on the site having been called as variant. This conditional probability is complex, so we make a simple approximation. If a site has been called as variant, then at least one individual was called, so we sum the probability that an additional c individuals are called out of the remaining , where c is at least :
The projection of the model AFS from sample size down to sample size is already implemented in dadi, based on sampling without replacement as initially described by Marth et al. (2004). Equation (5) is an additional overall factor by which the AFS is reduced, due to sites that cannot be projected downward because they do not have at least called individuals. From this point onward, we consider partitions over the subsampled individuals, where is the true alternate allele count in the subsample.
Lastly, we calculate the distortion in estimated allele frequencies. Once a site as called as variant, low-pass sequencing can bias the estimation of the allele frequency at that site, if one or more heterozygotes are miscalled because all their reads are reference or alternate. For each heterozygous individual, this occurs with total probability
where , is the distribution of depths conditioned on having at least one read. For a partition with true heterozygotes, the number of miscalled heterozygotes is binomially distributed with mean . Each miscalled heterozygote has equal chance of being called as homozygous reference or alternate, so the number of miscalls to homozygous reference is binomially distributed with mean , and the number of miscalls to homozygous alternate is . The net change in estimated alternative allele frequency is then .
The biases caused by low-pass sequencing do not depend on the underlying AFS; for each true allele frequency a given fraction will always, on average, be miscalled as any given other allele frequency. The correction above can be thus be calculated once for a given data set then applied to all model AFS generated, for example, during demographic parameter optimization. For efficiency, we calculate and cache an by transition matrix that can be multiplied by any given model AFS for individuals to apply the low coverage correction. When analyzing multiple populations, we calculate and apply transitions matrices for each population, because variant calling is independent among populations once a variant has been identified. Variant identification is, however, not independent among populations, which we address using simulated calling described next.
When calculating the probability of miscalling a heterozygote (Equation (6)), the correct distribution of depth is not simply ; rather it is the distribution conditional on the site being identified as variant. The lower the true sample alternate allele count, the more these distributions will differ. The conditional distribution is complex to calculate, particularly when multiple populations are involved. Instead, for true allele frequencies for which the probability of not identifying is above a user-defined threshold (by default ), we simulate the calling process rather than using our analytic results. For multiple populations, we calculate this threshold assuming that a variant must be identified independently in all populations, which gives a lower bound on the true probability of not identifying. To simulate calling, for a given true allele frequency (or combination in the multipopulation case) we simulate reads (default 1,000) using the coverage distribution and simulate variant identification and genotype calling for each potential partition of genotypes across the populations, proportional to its probability. For each combination of input true allele frequencies simulated, we estimate and store probability of each potential output allele frequency. These distortions are then applied in place of the transition matrices from the analytic model.
For inbred populations, there is an excess of homozygotes compared to the Hardy–Weinberg expectation, which reduces biases associated with low-pass sequencing. In this case, we follow Blischak et al. (2020) and within each genotype partition calculate the probability of reference homozygotes, heterozygotes, and alternate homozygotes using results from Balding and Nichols (1995, 1997), given the inbreeding coefficient F. The partition probability is then multinomial given these probabilities. In these calculations, we approximate the population allele frequency by the true sample allele frequency. Because calculation of the low-pass correction is expensive compared to typical normal model AFS calculation, we precalculate and cache transition matrices and calling simulations. But inbreeding is often an inferred model parameter, to be optimized during analysis. In this case, users can specify an assumed inbreeding parameter for the low-pass model, optimize the inbreeding parameter in their demographic model, update the inbreeding coefficient assumed in the low-pass model, and iterate until convergence.
Results
Low-Pass Sequencing Biases the AFS
We used simulated data to assess the biases introduced by low-pass sequencing with GATK multisample calling, along with our model of those biases. For a simulated population undergoing growth (supplementary fig. S2a, Supplementary Material online), low-pass sequencing reduces the number of observed low-frequency alleles (Fig. 1). Our model accurately captures these biases (Fig. 1). In contrast with our model, ANGSD attempts to reconstruct the true AFS from low-pass data. In our simulations, ANGSD reconstructed the mean shape of the AFS well, but it introduced dramatic fluctuations into the reconstructed AFS at low depth (Fig. 2). winsfs performed better than ANGSD, producing smoother AFS estimates and reducing errors at low coverage (Fig. 2).

The low-pass AFS is biased, which our model captures. Simulated sequence data from an exponential growth demographic model for 20 individuals were called by GATK and subsampled to 16 individuals (to accommodate missing data at low depth). The GATK-called AFS (green) is biased compared to the true AFS (orange), and our dadi model for low-pass sequencing (purple) fits those biases well. Coverage was a) , b) , c) , and d) .

ANGSD and winsfs correct for low-pass bias of the AFS, but ANGSD introduces fluctuations. For the same simulations as Fig. 1, ANGSD (blue) and winsfs (green) were used to reconstruct the true AFS (red). Coverage was a) , b) , c) , and d) .
When a pair of populations undergoing a split and isolation (supplementary fig. S2b, Supplementary Material online) is analyzed through a joint AFS, similar low coverage biases occur (supplementary fig. S3, Supplementary Material online). Again, our model corrects those biases well (supplementary fig. S3, Supplementary Material online). Similar to the single-population case, ANGSD also introduces large fluctuations in the joint AFS (supplementary fig. S4, Supplementary Material online). While winsfs enhances AFS estimation at low depth of coverage, it tends to introduce more noise as the depth of coverage increases (supplementary fig. S4, Supplementary Material online).
Low-pass biases are expected to be smaller in inbred populations, due to the reduction of heterozygosity. In a simulated population recovering from a bottleneck with inbreeding (supplementary fig. S2c, Supplementary Material online), biases are still observed, which our model corrects (supplementary fig. S5, Supplementary Material online). Again, ANGSD introduced large fluctuations in low-pass AFS, beyond those expected from inbreeding (supplementary fig. S6, Supplementary Material online).
Demographic History Inference from Low-Pass AFS
To assess effects on inference, we first fit demographic models to single-population data simulated under the same growth model as our prior simulations (supplementary fig. S2a, Supplementary Material online). When not modeling low-pass biases, the final population size was underestimated (Fig. 3a), consistent with a deficit of low-frequency alleles. The timing of growth onset was also inaccurately inferred, underestimated at depth and overestimated at depth (Fig. 3b). When the same data were fit with our low-pass model, both model parameters were accurately recovered (Fig. 3a, b) even at the lowest depth. Fits to the AFS reconstructed by ANGSD and winsfs also yielded accurate model parameters (Fig. 3a, b).

Our low-pass model, ANGSD and winsfs enable accurate demographic parameter inference. a, b) From data simulated under a single-population growth model, the final population size and time of growth onset (T) were accurately inferred using our low-pass model and a GATK-called AFS or using normal dadi and an ANGSD and winsfs-called AFS. But they were biased if low depth was not accounted for when fitting a GATK-called AFS. (Dashed horizontal lines are simulated true values.) c) The likelihoods using the GATK AFS were similar whether or not low-pass biases were modeled. d) The fluctuations introduced into the AFS by ANGSD and winsfs caused low likelihoods at low depth of coverage for ANGSD. e–h) For a two-population split with isolation model, similar results were found, although inferences from our low-pass model were slightly biased at coverage.
The logarithm of the likelihood is commonly used to assess the quality of model fit. ANGSD and winsfs reconstruct the AFS for the full sequenced sample size, while we subsample in our approach to deal with missing genotypes, so the likelihoods are not directly comparable. The likelihoods of models fit to the subsampled GATK data were similar whether or not low-pass biases were modeled (Fig. 3c), suggesting that the likelihood itself cannot be used to detect unmodeled low-pass bias. When fitting AFS estimated by ANGSD, likelihoods were much lower at low coverage than high coverage (Fig. 3d), likely driven by the fluctuations ANGSD introduced into the estimated AFS (Fig. 2). Conversely, likelihood estimates were more stable with winsfs, as the variance in likelihoods decreased with increasing depth of coverage (Fig. 3d).
For two-population data simulated under an isolation model (supplementary fig. S2b, Supplementary Material online), similar results were found. Fitting the observed low-pass AFS with our model enabled accurate parameter inference (although there was some bias at coverage) as did fitting the AFS estimated by ANGSD and winsfs (Fig. 3e, f). As in the single-population case, likelihoods were substantially lower when fitting the ANGSD and winsfs-estimated AFS, consistent with introduced fluctuations in the AFS (Fig. 3g, h). For both models, at coverage, ANGSD and winsfs showed slightly better results than dadi-low-pass in parameter inference (Fig. 3a, e).
For one-population data simulated under a growth model with inbreeding (supplementary fig. S2c, Supplementary Material online), failing to correct for low-pass biases at low inbreeding ( or ) led to similar biases as with no inbreeding, which our low-pass model corrected (supplementary fig. S7, Supplementary Material online). For high inbreeding (), the impact of low-pass sequencing on accuracy was smaller, because inbreeding reduces heterozygosity (supplementary fig. S7, Supplementary Material online).
When applying our low-pass bias correction, the user must specify a value for inbreeding, while they may separately estimate it during demographic parameter optimization. We tested the impact of misspecifying inbreeding in the low-bias correction using data simulated with moderate inbreeding of . Large inbreeding values were inferred if inbreeding was initially underestimated in the low-coverage model, and small values were inferred if inbreeding was initially overestimated (supplementary fig. S8c, Supplementary Material online). A substantial difference between the inbreeding coefficient used for correction and the inferred value thus suggests that the assumed inbreeding coefficient was not optimal. Users can thus iterate and update the value assumed in the low-pass correction to converge to a best inference of inbreeding.
Analysis of Human Data
To empirically validate our approach and compare with ANGSD and winsfs, we used chromosome 20 sequencing data from the 1000 Genomes Project, focusing on two sets of samples: Yoruba from Ibadan, Nigeria (YRI) and Utah residents of Northern and Western European ancestry (CEU). We inferred a single-population two-epoch demographic model (supplementary figure S9a, Supplementary Material online) from the YRI samples, and a two-population isolation-with-migration model (supplementary fig. S9b, Supplementary Material online) from the combined YRI and CEU samples. To mimic low-pass sequencing, we subsampled the original high-depth data (which averaged per site per individual) to create data with low to medium depth.
As with simulated data, the observed AFS from low-pass subsampled data was biased compared to high-pass data (Fig. 4a). Using the GATK pipeline, low-pass data yielded few low- and high-frequency derived alleles. In contrast to the simulated data, on these real data ANGSD and winsfs failed to recover the correct number of low-frequency alleles at and depth, while still introducing large fluctuations at intermediate frequencies (Fig. 4b, c).

Allele frequency spectra from 20 YRI samples versus subsampled sequencing depth. a) Spectra generated using the GATK pipeline and subsampled to 32 haplotypes to accommodate missing genotypes. b) Spectra generated using ANGSD genotype likelihood optimization with BAM files input. c) Spectra generated using winsfs genotype likelihood optimization.
If low-pass biases were corrected for, we expected the inferred demographic parameters from subsampled low-pass data to match those from the original high-pass data. For the two-epoch model fit to YRI data, we found that with a GATK-called AFS and no low-pass model (Table 2), the inferred population sizes were biased downward and the times were inaccurate, similar to the growth model fit to simulated data. With the low-pass model, inferred values for low depth were similar to those for high depth, with some deviation at (Table 2). Results from fitting ANGSD and winsfs-estimated spectra were similar to not modeling low depth, suggesting that ANGSD and winsfs are ineffective for these data (Table 2). In the simulated datasets, only ANGSD showed notably low likelihoods at low depth of coverage, whereas in the real data, both ANGSD and winsfs exhibited poor performance.
. | . | . | Depth . | |||
---|---|---|---|---|---|---|
Parameter . | AFS . | Model . | . | . | . | . |
GATK | dadi | 1.82 | 1.76 | 1.55 | 0.05 | |
GATK | Low-pass | 1.83 | 1.79 | 1.74 | 1.57 | |
ANGSD | dadi | 1.82 | 1.83 | 1.71 | 0.08 | |
winsfs | dadi | 1.80 | 1.84 | 1.72 | 0.08 | |
T | GATK | dadi | 0.43 | 0.51 | 0.91 | 0.001 |
GATK | Low-pass | 0.43 | 0.47 | 0.47 | 0.44 | |
ANGSD | dadi | 0.60 | 0.73 | 1.02 | 0.001 | |
winsfs | dadi | 0.55 | 0.74 | 1.03 | 0.001 | |
θ | GATK | dadi | 5.14 | 5.07 | 4.77 | 5.99 |
GATK | Low-pass | 5.15 | 5.11 | 5.12 | 5.16 | |
ANGSD | dadi | 5.48 | 5.24 | 4.93 | 6.80 | |
winsfs | dadi | 5.52 | 5.21 | 4.90 | 6.81 | |
Log-likelihood | GATK | dadi | 289 | 258 | 557 | 1,301 |
GATK | Low-pass | 294 | 289 | 327 | 347 | |
ANGSD | dadi | 467 | 495 | 1,188 | 5,674 | |
winsfs | dadi | 497 | 584 | 1,207 | 3,443 |
. | . | . | Depth . | |||
---|---|---|---|---|---|---|
Parameter . | AFS . | Model . | . | . | . | . |
GATK | dadi | 1.82 | 1.76 | 1.55 | 0.05 | |
GATK | Low-pass | 1.83 | 1.79 | 1.74 | 1.57 | |
ANGSD | dadi | 1.82 | 1.83 | 1.71 | 0.08 | |
winsfs | dadi | 1.80 | 1.84 | 1.72 | 0.08 | |
T | GATK | dadi | 0.43 | 0.51 | 0.91 | 0.001 |
GATK | Low-pass | 0.43 | 0.47 | 0.47 | 0.44 | |
ANGSD | dadi | 0.60 | 0.73 | 1.02 | 0.001 | |
winsfs | dadi | 0.55 | 0.74 | 1.03 | 0.001 | |
θ | GATK | dadi | 5.14 | 5.07 | 4.77 | 5.99 |
GATK | Low-pass | 5.15 | 5.11 | 5.12 | 5.16 | |
ANGSD | dadi | 5.48 | 5.24 | 4.93 | 6.80 | |
winsfs | dadi | 5.52 | 5.21 | 4.90 | 6.81 | |
Log-likelihood | GATK | dadi | 289 | 258 | 557 | 1,301 |
GATK | Low-pass | 294 | 289 | 327 | 347 | |
ANGSD | dadi | 467 | 495 | 1,188 | 5,674 | |
winsfs | dadi | 497 | 584 | 1,207 | 3,443 |
Inferred demographic parameters in dadi using empirical GATK and ANGSD AFS. We analyzed GATK empirical spectra without (dadi) and with low-pass correction (low-pass).
. | . | . | Depth . | |||
---|---|---|---|---|---|---|
Parameter . | AFS . | Model . | . | . | . | . |
GATK | dadi | 1.82 | 1.76 | 1.55 | 0.05 | |
GATK | Low-pass | 1.83 | 1.79 | 1.74 | 1.57 | |
ANGSD | dadi | 1.82 | 1.83 | 1.71 | 0.08 | |
winsfs | dadi | 1.80 | 1.84 | 1.72 | 0.08 | |
T | GATK | dadi | 0.43 | 0.51 | 0.91 | 0.001 |
GATK | Low-pass | 0.43 | 0.47 | 0.47 | 0.44 | |
ANGSD | dadi | 0.60 | 0.73 | 1.02 | 0.001 | |
winsfs | dadi | 0.55 | 0.74 | 1.03 | 0.001 | |
θ | GATK | dadi | 5.14 | 5.07 | 4.77 | 5.99 |
GATK | Low-pass | 5.15 | 5.11 | 5.12 | 5.16 | |
ANGSD | dadi | 5.48 | 5.24 | 4.93 | 6.80 | |
winsfs | dadi | 5.52 | 5.21 | 4.90 | 6.81 | |
Log-likelihood | GATK | dadi | 289 | 258 | 557 | 1,301 |
GATK | Low-pass | 294 | 289 | 327 | 347 | |
ANGSD | dadi | 467 | 495 | 1,188 | 5,674 | |
winsfs | dadi | 497 | 584 | 1,207 | 3,443 |
. | . | . | Depth . | |||
---|---|---|---|---|---|---|
Parameter . | AFS . | Model . | . | . | . | . |
GATK | dadi | 1.82 | 1.76 | 1.55 | 0.05 | |
GATK | Low-pass | 1.83 | 1.79 | 1.74 | 1.57 | |
ANGSD | dadi | 1.82 | 1.83 | 1.71 | 0.08 | |
winsfs | dadi | 1.80 | 1.84 | 1.72 | 0.08 | |
T | GATK | dadi | 0.43 | 0.51 | 0.91 | 0.001 |
GATK | Low-pass | 0.43 | 0.47 | 0.47 | 0.44 | |
ANGSD | dadi | 0.60 | 0.73 | 1.02 | 0.001 | |
winsfs | dadi | 0.55 | 0.74 | 1.03 | 0.001 | |
θ | GATK | dadi | 5.14 | 5.07 | 4.77 | 5.99 |
GATK | Low-pass | 5.15 | 5.11 | 5.12 | 5.16 | |
ANGSD | dadi | 5.48 | 5.24 | 4.93 | 6.80 | |
winsfs | dadi | 5.52 | 5.21 | 4.90 | 6.81 | |
Log-likelihood | GATK | dadi | 289 | 258 | 557 | 1,301 |
GATK | Low-pass | 294 | 289 | 327 | 347 | |
ANGSD | dadi | 467 | 495 | 1,188 | 5,674 | |
winsfs | dadi | 497 | 584 | 1,207 | 3,443 |
Inferred demographic parameters in dadi using empirical GATK and ANGSD AFS. We analyzed GATK empirical spectra without (dadi) and with low-pass correction (low-pass).
For the isolation-with-migration model fit to YRI and CEU data, the results were broadly similar (supplementary table S1, Supplementary Material online). For population sizes and the divergence time, inferences were more stable from GATK genotyping and our low-pass model than from ANGSD and winsfs-estimated AFS. By contrast, the inferred migration rate was similar across analyses. For both the two-epoch and isolation-with-migration models, the selection of GATK parameters had negligible influence on the results, as evidenced by the consistent similarity in the outcomes when using minMapQ = 30 (Table 2 and supplementary table S1, Supplementary Material online) and minMapQ = 1 (supplementary tables S2 and S3, Supplementary Material online).
Discussion
We assessed the biases introduced by low-pass sequencing using GATK multisample genotype calling and developed a model to mitigate these biases. In contrast to existing approaches (e.g. Korneliussen et al. 2014; Han et al. 2015; Mas-Sandoval et al. 2022), which attempt to estimate an unbiased AFS to which standard model-based analyses can be applied, we directly analyzed the low-pass AFS by accounting for low-pass biases in our modeling framework. In a simulated population undergoing growth, we found that low-pass sequencing reduced the presence of low-frequency alleles (Fig. 1), consistent with findings from previous studies (Han et al. 2014). Our model accounted for these biases, contrasting with ANGSD, which created fluctuations in the AFS at low depth (Fig. 2). In scenarios involving two populations, we observed similar biases, which our model effectively corrected, whereas ANGSD introduced additional noise (supplementary figs S3 and S4, Supplementary Material online). For demographic inference, using our model enabled accurate parameter estimates even at low-pass depths, while neglecting low-depth biases resulted in substantial inaccuracies (Fig. 3). ANGSD also yielded accurate estimates, but worse likelihoods. Empirical testing using human data from the 1000 Genomes Project showcased the accuracy of our correction method in improving demographic inference from low-pass data, outperforming both uncorrected analysis and ANGSD results (Fig. 4 and Table 2 and supplementary table S1, Supplementary Material online).
While ANGSD is recognized for its effectiveness in managing low-pass sequencing, our results showed its difficulties in modeling medium-frequency alleles. This is reflected in lower likelihood scores, particularly when comparing low-pass datasets to high-pass ones (Fig. 3). Despite their utility in incorporating uncertainty related to low-pass sequencing (Nielsen et al. 2011; Fumagalli 2013; Korneliussen et al. 2014), genotype likelihoods might not always accurately capture the entire range of allele frequencies. Despite the AFS fluctuations, ANGSD yielded reliable parameter estimates for simulated data. But ANGSD was unable to accurately estimate the demographic parameters of real datasets, as demonstrated in the analysis of the 1000 Genomes Project data (Table 2 and supplementary table S1, Supplementary Material online). This underscores the need for rigorous and critical assessments of results by evaluating the likelihood of the model and conducting uncertainty analysis. Similarly, winsfs showed the same pattern as ANGSD, performing well in simulations and with less noise in AFS estimation (Figs. 2 and 3), but producing inaccurate results in the analysis of 1000 Genomes data (Table 2 and supplementary table S1, Supplementary Material online).
The observed discrepancy between simulated and empirical data performance in ANGSD and winsfs likely stems from multiple underlying factors. While simulated data offers well-controlled and known parameters, which facilitate accurate model fitting, real-world datasets such as the 1kGP introduce complexities like sequencing biases or population structure. These elements may not be fully captured by ANGSD’s model assumptions, contributing to its underperformance with empirical data. Specifically, ANGSD’s reliance on genotype likelihoods, which are sensitive to the accurate modeling of sequencing errors, can pose challenges when real data exhibit issues such as base quality miscalibration or variable sequencing depth. A more detailed exploration of ANGSD’s assumptions, particularly regarding error profiles and filtering strategies, is required to fully understand these discrepancies and improve performance with empirical datasets.
Variant discovery using GATK involves two main approaches: multisample (classic joint-calling) and single-sample calling (Nielsen et al. 2011). We modeled multisample calling, which has higher statistical power compared to single-sample calling (Nielsen et al. 2011; Poplin et al. 2018). But multisample calling can become computationally burdensome with larger sample sizes, leading to the development of incremental single-calling as a scalable alternative (McKenna et al. 2010; Van der Auwera and O’Connor 2020). When our model was applied to incremental single-calling AFS from subsampled 1000 Genomes Project data, parameter inference was poor (supplementary table S4, Supplementary Material online). Therefore, our model should only be used with multisample calling, and a slightly different model may need to be developed for incremental single-calling.
We present a GATK multisample calling model designed to compensate for AFS biases introduced by low-pass sequencing. Although tailored for GATK, our model’s design allows for its extension to different pipelines with modifications to address the unique aspects of each calling algorithm. For example, our model currently assumes that a site is called when at least two reads supporting the alternative allele are found (Equations (3) and (4)), but this could be modified for other pipelines with different calling criteria. Our approach can thus be generalized to other calling pipelines, including those using short reads, long reads, and hybrid approaches (e.g. Bankevich et al. 2012; Poplin et al. 2018). Note that our mathematical model assumes a shared read depth distribution among all individuals, and some studies may vary depth among individuals. Simulations suggest, however, that our model remain accurate with uneven depths (supplementary fig. S10, Supplementary Material online).
Our approach can also be integrated into other AFS-based inference tools such as moments (Portik et al. 2017; Leaché et al. 2019), fastsimcoal2 (Excoffier et al. 2013, 2021), GADMA (Noskova et al. 2020), and delimitR (Smith and Carstens 2020), because our approach modifies the model AFS, independent of how it is computed. Our approach may also be useful in approximate Bayesian computation (Beaumont 2010; Csilléry et al. 2012) and machine learning workflows (Pudlo et al. 2016; Smith and Carstens 2020), facilitating simulation of low-pass datasets. Note, however, that we model bias in the mean shape of the AFS under low-pass sequencing, not its full variance (supplementary fig. S11, Supplementary Material online). Furthermore, AFS-based analyses are used not only for demographic studies but also to examine natural selection, including inferring the distribution of fitness effects of new mutations (Eyre-Walker and Keightley 2007; Huang et al. 2021). Our approach can thus facilitate population genomics research across tools, approaches, and problem domains.
In conclusion, we have developed a robust correction for low-pass sequencing biases, significantly enhancing the accuracy of demographic parameter estimation at various coverage depths. As the genetic research community continues to address challenges associated with low-pass data (Bryc et al. 2013; Korneliussen et al. 2014; Blischak et al. 2018; Meisner and Albrechtsen 2018), especially when constrained by economics or sample availability, our methodology provides enables more reliable genetic analysis.
Material and Methods
Simulating AFS Under Low-Pass Sequencing
We used msprime (Kelleher et al. 2016; Baumdicker et al. 2022) to generate SNP datasets via coalescent simulations. We simulated two demographic models. The demographic models were visualized using demesdraw (Gower et al. 2022). The first model, single-population exponential growth (supplementary fig. S2a, Supplementary Material online), involved two parameters: the relative population size and time of past growth (in units of two times the effect population size generations). The second model, two-population isolation (supplementary fig. S2b, Supplementary Material online), involved three parameters: equal relative sizes of populations 1 and 2, , and divergence time in the past T = 0.1. For each model, we conducted 25 independent simulations. For the exponential growth model, we sampled 20 diploid individuals, whereas for the isolation model, we sampled 10 individuals per population. Both demographic scenarios used an ancestral effective population size of 10,000, a sequence length of bp, a mutation rate of per site per generation, and recombination rate of per site per generation.
For simulations incorporating inbreeding, we used SLiM 4 (Messer 2013; Haller and Messer 2023). Datasets were generated under a bottleneck and growth model (supplementary figure S2c, Supplementary Material online), with a population bottleneck of , followed by a population expansion to . The time of the past bottleneck was set at , and the level of inbreeding was varied with . Inbreeding was introduced using the selfing rate, set to . Twenty-five independent simulations were conducted, with 20 individuals sampled for each replicate. Simulation parameters were , bp, , and , with a burn-in of 10,000 generations.
To create low-pass datasets, we used synthetic diploid genomes. For each simulation replicate, we generated a random reference genome spanning 10 Mb with a GC content of 40%, resembling the human genome. Mutations were incorporated by altering single nucleotides at the positions observed in the SNP matrix generated during each simulation, assuming that all sites were biallelic. Diploid individual genomes were generated by randomly selecting two chromosomes from the population pool.
Using the synthetic individual genomes as templates, we simulated 126 bp paired-end short reads for each individual with InSilicoSeq v2.0.1. (Gourlé et al. 2019). We calculated the number of reads per scenario as , where L is the genome length, C the coverage depth, and R the read length. Reads for each diploid chromosome were simulated with equal probability. Depth of coverage per individual was sampled from a normal distribution with means of 3, 5, 10, and 30 and corresponding standard deviations of 0.3, 0.5, 1, and 3 to explore coverage variability, which increased with coverage levels. These standard deviations were selected based on preliminary simulations that suggested they offer a realistic variance for each coverage level.
For each individual, we aligned simulated reads to the reference genome using BWA v0.7.17 (Li et al. 2009). We then processed the aligned reads with SAMTools v1.10 (Li 2013) to perform sorting, indexing, and pileup generation. To generate GATK spectra, we used the GATK multisample approach via HaplotypeCaller v4.2 (McKenna et al. 2010; Van der Auwera and O’Connor 2020). To minimize false positives, the identified variants underwent filtering based on GATK’s Best Practices guidelines, with thresholds tailored to expected error rates and variant quality. These thresholds included depth-normalized variant confidence (), mapping quality (), strand bias estimate (), and strand bias (). The filtered SNP VCF files were subsequently used in demographic inference analyses to estimate population parameters based on the AFS of these variants. To generate ANGSD spectra, we used the BAM files containing information about each individual with reads aligned to the reference genome. Subsequently, realAFS was used to estimate a maximum-likelihood AFS through the EM algorithm. The ANGSD v0.94 analysis was conducted with the following parameters: doSaf = 1 enabled the calculation of site allele frequency likelihoods, minMapQ = 30 set a minimum mapping quality score of 30 to filter low-quality alignments, minQ = 20 applied a minimum base quality score of 20 to exclude low-quality bases, and GL = 2 specified the genotype likelihood model (using model 2, which is the GATK-like model).
Empirical Subsampling of Human Data
We used high-quality whole-genome sequencing data () from the 1000 Genomes Project (1kGP), sourced from The International Genome Sample Resource data portal (https://www.internationalgenome.org/ Fairley et al. 2020). The data comprised CRAM files aligned to the GRCh38 human reference genome. We focused on two sets of samples for our analysis: 10 randomly selected individuals from the Yoruba from Ibadan, Nigeria (YRI) samples and 10 from the Utah residents with Northern and Western European ancestry (CEU) samples. The specific individuals included for the YRI were NA18486, NA18499, NA18510, NA18853, NA18858, NA18867, NA18878, NA18909, NA18917, NA18924, and for the CEU NA07037, NA11829, NA11892, NA11918, NA11932, NA11994, NA12004, NA12144, NA12249, and NA12273. Additionally, for a single-population demographic model, 20 YRI individuals were analyzed, which includes the initial 10 plus an additional 10 samples: NA19092, NA19116, NA19117, NA19121, NA19138, NA19159, NA19171, NA19184, NA19204, and NA19223.
Initially, we converted the CRAM files to BAM format and indexed them using Picard tools (https://broadinstitute.github.io/picard/). We then isolated reads from chromosome 20 at the original 30× coverage, which we subsequently subsampled to 10×, 5×, and 3× coverage using samtools v.1.10 (Li 2013) to emulate varying sequencing depths. Next, using GATK version 4.2.5 HaplotypeCaller (McKenna et al. 2010; Van der Auwera and O’Connor 2020), we called SNPs and indels from these varying coverage depths for each population. We employed multisample SNP calling, merging BAM files with identical coverage prior to processing with HaplotypeCaller. This approach yielded a raw output VCF file.
We also carried out a single-sample calling procedure. For this, individual BAM files were used directly as inputs for the GATK HaplotypeCaller with the -ERC GVCF flag to enable GVCF mode. Following this, we used GATK GenomicsDBImport to compile the individual variant calls into a cohesive data structure. This setup allowed us to conduct joint genotyping using GATK GenotypeGVCFs, ultimately producing a multisample VCF.
Following SNP calling, we employed GATK SelectVariants to filter out indels for both approaches, retaining only SNPs. Quality filtering of SNPs was conducted using GATK VariantFiltration, applying criteria such as depth-normalized variant confidence (), mapping quality (), strand bias estimate (), and overall strand bias (). After quality filtering, the VCF files were annotated with ancestral allele information using the vcftools fill-aa module, based on data from the Ensembl Release 110 Database (Danecek et al. 2011).
Finally, we used ANGSD to generate an AFS by using BAM files as input. The sample allele frequencies were first estimated using ANGSD’s -doSaf flag, using GATK genotype likelihoods. These likelihoods were then used to calculate the AFS via the EM algorithm using ANGSD’s realAFS program. In this way, we maintained the original sample sizes from the BAM files, resulting in AFS for 40 chromosomes in the single-population analysis and 20 chromosomes per population in the two-population analysis. ANGSD v0.94 analysis was executed with the following settings: doSaf = 1, minMapQ = 30, minQ = 20, and GL = 2. We used winsfs v0.7 to generate AFS based on site allele frequency likelihoods calculated with ANGSD, using its default parameters. We also generate AFS using winsfs, based on site allele frequency likelihoods computed with ANGSD.
Demographic Inference Using dadi
We used dadi (Gutenkunst et al. 2009) to fit demographic models to simulated and empirical datasets. For the GATK spectra, we used the VCF files as input and subsampled individuals to accommodate missing data. For the ANGSD spectra, we used them as input directly. Within dadi, we used three demographic models for the simulated datasets: (i) an exponential growth model: dadi.Demographics1D.growth; (ii) a divergence model with migration fixed to zero: dadi.Demographics2D.split_mig; (iii) an bottleneck then exponential growth model modified to incorporate inbreeding: dadi.Demographics1D.bottlegrowth. For the human datasets, we used two models: (i) a divergence with migration model: dadi.Demographics2D.split_mig and (ii) an instantaneous growth model: dadi.Demographics1D.two_epoch. The extrapolation grid points were set using the formula , where ns is the sample size of the AFS. Our low-pass correction is also implemented in dadi-cli (Huang et al. 2023).
Supplementary Material
Supplementary material is available at Molecular Biology and Evolution online.
Funding
This work was supported by the National Institute of General Medical Sciences of the National Institutes of Health (R01GM127348 and R35GM149235 to R.N.G.).
Data Availability
The correction for low-pass sequencing is implemented in the dadi Python package (https://bitbucket.org/gutenkunstlab/dadi) and in the corresponding dadi-cli command-line interface (https://github.com/xin-huang/dadi-cli). The codebase for creating and analyzing both simulated and empirical datasets is on GitHub at https://github.com/emanuelmfonseca/low-coverage-sfs and https://github.com/lntran26/low-coverage-sfs/tree/main/empirical_analysis. Furthermore, we provide a user guide to help users implement our methodology on the dadi webpage (https://dadi.readthedocs.io/en/latest/user-guide/low-pass). The empirical data were downloaded from The International Genome Sample Resource data portal (https://www.internationalgenome.org Fairley et al. 2020).
References
Author notes
Conflict of interests: None declared.