-
PDF
- Split View
-
Views
-
Cite
Cite
Veronika Boskova, Tanja Stadler, PIQMEE: Bayesian Phylodynamic Method for Analysis of Large Data Sets with Duplicate Sequences, Molecular Biology and Evolution, Volume 37, Issue 10, October 2020, Pages 3061–3075, https://doi.org/10.1093/molbev/msaa136
- Share Icon Share
Abstract
Next-generation sequencing of pathogen quasispecies within a host yields data sets of tens to hundreds of unique sequences. However, the full data set often contains thousands of sequences, because many of those unique sequences have multiple identical copies. Data sets of this size represent a computational challenge for currently available Bayesian phylogenetic and phylodynamic methods. Through simulations, we explore how large data sets with duplicate sequences affect the speed and accuracy of phylogenetic and phylodynamic analysis within BEAST 2. We show that using unique sequences only leads to biases, and using a random subset of sequences yields imprecise parameter estimates. To overcome these shortcomings, we introduce PIQMEE, a BEAST 2 add-on that produces reliable parameter estimates from full data sets with increased computational efficiency as compared with the currently available methods within BEAST 2. The principle behind PIQMEE is to resolve the tree structure of the unique sequences only, while simultaneously estimating the branching times of the duplicate sequences. Distinguishing between unique and duplicate sequences allows our method to perform well even for very large data sets. Although the classic method converges poorly for data sets of 6,000 sequences when allowed to run for 7 days, our method converges in slightly more than 1 day. In fact, PIQMEE can handle data sets of around 21,000 sequences with 20 unique sequences in 14 days. Finally, we apply the method to a real, within-host HIV sequencing data set with several thousand sequences per patient.
Introduction
Phylogenetic and phylodynamic studies of pathogen spread at both between- and within-host scales rely on genetic sequence data as input. For estimating the between-host dynamics, a single (usually consensus) pathogen sequence per patient is often sufficient for conducting successful analyses (Drummond et al. 2005; Stadler et al. 2013, 2014; Volz and Pond 2014; Faria et al. 2016). The usual data set size in such studies is in the order of tens to hundreds of sequences.
Many pathogens, such as RNA viruses, replicate and mutate very quickly within a host (Hué et al. 2005; Pybus and Rambaut 2009; Alizon and Fraser 2013). By consequence, within a very short time such pathogens create an entire population of a virus called quasispecies (Eigen and Schuster 1977; Wilke 2005; Domingo et al. 2012), characterized by high sequence duplicity and diversity within a single host (Boeras et al. 2011; Domingo et al. 2012; Töpfer et al. 2014; Wu et al. 2014). In the context of viral quasispecies, each unique sequence is referred to as a haplotype (Töpfer et al. 2013).
The sequences from the quasispecies population are usually obtained using either labor-intensive cloning combined with Sanger sequencing or faster and more efficient next-generation sequencing (Schuster 2008; Goodwin et al. 2016). Next-generation sequencing, however, produces short reads that require further processing, whereby the reads are stitched together in order to reconstruct the original sequence (Goodwin et al. 2016). Assuming we can overcome, or at least correct for, the known errors of the RNA amplification procedure (McKinley et al. 2011), sample preprocessing (Vrancken et al. 2016), and data postprocessing errors (Beerenwinkel et al. 2012), we can reconstruct the within-host pathogen diversity from next-generation sequencing data sets in great detail (Zagordi et al. 2011; Schirmer et al. 2014; Prosperi et al. 2013; Töpfer et al. 2013, 2014; Pandit and de Boer 2014; Malhotra et al. 2016). Depending on the depth of the sequencing and performance of the assembly method, next-generation sequencing followed by haplotype reconstruction yields data sets of tens to hundreds of haplotypes each in multiple copies, annotated as a haplotype frequency.
The reconstructed haplotypes are then aligned to create a multiple sequence alignment. Before it can be used as input for current phylogenetic and phylodynamic inference methods, this alignment needs to be re-expanded, each unique haplotype being duplicated many times, proportional to its frequency. The phylogenetic inference methods then produce a phylogenetic tree, on which all sequences in the alignment are represented as tips. However, when re-expanded, the sequence alignment contains thousands of sequences. Reconstructing the phylogeny and the population dynamics from such big data sets represents a substantial computational hurdle.
Maximum-likelihood (ML) approaches have been especially useful to treat such large data sets (Stamatakis 2014; Montoya et al. 2016; Minh et al. 2020). However, the disadvantage of these methods is that they first maximize the likelihood over different phylogenetic trees, heuristically searching tree space to find the best-fitting tree. Then, given this ML tree, the pathogen dynamics are inferred. During tree inference, identical sequences will be grouped together, and because the most likely divergence between two identical sequences is always 0, the branch lengths of these subtrees will be (very close to) 0 or set to some default minimum (e.g.,
In contrast, Bayesian phylodynamic methods provide a natural way of integrating over phylogenetic trees, τ, when inferring population dynamic parameters, η, and sequence evolution parameters, θ. These methods infer the joint posterior distribution of τ, η, and θ given the data,
As the denominator at the right-hand side of the equation is difficult to evaluate, classic phylodynamic methods rely on numerically approximating the posterior distribution by sampling from it. Sampling from the posterior distribution is performed using the Metropolis–Hastings Markov chain Monte Carlo (MCMC) procedure (Metropolis et al. 1953; Hastings 1970). However, this procedure is computationally expensive and is not well suited for data sets larger than a few hundred sequences (Poon et al. 2012).
Options to speed up the calculation of the posterior density have been proposed and implemented. One approach is to parallelize the phylogenetic likelihood
To address the issue of computational expense when applying Bayesian methods to large data sets with duplicate sequences, one of the following two approaches is usually employed. The first approach is to randomly subsample the full data set to keep the diversity but to decrease the computational burden (Poon et al. 2011, 2012). This approach should yield unbiased estimates of population dynamic model parameters, but may lead to inference of a most recent common ancestor (MRCA) that is younger than, and thus not representative of, the MRCA of the full data set. This happens if the random sample chosen does not contain the two most divergent sequences that define the MRCA of the full data set. Furthermore, as a lot of information is left out, precision of the parameter estimates is compromised if a subsample instead of the whole data set is used. The second way of reducing the computational burden in the Bayesian methods is to perform inference on only the unique sequences, completely ignoring their respective frequencies (Bull et al. 2011; Recarey and Cristina 2014). However, this amounts to inference on a biased data set and thus parameter estimates may be biased as well. To our knowledge, these biases have not been explored previously.
In this study, we show that analyses with only unique sequences or only randomly subsampled data sets result in less accurate and/or less precise parameter estimates than when all data are being used. More importantly, we present a new method that improves convergence of the MCMC by keeping track of the duplicate sequence branching times but not of the full topology formed by these duplicate sequences. The rationale for this is that the duplicate sequences provide information regarding the evolutionary rate and thus the branching times, and can further help refine the topology on the unique sequences (DeWitt et al. 2018). However, these same sequences provide no information to narrow down the plausible tree topology space of the duplicate sequence subtrees (Dudas and Bedford 2019), making the tracking of the duplicate sequence subtree topology unnecessary. We implement this method as a BEAST 2 (Bouckaert et al. 2014) package called PIQMEE, which stands for “phylogenetic inference of quasispecies molecular evolution and epidemiology.” The method works with any sequence evolution model currently available in BEAST 2. As the population dynamics model, we adapted the birth–death skyline (BDsky) model (Stadler et al. 2013). We show on simulations that the PIQMEE method is accurate and precise, and faster than the classic method when the alignment analyzed contains duplicate sequences. Finally, we apply our method to an empirical HIV data set containing thousands of sequences.
New Approaches
The main idea of the new approach is based on the observation that a set of identical sequences contains no information regarding the underlying tree topology. Sampling various topological configurations of the subtree of identical sequences in MCMC is therefore a waste of computational time. Thus, especially for large data sets with many duplicate sequences, when including all sequences and representing each sequence as a tip in the tree, the MCMC chains mix very slowly. This then needs to be compensated for by simulating long chains. A lot of computational effort could be saved if, during MCMC, the topology of identical sequences was not sampled, meaning that the duplicate sequences were not treated as separate tips in the phylogenetic tree. However, the duplicates cannot be completely ignored, because such an approach would lead to biases.
We therefore propose calculating the posterior probability of the tree and the parameters for a reduced tree structure (see the Materials and Methods section for details). The reduced phylogenetic tree is built only from unique sequences but is complemented by an array of branching and sampling times of all the duplicate sequences.
The main assumption of the method is that the duplicate sequences of the haplotype always arise from an already existing sequence through branching (duplication) and not through mutation. In other words, we allow each haplotype to arise only once during the population history. If we were to represent this assumption on a full, nonreduced tree, it would lead to a tree which is “recursively monophyletic” with respect to the identical sequences. In such a recursively monophyletic tree, there is always at least one monophyletic group of identical sequences. If we were to remove it from the tree, there would be another monophyletic group of identical sequences formed. One could continue with the cycle of identification and removal of monophyletic groups of identical sequences until no tips would be left in the tree.
We adapted the calculation of the phylogenetic likelihood
Results
Impact of Input Number of Sequences on Method Performance
We simulated 100 trees for each of the following tip counts: 300, 1,200, 2,100, 3,000, and 6,000 tips. All tips were sampled at the same time point. The tree model used was a birth–death model with a constant birth and death rate. We simulated sequences on these trees according to a Jukes–Cantor model with a strict clock (for details, see the Materials and Methods section). We analyzed the resulting sequences using four inference methods. First, we analyzed the full data set with our new method, which we will refer to in the results as PIQMEE_all. Second, we analyzed the full data set with the classic BEAST 2 method, referred to as CLASSIC_all. Third, we analyzed only the unique sequences using the classic method (CLASSIC_unique). Fourth, we randomly subsampled the full data set to the size equal to the number of unique sequences and analyzed this subset with the classic method (CLASSIC_random). The BDsky model was assumed for the tree prior in all four analyses.
The results of the analyses are shown in figure 1 and table 1. Analyses of all data sets ran well with the exception of 68 data sets with 6,000 sequences in the CLASSIC_all method. These runs never started due to the BEAST 2 Java application having insufficient memory to handle the amount of data loaded. The remaining 32 runs ran for 7 days but only 20 reached an effective sample size (ESS) of 200 for all the metrics and model parameters. This shows that data sets of this size are at the limit of what can be handled by the software, if each duplicate sequence corresponds to a separate tip in the tree.

Performance comparison of the PIQMEE method and the CLASSIC method on full, unique and randomly subsampled data sets of various sizes. For each setting, 100 data sets were analyzed. All the summary statistics shown are only for the runs that reached ESS of 200 for each parameter. The total number of such runs out of 100 is shown in smaller, colored numbers on the x-axis below each figure. (A) The number of sequences analyzed by each method. For PIQMEE_all and CLASSIC_all, the method was considering all the sequences. For CLASSIC_unique and CLASSIC_random, the size of the data set was smaller. (B) The distribution of the CPU seconds elapsed until the runs reached ESS of 200 for all the parameters. (C) The clock rate used for inference, in units of substitutions/site/time unit. As all our data sets have the sequences sampled at one time point, we need to fix the clock rate to infer the tree height. The clock rate for all analyses was fixed to the values used for simulations. (D) The distribution of the normalized relative error of median posterior estimates of the tree heights (
Number of Simulated Sequences . | Method . | Median Number of Analyzed Sequences . | Runs Reaching 200 ESS . | CPU Seconds per 200 ESS . | Tree Height . | Re . | Death Rate . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
True . | MoM . | RMSE . | Coverage . | HPD Size . | True . | MoM . | RMSE . | Coverage . | HPD Size . | True . | MoM . | RMSE . | Coverage . | HPD Size . | |||||
300 | PIQMEE_all | 300 | 100 | 510.716 | 189.414 | 164.452 | 5.089 | 92 | 1.286 | 1 | 0.998 | 0.008 | 100 | 0.023 | 1 | 1.023 | 0.352 | 99 | 1.352 |
CLASSIC_all | 300 | 98 | 868.493 | 189.414 | 167.685 | 5.098 | 90 | 1.291 | 1 | 0.998 | 0.007 | 98 | 0.022 | 1 | 1.016 | 0.352 | 97 | 1.355 | |
CLASSIC_unique | 19 | 100 | 61.457 | 189.414 | 153.707 | 5.670 | 81 | 1.135 | 1 | 12.932 | 17.343 | 99 | 643.745 | 1 | 0.004 | 0.935 | 0 | 0.204 | |
CLASSIC_random | 19 | 100 | 51.851 | 189.414 | 147.843 | 5.410 | 91 | 1.309 | 1 | 0.998 | 0.177 | 100 | 0.031 | 1 | 1.025 | 1.080 | 96 | 3.001 | |
1,200 | PIQMEE_all | 1,200 | 98 | 3,457.728 | 664.673 | 631.564 | 10.484 | 93 | 1.437 | 1 | 0.999 | 0.004 | 98 | 0.006 | 1 | 1.031 | 0.308 | 95 | 1.274 |
CLASSIC_all | 1,200 | 93 | 13,950.919 | 664.673 | 679.082 | 10.661 | 88 | 1.354 | 1 | 0.999 | 0.002 | 93 | 0.006 | 1 | 1.032 | 0.304 | 91 | 1.238 | |
CLASSIC_unique | 19 | 100 | 91.485 | 664.673 | 671.776 | 12.251 | 78 | 1.126 | 1 | 11.717 | 15.352 | 90 | 634.224 | 1 | 0.001 | 0.977 | 0 | 0.058 | |
CLASSIC_random | 19 | 100 | 94.050 | 664.673 | 571.125 | 12.426 | 84 | 1.302 | 1 | 0.999 | 1.419 | 100 | 0.017 | 1 | 1.134 | 119.208 | 96 | 4.041 | |
2,100 | PIQMEE_all | 2,100 | 95 | 7,532.271 | 1,195.555 | 1,214.334 | 12.626 | 91 | 1.520 | 1 | 1.000 | 0.001 | 95 | 0.004 | 1 | 1.119 | 0.371 | 92 | 1.313 |
CLASSIC_all | 2,100 | 83 | 33,230.011 | 1,195.555 | 1,346.233 | 13.353 | 79 | 1.417 | 1 | 1.000 | 0.000 | 83 | 0.003 | 1 | 1.104 | 0.363 | 80 | 1.266 | |
CLASSIC_unique | 19.5 | 100 | 115.094 | 1,195.555 | 1,196.450 | 15.974 | 80 | 1.217 | 1 | 5.426 | 10.711 | 93 | 581.605 | 1 | 0.002 | 0.975 | 1 | 0.070 | |
CLASSIC_random | 19.5 | 97 | 44.540 | 1,195.555 | 1,053.862 | 14.153 | 89 | 1.488 | 1 | 1.000 | 0.917 | 97 | 0.008 | 1 | 0.985 | 1.446 | 92 | 3.468 | |
3,000 | PIQMEE_all | 3,000 | 88 | 16,509.389 | 1,802.415 | 1,900.862 | 17.910 | 86 | 1.471 | 1 | 1.000 | 0.000 | 88 | 0.002 | 1 | 1.012 | 0.333 | 84 | 1.190 |
CLASSIC_all | 3,000 | 82 | 68,108.129 | 1,802.415 | 1,989.483 | 18.058 | 81 | 1.460 | 1 | 1.000 | 0.000 | 82 | 0.002 | 1 | 1.010 | 0.343 | 79 | 1.151 | |
CLASSIC_unique | 19 | 100 | 133.010 | 1,802.415 | 1,879.452 | 21.912 | 79 | 1.234 | 1 | 6.364 | 10.479 | 92 | 586.983 | 1 | 0.001 | 0.977 | 0 | 0.046 | |
CLASSIC_random | 19 | 97 | 52.201 | 1,802.415 | 1,391.544 | 21.435 | 84 | 1.440 | 1 | 1.000 | 0.119 | 97 | 0.005 | 1 | 0.921 | 2.555 | 96 | 3.420 | |
6,000 | PIQMEE_all | 6,000 | 79 | 92,081.925 | 3,685.930 | 4,097.601 | 25.200 | 75 | 1.407 | 1 | 1.000 | 0.000 | 79 | 0.001 | 1 | 1.000 | 0.457 | 78 | 1.086 |
CLASSIC_all | 6,000 | 20 | 352,521.226 | 3,685.930 | 4,737.040 | 21.652 | 19 | 1.390 | 1 | 1.000 | 0.000 | 20 | 0.001 | 1 | 0.957 | 0.781 | 19 | 1.023 | |
CLASSIC_unique | 20 | 100 | 181.031 | 3,685.930 | 4,006.775 | 31.893 | 81 | 1.290 | 1 | 3.683 | 8.946 | 82 | 550.668 | 1 | 0.001 | 0.980 | 1 | 0.041 | |
CLASSIC_random | 20 | 95 | 44.510 | 3,685.930 | 3,012.714 | 25.474 | 86 | 1.421 | 1 | 1.000 | 0.214 | 95 | 0.005 | 1 | 1.068 | 63.933 | 92 | 3.754 |
Number of Simulated Sequences . | Method . | Median Number of Analyzed Sequences . | Runs Reaching 200 ESS . | CPU Seconds per 200 ESS . | Tree Height . | Re . | Death Rate . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
True . | MoM . | RMSE . | Coverage . | HPD Size . | True . | MoM . | RMSE . | Coverage . | HPD Size . | True . | MoM . | RMSE . | Coverage . | HPD Size . | |||||
300 | PIQMEE_all | 300 | 100 | 510.716 | 189.414 | 164.452 | 5.089 | 92 | 1.286 | 1 | 0.998 | 0.008 | 100 | 0.023 | 1 | 1.023 | 0.352 | 99 | 1.352 |
CLASSIC_all | 300 | 98 | 868.493 | 189.414 | 167.685 | 5.098 | 90 | 1.291 | 1 | 0.998 | 0.007 | 98 | 0.022 | 1 | 1.016 | 0.352 | 97 | 1.355 | |
CLASSIC_unique | 19 | 100 | 61.457 | 189.414 | 153.707 | 5.670 | 81 | 1.135 | 1 | 12.932 | 17.343 | 99 | 643.745 | 1 | 0.004 | 0.935 | 0 | 0.204 | |
CLASSIC_random | 19 | 100 | 51.851 | 189.414 | 147.843 | 5.410 | 91 | 1.309 | 1 | 0.998 | 0.177 | 100 | 0.031 | 1 | 1.025 | 1.080 | 96 | 3.001 | |
1,200 | PIQMEE_all | 1,200 | 98 | 3,457.728 | 664.673 | 631.564 | 10.484 | 93 | 1.437 | 1 | 0.999 | 0.004 | 98 | 0.006 | 1 | 1.031 | 0.308 | 95 | 1.274 |
CLASSIC_all | 1,200 | 93 | 13,950.919 | 664.673 | 679.082 | 10.661 | 88 | 1.354 | 1 | 0.999 | 0.002 | 93 | 0.006 | 1 | 1.032 | 0.304 | 91 | 1.238 | |
CLASSIC_unique | 19 | 100 | 91.485 | 664.673 | 671.776 | 12.251 | 78 | 1.126 | 1 | 11.717 | 15.352 | 90 | 634.224 | 1 | 0.001 | 0.977 | 0 | 0.058 | |
CLASSIC_random | 19 | 100 | 94.050 | 664.673 | 571.125 | 12.426 | 84 | 1.302 | 1 | 0.999 | 1.419 | 100 | 0.017 | 1 | 1.134 | 119.208 | 96 | 4.041 | |
2,100 | PIQMEE_all | 2,100 | 95 | 7,532.271 | 1,195.555 | 1,214.334 | 12.626 | 91 | 1.520 | 1 | 1.000 | 0.001 | 95 | 0.004 | 1 | 1.119 | 0.371 | 92 | 1.313 |
CLASSIC_all | 2,100 | 83 | 33,230.011 | 1,195.555 | 1,346.233 | 13.353 | 79 | 1.417 | 1 | 1.000 | 0.000 | 83 | 0.003 | 1 | 1.104 | 0.363 | 80 | 1.266 | |
CLASSIC_unique | 19.5 | 100 | 115.094 | 1,195.555 | 1,196.450 | 15.974 | 80 | 1.217 | 1 | 5.426 | 10.711 | 93 | 581.605 | 1 | 0.002 | 0.975 | 1 | 0.070 | |
CLASSIC_random | 19.5 | 97 | 44.540 | 1,195.555 | 1,053.862 | 14.153 | 89 | 1.488 | 1 | 1.000 | 0.917 | 97 | 0.008 | 1 | 0.985 | 1.446 | 92 | 3.468 | |
3,000 | PIQMEE_all | 3,000 | 88 | 16,509.389 | 1,802.415 | 1,900.862 | 17.910 | 86 | 1.471 | 1 | 1.000 | 0.000 | 88 | 0.002 | 1 | 1.012 | 0.333 | 84 | 1.190 |
CLASSIC_all | 3,000 | 82 | 68,108.129 | 1,802.415 | 1,989.483 | 18.058 | 81 | 1.460 | 1 | 1.000 | 0.000 | 82 | 0.002 | 1 | 1.010 | 0.343 | 79 | 1.151 | |
CLASSIC_unique | 19 | 100 | 133.010 | 1,802.415 | 1,879.452 | 21.912 | 79 | 1.234 | 1 | 6.364 | 10.479 | 92 | 586.983 | 1 | 0.001 | 0.977 | 0 | 0.046 | |
CLASSIC_random | 19 | 97 | 52.201 | 1,802.415 | 1,391.544 | 21.435 | 84 | 1.440 | 1 | 1.000 | 0.119 | 97 | 0.005 | 1 | 0.921 | 2.555 | 96 | 3.420 | |
6,000 | PIQMEE_all | 6,000 | 79 | 92,081.925 | 3,685.930 | 4,097.601 | 25.200 | 75 | 1.407 | 1 | 1.000 | 0.000 | 79 | 0.001 | 1 | 1.000 | 0.457 | 78 | 1.086 |
CLASSIC_all | 6,000 | 20 | 352,521.226 | 3,685.930 | 4,737.040 | 21.652 | 19 | 1.390 | 1 | 1.000 | 0.000 | 20 | 0.001 | 1 | 0.957 | 0.781 | 19 | 1.023 | |
CLASSIC_unique | 20 | 100 | 181.031 | 3,685.930 | 4,006.775 | 31.893 | 81 | 1.290 | 1 | 3.683 | 8.946 | 82 | 550.668 | 1 | 0.001 | 0.980 | 1 | 0.041 | |
CLASSIC_random | 20 | 95 | 44.510 | 3,685.930 | 3,012.714 | 25.474 | 86 | 1.421 | 1 | 1.000 | 0.214 | 95 | 0.005 | 1 | 1.068 | 63.933 | 92 | 3.754 |
Note.—Summary statistics for the analyses of 100 simulations per setting using four different methods to analyze the same data sets. The “true” value of the parameter, that is, the value of the parameter under which the data were simulated, is displayed in the respective column for each parameter. For the tree height parameter, the median of the “true” tree heights is displayed. MoM represents the median of medians of those analyses out of 100 that reached ESS 200 for each parameter. RMSE is the root mean squared error of the medians of those analyses out of 100 that reached ESS 200 for each parameter. RMSE was normalized by the true value of the respective parameter. Coverage displays the number of analyses (out of 100) that reached ESS of 200 for each parameter and for which the true value of a parameter was contained in the 95% HPD interval. HPD size is the median size of the 95% HPD interval of those analyses out of 100 that reached ESS of 200 for each parameter. HPD size was normalized by the true value of the respective parameter. The table also shows the median CPU time of those runs that finished within the runtime limit of 7 days and reached 200 ESS for each parameter.
Number of Simulated Sequences . | Method . | Median Number of Analyzed Sequences . | Runs Reaching 200 ESS . | CPU Seconds per 200 ESS . | Tree Height . | Re . | Death Rate . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
True . | MoM . | RMSE . | Coverage . | HPD Size . | True . | MoM . | RMSE . | Coverage . | HPD Size . | True . | MoM . | RMSE . | Coverage . | HPD Size . | |||||
300 | PIQMEE_all | 300 | 100 | 510.716 | 189.414 | 164.452 | 5.089 | 92 | 1.286 | 1 | 0.998 | 0.008 | 100 | 0.023 | 1 | 1.023 | 0.352 | 99 | 1.352 |
CLASSIC_all | 300 | 98 | 868.493 | 189.414 | 167.685 | 5.098 | 90 | 1.291 | 1 | 0.998 | 0.007 | 98 | 0.022 | 1 | 1.016 | 0.352 | 97 | 1.355 | |
CLASSIC_unique | 19 | 100 | 61.457 | 189.414 | 153.707 | 5.670 | 81 | 1.135 | 1 | 12.932 | 17.343 | 99 | 643.745 | 1 | 0.004 | 0.935 | 0 | 0.204 | |
CLASSIC_random | 19 | 100 | 51.851 | 189.414 | 147.843 | 5.410 | 91 | 1.309 | 1 | 0.998 | 0.177 | 100 | 0.031 | 1 | 1.025 | 1.080 | 96 | 3.001 | |
1,200 | PIQMEE_all | 1,200 | 98 | 3,457.728 | 664.673 | 631.564 | 10.484 | 93 | 1.437 | 1 | 0.999 | 0.004 | 98 | 0.006 | 1 | 1.031 | 0.308 | 95 | 1.274 |
CLASSIC_all | 1,200 | 93 | 13,950.919 | 664.673 | 679.082 | 10.661 | 88 | 1.354 | 1 | 0.999 | 0.002 | 93 | 0.006 | 1 | 1.032 | 0.304 | 91 | 1.238 | |
CLASSIC_unique | 19 | 100 | 91.485 | 664.673 | 671.776 | 12.251 | 78 | 1.126 | 1 | 11.717 | 15.352 | 90 | 634.224 | 1 | 0.001 | 0.977 | 0 | 0.058 | |
CLASSIC_random | 19 | 100 | 94.050 | 664.673 | 571.125 | 12.426 | 84 | 1.302 | 1 | 0.999 | 1.419 | 100 | 0.017 | 1 | 1.134 | 119.208 | 96 | 4.041 | |
2,100 | PIQMEE_all | 2,100 | 95 | 7,532.271 | 1,195.555 | 1,214.334 | 12.626 | 91 | 1.520 | 1 | 1.000 | 0.001 | 95 | 0.004 | 1 | 1.119 | 0.371 | 92 | 1.313 |
CLASSIC_all | 2,100 | 83 | 33,230.011 | 1,195.555 | 1,346.233 | 13.353 | 79 | 1.417 | 1 | 1.000 | 0.000 | 83 | 0.003 | 1 | 1.104 | 0.363 | 80 | 1.266 | |
CLASSIC_unique | 19.5 | 100 | 115.094 | 1,195.555 | 1,196.450 | 15.974 | 80 | 1.217 | 1 | 5.426 | 10.711 | 93 | 581.605 | 1 | 0.002 | 0.975 | 1 | 0.070 | |
CLASSIC_random | 19.5 | 97 | 44.540 | 1,195.555 | 1,053.862 | 14.153 | 89 | 1.488 | 1 | 1.000 | 0.917 | 97 | 0.008 | 1 | 0.985 | 1.446 | 92 | 3.468 | |
3,000 | PIQMEE_all | 3,000 | 88 | 16,509.389 | 1,802.415 | 1,900.862 | 17.910 | 86 | 1.471 | 1 | 1.000 | 0.000 | 88 | 0.002 | 1 | 1.012 | 0.333 | 84 | 1.190 |
CLASSIC_all | 3,000 | 82 | 68,108.129 | 1,802.415 | 1,989.483 | 18.058 | 81 | 1.460 | 1 | 1.000 | 0.000 | 82 | 0.002 | 1 | 1.010 | 0.343 | 79 | 1.151 | |
CLASSIC_unique | 19 | 100 | 133.010 | 1,802.415 | 1,879.452 | 21.912 | 79 | 1.234 | 1 | 6.364 | 10.479 | 92 | 586.983 | 1 | 0.001 | 0.977 | 0 | 0.046 | |
CLASSIC_random | 19 | 97 | 52.201 | 1,802.415 | 1,391.544 | 21.435 | 84 | 1.440 | 1 | 1.000 | 0.119 | 97 | 0.005 | 1 | 0.921 | 2.555 | 96 | 3.420 | |
6,000 | PIQMEE_all | 6,000 | 79 | 92,081.925 | 3,685.930 | 4,097.601 | 25.200 | 75 | 1.407 | 1 | 1.000 | 0.000 | 79 | 0.001 | 1 | 1.000 | 0.457 | 78 | 1.086 |
CLASSIC_all | 6,000 | 20 | 352,521.226 | 3,685.930 | 4,737.040 | 21.652 | 19 | 1.390 | 1 | 1.000 | 0.000 | 20 | 0.001 | 1 | 0.957 | 0.781 | 19 | 1.023 | |
CLASSIC_unique | 20 | 100 | 181.031 | 3,685.930 | 4,006.775 | 31.893 | 81 | 1.290 | 1 | 3.683 | 8.946 | 82 | 550.668 | 1 | 0.001 | 0.980 | 1 | 0.041 | |
CLASSIC_random | 20 | 95 | 44.510 | 3,685.930 | 3,012.714 | 25.474 | 86 | 1.421 | 1 | 1.000 | 0.214 | 95 | 0.005 | 1 | 1.068 | 63.933 | 92 | 3.754 |
Number of Simulated Sequences . | Method . | Median Number of Analyzed Sequences . | Runs Reaching 200 ESS . | CPU Seconds per 200 ESS . | Tree Height . | Re . | Death Rate . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
True . | MoM . | RMSE . | Coverage . | HPD Size . | True . | MoM . | RMSE . | Coverage . | HPD Size . | True . | MoM . | RMSE . | Coverage . | HPD Size . | |||||
300 | PIQMEE_all | 300 | 100 | 510.716 | 189.414 | 164.452 | 5.089 | 92 | 1.286 | 1 | 0.998 | 0.008 | 100 | 0.023 | 1 | 1.023 | 0.352 | 99 | 1.352 |
CLASSIC_all | 300 | 98 | 868.493 | 189.414 | 167.685 | 5.098 | 90 | 1.291 | 1 | 0.998 | 0.007 | 98 | 0.022 | 1 | 1.016 | 0.352 | 97 | 1.355 | |
CLASSIC_unique | 19 | 100 | 61.457 | 189.414 | 153.707 | 5.670 | 81 | 1.135 | 1 | 12.932 | 17.343 | 99 | 643.745 | 1 | 0.004 | 0.935 | 0 | 0.204 | |
CLASSIC_random | 19 | 100 | 51.851 | 189.414 | 147.843 | 5.410 | 91 | 1.309 | 1 | 0.998 | 0.177 | 100 | 0.031 | 1 | 1.025 | 1.080 | 96 | 3.001 | |
1,200 | PIQMEE_all | 1,200 | 98 | 3,457.728 | 664.673 | 631.564 | 10.484 | 93 | 1.437 | 1 | 0.999 | 0.004 | 98 | 0.006 | 1 | 1.031 | 0.308 | 95 | 1.274 |
CLASSIC_all | 1,200 | 93 | 13,950.919 | 664.673 | 679.082 | 10.661 | 88 | 1.354 | 1 | 0.999 | 0.002 | 93 | 0.006 | 1 | 1.032 | 0.304 | 91 | 1.238 | |
CLASSIC_unique | 19 | 100 | 91.485 | 664.673 | 671.776 | 12.251 | 78 | 1.126 | 1 | 11.717 | 15.352 | 90 | 634.224 | 1 | 0.001 | 0.977 | 0 | 0.058 | |
CLASSIC_random | 19 | 100 | 94.050 | 664.673 | 571.125 | 12.426 | 84 | 1.302 | 1 | 0.999 | 1.419 | 100 | 0.017 | 1 | 1.134 | 119.208 | 96 | 4.041 | |
2,100 | PIQMEE_all | 2,100 | 95 | 7,532.271 | 1,195.555 | 1,214.334 | 12.626 | 91 | 1.520 | 1 | 1.000 | 0.001 | 95 | 0.004 | 1 | 1.119 | 0.371 | 92 | 1.313 |
CLASSIC_all | 2,100 | 83 | 33,230.011 | 1,195.555 | 1,346.233 | 13.353 | 79 | 1.417 | 1 | 1.000 | 0.000 | 83 | 0.003 | 1 | 1.104 | 0.363 | 80 | 1.266 | |
CLASSIC_unique | 19.5 | 100 | 115.094 | 1,195.555 | 1,196.450 | 15.974 | 80 | 1.217 | 1 | 5.426 | 10.711 | 93 | 581.605 | 1 | 0.002 | 0.975 | 1 | 0.070 | |
CLASSIC_random | 19.5 | 97 | 44.540 | 1,195.555 | 1,053.862 | 14.153 | 89 | 1.488 | 1 | 1.000 | 0.917 | 97 | 0.008 | 1 | 0.985 | 1.446 | 92 | 3.468 | |
3,000 | PIQMEE_all | 3,000 | 88 | 16,509.389 | 1,802.415 | 1,900.862 | 17.910 | 86 | 1.471 | 1 | 1.000 | 0.000 | 88 | 0.002 | 1 | 1.012 | 0.333 | 84 | 1.190 |
CLASSIC_all | 3,000 | 82 | 68,108.129 | 1,802.415 | 1,989.483 | 18.058 | 81 | 1.460 | 1 | 1.000 | 0.000 | 82 | 0.002 | 1 | 1.010 | 0.343 | 79 | 1.151 | |
CLASSIC_unique | 19 | 100 | 133.010 | 1,802.415 | 1,879.452 | 21.912 | 79 | 1.234 | 1 | 6.364 | 10.479 | 92 | 586.983 | 1 | 0.001 | 0.977 | 0 | 0.046 | |
CLASSIC_random | 19 | 97 | 52.201 | 1,802.415 | 1,391.544 | 21.435 | 84 | 1.440 | 1 | 1.000 | 0.119 | 97 | 0.005 | 1 | 0.921 | 2.555 | 96 | 3.420 | |
6,000 | PIQMEE_all | 6,000 | 79 | 92,081.925 | 3,685.930 | 4,097.601 | 25.200 | 75 | 1.407 | 1 | 1.000 | 0.000 | 79 | 0.001 | 1 | 1.000 | 0.457 | 78 | 1.086 |
CLASSIC_all | 6,000 | 20 | 352,521.226 | 3,685.930 | 4,737.040 | 21.652 | 19 | 1.390 | 1 | 1.000 | 0.000 | 20 | 0.001 | 1 | 0.957 | 0.781 | 19 | 1.023 | |
CLASSIC_unique | 20 | 100 | 181.031 | 3,685.930 | 4,006.775 | 31.893 | 81 | 1.290 | 1 | 3.683 | 8.946 | 82 | 550.668 | 1 | 0.001 | 0.980 | 1 | 0.041 | |
CLASSIC_random | 20 | 95 | 44.510 | 3,685.930 | 3,012.714 | 25.474 | 86 | 1.421 | 1 | 1.000 | 0.214 | 95 | 0.005 | 1 | 1.068 | 63.933 | 92 | 3.754 |
Note.—Summary statistics for the analyses of 100 simulations per setting using four different methods to analyze the same data sets. The “true” value of the parameter, that is, the value of the parameter under which the data were simulated, is displayed in the respective column for each parameter. For the tree height parameter, the median of the “true” tree heights is displayed. MoM represents the median of medians of those analyses out of 100 that reached ESS 200 for each parameter. RMSE is the root mean squared error of the medians of those analyses out of 100 that reached ESS 200 for each parameter. RMSE was normalized by the true value of the respective parameter. Coverage displays the number of analyses (out of 100) that reached ESS of 200 for each parameter and for which the true value of a parameter was contained in the 95% HPD interval. HPD size is the median size of the 95% HPD interval of those analyses out of 100 that reached ESS of 200 for each parameter. HPD size was normalized by the true value of the respective parameter. The table also shows the median CPU time of those runs that finished within the runtime limit of 7 days and reached 200 ESS for each parameter.
Table 1 and figure 1 clearly show that the analyses of subsets of the full data, that is, unique or random subsets of sequences, are much faster than the analyses of the full data set under either the CLASSIC or PIQMEE methods. Note that as the number of sequences in the unique and random data sets stays more-or-less constant (median of 19–20 sequences), these analyses always take approximately the same amount of time no matter how many sequences there are in the full data set.
In what follows, we use the root mean squared error (RMSE) of the median parameter estimates as a measure of accuracy. Systematic deviation from the true value calculated as “MoM—true value”, where MoM stands for the median of medians estimates, is used as a measure of bias. Precision refers to the width of the 95% highest posterior density (HPD) interval.
First, we will discuss the results obtained from the CLASSIC_unique method. Analyses of the unique sequences lead to biased parameter estimates as indicated by the systematic deviation of the MoM from the true values of the parameters. The posterior estimates of the tree height parameter are above the true parameter value. This overestimation likely happens because in the unique sequence tree only the branches with mutation are seen. By consequence, at the considered substitution rate the plausible trees have longer branches than the true tree in order to accommodate all this “quickly appearing” diversity. The population dynamic parameter Re (effective reproductive number), which is the birth rate divided by the death rate, is biased upwards and the death rate is biased downwards. The bias in the population dynamic parameters could be purely a consequence of the overestimation of the tree height. However, it could also be explained by the lack of short terminal branches in the tree, which constitute a significant proportion of all branches leading to tips in our simulations. These two hypotheses can be probed by fixing the tree and reestimating the parameters (see supplementary fig. S1, Supplementary Material online). Results of these analyses show that the estimated death rate is still low, but higher, and closer to the true value, than when the tree is unfixed (fig. 1). However, not all of the bias can be explained by the missing short terminal branches. Thus, the bias in the death rate and Re, comes partly from the overestimation of the tree height and partly from the missing short terminal branches. Using the CLASSIC_unique method, the Re, and for most settings also the tree height, are estimated with the lowest accuracy, that is, the RMSE is the highest. Similarly, for each single parameter across all five data set sizes the coverage percentage for the mixed runs (defined as the percentage of analyses, out of those for which an ESS of 200 was reached for all parameters, whose HPD interval includes the true parameter value) is the lowest when the unique sequences only are analyzed (table 1). The coverage ranges between 82% and 99% for Re and is close to 0% for the death rate. The higher coverage for the Re can be partially explained by very low precision, that is, very wide 95% HPD intervals, for that parameter.
In contrast to the CLASSIC_unique results, the analyses of random subsets of the data do not lead to bias in the estimates of the population dynamic parameters. For both the Re and the death rate, the accuracy and the precision are nevertheless lower than for the CLASSIC or PIQMEE analyses of full data sets. Due to the reduced sample size, larger HPD intervals without biases are expected when analyzing random samples as opposed to the full data set. High coverage is therefore expected as a direct consequence of the increased HPD intervals. However, the tree height inferred is generally smaller than the true tree height of the full data set. As mentioned in the introduction, this is due to the fact that the random subset of sequences may not contain the two most divergent sequences from the full set, leading to a tree with a younger MRCA. In such a case, the estimate of tree height from the subsample will have seemingly increased bias when compared with the MRCA of the full data set, an observation reproduced in our results (table 1).
The PIQMEE_all and CLASSIC_all analyses lead to very similar parameter estimates, confirming that our PIQMEE implementation works correctly. Both the table 1 and the figure 1 show that the accuracy, precision, and coverage for all parameters inferred with PIQMEE are almost identical to those obtained using the CLASSIC method applied to the full data set. However, the PIQMEE_all analyses are much faster. With increasing number of sequences, the speed difference gets bigger. The reason for this is that the sequences were simulated on each tree such that the median number of unique sequences was around 19–20. This also means that the amount of duplicates for each unique sequence increases with increasing number of total sequences. This in turn results in the speed advantage of the PIQMEE method increasing as the data set grows in size. The analyses with the CLASSIC_all method start to mix very slowly at 2,100 sequences. At 6,000 sequences, many of the CLASSIC_all analyses fail early on due to too large memory requirements. Of those that run many converge slowly, resulting in only 20 out of 100 well-mixed runs. By contrast, when the number of input sequences reaches 6,000, 79 out of 100 PIQMEE runs mix well.
To find out the maximum size of the data set that PIQMEE is able to analyze within a 2-week runtime, we simulated 100 trees for each of following sizes: 12,000, 15,000, 18,000, 21,000, 24,000, 27,000, and 30,000 tips. All tips in the trees were sampled at one point in time. We then simulated sequences on these trees under two different substitution rates, such that there were ∼20 and 200 unique sequences. The resulting sequence data sets were analyzed by PIQMEE. Figure 2 and supplementary table S1, Supplementary Material online, show for which data sets we obtained well-mixed MCMC chains.

Performance of the PIQMEE method on very large data sets. We analyzed data sets consisting of 12,000, 15,000, 18,000, 21,000, 24,000, 27,000, and 30,000 sequences with PIQMEE. All sequences were sampled at one point in time (homochronous sampling). (A) The chain length reached in 14 days. (B) The burn-in, expressed as the fraction of the entire chain, that needed to be removed in order to obtain well-mixing part of the chain. We note that if we have to remove more than half of the chain, it indicates that convergence is not necessarily reached and the chain should thus be run for longer. (C) The ESS reached. The y-axis in (C) has a log-scale.
As expected, with increasing number of sequences in the data set the sampling from the posterior distribution gets slower, as indicated by the decreased number of steps the MCMC chain achieved within the given run time of 14 days. The mixing also gets slower, as indicated by the increasing number of burn-in steps.
The data sets with ∼20 unique sequences ran and also converged faster than those with ∼200 unique sequences. For the data sets with 20 unique sequences, the majority of the analyses of data sets with 21,000 or less sequences mixed well (ESS
Method Performance under Various Scenarios
Sequence Evolution Models
We have shown above that the parameter inference under PIQMEE works very well when the sequences are simulated and analyzed with the JC69 model under the strict molecular clock model. The phylogenetic likelihood calculation under PIQMEE also works for more complex models of sequence evolution. The parameter inference is equally good under PIQMEE as under CLASSIC method with the full data set when the sequences are simulated with HKY (Hasegawa–Kishino–Yano) or GTR (general time reversible) model (see supplementary fig. S2, Supplementary Material online, for an example of data sets with 300 simulated sequences).
Heterochronous Sampling
In addition to the homochronous sequences (sampling at one point in time) shown above, our PIQMEE method can also handle data sets with heterochronous sequences, that is, sampled through several points in time. Again, as expected given a correct PIQMEE implementation, the parameter estimates under PIQMEE_all and CLASSIC_all are the same (supplementary fig. S3 and table S2, Supplementary Material online), despite an additional parameter, that is, clock rate, being estimated. Similarly to the analysis of homochronous sequences, the analysis of heterochronous sequences by the PIQMEE method is faster than by the CLASSIC method. In fact, the speed advantage of PIQMEE is more pronounced on the heterochronous data set as compared with the homochronous data set. With just 300 sequences being analyzed, PIQMEE is, based on the median runtime, five times faster on the heterochronous data set (3,203 vs. 16,208 CPU seconds), whereas it is 1.7 times faster on the homochronous data set (511 vs. 868 CPU seconds).
Patient Number . | Estimated Time since Infection (years) . | Total Number of Sequences . | Number of Unique Sequences . |
---|---|---|---|
p1 | 8.21 | 10,394 | 52 |
p2 | 5.53 | 6,998 | 13 |
p3 | 8.44 | 3,625 | 41 |
p5 | 5.89 | 7,408 | 35 |
p6 | 7.00 | 10,798 | 15 |
p8 | 4.96 | 4,689 | 31 |
p9 | 8.10 | 5,253 | 17 |
p11 | 5.60 | 4,693 | 35 |
Patient Number . | Estimated Time since Infection (years) . | Total Number of Sequences . | Number of Unique Sequences . |
---|---|---|---|
p1 | 8.21 | 10,394 | 52 |
p2 | 5.53 | 6,998 | 13 |
p3 | 8.44 | 3,625 | 41 |
p5 | 5.89 | 7,408 | 35 |
p6 | 7.00 | 10,798 | 15 |
p8 | 4.96 | 4,689 | 31 |
p9 | 8.10 | 5,253 | 17 |
p11 | 5.60 | 4,693 | 35 |
Note.—The second column shows clinically established estimated time since infection for each patient. The third column shows the total sequence count covering the C2-V5 region of HIV genome. The fourth column shows the number of unique sequences in each patient data set.
Patient Number . | Estimated Time since Infection (years) . | Total Number of Sequences . | Number of Unique Sequences . |
---|---|---|---|
p1 | 8.21 | 10,394 | 52 |
p2 | 5.53 | 6,998 | 13 |
p3 | 8.44 | 3,625 | 41 |
p5 | 5.89 | 7,408 | 35 |
p6 | 7.00 | 10,798 | 15 |
p8 | 4.96 | 4,689 | 31 |
p9 | 8.10 | 5,253 | 17 |
p11 | 5.60 | 4,693 | 35 |
Patient Number . | Estimated Time since Infection (years) . | Total Number of Sequences . | Number of Unique Sequences . |
---|---|---|---|
p1 | 8.21 | 10,394 | 52 |
p2 | 5.53 | 6,998 | 13 |
p3 | 8.44 | 3,625 | 41 |
p5 | 5.89 | 7,408 | 35 |
p6 | 7.00 | 10,798 | 15 |
p8 | 4.96 | 4,689 | 31 |
p9 | 8.10 | 5,253 | 17 |
p11 | 5.60 | 4,693 | 35 |
Note.—The second column shows clinically established estimated time since infection for each patient. The third column shows the total sequence count covering the C2-V5 region of HIV genome. The fourth column shows the number of unique sequences in each patient data set.
Furthermore, we obtain the same patterns with respect to speed, bias, accuracy, and precision for the analyses of the unique subsets of the data as we did in the homochronous case. The only exception is that in the homochronous scenario the Re was overestimated, whereas in the heterochronous data sets it is underestimated.
For the random subsets, the results are generally less accurate than in the homochronous scenario. The CLASSIC_random method overestimates the clock rate and the death rate and underestimates the Re. This is most likely due to a lack of information content in the subsampled data set, insufficient for the inference of all parameters of the model. When we reanalyze the same data set with a stronger prior around the true value of the death rate parameter (supplementary fig. S4, Supplementary Material online), estimates of the death rate and clock rate improve.
Clock Models
Analyses of subsets of the (heterochronous) data using the relaxed clock model lead to similar results to those observed when the strict clock model is used. Both the CLASSIC_random and CLASSIC_unique methods are still faster than full data set analyses; however, they also lead to biased parameter estimates. We further observe that there is a slight deviation of the PIQMEE method results from those of the CLASSIC method when applied to full data sets (supplementary fig. S5, Supplementary Material online). The main difference is that the confidence intervals for the Re and the average clock rate parameter are larger for the PIQMEE method as compared with the CLASSIC method. The coverage of the PIQMEE method is very good, and similar to that of the CLASSIC_all method, for all parameters displayed in supplementary figure S5, Supplementary Material online, with exception of the death rate (see supplementary tables S3 and S4, Supplementary Material online). In addition, the PIQMEE method tends to slightly underestimate the death rate and the clock rate, while overestimating the tree height. This pattern is seen even if the tree is fixed (supplementary fig. S6, Supplementary Material online) to the true tree. The distribution of substitution rates associated with different types of branches (internal vs. external) in our tree (supplementary figs. S7 and S8, Supplementary Material online) makes it clear that the bias in the parameter estimates is a result of the PIQMEE method’s assumption that each haplotype only evolves once within the tree. This assumption translates into the PIQMEE method requiring that for each haplotype subtree the sequence at the MRCA of the subtree is exactly the same as the sequence at the tips. This has two consequences that are reflected in our results. Firstly, it can be seen from the transition probability formula (see the Materials and Methods section, phylogenetic likelihood) that the lower the substitution rate for a fixed, large haplotype subtree is, the higher the phylogenetic likelihood value will be. This association between low rates and large subtrees can be seen in our results (supplementary figs. S7 and S8, Supplementary Material online). In an unfixed tree, this can then translate in forcing the root of the tree to be older, and by consequence the death rate (as well as the birth rate) to be lower.
Secondly, the PIQMEE method forces the sequence within haplotype subtrees to remain unchanged until further in the past than the CLASSIC method does. The (internal) branches in the tree that are above the haplotype subtrees need to take relatively higher substitution rates (see supplementary figs. S7 and S8, Supplementary Material online), such that the sequences change fast enough between the MRCA of the haplotype subtrees and the next internal node, where they join with another tip or MRCA of another haplotype’s subtree.
Skyline Model
Finally, the tree prior in the PIQMEE method has been implemented as an extension of a BDsky model. PIQMEE preserves the skyline functionality of BDsky and is thus able to capture the changes in the population dynamics parameters (Re and death rate) over time (see supplementary fig. S9, Supplementary Material online). As in the previous analyses, where the sequences were sampled at multiple time points, both the CLASSIC_random and the CLASSIC_unique method perform poorly. In contrast to the nonskyline methods, the clock rate is severely underestimated by both CLASSIC_random and CLASSIC_unique skyline methods. This leads to overestimation of the tree height. Also the population dynamics parameter estimates have large biases, with both methods overestimating the Re and underestimating the death rate. These biased parameter estimates are very likely due to the lack of sufficient information in the data for inference of all parameters of the model, confirmed by the results of the analyses with more defined priors (see supplementary fig. S10, Supplementary Material online).
Analysis of the Real Within-Host HIV Data Sets
We applied the PIQMEE method to the publicly available HIV sequence data from 8 out of 11 patients published in Zanini et al. (2015). Two patients, patient 4 and patient 7, were omitted because they have been superinfected. One additional patient, patient 10, was removed from the analyses, because for the genomic region we analyzed, sequences were successfully obtained for one time point only, insufficient to perform inference without fixing the clock rate. Table 2 shows summary statistics of the analyzed data sets.
We used a skyline model with 1, 3, or 5 intervals for Re. We ran the analyses once with and once without sequences (option “Sample from prior” in BEAST 2) to check how the priors interfere with each other and whether the distributions obtained from the two runs are different, that is, whether the sequence data contain enough information to allow for inference of the parameters of interest. Running the analyses when excluding sequences in BEAST 2 is not equivalent to running the analyses under the prior in the Bayesian sense, as the birth–death model still uses the sampling dates of the sequences as source of information (Boskova et al. 2018).
None of the runs using the CLASSIC_all method mixed within the 7-day runtime. We obtained well-mixed chains for both when using the sequence data and when excluding the sequence data from the PIQMEE analyses for the following patients and settings: patients 2, 8, and 9 when using one interval for Re, patient 8 when using three intervals for Re, and patient 5 when using five intervals for Re (see supplementary table S5, Supplementary Material online). However, in the analyses of patients 2 and 9, when using a single interval for Re in the PIQMEE method, the distribution of the Re parameter was essentially invariant to the inclusion or exclusion of the sequence data. This indicates that the sequence data bring very little information to the model on this parameter (see supplementary figs. S11 and S12, Supplementary Material online, for patients 2 and 9, respectively).
In addition to the analyses of full data sets with the CLASSIC and PIQMEE methods, we also analyzed the data set consisting of only the unique sequences (see table 2 for unique sequence counts) and a random subset of 600 sequences. For patients 2 and 8 when using one interval for Re (supplementary figs. S11 and S13, Supplementary Material online) and for patient 8 when using three intervals for Re (supplementary fig. S14, Supplementary Material online), only the analyses with PIQMEE_all and CLASSIC_unique methods mixed well. There was only a single case for which all three analyses (PIQMEE_all, CLASSIC_random, and CLASSIC_unique) mixed well both with and without the sequence data: when using one interval for Re with patient 9 data (supplementary fig. S12, Supplementary Material online). However, same as for PIQMEE_all, inclusion or exclusion of the sequence data in the CLASSIC_random analyses did not make a difference for the distribution of Re. Finally, only the analyses with PIQMEE_all and CLASSIC_random mixed well when using five intervals for Re for patient 5 data (fig. 3).

Skyline plot of chronically infected HIV patient number 5 from Zanini et al. (2015), analyzed using the skyline model with five intervals for Re. We plot the distribution of the clock rate (A and C), and the tree height, the effective reproductive number Re (orange) as well as the sampling proportion (blue) (B and D). (A) and (B) correspond to analyses performed with the PIQMEE_all method. (C) and (D) The results of analyses with the CLASSIC_random method. The results of the CLASSIC_unique method are not displayed here because the analyses when the sequence data were excluded did not mix well. For all analyses, the death rate was fixed to 124 per year. The distributions obtained when excluding the sequence data are shown in lighter color, whereas the distributions obtained when the sequence data are included are plotted in darker shades. The gray interval shows the distribution of the tree height. Note that when excluding the sequence data, the distribution of tree height for the random subsample ranged from 5.6 to 14.6 years (the plot is truncated at 8 years). The clock rate in plots (A) and (C) is in units of substitutions/site/year. The time on the x-axes of plots (B) and (D) goes from the time of the last sample (0) backwards and is displayed in units of calendar years.
Across the successfully converged analyses (fig. 3 and supplementary figs. S11–S14, Supplementary Material online), the clock rate is estimated to be higher by the CLASSIC_random and CLASSIC_unique methods than the PIQMEE_all method. Additionally, the tree height and the Re estimates provided by the CLASSIC methods often differ from those obtained with PIQMEE_all method. When using one interval for Re, the Re estimates by the CLASSIC methods are always below the PIQMEE_all estimates. In addition, for patient number 5 and five intervals for Re (fig. 3), the PIQMEE method estimates that the HIV population was relatively stable (Re is around or slightly above 1) since the time of infection. The virus population was growing the fastest between 1 and 2.3 years before the last sample, and this growth continued, though at a slower pace in the year immediately preceding the last sample. However, the random data set does not capture this trend well. The CLASSIC_random method estimates that the virus population was increasing (
In summary, the analyses of the random and unique subsets of real HIV data sets provide very different parameter estimates as compared with the analyses of full data sets using the PIQMEE method, which is consistent with the analyses of the simulated data sets.
Discussion
Understanding the dynamics of pathogen dissemination is crucial for introducing appropriate measures to either stop or slow down its spread. There are two different scales at which information about pathogen dynamics can be gained: between- and within-host. For most infectious diseases, the between-host level is of importance. This is especially the case for infections that spread quickly among individuals, for example, the recent Ebola epidemic (Althaus 2014), the Zika virus epidemic (Ferguson et al. 2016), or the ongoing Covid-19 pandemic (Ferguson et al. 2020). If the infection is long-lasting with damaging effects to the host, such as HIV or HCV, insights into within-host disease progression and pathogen evolution are necessary for effectively personalizing treatment of individual patients (Wei et al. 1995; Perelson et al. 1996; Gray et al. 2011; Ribeiro et al. 2012).
The within-host populations are often sequenced using next-generation sequencing methods. Advances in sequencing technology provide us with an in-depth view into pathogen population diversity (Schuster 2008; Zagordi et al. 2011; Töpfer et al. 2013; Pandit and de Boer 2014). Many studies report that these within-host data sets contain many biological duplicates (Boeras et al. 2011; Töpfer et al. 2014; Wu et al. 2014). The development of sophisticated phylodynamic methods to fully exploit such data is now apt.
We have shown through simulations that analyses using only the unique sequences lead to biased parameter estimates. Additionally, when compared with analyses of full data set, analyses using a random subset of the data show decreased precision. Analyses of random subsets of data can lead to parameter estimates, for example, of the tree height, that do not correctly reflect the properties of the full data set. Furthermore, the analyses of full data sets with the classic method slowed down significantly with increasing number of sequences. In fact, only 20% of the analyses with 6,000 sequences reached ESS of 200 for all parameters, despite the fact that the complexity of the data set, as measured by the number of unique sequences, remained the same across all data sets.
In current Bayesian phylodynamic methods, each sequence corresponds to a separate tip in a tree. The inclusion of duplicate sequences means that tree space increases significantly but not in an informative way. Identical sequences can be freely exchanged on the tree without the phylogenetic likelihood and the tree prior changing. Such inefficient tree space exploration causes poor mixing of MCMC chains. Long chains are therefore needed to achieve satisfactory ESSs for all estimated parameters. This is especially problematic for, and may thus completely preclude, analysis of large data sets with many duplicate sequences.
We have therefore proposed a new method, PIQMEE, that takes advantage of the fact that duplicate sequences can be treated differently than separate tips in the tree, and thus only unique sequences are represented as tips in the tree. It is founded on the observation that the topology of duplicate sequences cannot be resolved, but the timing of their branching can, because this is informed by a combination of the phylogenetic likelihood
We have shown that the PIQMEE method is as accurate and as precise as the classic implementation of the likelihood and tree prior for the full data set. The PIQMEE method is capable of analyzing both homochronous and heterochronous samples. The estimates of parameters are the same as in analyses using the classic method for data sets evolving under the strict clock model.
It is also possible to use the PIQMEE method with relaxed clock models of sequence evolution. However, slight biases in the clock rate, the tree height, and the death rate may arise due to the PIQMEE assumption of each unique sequence arising only once within the tree. The inference will thus show slight deviations from the classic method results.
We have shown that the PIQMEE method tied with the strict clock model can process data sets of up to 21,000 sequences, of which 20 were unique and the rest were duplicates. This means that larger amounts of data can now be processed using Bayesian phylodynamic methods than was possible before. By being able to use all available sequences and not only the unique subset of them, the MRCA (proxy for the start of the infection) and the within-host pathogen population dynamics can be studied in more detail.
The composition of quasispecies populations, in terms of their unique sequence spectrum and frequency, is temporally dynamic (Bull et al. 2011; Pandit and de Boer 2014). This can be attributed to factors such as sequence adaptation to host immune system and population bottlenecks due to the patient’s drug regimen (Pybus and Rambaut 2009). Detailed information on the pathogen population composition and dynamics will yield insight into the dependency (of speed) of drug resistance development on population composition and history. The bottlenecks could be identified and correlated with events such as a change of the drug regimen. Similarly, if a sudden expansion in diversity of quasispecies was seen, this could be correlated with events such as a failure of the patient to stick to the treatment. The PIQMEE method is ideal for the study of large data sets with many duplicate sequences, for example, those obtained from chronic infections of a host with pathogen such as HIV or HCV. We have shown the usefulness of our method for such data sets by successfully applying it to large sequence data sets from patients chronically infected with HIV.
Within-host population dynamics is an important factor to include in the model if one wants to correctly reconstruct the between-host transmission network (Didelot et al. 2014; Romero-Severson et al. 2016; Didelot et al. 2017). Although for some studies using a single sequence from a rich within-host quasispecies seemed to be sufficient to approximate the date of infection (Poon et al. 2011), others claim that one sequence per patient may not be enough to allow for correct reconstruction of the transmission chain (Ypma et al. 2013; Worby et al. 2014; Volz et al. 2017). Corrections in the form of modeling the within-host population dynamics are necessary (Ypma et al. 2013; Didelot et al. 2014, 2017; Klinkenberg et al. 2017; Volz et al. 2017) even if several unique sequences are used (Vrancken et al. 2014). Only the correct transmission network can lead to reliable parameter estimates of the transmission dynamics (Ypma et al. 2013; Volz et al. 2017). Using a model that can accommodate many sequences from a single host while reconstructing the between-host transmission network should lead to more reliable estimates of the transmission network structure and dynamics. Our method can serve as a starting point for designing such models.
Several of the nested within- and between-host models treat each patient as a separate compartment. Often, the within-host dynamics are modeled using a coalescent approach, with coalescent events among sequences being allowed to happen only within, but not between these compartments (Hall et al. 2015; De Maio et al. 2016). For the PIQMEE method to be compatible with these approaches, the tree prior would need to be adapted to the coalescent framework. For a coalescent event between two identical sequences, the coalescent rate would depend not only on the number of identical sequences present in that host at that time point (see calculation of the factor γ for the tree prior in the Materials and Methods section) but also on the population size of the pathogen within that host.
Although the PIQMEE method has originally been conceived for quantifying within-host quasispecies evolution and dynamics, there is no reason why the method could not be applied to other data sets, such as between-host data sets, where many sequences are identical. The only difference to the within-host model would be the meaning of the parameters, for example, the Re in the within-host context would represent the ability of the pathogen to spread within the host, whereas for the between-host dynamics the Re would refer to the between-host spread.
A drawback of our PIQMEE method is that it requires nonrecombining sequences as input. Although recombination occurs in many pathogens (Simon-Loriere and Holmes 2011), only a few phylodynamic methods can currently handle such data sets (Bloomquist and Suchard 2010; Vaughan et al. 2017). Using portions of the genes or genomes that are known or assumed to not be recombining (González-Candelas et al. 2011; Smyth et al. 2014) is the usual workaround to this problem, and we would recommend this approach if using PIQMEE with such data sets.
In summary, the PIQMEE method is a significant step toward faster analysis and accurate estimation of population dynamics based on deeply sequenced quasispecies data sets, or any other large data set with a high proportion of duplicate sequences. The method could further be improved by implementing the phylogenetic likelihood calculation to work with BEAGLE (Suchard and Rambaut 2009; Ayres et al. 2019), by making corrections such that the relaxed clock models would be fully compatible with the PIQMEE model assumptions, and by allowing for recombination of sequences. Also, tree priors other than the birth–death-based BDsky model could be implemented to work with our tree structure allowing for analyses under various population dynamic models.
Materials and Methods
PIQMEE Method Description
New Tree Structure
As mentioned in the introduction, PIQMEE uses a reduced tree structure (fig. 4). It is composed of the tree of unique sequences, τu, an array of branching times (colored dashes on the branches of τu in fig. 4), and sampling times of all the duplicate sequences, denoted ζ. Note that each unique sequence (haplotype) corresponds to exactly one tip in the tree, and that tip represents the most recently sampled duplicate of the haplotype.

Reduced representation of a tree of three unique haplotypes (A–C) and their duplicates. The origin of the process (the tree) is denoted as X0. We call the internal nodes that join the unique haplotypes (X1, X2) the bifurcation nodes of the tree. The total count for each haplotype is noted below the corresponding tip in the tree. Note that the branching times of the duplicates are depicted as colored dashes on the tree. The notation of these times in the tree prior and the phylogenetic likelihood is shown to the right. We only show the times of the bifurcation nodes (solid lines leading to the time axis), and first few branching times of duplicates (dark dash-dotted lines). Due to space constraints, the rest of the branching times of duplicates are only shown as gray dash-dotted lines. The last sampling time of each haplotype is shown (
Let the number of unique sequences be nu. The letters X
The total count of copies of a given haplotype H is denoted as
Phylogenetic Model
We denote the time at which the bifurcation node
The lower case letters xi or
The substitution model parameters are composed of
As our tree τu only represents the topology of the unique sequences, we needed to rewrite the phylogenetic likelihood formulated using the Felsenstein’s peeling algorithm (Felsenstein 1981) to fit such a structure. Instead of the full tree, τ, the adapted Felsenstein likelihood takes τu and ζ as input.
The new phylogenetic likelihood formula accommodates our main assumption of each haplotype arising only once during the history of the process (i.e., in the tree) by imposing that the duplicates of a haplotype do not mutate at all. Thus, the probability of no mutation event on a branch with length t is
There are two direct consequences of our method’s assumption reflected in the formula. First, the haplotype sequence stays the same for the duration of the sum of the branch lengths defined by the distances from each of the duplicate sequence branching points X
Tree Prior Model
The tree prior within PIQMEE is an extension of the BDsky model (Stadler et al. 2013). Let λ be the birth rate, δ the death rate (referred to as “total rate of becoming noninfectious” in Stadler et al. [2013]), ψ the sampling rate through time, ρ the sampling probability at the time of special sampling effort,
The factors
Implementation in BEAST 2 Software
In order to implement the PIQMEE method in BEAST 2, we rewrote the basic tree class of BEAST 2 to accommodate the new tree structure. If the duplicate sequences of the same haplotype are sampled at different time points, all such time points are merged to a single representative tip to fulfill the assumption that each sequence arose only once during the process. The tip representing each haplotype in τu is assigned the date corresponding to the most recent sampling time of that haplotype sequence. The different sampling times, counts of the duplicates, and their branching times are tracked internally. The BDsky model (Stadler et al. 2013) was adapted for the phylodynamic inference, as was the general implementation of the phylogenetic likelihood function
Simulations and (Real Data) Analyses
For details on how we simulated the sequences, performed the analyses of simulated and real data sets, please refer to the Supplement sup1Supplementary Material online.
Acknowledgments
We would like to thank to Joëlle Barido-Sottani, Denise Kühnert, Timothy G. Vaughan, Alexei J. Drummond, and Stephen Crotty for helpful discussions. We thank Stephen Crotty and Chi Zhang for critically reading parts of the manuscript. We also would like to thank two anonymous reviewers for valuable comments. This work was supported by the ETH Zurich and European Research Council under the Seventh Framework Programme of the European Commission (PhyPD: Grant Agreement No. 335529). V.B. would additionally like to thank Swiss National Science foundation for funding (Early PostDoc.Mobility Fellowship grant number: P2EZP3_184543).
Author Contributions
V.B. and T.S. designed the method and the analyses. V.B. implemented and tested the method within the BEAST 2 framework. V.B. performed the analyses. V.B. and T.S. interpreted the results and wrote the manuscript.