Quantifying and Predicting Ongoing Human Immunodeficiency Virus Type 1 Transmission Dynamics in Switzerland Using a Distance-Based Clustering Approach

Network Properties That Were Calculated for Clusters and Nodes

Variable	Object	Description
Node degree	Node	Number of links
Past node growth	Node	Number of links gained over the past 3 years
Future node growth	Node	Number of links gained over the next 3 years
Closeness	Node	(n − 1)/(Σ_iⁿ p_i), where n is the number of nodes in the cluster and p_i is the shortest path from the node of interest to node i
Betweenness	Node	Number of shortest paths between each pair of nodes in the cluster that pass through the node of interest
Cluster size	Cluster	Number of nodes in the cluster
Past cluster growth	Cluster	Number of nodes gained over the past 3 years
Future cluster growth	Cluster	Number of nodes gained over the next 3 years
Density	Cluster	m/(n × (n − 1)/2), where m is the total number of links and n is the total number of nodes in the cluster
Transitivity	Cluster	Probability of 2 neighbors of the same node being linked directly
Median degree	Cluster	Median of all node degrees in the cluster
Median distance	Cluster	Median Tamura-Nei 93 distance of all the links in the cluster
Median closeness	Cluster	Median of the node closenesses
Median betweenness	Cluster	Median of the node betweennesses

Variable	Object	Description
Node degree	Node	Number of links
Past node growth	Node	Number of links gained over the past 3 years
Future node growth	Node	Number of links gained over the next 3 years
Closeness	Node	(n − 1)/(Σ_iⁿ p_i), where n is the number of nodes in the cluster and p_i is the shortest path from the node of interest to node i
Betweenness	Node	Number of shortest paths between each pair of nodes in the cluster that pass through the node of interest
Cluster size	Cluster	Number of nodes in the cluster
Past cluster growth	Cluster	Number of nodes gained over the past 3 years
Future cluster growth	Cluster	Number of nodes gained over the next 3 years
Density	Cluster	m/(n × (n − 1)/2), where m is the total number of links and n is the total number of nodes in the cluster
Transitivity	Cluster	Probability of 2 neighbors of the same node being linked directly
Median degree	Cluster	Median of all node degrees in the cluster
Median distance	Cluster	Median Tamura-Nei 93 distance of all the links in the cluster
Median closeness	Cluster	Median of the node closenesses
Median betweenness	Cluster	Median of the node betweennesses

Table 1.

Network Properties That Were Calculated for Clusters and Nodes

Variable	Object	Description
Node degree	Node	Number of links
Past node growth	Node	Number of links gained over the past 3 years
Future node growth	Node	Number of links gained over the next 3 years
Closeness	Node	(n − 1)/(Σ_iⁿ p_i), where n is the number of nodes in the cluster and p_i is the shortest path from the node of interest to node i
Betweenness	Node	Number of shortest paths between each pair of nodes in the cluster that pass through the node of interest
Cluster size	Cluster	Number of nodes in the cluster
Past cluster growth	Cluster	Number of nodes gained over the past 3 years
Future cluster growth	Cluster	Number of nodes gained over the next 3 years
Density	Cluster	m/(n × (n − 1)/2), where m is the total number of links and n is the total number of nodes in the cluster
Transitivity	Cluster	Probability of 2 neighbors of the same node being linked directly
Median degree	Cluster	Median of all node degrees in the cluster
Median distance	Cluster	Median Tamura-Nei 93 distance of all the links in the cluster
Median closeness	Cluster	Median of the node closenesses
Median betweenness	Cluster	Median of the node betweennesses

Variable	Object	Description
Node degree	Node	Number of links
Past node growth	Node	Number of links gained over the past 3 years
Future node growth	Node	Number of links gained over the next 3 years
Closeness	Node	(n − 1)/(Σ_iⁿ p_i), where n is the number of nodes in the cluster and p_i is the shortest path from the node of interest to node i
Betweenness	Node	Number of shortest paths between each pair of nodes in the cluster that pass through the node of interest
Cluster size	Cluster	Number of nodes in the cluster
Past cluster growth	Cluster	Number of nodes gained over the past 3 years
Future cluster growth	Cluster	Number of nodes gained over the next 3 years
Density	Cluster	m/(n × (n − 1)/2), where m is the total number of links and n is the total number of nodes in the cluster
Transitivity	Cluster	Probability of 2 neighbors of the same node being linked directly
Median degree	Cluster	Median of all node degrees in the cluster
Median distance	Cluster	Median Tamura-Nei 93 distance of all the links in the cluster
Median closeness	Cluster	Median of the node closenesses
Median betweenness	Cluster	Median of the node betweennesses

Cluster- and Node-Level Growth Modeling

We used Poisson regression to model the number of nodes acquired in each cluster from 2014 to 2017 and assess factors associated with cluster growth. One considered factor was the past cluster growth (defined as the change in cluster size from 2011 to 2014). We used logistic regression to model the acquisition of new links of individual nodes within the first 3 years of being enrolled in the cohort (with a binary outcome variable). We included variables that have been found to be predictive of cluster growth or clustering in similar work [4, 19, 28], such as age, sex, CD4-cell count, virus load, and transmission risk factor. More details on the models and included variables can be found in the Supplementary Material.

Cross-Validation

To predict whether a node will acquire a new link in the following 3 years, we compared logistic regressions and classification random forests [29] built on several subsets of variables. For each set of predictors, a logistic regression model and a random forest model were trained on the same training data. To supplement the sets of predictors manually chosen, we also employed a method of automated variable selection implemented in variable selection using random forests (VSURF) [30]. For more detailed information on the methods employed, please refer to the “Methods” section in the Supplementary Material.

RESULTS

Analyzing a total of 13 299 sequences with the distance-based clustering algorithm yielded a total of 998 clusters that were highly robust over the observed time frame, making it possible to assess the dynamics of the clusters and their constituent nodes over the 13 year-long period (Supplementary Figures 2–4). Out of the 13 299 included sequences, 4074 (30.6%) clustered with at least one other sequence at the time of sampling. At the last observed time point (31 December 2020), 5415 (40.7%) of all sequences were linked to at least one other sequence. We found that although intravenous (IV)-drug users represented 2572 (19.3%) of the total number of sequences, they constituted 29% of the clustered sequences (Table 2). On the other hand, patients in the heterosexual acquisition risk category represented 4782 (36%) of the total number of sequences but only 25% of the clustered sequences, indicating potentially less frequent transmission in this subpopulation (P_χ2 < .001).

Table 2.

Characteristics of the Patients Whose HIV-Pol Sequences Were Used in the Analysis

Characteristic	Clustered^a (n = 4074)	Not Clustered^a (n = 9225)	All (n = 13 299)
Age, y^a
ȃMean (SD)	37.3 (9.73)	38.7 (10.7)	38.3 (10.4)
ȃMedian (Q1, Q3)	36 (31, 42)	37 (31, 45)	37 (31, 44)
Sex, No. (%)
ȃFemale	899 (22.1)	2844 (30.8)	3743 (28.1)
ȃMale	3175 (77.9)	6381 (69.2)	9556 (71.9)
Acquisition risk group, No. (%)
ȃMSM	1751 (43.0)	3588 (38.9)	5339 (40.1)
ȃHeterosexuals	1017 (25.0)	3765 (40.8)	4782 (36.0)
ȃIntravenous-drug users	1180 (29.0)	1392 (15.1)	2572 (19.3)
ȃUnknown	126 (3.1)	480 (5.2)	606 (4.6)
RNA concentration^b
ȃMedian, copies/mL (Q1, Q3)	27 162 (3345, 110 023)	15 900 (790, 87 078)	19 605 (1260, 95 938)
ȃMissing, No. (%)	686 (16.8)	2054 (22.3)	2740 (20.6)
CD4 cell count^b
ȃMedian, cells/μL (min, max)	379 (222, 568)	340 (180, 527)	350 (191, 540)
ȃMissing, No. (%)	68 (1.7)	184 (2.0)	252 (1.9)

Characteristic	Clustered^a (n = 4074)	Not Clustered^a (n = 9225)	All (n = 13 299)
Age, y^a
ȃMean (SD)	37.3 (9.73)	38.7 (10.7)	38.3 (10.4)
ȃMedian (Q1, Q3)	36 (31, 42)	37 (31, 45)	37 (31, 44)
Sex, No. (%)
ȃFemale	899 (22.1)	2844 (30.8)	3743 (28.1)
ȃMale	3175 (77.9)	6381 (69.2)	9556 (71.9)
Acquisition risk group, No. (%)
ȃMSM	1751 (43.0)	3588 (38.9)	5339 (40.1)
ȃHeterosexuals	1017 (25.0)	3765 (40.8)	4782 (36.0)
ȃIntravenous-drug users	1180 (29.0)	1392 (15.1)	2572 (19.3)
ȃUnknown	126 (3.1)	480 (5.2)	606 (4.6)
RNA concentration^b
ȃMedian, copies/mL (Q1, Q3)	27 162 (3345, 110 023)	15 900 (790, 87 078)	19 605 (1260, 95 938)
ȃMissing, No. (%)	686 (16.8)	2054 (22.3)	2740 (20.6)
CD4 cell count^b
ȃMedian, cells/μL (min, max)	379 (222, 568)	340 (180, 527)	350 (191, 540)
ȃMissing, No. (%)	68 (1.7)	184 (2.0)	252 (1.9)

Abbreviation: MSM, men who have sex with men.

At the time when the sample for the genotypic resistance test was taken.

At the follow-up visit closest to the sampling for the genotypic resistance test.

Table 2.

Characteristics of the Patients Whose HIV-Pol Sequences Were Used in the Analysis

Characteristic	Clustered^a (n = 4074)	Not Clustered^a (n = 9225)	All (n = 13 299)
Age, y^a
ȃMean (SD)	37.3 (9.73)	38.7 (10.7)	38.3 (10.4)
ȃMedian (Q1, Q3)	36 (31, 42)	37 (31, 45)	37 (31, 44)
Sex, No. (%)
ȃFemale	899 (22.1)	2844 (30.8)	3743 (28.1)
ȃMale	3175 (77.9)	6381 (69.2)	9556 (71.9)
Acquisition risk group, No. (%)
ȃMSM	1751 (43.0)	3588 (38.9)	5339 (40.1)
ȃHeterosexuals	1017 (25.0)	3765 (40.8)	4782 (36.0)
ȃIntravenous-drug users	1180 (29.0)	1392 (15.1)	2572 (19.3)
ȃUnknown	126 (3.1)	480 (5.2)	606 (4.6)
RNA concentration^b
ȃMedian, copies/mL (Q1, Q3)	27 162 (3345, 110 023)	15 900 (790, 87 078)	19 605 (1260, 95 938)
ȃMissing, No. (%)	686 (16.8)	2054 (22.3)	2740 (20.6)
CD4 cell count^b
ȃMedian, cells/μL (min, max)	379 (222, 568)	340 (180, 527)	350 (191, 540)
ȃMissing, No. (%)	68 (1.7)	184 (2.0)	252 (1.9)

Characteristic	Clustered^a (n = 4074)	Not Clustered^a (n = 9225)	All (n = 13 299)
Age, y^a
ȃMean (SD)	37.3 (9.73)	38.7 (10.7)	38.3 (10.4)
ȃMedian (Q1, Q3)	36 (31, 42)	37 (31, 45)	37 (31, 44)
Sex, No. (%)
ȃFemale	899 (22.1)	2844 (30.8)	3743 (28.1)
ȃMale	3175 (77.9)	6381 (69.2)	9556 (71.9)
Acquisition risk group, No. (%)
ȃMSM	1751 (43.0)	3588 (38.9)	5339 (40.1)
ȃHeterosexuals	1017 (25.0)	3765 (40.8)	4782 (36.0)
ȃIntravenous-drug users	1180 (29.0)	1392 (15.1)	2572 (19.3)
ȃUnknown	126 (3.1)	480 (5.2)	606 (4.6)
RNA concentration^b
ȃMedian, copies/mL (Q1, Q3)	27 162 (3345, 110 023)	15 900 (790, 87 078)	19 605 (1260, 95 938)
ȃMissing, No. (%)	686 (16.8)	2054 (22.3)	2740 (20.6)
CD4 cell count^b
ȃMedian, cells/μL (min, max)	379 (222, 568)	340 (180, 527)	350 (191, 540)
ȃMissing, No. (%)	68 (1.7)	184 (2.0)	252 (1.9)

Abbreviation: MSM, men who have sex with men.

At the time when the sample for the genotypic resistance test was taken.

At the follow-up visit closest to the sampling for the genotypic resistance test.

Most clusters had less than 10 nodes at the end of 2020 (943/998, 94.5%). The largest identified cluster contained 1577 nodes, 43.7% of which were categorized as IV-drug users, and 24.0% of which were categorized as heterosexuals.

The obtained clusters exhibited a large heterogeneity in terms of composition, size, and growth patterns (Figure 1 and Figure 2; and Supplementary Figures 4–7). Of 575 clusters identified up to 31 December 2007, only 134 (23.3%) gained any new nodes in the following 13 years, of which only 33 (5.7%) gained 5 or more new nodes (Figure 2A). Despite the small fraction of clusters that gained 5 or more nodes, they accounted for 443 (70.9%) of all 625 nodes that were gained by all 575 clusters collectively. The clusters that gained 5 or more nodes were disproportionately MSM clusters (27 of 33, 81.8%). We found a strong correlation between cluster size in 2007 and number of new nodes acquired up to 2020 (Spearman r = 0.72, P < .001; Figure 2A). Similarly, of the 9308 SHCS patients with sequences sampled as of 2007, only 1079 (11.6%) gained links to new sequences up to the year 2020 (Figure 2B). Most patients that acquired new links only gained very few: only 206 (2.2%) gained links to 3 or more new sequences, and they accounted for 1103 (50.4%) of the total 2190 new links over the studied period.

Figure 1.

Graph representations of 4 different clusters (for a larger selection of clusters, see Supplementary Figure 7). Each node represents a single patient. Two linked nodes are patients whose HIV pol sequences have a Tamura-Nei 93 genetic distance of less than or equal to 0.01. The sample year refers to the year when the sample for the genotypic drug resistance test was taken. A small amount of noise was added to the coordinates of each node for better readability of clusters with many overlapping links. Abbreviation: MSM, men who have sex with men.

Figure 2.

A, Cluster growth from 31 December 2007 to 31 December 2020 as a function of cluster size in 2007. Clusters where the most common acquisition risk group did not constitute >50% of all members were assigned to a combined (hyphenated) category consisting of the 2 most common risk groups in alphabetical order. B, Node growth from 31 December 2007 to 31 December 2020 as a function of node degree, that is, the number of links, in 2007. Abbreviations: Het, heterosexuals; IDU, intravenous drug users; MSM, men who have sex with men.

When modeling cluster growth using Poisson regression (with log₁₀ cluster size in the year 2014 as an offset), we found that past growth of a cluster was a good predictor for future growth (Figure 3A; adjusted incidence rate ratio [aIRR], 5.11 [95% confidence interval, 95% CI, 2.62–9.95] and aIRR, 11.03 [95% CI, 6.44–18.88] for past cluster growth of 2–3 and ≥4, respectively). Besides past growth, no other variable yielded a statistically significant estimate in the multivariable model. Clusters with older individuals had lower growth rates in the univariable models (aIRR, 0.62 [95% CI, .42–.91] and aIRR, 0.17 [95% CI, 0.11–0.29] for median ages 40–49 and ≥50 years, respectively), as did clusters made up of mostly heterosexuals (aIRR, 0.38 [95% CI, .24–.60]), clusters with more than 90% virally suppressed patients (aIRR, 0.48 [95% CI, .37–.62]), and clusters with high rates of condom use with occasional partners (aIRR, 0.15 [95% CI, .07–.31]). On the other hand, clusters with more patients using non-IV drugs had significantly higher growth rates in the univariable model (aIRR, 2.18 [95% CI, 1.41–3.37]). This indicates that the effect of these variables can be captured by including past cluster growth as a proxy for behavioral and demographic risk factors. Similar results were obtained when restricting the analysis to clusters where MSM was the most common acquisition risk category (Supplementary Figure 8) and when varying the time period considered (Supplementary Figures 9–12).

Figure 3.

A, Factors associated with the number of nodes gained by a cluster over the span of 3 years, as assessed by a Poisson regression model. Parameters estimated and 95% CI from univariable and multivariable regressions are represented. Past cluster growth was calculated as the number of nodes gained from 31 December 2011 to 31 December 2014, and future cluster growth was calculated as the number of nodes gained from 31 December 2014 to 31 December 2017. B, Factors associated with the gain of new links for a node within 3 years, as modeled by a logistic regression. Odds ratio and 95% CI from univariable and multivariable models are represented. The outcome was a binary variable based on the number of links gained in the first 3 years after the date of the genotypic resistance test. Abbreviations: CI, confidence interval; IV, intravenous; MSM, men who have sex with men; ref, reference.

To quantify the relevant factors of growth at the individual node level, that is, a node's risk of acquiring new links over time, we specified a logistic regression model where we used a similar set of variables for predicting the addition of new links to a given node within 3 years of being sequenced (Figure 3B). Node degree had a significant effect on the outcome, with larger node degrees being associated with higher probabilities to gain new links (odds ratio [OR], 2.41 [95% CI, 1.94–3.00], OR, 4.98 [95% CI, 3.98–6.24], and OR, 11.35 [95% CI, 8.34–15.45] for node degree 1, 2–4, and ≥5, respectively). Accordingly, removing node degree from the regression model led to a significantly worse model fit (likelihood ratio test, P < .001). In other words, the growth of the network occurs by preferential attachment, meaning more connected nodes acquire more new links, which also explains the approximately scale-free pattern observed for the degree distribution of the whole network (Supplementary Figure 13). Besides node degree, several epidemiological and virological factors were associated with acquisition of new links: patients between 40 and 49 years old were at a significantly lower risk than younger patients (OR, 0.52 [95% CI, .42–.64]), as were IV-drug users compared to MSM (OR, 0.70 [95% CI, .54–.89]). Viral loads above 10 000 copies/mL were associated with the gain of new links (OR, 1.35 [95% CI, 1.05–1.74]), as were CD4 cell counts above 300 cells/µL (OR, 1.59 [95% CI, 1.31–1.93]) and inconsistent condom use with occasional partners (OR, 1.37 [95% CI, 1.13–1.67]). Restricting the analysis to MSM patients yielded similar results (Supplementary Figure 14), as did adding the enrolment year as a linear effect (Supplementary Figure 15) and random subsampling of 75% or 50% of the available sequences (Supplementary Figures 16 and 17).

Model Comparison

We trained multiple models using 5 different sets of predictors (specified in Table 3 and Supplementary Table 2) with the goal of identifying the best model for predicting whether a certain node is going to acquire a link to a new node within 3 years. To assess the performance of these models, we performed a 10-fold cross-validation and compared the median areas under the curve (AUCs) of the receiver operating characteristic (ROC)-curves based on the model predictions.

Table 3.

Predictor Sets Used in the Model Comparison

Predictor Set	Network-Based Predictors	Demographical Predictors	Clinical Predictors
Mix	Node degree, past cluster growth, cluster size	Acquisition risk group, registration center, age, sex	RNA concentration
Cluster	Node degree, past cluster growth, cluster size, median closeness in the cluster, node closeness, cluster density, median distance in the cluster	None	None
Patient	None	Acquisition risk group, registration center, age, sex	RNA concentration

Predictor Set	Network-Based Predictors	Demographical Predictors	Clinical Predictors
Mix	Node degree, past cluster growth, cluster size	Acquisition risk group, registration center, age, sex	RNA concentration
Cluster	Node degree, past cluster growth, cluster size, median closeness in the cluster, node closeness, cluster density, median distance in the cluster	None	None
Patient	None	Acquisition risk group, registration center, age, sex	RNA concentration

Mix, cluster, and patient are predictor sets with mixed, only cluster-based predictors, and only demographical and clinical predictors, respectively. The model comparison further included 2 predictor sets generated with automatic variable selection algorithm, which are described in Supplementary Table 2.

Table 3.

Predictor Sets Used in the Model Comparison

Predictor Set	Network-Based Predictors	Demographical Predictors	Clinical Predictors
Mix	Node degree, past cluster growth, cluster size	Acquisition risk group, registration center, age, sex	RNA concentration
Cluster	Node degree, past cluster growth, cluster size, median closeness in the cluster, node closeness, cluster density, median distance in the cluster	None	None
Patient	None	Acquisition risk group, registration center, age, sex	RNA concentration

Predictor Set	Network-Based Predictors	Demographical Predictors	Clinical Predictors
Mix	Node degree, past cluster growth, cluster size	Acquisition risk group, registration center, age, sex	RNA concentration
Cluster	Node degree, past cluster growth, cluster size, median closeness in the cluster, node closeness, cluster density, median distance in the cluster	None	None
Patient	None	Acquisition risk group, registration center, age, sex	RNA concentration

Mix, cluster, and patient are predictor sets with mixed, only cluster-based predictors, and only demographical and clinical predictors, respectively. The model comparison further included 2 predictor sets generated with automatic variable selection algorithm, which are described in Supplementary Table 2.

Among models with preselected predictor sets (Table 3), models that used both network and patient characteristics yielded the most accurate predictions (Figure 4). Random forests and logistic regression models performed similarly in all cases except one. Notably, restricting the set of predictors to demographical and clinical variables (patient predictor set) resulted in a large drop in accuracy: from the mix to the patient predictor set, the median AUC decreased from 0.78 to 0.67 for the logistic regression and from 0.76 to 0.55 for the random forest. On the other hand, restricting the set of predictors to variables pertaining to the topological characteristics of clusters and nodes (cluster predictor set) did not decrease accuracy to the same degree, as the median AUC was 0.76 both for the logistic regression and the random forest. Accordingly, variables with the highest variable importance in the mix random forest model were cluster characteristics, namely node degree, past cluster growth, and cluster size (Supplementary Figure 18).

Figure 4.

Comparison of the predictive abilities of 12 different classification models. These models are based on the combination of 5 different sets of predictors (described in Table 3 and Supplementary Table 2) with 2 different modeling methods: logistic regression and random forest. Each one of these combinations was assessed in a 10-fold cross-validation. Predictive ability was assessed by comparing the areas under the curve (AUCs) of the receiver operator characteristic (ROC) curves of the models.

Additionally, we identified 2 more subsets of predictors using variable selection using random forests (VSURF) [30]. From a mix of demographic, clinical, and cluster topology-related predictors VSURF repeatedly selected only the latter category of variables (Supplementary Table 2). The performance of random forests based on the variables selected by VSURF was similar to the mix and cluster models, with median AUCs of 0.77 and 0.74 for VSURF_interpretation and VSURF_prediction, respectively (Figure 4). ROC curves for each model are displayed in Supplementary Figure 19.

DISCUSSION

Here we combined the evolutionary-distance-based clustering method HIV-TRACE [16] with longitudinal cohort data and statistical learning approaches to analyze cluster growth dynamics in the Swiss HIV Cohort study. In concordance with previous work [4], we found that, in the timespan from 2007 to 2020, only a minority of the HIV clusters in Switzerland were growing. Similarly, only a small fraction of patients enrolled up to the year 2007 have formed any new links, which would be an indication of onward transmission of HIV. Consistent with earlier work [4], we found that the fraction of virally suppressed patients and behavioral risk factors were predictive of cluster growth. When adjusting for network characteristics, however, these associations were no longer statistically significant, suggesting that part of the information provided by the aforementioned variables can be captured by the characteristics of the network.

When modeling the risk of acquiring new links on the patient level, we found that viral loads of more than 10 000 copies/mL were associated with a high risk of gaining links, adding to the evidence that suppressing viral loads is essential for HIV prevention [5, 31–34]. Additionally, we observed a subgroup of MSM with a sudden burst of growth early in the studied period, indicating that this subgroup or their undiagnosed or HIV–negative contacts might benefit from targeted preventive efforts (Figure 2B, and Supplementary Figures 4 and 5).

As has been partly shown previously in a US-based study [35], we find that cluster size and its previous growth activity is predictive of future growth. When comparing demographical, clinical, and behavioral variables with network-based variables, we observed a significant improvement in the predictive capacity of both cluster-level and patient-level growth models when network-based variables were added as predictors. Keeping in mind the goal of prospectively analyzing the state of HIV epidemics, these variables derived from the network topology provided a substantial increase in predictive accuracy that should not be ignored. The predictive power of past cluster growth and the node degree, the small fraction of active clusters and patients, and the degree distributions observed in the clusters also suggest some degree of preferential attachment being responsible in the generation of the clusters. This underlines the need for approaches that allow the precise and timely identification of foci of ongoing transmission to enable preventive action. The predictive models established in this work could thus form the basis of such precision public health approaches to HIV prevention.

One limitation of our analyses is that they depend on an ad hoc choice of clustering threshold. Here, in line with previous work [19, 36, 37], we chose a threshold of 0.01. This conservative threshold maximizes the number of clusters in our cohort and thereby provides the best resolution for our analysis. In this way, we avoid the two extremes of an unnecessarily strict threshold, which would fail to cluster sequences even if they correspond to real transmission pairs, and a too lenient threshold, which would combine even very different sequences into large uninformative clusters that do not reflect the underlying transmission network. In addition, a relatively strict threshold was preferable in the case of the patient-based prediction models. We also conducted a sensitivity analysis showing that in this study, results were robust to the threshold choice. Another limitation is that we cannot establish individual transmission events between linked patients. Furthermore, the SHCS contains only part of the Swiss HIV–positive population, which means that the analyzed clusters are missing patients that are not enrolled in the cohort. Consequently, the appearance of new links between patients of the SHCS can be caused by undiagnosed or otherwise not enrolled PWH. However, with close to 21 000 total patients and nearly 10 000 patients under follow-up as of 2020, the SHCS is representative of the Swiss HIV epidemic [38]. Another limitation is the use of the first sequence per patient only, which does not account for intrapatient evolution of the virus. Because there was only a single sequence available for most (63.0%) SHCS patients for whom a genotypic resistance test had been performed, this was a practical choice with the added benefit of maximizing the long-term robustness of the clusters generated by HIV-TRACE. Future extensions of this study could possibly take this intrapatient evolution into account, therefore more precisely modeling the real epidemic, although this is contingent on the availability of sequence data on a large number of longitudinally sampled patients.

Despite these limitations, this study provides insight into the long-term dynamics of cluster growth of HIV in Switzerland. It makes use of the densely sampled SHCS, representing a significant and representative part of the Swiss HIV–positive population. The clustering method used makes longitudinal follow-up on individual clusters feasible and opens the possibility of prospective analyses performed in real time. Additionally, it demonstrates the importance of considering cluster-derived variables in addition to demographic and clinical variables when modeling cluster and individual growth dynamics.

In conclusion, we present new insights into the long-term dynamics of HIV cluster growth including the value of using cluster-based variables in predicting future growth both on the level of clusters and individual patients in the Swiss HIV epidemic.

Supplementary Data

Supplementary materials are available at The Journal of Infectious Diseases online. Consisting of data provided by the authors to benefit the reader, the posted materials are not copyedited and are the sole responsibility of the authors, so questions or comments should be addressed to the corresponding author.

Notes

Acknowledgments. We thank the participants of the Swiss HIV Cohort Study (SHCS); the physicians, and study nurses for excellent patient care; the resistance laboratories for high-quality genotyping drug resistance testing; the SHCS data center (A. Scherrer, K. Kusejko, J. Meier, Y. Schäfer, and O. Follonier) for excellent data management; and D. Perraudin and M. Amstad for administrative assistance. The data were gathered by the 5 Swiss university hospitals, 2 cantonal hospitals, 15 affiliated hospitals, and 36 private physicians (listed in Rodger et al [34]).

Members of the Swiss HIV Cohort Study. I. Abela, K. Aebi-Popp, A. Anagnostopoulos, M. Battegay, E. Bernasconi, D. L. Braun, H. C. Bucher, A. Calmy, M. Cavassini, A. Ciuffi, G. Dollenmaier, M. Egger, L. Elzi, J. Fehr, J. Fellay, H. Furrer, C. A. Fux, H. F. Günthard (President of the SHCS), A. Hachfeld, D. Haerry (Deputy of Positive Council), B. Hasse, H. H. Hirsch, M. Hoffmann, I. Hösli, M. Huber, C. R. Kahlert (Chairman of the Mother and Child Substudy), L. Kaiser, O. Keiser, T. Klimkait R. D. Kouyos, H. Kovari, K. Kusejko (Head of Data Centre), G. Martinetti, B. Martinez de Tejada, C. Marzolini, K. J. Metzner, N. Müller, J. Nemeth, D. Nicca, P. Paioni, G. Pantaleo, M. Perreau, A. Rauch (Chairman of the Scientific Board), P. Schmid, R. Speck, M. Stöckle (Chairman of the Clinical and Laboratory Committee), P. Tarr, A. Trkola, G. Wandeler, and S. Yerly.

Financial support. This work was supported by the Swiss National Science Foundation (grant numbers 33CS30_177499 to H. F. G. in the framework of the Swiss HIV Cohort Study [SHCS]; 324730B_179571 and 310030_141067 to H. F. G.; and 324730_207957 and BSSGI0_155851 to R. D. K.); the Yvonne-Jacob Foundation (to H. F. G.); the University of Zurich Clinical Research Priority Program for Viral Infectious Disease, the Zurich Primary HIV Infection Cohort Study (to H. F. G.); and an unrestricted research grant from Gilead Sciences (to the SHCS Research Foundation).

References

1

Rehle

TM

,

Hallett

TB

,

Shisana

O

, et al.

A decline in new HIV Infections in South Africa: estimating HIV incidence from three national HIV surveys in 2002, 2005 and 2008

.

Plos One

2010

;

5

:

e11094

.

2

Joint United Nations Programme on HIV/AIDS (UNAIDS)

.

90-90-90: an ambitious treatment target to help end the AIDS epidemic

.

Geneva,

Switzerland

:

UNAIDS

,

2014

.

Google Preview

3

Kusejko

K

,

Marzel

A

,

Hampel

B

, et al.

Quantifying the drivers of HIV transmission and prevention in men who have sex with men: a population model-based analysis in Switzerland

.

HIV Med

2018

;

19

:

688

–

97

.

4

Bachmann

N

,

Kusejko

K

,

Nguyen

H

, et al.

Phylogenetic cluster analysis identifies virological and behavioral drivers of human immunodeficiency virus transmission in men who have sex with men

.

Clin Infect Dis

2021

;

72

:

2175

–

83

.

5

Cohen

MS

,

Chen

YQ

,

McCauley

M

, et al.

Prevention of HIV-1 infection with early antiretroviral therapy

.

N Engl J Med

2011

;

365

:

493

–

505

.

6

Celum

C

,

Baeten

J

.

PrEP for HIV prevention: evidence, global scale-up, and emerging options

.

Cell Host Microbe

2020

;

27

:

502

–

6

.

7

UNAIDS

.

Global HIV and AIDS statistics—fact sheet

.

Geneva

,

Switzerland

:

UNAIDS

,

2021

.

Google Preview

8

Dennis

AM

,

Herbeck

JT

,

Brown

AL

, et al.

Phylogenetic studies of transmission dynamics in generalized HIV epidemics: an essential tool where the burden is greatest?

J Acquir Immune Defic Syndr

2014

;

67

:

181

–

95

.

9

Sivay

MV

,

Hudelson

SE

,

Wang

J

, et al.

HIV-1 diversity among young women in rural South Africa: HPTN 068

.

PloS One

2018

;

13

:

e0198999

.

10

Castro-Nallar

E

,

Pérez-Losada

M

,

Burton

GF

,

Crandall

KA

.

The evolution of HIV: inferences using phylogenetics

.

Mol Phylogenet Evol

2012

;

62

:

777

–

92

.

11

Oster

AM

,

France

AM

,

Mermin

J

.

Molecular epidemiology and the transformation of HIV prevention

.

JAMA

2018

;

319

:

1657

–

58

.

12

Grabowski

MK

,

Herbeck

JT

,

Poon

AFY

.

Genetic cluster analysis for HIV prevention

.

Curr HIV/AIDS Rep

2018

;

15

:

182

–

9

.

13

Beloukas

A

,

Psarris

A

,

Giannelou

P

,

Kostaki

E

,

Hatzakis

A

,

Paraskevis

D

.

Molecular epidemiology of HIV-1 infection in Europe: an overview

.

Infect Genet Evol

2016

;

46

:

180

–

9

.

14

Peeters

M

,

Jung

M

,

Ayouba

A

.

The origin and molecular epidemiology of HIV

.

Expert Rev Anti Infect Ther

2013

;

11

:

885

–

96

.

15

Hassan

AS

,

Pybus

OG

,

Sanders

EJ

,

Albert

J

,

Esbjörnsson

J

.

Defining HIV-1 transmission clusters based on sequence data

.

AIDS

2017

;

31

:

1211

–

22

.

16

Kosakovsky Pond

SL

,

Weaver

S

,

Leigh Brown

AJ

,

Wertheim

JO

.

HIV-TRACE (transmission cluster engine): a tool for large scale molecular epidemiology of HIV-1 and other rapidly evolving pathogens

.

Mol Biol Evol

2018

;

35

:

1812

–

9

.

17

Xia

Q

,

Wertheim

JO

,

Braunstein

SL

,

Misra

K

,

Udeagu

CC

,

Torian

LV

.

Use of molecular HIV surveillance data and predictive modeling to prioritize persons for transmission-reduction interventions

.

AIDS

2020

;

34

:

459

–

67

.

18

Oster

AM

,

France

AM

,

Panneer

N

, et al.

Identifying clusters of recent and rapid HIV transmission through analysis of molecular surveillance data

.

J Acquir Immune Defic Syndr

2018

;

79

:

543

–

50

.

19

Wertheim

JO

,

Kosakovsky Pond

SL

,

Forgione

LA

, et al.

Social and genetic networks of HIV-1 transmission in New York City

.

PLOS Pathog

2017

;

13

:

e1006000

.

20

Villandre

L

,

Stephens

DA

,

Labbe

A

, et al.

Assessment of overlap of phylogenetic transmission clusters and communities in simple sexual contact networks: applications to HIV-1

.

Plos One

2016

;

11

:

e0148459

.

21

Scherrer

AU

,

Traytel

A

,

Braun

DL

, et al.

Cohort profile update: the Swiss HIV cohort study (SHCS)

.

Int J Epidemiol

2022

;

51

:

33

–

4j

.

22

Tamura

K

,

Nei

M

.

Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees

.

Mol Biol Evol

1993

;

10

:

512

–

26

.

PubMed

23

Fujimoto

K

,

Bahl

J

,

Wertheim

JO

, et al.

Methodological synthesis of Bayesian phylodynamics, HIV-TRACE, and GEE: HIV-1 transmission epidemiology in a racially/ethnically diverse Southern U.S. context

.

Sci Rep

2021

;

11

:

3325

.

24

Chato

C

,

Kalish

ML

,

Poon

AFY

.

Public health in genetic spaces: a statistical framework to optimize cluster-based outbreak detection

.

Virus Evol

2020

;

6

:

veaa011

.

25

Csardi

G

,

Nepusz

T

.

The iGraph software package for complex network research

.

InterJournal

2006

;

1695

:

1

–

9

.

26

R Core Team

.

R: A language and environment for statistical computing

.

Vienna

,

Austria

:

R Foundation for Statistical Computing

,

2020

.

Google Preview

27

Wickham

H

.

ggplot2: elegant graphics for data analysis

.

New York

,

NY

;

Springer-Verlag

,

2016

.

28

Wertheim

JO

,

Murrell

B

,

Mehta

SR

, et al.

Growth of HIV-1 molecular transmission clusters in New York City

.

J Infect Dis

2018

;

218

:

1943

–

53

.

29

Breiman

L

.

Random forests

.

Mach Learn

2001

;

45

:

5

–

32

.

Crossref

30

Genuer

R

,

Poggi

JM

,

Tuleau-Malot

C

.

VSURF: an R package for variable selection using random forests

.

R J

2015

;

7

:

19

–

33

.

Crossref