Abstract

Background

The detection of communicable disease clusters in genomic surveillance data typically involves the application of rule-based signaling criteria, which can be arbitrary. In contrast, scan statistics that are used for spatiotemporal cluster detection can flexibly scan in calendar time, and scan statistics that are used for pharmacovigilance can flexibly scan along hierarchical tree structures that are based on diagnosis codes.

Methods

New York City (NYC) Health Department staff applied tree-temporal scan statistics prospectively to genomic surveillance data with a hierarchical nomenclature for COVID-19 and salmonellosis cases that were diagnosed among NYC residents. We searched weekly for recent case increases at any granularity, from large phylogenetic branches to small groups of indistinguishable isolates. Using free and open-source TreeScan software, we looked for emerging SARS-CoV-2 variants based on Pango lineages during August 2021–November 2023 and emerging clusters of Salmonella isolates based on allele codes during November 2022–November 2023.

Results

The SARS-CoV-2 Omicron subvariant EG.5.1 first signaled as locally emerging on 22 June 2023, 7 weeks before the World Health Organization designated it as a variant of interest. During 1 year of salmonellosis analyses, TreeScan detected 15 credible clusters that were worth investigating for common exposures and two data-quality issues for correction.

Conclusion

A challenge was the maintenance of timely and specific lineage assignments, and a limitation was that genetic distances between tree nodes were not considered. By automatically sifting through genomic data and generating ranked shortlists of nodes with statistically unusual recent case increases, TreeScan assisted in detecting emerging variants and clusters of communicable diseases and in prioritizing them for investigation.

Key Messages
  • Current practices for the detection of emerging disease clusters and variants in genomic surveillance data often rely on arbitrary criteria and time-consuming, manual laboratory data review.

  • Tree-temporal scan statistics can detect emerging clusters by flexibly scanning prospectively in calendar time while also flexibly scanning along a tree structure, such as for genetic relatedness.

  • We detected credible disease clusters for investigation and data-quality problems for correction, which helped health department officials to focus investigative resources.

Whole-genome sequencing (WGS) data are increasingly used by public health officials for communicable disease surveillance and cluster detection [1]. For example, SARS-CoV-2 variant surveillance allows officials to monitor the effects of new variants on COVID-19 disease severity, transmission, diagnostics, therapeutics, and immunity from prior infections and vaccinations [2]. In the USA, variant data have guided decisions around COVID-19 vaccine composition and the revocation of emergency-use authorizations for monoclonal antibody therapies with decreased clinical efficacy [3]. Such data have been used in New York City (NYC) to summarize epidemiologic characteristics of newly emerging variants [4], assess illness severity [5], and elucidate community transmission patterns [6]. Timely knowledge of emerging variants with increased transmissibility or immune escape can prompt actions to limit spread. Such actions are particularly important in congregate settings and for populations who are at increased risk of severe illness, such as people who are older or living with comorbid conditions [7].

When many SARS-CoV-2 variants and recombinants co-circulate, a key challenge is deciding in near real time which ones to closely monitor over which time increments. Bioinformatic methods and phylodynamic models can be used to estimate variant-specific growth rates and prioritize variants [8], although this can be onerous to operationalize with many co-circulating variants. In the COVID data tracker that was developed by the US Centers for Disease Control and Prevention (CDC), lineages are displayed if they either account for >1% of sequences nationally during a 2-week period or have been classified as a variant of interest or concern [2, 9].

Moreover, WGS-based subtyping is revolutionizing population-based enteric bacterial disease surveillance. When officials can quickly identify patients who are infected with genetically similar pathogens, the probability of identifying a common exposure and preventing further infections is increased [10]. The Public Health Laboratory (PHL) at the NYC Health Department performs core genome multilocus sequence typing (cgMLST) by using prescribed CDC PulseNet methods [11]. To detect Salmonella clusters, PHL staff compare cgMLST profiles of sequences that are stored in a local database of isolates (pathogens isolated from clinical specimens) that have been tested and sequenced at PHL. This process excludes isolates that are sequenced out-of-jurisdiction and also requires time-consuming manual input, which can lead to missed or delayed cluster detection.

Enteric disease cluster detection is typically operationalized by using static, rule-based definitions in which isolates from patients in a geographic area are grouped within fixed cutoffs of genetic relatedness and time [12, 13]. A commonly used working cluster definition is ≥3 Salmonella clinical isolates within a 60-day window within 10 alleles, of which ≥2 cases are within ≤5 alleles [11, 14]. Such rules, which can be arbitrary, vary across pathogens according to genetic diversity, ecology, and prevalence [15]. Existing cluster detection tools [16–18] do not also analyse the extent to which cases were spread out versus were concentrated in time, despite the importance of temporal clustering for cluster detection and investigation.

In contrast, space–time scan statistics search flexibly in both space and time [19, 20]. Several public health institutions previously used a rule-based aberration detection method (the historical limits method) [21]; CDC discontinued this approach in 2020 [22]. In 2014, the NYC Health Department transitioned to using scan statistics to quickly detect unusual clusters of any geographical size or duration for many reportable communicable diseases [23]. We wish to similarly search WGS data in a flexible manner in time. However, rather than flexibly searching for increases in geographical location and size, we wish to be flexible in the location of the patients’ WGS isolates on a phylogenetic tree and the granularity of nodes on that tree.

Tree-temporal scan statistics [24, 25] are used by CDC, the US Food and Drug Administration, and academic scientists to detect and evaluate unanticipated adverse reactions to pharmaceutical drugs and vaccines [26–28]. In this pharmacovigilance context, potential adverse events can be classified in a tree structure based on the International Classification of Diseases, Tenth Revision (ICD-10) diagnosis codes. The codes are grouped hierarchically, reflecting general or specific disease conditions that affect different body systems, with related diagnoses located on the same tree branch. Unusual increases in diagnoses at any level of specificity can be detected in sequential analyses, at any length of time after vaccine or drug administration.

Herein, we marry ideas of flexibly scanning prospectively in calendar time (as for spatiotemporal cluster detection) with flexibly scanning along a hierarchical tree structure (as is conducted for pharmacovigilance). We thereby establish an “innovation at the edge” of infectious disease epidemiology and pharmacoepidemiology [29]. We describe the real-time application of prospective tree-temporal scan statistics by the NYC Health Department. We selected SARS-CoV-2 and Salmonella because of their substantial disease burdens and availability of genomic surveillance data with a hierarchical nomenclature, with the potential to guide local public health actions.

Methods

Genomic surveillance data

SARS-CoV-2

PHL and other laboratories perform WGS on a portion of specimens from confirmed COVID-19 cases [30] that are diagnosed among NYC residents, as previously described [4, 31]. Weekly, starting on 12 August 2021, we determined the counts of each lineage assignment during a rolling 12-week period, ending on the most recent specimen collection date (Table 1). Pango lineages, which represent the dynamic nomenclature applied to genetically distinct SARS-CoV-2 lineages [32], were assigned by using the pangolin software tool [33]. Initially, we used the PangoLEARN machine-learning model to assign a lineage name to each WGS result [33]. To improve lineage assignment stability, on 16 December 2021, we switched to the UShER method for placing new genome sequences onto a phylogeny [34].

Table 1.

Specifications for analyses using tree-temporal scan statistics applied prospectively to genomic surveillance data among NYC residents

FeatureSARS-CoV-2SalmonellaNotes
Genomic data resolutionPango lineagesAllele codesThe “nodes” in our genomic surveillance trees represent SARS-CoV-2 variants or Salmonella allele codes
Temporal elementSpecimen collection dateSpecimen collection date (or upload date, in sensitivity analyses)The specimen collection date is the most epidemiologically relevant date, representing when patients sought care. For Salmonella, to accommodate delays between specimen collection and allele code assignment, we also conducted sensitivity analyses, in which the temporal element was the date uploaded to the System for Enteric Disease Response, Investigation, and Coordination (SEDRIC)
Time precisionDayWe used data at daily resolution (as opposed to aggregating by week or month) to improve precision in cluster start dates
Study period12-week period ending on the most recent specimen collection date1-year period ending on the most recent specimen collection (or upload) dateFor SARS-CoV-2, due to rapid variant turnover, we used a short study period that was three times as long as the maximum temporal cluster size (see below). For Salmonella, we used the standard study period of 1 year [35]
Only allow data on leaves of treeNoFor genomic surveillance data, valid patient results could be anywhere on the tree, not only at the most specific nodes
Allow multiple parents for the same nodeYesNoWe assigned multiple parents for recombinant SARS-CoV-2 lineages, effective in January 2024. For Salmonella, each node had only one parent
Type of scanTree and timeWe scanned for increases in cases at any node or group of related nodes and over any recent time period
Conditional analysisNode and timeWe conditioned on time to adjust nonparametrically for any citywide purely temporal patterns, such as data-reporting lags or increasing or decreasing trends. We also conditioned on node to account for whether cases historically had been common or rare at each node during the baseline period. This is because we were interested in detecting newly emerging nodes, not nodes that were also common during the baseline period
Scan for branches with:High ratesWe wished to detect clusters as they emerged rather than declined
Maximum temporal size28 days90 daysFor SARS-CoV-2, we searched for increases in variants during the most recent 14, 15, 16, …, 27, or 28 days to balance recency and persistence. For Salmonella, we searched for allele codes with increases during the most recent 1, 2, 3, …, 89, or 90 days to encompass the standard 60 days in the rule-based Salmonella definition, plus an additional 30 days to accommodate data lags
Minimum temporal size14 days1 day
Prospective evaluationYesProspective analyses were used to search for emerging clusters rather than historical clusters by only considering temporal windows reaching up to the study period end date
Perform node by day-of-week adjustmentNoSequencing results were unlikely to vary by the day of the week on which the specimen was collected
Inference methodSequential Monte CarloWe used a sequential method with an early termination cutoff, which allowed runs to terminate early if there were no unusual clusters
Monte Carlo replications999 99999 999To slightly improve performance, we used more than the standard 999 Monte Carlo replications, as allowed based on computing time, which is determined by the number of tree nodes and time intervals
Prospective analysis frequencyWeeklyWe performed analyses weekly (as opposed to daily) to match the frequency with which input data were refreshed
Minimum number of cases2We retained the default minimum so as not to miss any emerging clusters
Signal definitionRecurrence interval (RI) ≥ 365 daysRI ≥ 100 daysWe considered RI 100 to <365 days as a weak cluster, RI 365 days to <5 years as a moderate cluster, RI 5 to <100 years as a strong cluster, and RI ≥100 years as a very strong cluster [35]
FeatureSARS-CoV-2SalmonellaNotes
Genomic data resolutionPango lineagesAllele codesThe “nodes” in our genomic surveillance trees represent SARS-CoV-2 variants or Salmonella allele codes
Temporal elementSpecimen collection dateSpecimen collection date (or upload date, in sensitivity analyses)The specimen collection date is the most epidemiologically relevant date, representing when patients sought care. For Salmonella, to accommodate delays between specimen collection and allele code assignment, we also conducted sensitivity analyses, in which the temporal element was the date uploaded to the System for Enteric Disease Response, Investigation, and Coordination (SEDRIC)
Time precisionDayWe used data at daily resolution (as opposed to aggregating by week or month) to improve precision in cluster start dates
Study period12-week period ending on the most recent specimen collection date1-year period ending on the most recent specimen collection (or upload) dateFor SARS-CoV-2, due to rapid variant turnover, we used a short study period that was three times as long as the maximum temporal cluster size (see below). For Salmonella, we used the standard study period of 1 year [35]
Only allow data on leaves of treeNoFor genomic surveillance data, valid patient results could be anywhere on the tree, not only at the most specific nodes
Allow multiple parents for the same nodeYesNoWe assigned multiple parents for recombinant SARS-CoV-2 lineages, effective in January 2024. For Salmonella, each node had only one parent
Type of scanTree and timeWe scanned for increases in cases at any node or group of related nodes and over any recent time period
Conditional analysisNode and timeWe conditioned on time to adjust nonparametrically for any citywide purely temporal patterns, such as data-reporting lags or increasing or decreasing trends. We also conditioned on node to account for whether cases historically had been common or rare at each node during the baseline period. This is because we were interested in detecting newly emerging nodes, not nodes that were also common during the baseline period
Scan for branches with:High ratesWe wished to detect clusters as they emerged rather than declined
Maximum temporal size28 days90 daysFor SARS-CoV-2, we searched for increases in variants during the most recent 14, 15, 16, …, 27, or 28 days to balance recency and persistence. For Salmonella, we searched for allele codes with increases during the most recent 1, 2, 3, …, 89, or 90 days to encompass the standard 60 days in the rule-based Salmonella definition, plus an additional 30 days to accommodate data lags
Minimum temporal size14 days1 day
Prospective evaluationYesProspective analyses were used to search for emerging clusters rather than historical clusters by only considering temporal windows reaching up to the study period end date
Perform node by day-of-week adjustmentNoSequencing results were unlikely to vary by the day of the week on which the specimen was collected
Inference methodSequential Monte CarloWe used a sequential method with an early termination cutoff, which allowed runs to terminate early if there were no unusual clusters
Monte Carlo replications999 99999 999To slightly improve performance, we used more than the standard 999 Monte Carlo replications, as allowed based on computing time, which is determined by the number of tree nodes and time intervals
Prospective analysis frequencyWeeklyWe performed analyses weekly (as opposed to daily) to match the frequency with which input data were refreshed
Minimum number of cases2We retained the default minimum so as not to miss any emerging clusters
Signal definitionRecurrence interval (RI) ≥ 365 daysRI ≥ 100 daysWe considered RI 100 to <365 days as a weak cluster, RI 365 days to <5 years as a moderate cluster, RI 5 to <100 years as a strong cluster, and RI ≥100 years as a very strong cluster [35]
Table 1.

Specifications for analyses using tree-temporal scan statistics applied prospectively to genomic surveillance data among NYC residents

FeatureSARS-CoV-2SalmonellaNotes
Genomic data resolutionPango lineagesAllele codesThe “nodes” in our genomic surveillance trees represent SARS-CoV-2 variants or Salmonella allele codes
Temporal elementSpecimen collection dateSpecimen collection date (or upload date, in sensitivity analyses)The specimen collection date is the most epidemiologically relevant date, representing when patients sought care. For Salmonella, to accommodate delays between specimen collection and allele code assignment, we also conducted sensitivity analyses, in which the temporal element was the date uploaded to the System for Enteric Disease Response, Investigation, and Coordination (SEDRIC)
Time precisionDayWe used data at daily resolution (as opposed to aggregating by week or month) to improve precision in cluster start dates
Study period12-week period ending on the most recent specimen collection date1-year period ending on the most recent specimen collection (or upload) dateFor SARS-CoV-2, due to rapid variant turnover, we used a short study period that was three times as long as the maximum temporal cluster size (see below). For Salmonella, we used the standard study period of 1 year [35]
Only allow data on leaves of treeNoFor genomic surveillance data, valid patient results could be anywhere on the tree, not only at the most specific nodes
Allow multiple parents for the same nodeYesNoWe assigned multiple parents for recombinant SARS-CoV-2 lineages, effective in January 2024. For Salmonella, each node had only one parent
Type of scanTree and timeWe scanned for increases in cases at any node or group of related nodes and over any recent time period
Conditional analysisNode and timeWe conditioned on time to adjust nonparametrically for any citywide purely temporal patterns, such as data-reporting lags or increasing or decreasing trends. We also conditioned on node to account for whether cases historically had been common or rare at each node during the baseline period. This is because we were interested in detecting newly emerging nodes, not nodes that were also common during the baseline period
Scan for branches with:High ratesWe wished to detect clusters as they emerged rather than declined
Maximum temporal size28 days90 daysFor SARS-CoV-2, we searched for increases in variants during the most recent 14, 15, 16, …, 27, or 28 days to balance recency and persistence. For Salmonella, we searched for allele codes with increases during the most recent 1, 2, 3, …, 89, or 90 days to encompass the standard 60 days in the rule-based Salmonella definition, plus an additional 30 days to accommodate data lags
Minimum temporal size14 days1 day
Prospective evaluationYesProspective analyses were used to search for emerging clusters rather than historical clusters by only considering temporal windows reaching up to the study period end date
Perform node by day-of-week adjustmentNoSequencing results were unlikely to vary by the day of the week on which the specimen was collected
Inference methodSequential Monte CarloWe used a sequential method with an early termination cutoff, which allowed runs to terminate early if there were no unusual clusters
Monte Carlo replications999 99999 999To slightly improve performance, we used more than the standard 999 Monte Carlo replications, as allowed based on computing time, which is determined by the number of tree nodes and time intervals
Prospective analysis frequencyWeeklyWe performed analyses weekly (as opposed to daily) to match the frequency with which input data were refreshed
Minimum number of cases2We retained the default minimum so as not to miss any emerging clusters
Signal definitionRecurrence interval (RI) ≥ 365 daysRI ≥ 100 daysWe considered RI 100 to <365 days as a weak cluster, RI 365 days to <5 years as a moderate cluster, RI 5 to <100 years as a strong cluster, and RI ≥100 years as a very strong cluster [35]
FeatureSARS-CoV-2SalmonellaNotes
Genomic data resolutionPango lineagesAllele codesThe “nodes” in our genomic surveillance trees represent SARS-CoV-2 variants or Salmonella allele codes
Temporal elementSpecimen collection dateSpecimen collection date (or upload date, in sensitivity analyses)The specimen collection date is the most epidemiologically relevant date, representing when patients sought care. For Salmonella, to accommodate delays between specimen collection and allele code assignment, we also conducted sensitivity analyses, in which the temporal element was the date uploaded to the System for Enteric Disease Response, Investigation, and Coordination (SEDRIC)
Time precisionDayWe used data at daily resolution (as opposed to aggregating by week or month) to improve precision in cluster start dates
Study period12-week period ending on the most recent specimen collection date1-year period ending on the most recent specimen collection (or upload) dateFor SARS-CoV-2, due to rapid variant turnover, we used a short study period that was three times as long as the maximum temporal cluster size (see below). For Salmonella, we used the standard study period of 1 year [35]
Only allow data on leaves of treeNoFor genomic surveillance data, valid patient results could be anywhere on the tree, not only at the most specific nodes
Allow multiple parents for the same nodeYesNoWe assigned multiple parents for recombinant SARS-CoV-2 lineages, effective in January 2024. For Salmonella, each node had only one parent
Type of scanTree and timeWe scanned for increases in cases at any node or group of related nodes and over any recent time period
Conditional analysisNode and timeWe conditioned on time to adjust nonparametrically for any citywide purely temporal patterns, such as data-reporting lags or increasing or decreasing trends. We also conditioned on node to account for whether cases historically had been common or rare at each node during the baseline period. This is because we were interested in detecting newly emerging nodes, not nodes that were also common during the baseline period
Scan for branches with:High ratesWe wished to detect clusters as they emerged rather than declined
Maximum temporal size28 days90 daysFor SARS-CoV-2, we searched for increases in variants during the most recent 14, 15, 16, …, 27, or 28 days to balance recency and persistence. For Salmonella, we searched for allele codes with increases during the most recent 1, 2, 3, …, 89, or 90 days to encompass the standard 60 days in the rule-based Salmonella definition, plus an additional 30 days to accommodate data lags
Minimum temporal size14 days1 day
Prospective evaluationYesProspective analyses were used to search for emerging clusters rather than historical clusters by only considering temporal windows reaching up to the study period end date
Perform node by day-of-week adjustmentNoSequencing results were unlikely to vary by the day of the week on which the specimen was collected
Inference methodSequential Monte CarloWe used a sequential method with an early termination cutoff, which allowed runs to terminate early if there were no unusual clusters
Monte Carlo replications999 99999 999To slightly improve performance, we used more than the standard 999 Monte Carlo replications, as allowed based on computing time, which is determined by the number of tree nodes and time intervals
Prospective analysis frequencyWeeklyWe performed analyses weekly (as opposed to daily) to match the frequency with which input data were refreshed
Minimum number of cases2We retained the default minimum so as not to miss any emerging clusters
Signal definitionRecurrence interval (RI) ≥ 365 daysRI ≥ 100 daysWe considered RI 100 to <365 days as a weak cluster, RI 365 days to <5 years as a moderate cluster, RI 5 to <100 years as a strong cluster, and RI ≥100 years as a very strong cluster [35]

Occasionally, as with XBB.1.5 and then XBB.1.16, a variant newly emerged during the rolling 12-week study period and quickly became the primary signaling node. In these instances, so as not to obscure more recently emerging variants, we reset the study period to begin after that variant stabilized as a percentage of sequenced cases. Once 12 weeks had elapsed since that stabilization, we returned to a rolling 12-week period. This is similar to an approach that is used to fine-tune a spatiotemporal cluster detection system when data in the temporal window and the baseline period are not comparable [35].

Salmonella

When a NYC resident tests positive for Salmonella infection, city and state laws require the laboratory to report the result and submit the patient’s isolate to PHL or the New York State Department of Health [36, 37]. These laboratories conduct WGS on the isolates. WGS data (including serotype and cgMLST allele calls) and patient demographic data are uploaded to CDC PulseNet, where allele codes are assigned at the national level [11]. Allele codes are then populated in CDC’s System for Enteric Disease Response, Investigation, and Coordination (SEDRIC). In parallel, graduate student interns at the NYC Health Department attempt to interview all NYC residents with salmonellosis as soon as is feasible after the initial report to collect possible exposure information.

Weekly, starting on 16 November 2022, we downloaded from SEDRIC the allele codes for salmonellosis (typhoidal and non-typhoidal) for New York State residents, as additional parsing of patient addresses was necessary to restrict to NYC residents. We determined counts of each Salmonella allele code among NYC residents during a rolling 365-day period, ending on the most recent specimen collection date (Table 1).

Health equity

The population benefits of genomic surveillance might be inequitably distributed if particular groups are underrepresented in WGS results [38]. Underrepresentation might be a consequence of inequitable access to healthcare and laboratory testing, and, for SARS-CoV-2 infections, nonrandom sampling practices for sequencing [31]. We assessed WGS result availability for confirmed and probable cases [30] of COVID-19 and salmonellosis among NYC residents who were diagnosed during a 2-year period that ended in October 2023. We stratified by patient-level race or ethnicity and by the Index of Concentration at the Extremes—an area-based measure of economic and racial or ethnic segregation [39].

Hierarchical tree files

SARS-CoV-2

Pango lineage notes were used to determine parent–child relationships for all detected SARS-CoV-2 variants [33, 40]. For example, for the analysis that was conducted on 17 August 2023, all detected variants were descended from B.1. Thus, B.1 was designated as the tree root, which progressively branched into increasingly specific lineages, including the Omicron variant (i.e. B.1.1.529), culminating in more specific nodes, such as the Omicron subvariant EG.5.1.1. For recombinant lineages (e.g. XBB), in January 2024, we switched to assigning multiple parents, but in the earlier analyses that are presented here, we assigned the most recent common ancestor as the parent (Table 2).

Table 2.

Example hierarchical nomenclature for SARS-CoV-2 variants assigned to Pango lineages, showing tree levels

LevelNode (Pango lineage)Note
1B.1
2B.1.1
3B.1.1.529Omicron
4BA.2Alias of B.1.1.529.2
5XBBRecombinant lineage of BJ.1 (alias of B.1.1.529.2.10.1.1) and BM.1.1.1 (alias of B.1.1.529.2.75.3.1.1.1)
6XBB.1
7XBB.1.9
8XBB.1.9.2
9EG.5Alias of XBB.1.9.2.5
10EG.5.1
11EG.5.1.1
LevelNode (Pango lineage)Note
1B.1
2B.1.1
3B.1.1.529Omicron
4BA.2Alias of B.1.1.529.2
5XBBRecombinant lineage of BJ.1 (alias of B.1.1.529.2.10.1.1) and BM.1.1.1 (alias of B.1.1.529.2.75.3.1.1.1)
6XBB.1
7XBB.1.9
8XBB.1.9.2
9EG.5Alias of XBB.1.9.2.5
10EG.5.1
11EG.5.1.1
Table 2.

Example hierarchical nomenclature for SARS-CoV-2 variants assigned to Pango lineages, showing tree levels

LevelNode (Pango lineage)Note
1B.1
2B.1.1
3B.1.1.529Omicron
4BA.2Alias of B.1.1.529.2
5XBBRecombinant lineage of BJ.1 (alias of B.1.1.529.2.10.1.1) and BM.1.1.1 (alias of B.1.1.529.2.75.3.1.1.1)
6XBB.1
7XBB.1.9
8XBB.1.9.2
9EG.5Alias of XBB.1.9.2.5
10EG.5.1
11EG.5.1.1
LevelNode (Pango lineage)Note
1B.1
2B.1.1
3B.1.1.529Omicron
4BA.2Alias of B.1.1.529.2
5XBBRecombinant lineage of BJ.1 (alias of B.1.1.529.2.10.1.1) and BM.1.1.1 (alias of B.1.1.529.2.75.3.1.1.1)
6XBB.1
7XBB.1.9
8XBB.1.9.2
9EG.5Alias of XBB.1.9.2.5
10EG.5.1
11EG.5.1.1
Salmonella

We designated ‘SAL’ as the tree root and each Salmonella serotype (e.g. Typhi, Enteritidis, Kottbus) as the second tree level. We appended the allele code, which can be up to six digits, to the serotype. Whereas laboratory scientists typically compare isolates manually by using allele ranges, we used allele codes because of the standardized hierarchical nomenclature. Isolates with more allele code digits in common have a lower number of allele differences (Table 3).

Table 3.

Example hierarchical nomenclature for Salmonella isolates assigned a serotype and allele code

LevelNode (serotype, allele code)Maximum expected allele difference for isolates matching at specified allele code digit
1SALNot applicable
2SAL.KottbusNot applicable
3SAL.Kottbus.618580
4SAL.Kottbus.6185.128
5SAL.Kottbus.6185.1.115
6SAL.Kottbus.6185.1.1.27
7SAL.Kottbus.6185.1.1.2.14
8SAL.Kottbus.6185.1.1.2.1.10a
LevelNode (serotype, allele code)Maximum expected allele difference for isolates matching at specified allele code digit
1SALNot applicable
2SAL.KottbusNot applicable
3SAL.Kottbus.618580
4SAL.Kottbus.6185.128
5SAL.Kottbus.6185.1.115
6SAL.Kottbus.6185.1.1.27
7SAL.Kottbus.6185.1.1.2.14
8SAL.Kottbus.6185.1.1.2.1.10a
a

Isolates matching at the sixth digit of the allele code are 0 alleles apart, i.e. indistinguishable by core genome multilocus sequence typing.

Table 3.

Example hierarchical nomenclature for Salmonella isolates assigned a serotype and allele code

LevelNode (serotype, allele code)Maximum expected allele difference for isolates matching at specified allele code digit
1SALNot applicable
2SAL.KottbusNot applicable
3SAL.Kottbus.618580
4SAL.Kottbus.6185.128
5SAL.Kottbus.6185.1.115
6SAL.Kottbus.6185.1.1.27
7SAL.Kottbus.6185.1.1.2.14
8SAL.Kottbus.6185.1.1.2.1.10a
LevelNode (serotype, allele code)Maximum expected allele difference for isolates matching at specified allele code digit
1SALNot applicable
2SAL.KottbusNot applicable
3SAL.Kottbus.618580
4SAL.Kottbus.6185.128
5SAL.Kottbus.6185.1.115
6SAL.Kottbus.6185.1.1.27
7SAL.Kottbus.6185.1.1.2.14
8SAL.Kottbus.6185.1.1.2.1.10a
a

Isolates matching at the sixth digit of the allele code are 0 alleles apart, i.e. indistinguishable by core genome multilocus sequence typing.

Prospective tree-temporal scan statistic

We conducted prospective analyses by using tree-temporal scan statistics [24, 41] (Table 1) in the free and open-source TreeScanTM software (Martin Kulldorff and Information Management Services, Inc., Calverton, Maryland). We searched for unusual increases that emerged over any recent time period at any node, from large phylogenetic branches to small groups of genetically indistinguishable isolates. Under the null hypothesis, for each variant or group of related variants, its proportion among all variants is constant over time. Under the alternative hypothesis, for one variant or group of related variants, its proportion among all variants is higher during some recent period. Mathematical formulae for calculating the expected number of cases, excess cases, and relative risk are available in the TreeScan user guide [41].

For each node (candidate cluster), a likelihood ratio-based test statistic is calculated when the observed number of cases during the time window at the node exceeds the expected number. The candidate cluster with the maximum likelihood ratio test statistic is the cluster that is least likely to be due to chance under the null hypothesis of no node-by-time interaction, after adjusting for purely temporal variation and total node counts during the study period. For example, if a node has 5.4% of cases during the baseline period (i.e. prior to the cluster period) and there are 100 total cases with WGS results during the cluster period, then the expected number of cases during the cluster period at that node is 5.4.

Monte Carlo hypothesis testing is used to assess the statistical strength of clusters, controlling for the multiplicity of overlapping nodes and time windows evaluated. To create simulated datasets under the null hypothesis, case dates are shuffled and randomly assigned to the original nodes. The maximum likelihood ratio test statistic for each simulated dataset is calculated in the same way as for the observed dataset.

The maximum likelihood ratio for the observed dataset is ranked among those from the simulated runs under the null hypothesis, and a P-value is derived from this ranking as P = rank/(999 999 + 1) for the SARS-CoV-2 analyses [41]. For prospective analyses, a recurrence interval (RI) is calculated as the reciprocal of the P-value. For a weekly analysis frequency, this is further divided by 52 for the number of analyses per year. The RI represents the duration of weekly surveillance required for the expected number of clusters that are at least as unusual as the observed cluster to be equal to 1 by chance [42]. For example, when the null hypothesis of no clusters is true, then, during a 1-year period, the expected number of clusters with RI ≥ 365 days is 1. Supplementary Appendix S1 provides cluster-reporting details.

Performance assessment

SARS-CoV-2

In the absence of national guidance for ways in which jurisdictions should select variants to monitor locally, we compiled illustrative examples of successes and challenges in using TreeScan results to focus attention on emerging variants during weekly analyses that were conducted during August 2021–November 2023.

Salmonella

We characterized clusters that were prospectively detected by using TreeScan during the first year of weekly analyses, from 16 November 2022 to 8 November 2023. We considered clusters to be “solved” if investigators identified a common food source, animal exposure, exposure site, or travel history that likely explained the association among cluster patients. Supplementary Appendix S2 provides further details about cluster definitions, cluster prioritization, and consideration of typhoidal clusters.

Results

Completeness and representativeness

Among NYC residents who were diagnosed during November 2021–October 2023, WGS was conducted for 7% of COVID-19 cases (151 944 of 2 266 600) and for 62% of non-typhoidal salmonellosis cases (1679 of 2722; Supplementary Table S1). Of 1068 salmonellosis cases with no allele code, 937 (88%) were probable cases with only a positive culture-independent diagnostic test result, 106 (10%) were culture-positive but had no isolate available for WGS, 5 (<1%) underwent WGS but failed quality control, and 20 (2%) were unique sequences that CDC’s naming algorithm could not match to an existing allele code. The median lag from specimen collection to allele code assignment was 22 days (interquartile range: 20–28 days). Of 1654 salmonellosis cases with an allele code assigned, 1322 (80%) had a fully or partially completed interview; interviews are necessary to collect information for identifying common exposures among cluster patients.

Patient demographic characteristics were similarly distributed between reported cases overall and the subset with WGS results. Distributions were within ±2.5% for every stratum of race or ethnicity and the Index of Concentration at the Extremes (Supplementary Table S1). Although substantial proportions of patients lacked WGS results, there was no evidence of systematic underrepresentation during this period.

SARS-CoV-2 illustrative examples

Rapid detection of a locally emerging variant

The analysis that was performed on 22 June 2023 searched for variants that were emerging during the 14, 15, 16, …, 27, or 28-day period that ended on 12 June 2023 (Figure 1), across 200 nodes (excerpted in Figure 2). This analysis, with a computer running time of 4 minutes and 15 seconds, identified six SARS-CoV-2 variants emerging among NYC residents (Table 4). Three of the six nodes were the same as in the prior week’s analysis (Supplementary Table S2), including persistent, strong signals for XBB.1.16, XBB.2.3, and their subvariants. Of the newly signaling nodes, EG.5.1 (RI = 35 years) first signaled more strongly than its grandparent (XBB.1.9.2, Figure 2), with 11 specimens collected during 17 May–12 June 2023. EG.5 or EG.5.1 continued to signal for 13 consecutive weekly analyses, from 22 June to 14 September 2023, after which more specific subvariants (e.g. EG.5.1.6) began to signal more strongly.

for accessibility: A schematic of 15 stacked horizontal bars, each representing an 84-day study period, 21 March–12 June 2023. Each bar has an unshaded portion starting from the left side (representing the baseline period) meeting in the middle with a shaded portion starting from the right side (representing the cluster period). The top bar displays the minimum cluster period of 14 days and the longest baseline period of 70 days. Each subsequent bar lengthens the cluster period and shortens the baseline period in 1-day increments. The bottom bar displays the maximum cluster period of 28 days and the shortest baseline period of 56 days.
Figure 1.

Example of prospective temporal scan windows for a 12-week study period at daily resolution ending on 12 June 2023, with 14-day minimum and 28-day maximum cluster periods. In prospective analyses, only cluster periods extending to the study period end date are evaluated. The baseline period, which is used for comparison, is prior to each cluster period.

for accessibility: A dendrogram showing genomic relationships for 17 nodes representing XBB.1.9.2, its seven children, and its nine grandchildren. The nine grandchildren are distributed across three children, with one, two, or six grandchildren per child. Seventeen overlapping ovals, which represent scanning windows, are overlaid on the dendrogram. One large oval encompasses all 17 nodes. Three intermediate ovals encompass three children together with their own children. Thirteen small ovals encompass each terminal node, which are the nine grandchildren and the four children with no children of their own.
Figure 2.

Example of scanning windows for a SARS-CoV-2 genomic surveillance tree excerpt, showing XBB.1.9.2 and its subvariants as detected among NYC residents as of 12 June 2023. Each detected variant is a node that is connected to its parent and any children. Ovals indicate the complete set of scanning windows within this excerpt, evaluating each node together with its descendants.

Table 4.

Analysis conducted on 22 June 2023 to apply prospective tree-temporal scan statistics to detect emerging SARS-CoV-2 variants in specimens that were collected among NYC residents during the 14- to 28-day period that ended on 12 June 2023

VariantNo. of cases with specimens collected during 12-week study period, 21 March–12 June 2023Cluster start date, ending on 12 June 2023No. of cases with specimens collected during cluster windowNo. of expected casesRelative riskNo. of excess casesTest statisticRI (years)bNo. of consecutive weeks signalingPercentage of sequenced cases with specimens collected in week ending 12 June 2023
XBB.1.16a21316 May10752.13.778.022.819 2311128
XBB.2.3a8318 May4117.53.930.511.54808317
XBB.1.24.1530 May50.45.08.2551<1
EG.5.1c1117 May112.511.07.93516
XBB.1.5.68819 May81.68.06.651<1
XBB.1.5.16a1316 May113.217.310.45.822<1
VariantNo. of cases with specimens collected during 12-week study period, 21 March–12 June 2023Cluster start date, ending on 12 June 2023No. of cases with specimens collected during cluster windowNo. of expected casesRelative riskNo. of excess casesTest statisticRI (years)bNo. of consecutive weeks signalingPercentage of sequenced cases with specimens collected in week ending 12 June 2023
XBB.1.16a21316 May10752.13.778.022.819 2311128
XBB.2.3a8318 May4117.53.930.511.54808317
XBB.1.24.1530 May50.45.08.2551<1
EG.5.1c1117 May112.511.07.93516
XBB.1.5.68819 May81.68.06.651<1
XBB.1.5.16a1316 May113.217.310.45.822<1
a

Subvariants included.

b

Nodes with RI ≥ 1 year were included. The maximum possible RI for this analysis was 19 231 years. When using 999 999 Monte Carlo replications, the smallest possible P-value is 1/999 999 = 0.000001. With a weekly prospective analysis frequency, the maximum RI was thus (1/0.000001)/52 analyses per year = 19 231 years.

c

Alias of XBB.1.9.2.5.1.

Table 4.

Analysis conducted on 22 June 2023 to apply prospective tree-temporal scan statistics to detect emerging SARS-CoV-2 variants in specimens that were collected among NYC residents during the 14- to 28-day period that ended on 12 June 2023

VariantNo. of cases with specimens collected during 12-week study period, 21 March–12 June 2023Cluster start date, ending on 12 June 2023No. of cases with specimens collected during cluster windowNo. of expected casesRelative riskNo. of excess casesTest statisticRI (years)bNo. of consecutive weeks signalingPercentage of sequenced cases with specimens collected in week ending 12 June 2023
XBB.1.16a21316 May10752.13.778.022.819 2311128
XBB.2.3a8318 May4117.53.930.511.54808317
XBB.1.24.1530 May50.45.08.2551<1
EG.5.1c1117 May112.511.07.93516
XBB.1.5.68819 May81.68.06.651<1
XBB.1.5.16a1316 May113.217.310.45.822<1
VariantNo. of cases with specimens collected during 12-week study period, 21 March–12 June 2023Cluster start date, ending on 12 June 2023No. of cases with specimens collected during cluster windowNo. of expected casesRelative riskNo. of excess casesTest statisticRI (years)bNo. of consecutive weeks signalingPercentage of sequenced cases with specimens collected in week ending 12 June 2023
XBB.1.16a21316 May10752.13.778.022.819 2311128
XBB.2.3a8318 May4117.53.930.511.54808317
XBB.1.24.1530 May50.45.08.2551<1
EG.5.1c1117 May112.511.07.93516
XBB.1.5.68819 May81.68.06.651<1
XBB.1.5.16a1316 May113.217.310.45.822<1
a

Subvariants included.

b

Nodes with RI ≥ 1 year were included. The maximum possible RI for this analysis was 19 231 years. When using 999 999 Monte Carlo replications, the smallest possible P-value is 1/999 999 = 0.000001. With a weekly prospective analysis frequency, the maximum RI was thus (1/0.000001)/52 analyses per year = 19 231 years.

c

Alias of XBB.1.9.2.5.1.

The WHO designated EG.5 as a variant under monitoring on 19 July 2023 and a variant of interest on 9 August 2023 [43], 4 and 7 weeks, respectively, after our first EG.5.1 signal. During a period with many co-circulating variants and when EG.5 initially constituted a small number and percentage of cases with WGS results, the TreeScan analysis, trend visualization (Figure 3), and dendrogram (Supplementary Figure S1; see online supplementary material for a color version of this figure) led NYC Health Department officials to focus attention on this variant.

for accessibility: A time-series graph showing zero cases of the SARS-CoV-2 variant EG.5.1 until 17 May 2023, followed by 11 cases during 17 May through 7 June 2023, with zero, one, or two cases per day.
Figure 3.

Count and percentage of cases with EG.5.1 as the SARS-CoV-2 sequencing result among NYC residents with specimens collected during 21 March–12 June 2023. Results are as of 22 June 2023, the first analysis week that EG.5.1 signaled as emerging, with the cluster window starting on 17 May 2023 shaded in gray.

Delayed detection of locally emerging variants

In the analysis that was performed on 20 October 2022, multiple BE.1.1.1 subvariants signaled for the first time, indicating delayed detection of BQ.1, BQ.1.1 (RI = 19 231 years for both nodes), and BQ.1.3 (RI = 2.4 years). In the concurrent UShER version update, a subset of cases had been reassigned to BQ lineages, revealing that BQ lineages, which descended from BE.1.1.1 [40], had been present for >1 month. Once the input data were updated, TreeScan analyses appropriately detected the emergence of BQ lineages.

Assurance of no other locally emerging variants

The Omicron variant was first detected in NYC in clinical and wastewater samples that were collected in November 2021 and quickly became predominant [31, 44]. While staff were urgently focused on the characterization of local effects on population health, TreeScan analyses provided confirmation that additional lineages that required attention and response were not emerging concurrently.

Salmonella

During the first year of analyses, on 128 serotypes, TreeScan detected 16 unique clusters in the primary analysis by using the specimen collection date as the temporal element and one additional cluster in the sensitivity analysis by using the upload date (Table 5). TreeScan detects statistical anomalies, which must be investigated to distinguish true clusters from data-quality issues. Of the 17 clusters, two clusters were due to >1 isolate being sequenced from the same patient. These data-quality issues were identified by reviewing line lists and quickly resolved in SEDRIC. The remaining 15 clusters were plausibly outbreaks and worth epidemiological investigation. Of these 15 clusters, two were typhoidal and associated with travel to an endemic area. Of the 13 non-typhoidal clusters, two comprised family members with shared exposures, two reflected larger, interjurisdictional outbreaks, two were persistent strains that caused illnesses over a long time, and seven were unsolved. Of the 13 non-typhoidal clusters, one was detected at the third allele code digit so encompassed a broader allele range than rule-based cluster definitions. The remaining 12 clusters were detected to at least the fifth allele code digit (i.e. had a maximum expected difference of 4 alleles; Table 3), aligning with rule-based cluster definitions (Supplementary Appendix S2). Of these clusters, nine were concurrently detected by PHL, two consisted entirely of isolates that were tested at other jurisdictions’ public health laboratories and so could not have been detected by PHL, and one was not concurrently detected by PHL because temporary technological issues disrupted portions of PHL’s cluster detection workflow.

Table 5.

Salmonellosis clusters among NYC residents that were detected by using prospective tree-temporal scan statistics in weekly analyses conducted during 16 November 2022–8 November 2023

SerotypeAllele codeDate first detected(DD/MM/YY)Cluster start date (specimen collection date) (DD/MM/YY)Observed casesRelative riskExcess casesRecurrence intervalNotes
Typhi6788.1.1.47.1.9715/3/2315/2/2344.0481 yearsSame family. All traveled to endemic area
Typhimurium6766.47.1.12.4.31/3/2310/2/2322.033 yearsUnsolved
Paratyphi B Var. L(+) Tartrate+316.6.3.13.119/7/234/7/2322.031 yearsUnsolved
Typhimurium6745.20.1.1.10.91/2/239/1/2333.012 yearsSame family. Multiple shared exposures
Typhimurium6745.56.1.2.3.1168/12/2225/10/2244.07.5 yearsUnsolved
Newport6809.9.1.1.1.95412/10/2324/9/2322.06.2 yearsPersisting enteric bacterial strain
I 4:i:-772.1.3.1.2.211/1/238/12/2233.02.9 yearsUnsolved
Hadar6771.1.1.30.18/11/2317/10/233126.13.02.6 yearsPersisting enteric bacterial strain
Kottbus6185.1.1.2.14/1/231/12/22525.64.81.9 yearsUnsolved
Typhimurium6766.32.1.3.8.317/5/2328/4/2322.0339 daysUnsolved
Javiana6765.1.17/9/2321/7/23714.66.5213 daysThree subgroups of more closely related isolates, two of which were multistate investigations
Typhimurium459.3.1.8.1.17/6/2315/5/2322.0174 daysSame family. Shared meal
Typhi6788.1.1.4.183/5/2310/4/2322.0147 daysData-quality issue: removed isolate from a follow-up specimen
Oranienburg6760.67.167.5.1.121/12/2219/11/2222.0130 daysData-quality issue: removed duplicate isolate
Paratyphi A1082.1.3.1.1.1119/4/2324/3/2322.0119 daysSame family. All traveled to endemic area
IIIb 61:k:1,5135.5.33.6.1.127/9/2326/9/23a22.0114 daysUnsolved
Infantis6747.16.3.109.113/9/2328/7/2344.0101 daysMultistate investigation
SerotypeAllele codeDate first detected(DD/MM/YY)Cluster start date (specimen collection date) (DD/MM/YY)Observed casesRelative riskExcess casesRecurrence intervalNotes
Typhi6788.1.1.47.1.9715/3/2315/2/2344.0481 yearsSame family. All traveled to endemic area
Typhimurium6766.47.1.12.4.31/3/2310/2/2322.033 yearsUnsolved
Paratyphi B Var. L(+) Tartrate+316.6.3.13.119/7/234/7/2322.031 yearsUnsolved
Typhimurium6745.20.1.1.10.91/2/239/1/2333.012 yearsSame family. Multiple shared exposures
Typhimurium6745.56.1.2.3.1168/12/2225/10/2244.07.5 yearsUnsolved
Newport6809.9.1.1.1.95412/10/2324/9/2322.06.2 yearsPersisting enteric bacterial strain
I 4:i:-772.1.3.1.2.211/1/238/12/2233.02.9 yearsUnsolved
Hadar6771.1.1.30.18/11/2317/10/233126.13.02.6 yearsPersisting enteric bacterial strain
Kottbus6185.1.1.2.14/1/231/12/22525.64.81.9 yearsUnsolved
Typhimurium6766.32.1.3.8.317/5/2328/4/2322.0339 daysUnsolved
Javiana6765.1.17/9/2321/7/23714.66.5213 daysThree subgroups of more closely related isolates, two of which were multistate investigations
Typhimurium459.3.1.8.1.17/6/2315/5/2322.0174 daysSame family. Shared meal
Typhi6788.1.1.4.183/5/2310/4/2322.0147 daysData-quality issue: removed isolate from a follow-up specimen
Oranienburg6760.67.167.5.1.121/12/2219/11/2222.0130 daysData-quality issue: removed duplicate isolate
Paratyphi A1082.1.3.1.1.1119/4/2324/3/2322.0119 daysSame family. All traveled to endemic area
IIIb 61:k:1,5135.5.33.6.1.127/9/2326/9/23a22.0114 daysUnsolved
Infantis6747.16.3.109.113/9/2328/7/2344.0101 daysMultistate investigation
a

This was the only unique cluster detected in sensitivity analyses in which the temporal element was the upload date instead of the specimen collection date.

Table 5.

Salmonellosis clusters among NYC residents that were detected by using prospective tree-temporal scan statistics in weekly analyses conducted during 16 November 2022–8 November 2023

SerotypeAllele codeDate first detected(DD/MM/YY)Cluster start date (specimen collection date) (DD/MM/YY)Observed casesRelative riskExcess casesRecurrence intervalNotes
Typhi6788.1.1.47.1.9715/3/2315/2/2344.0481 yearsSame family. All traveled to endemic area
Typhimurium6766.47.1.12.4.31/3/2310/2/2322.033 yearsUnsolved
Paratyphi B Var. L(+) Tartrate+316.6.3.13.119/7/234/7/2322.031 yearsUnsolved
Typhimurium6745.20.1.1.10.91/2/239/1/2333.012 yearsSame family. Multiple shared exposures
Typhimurium6745.56.1.2.3.1168/12/2225/10/2244.07.5 yearsUnsolved
Newport6809.9.1.1.1.95412/10/2324/9/2322.06.2 yearsPersisting enteric bacterial strain
I 4:i:-772.1.3.1.2.211/1/238/12/2233.02.9 yearsUnsolved
Hadar6771.1.1.30.18/11/2317/10/233126.13.02.6 yearsPersisting enteric bacterial strain
Kottbus6185.1.1.2.14/1/231/12/22525.64.81.9 yearsUnsolved
Typhimurium6766.32.1.3.8.317/5/2328/4/2322.0339 daysUnsolved
Javiana6765.1.17/9/2321/7/23714.66.5213 daysThree subgroups of more closely related isolates, two of which were multistate investigations
Typhimurium459.3.1.8.1.17/6/2315/5/2322.0174 daysSame family. Shared meal
Typhi6788.1.1.4.183/5/2310/4/2322.0147 daysData-quality issue: removed isolate from a follow-up specimen
Oranienburg6760.67.167.5.1.121/12/2219/11/2222.0130 daysData-quality issue: removed duplicate isolate
Paratyphi A1082.1.3.1.1.1119/4/2324/3/2322.0119 daysSame family. All traveled to endemic area
IIIb 61:k:1,5135.5.33.6.1.127/9/2326/9/23a22.0114 daysUnsolved
Infantis6747.16.3.109.113/9/2328/7/2344.0101 daysMultistate investigation
SerotypeAllele codeDate first detected(DD/MM/YY)Cluster start date (specimen collection date) (DD/MM/YY)Observed casesRelative riskExcess casesRecurrence intervalNotes
Typhi6788.1.1.47.1.9715/3/2315/2/2344.0481 yearsSame family. All traveled to endemic area
Typhimurium6766.47.1.12.4.31/3/2310/2/2322.033 yearsUnsolved
Paratyphi B Var. L(+) Tartrate+316.6.3.13.119/7/234/7/2322.031 yearsUnsolved
Typhimurium6745.20.1.1.10.91/2/239/1/2333.012 yearsSame family. Multiple shared exposures
Typhimurium6745.56.1.2.3.1168/12/2225/10/2244.07.5 yearsUnsolved
Newport6809.9.1.1.1.95412/10/2324/9/2322.06.2 yearsPersisting enteric bacterial strain
I 4:i:-772.1.3.1.2.211/1/238/12/2233.02.9 yearsUnsolved
Hadar6771.1.1.30.18/11/2317/10/233126.13.02.6 yearsPersisting enteric bacterial strain
Kottbus6185.1.1.2.14/1/231/12/22525.64.81.9 yearsUnsolved
Typhimurium6766.32.1.3.8.317/5/2328/4/2322.0339 daysUnsolved
Javiana6765.1.17/9/2321/7/23714.66.5213 daysThree subgroups of more closely related isolates, two of which were multistate investigations
Typhimurium459.3.1.8.1.17/6/2315/5/2322.0174 daysSame family. Shared meal
Typhi6788.1.1.4.183/5/2310/4/2322.0147 daysData-quality issue: removed isolate from a follow-up specimen
Oranienburg6760.67.167.5.1.121/12/2219/11/2222.0130 daysData-quality issue: removed duplicate isolate
Paratyphi A1082.1.3.1.1.1119/4/2324/3/2322.0119 daysSame family. All traveled to endemic area
IIIb 61:k:1,5135.5.33.6.1.127/9/2326/9/23a22.0114 daysUnsolved
Infantis6747.16.3.109.113/9/2328/7/2344.0101 daysMultistate investigation
a

This was the only unique cluster detected in sensitivity analyses in which the temporal element was the upload date instead of the specimen collection date.

Investigators considered TreeScan results to be helpful in focusing staff attention and investigation resources. For example, TreeScan clusters in NYC occasionally reflected concurrent, aberrant clusters across other jurisdictions, prompting multijurisdictional collaboration to identify common exposures. The TreeScan cluster at the third allele code digit alerted investigators to a local increase in Salmonella Javiana, spurring further investigation into possible subclusters. Moreover, in analysing isolates according to the location of residence, TreeScan detected clusters among NYC residents that otherwise might not have been detected because patients were tested by different laboratories. TreeScan also provided coverage when external technological issues disrupted certain processes at PHL.

Discussion

By applying tree-temporal scan statistics prospectively to genomic surveillance data with a standardized hierarchical nomenclature, we automatically sifted through large quantities of data in minutes and generated weekly ranked shortlists of nodes with statistically unusual numbers of recent cases. This method flexibly evaluates all candidate clusters, across many degrees of genetic relatedness and date ranges. It dynamically accounts for any purely temporal trends, such as data lags or changes in WGS result availability, and minimizes false signals by adjusting for the multiplicity of nodes and cluster windows scanned. With real-time application to SARS-CoV-2 and Salmonella data, the NYC Health Department detected credible clusters for investigation and data-quality problems for correction.

Limitations

WGS results were available for only 7% of COVID-19 cases and 62% of salmonellosis cases. After the federal COVID-19 public health emergency declaration ended in May 2023 and with reduced funding, specimen and sequence availability declined [45], which could have reduced population representativeness and delayed the detection of new variants. Additionally, although the NYC Health Code requires laboratories to reflexively culture certain enteric pathogens, including Salmonella [36], the widespread use of culture-independent diagnostic testing has reduced the proportion of salmonellosis cases with recovered isolates. Patients without WGS results cannot contribute their exposure histories to cluster investigations, making it more challenging to solve outbreaks. Improving population-based WGS data completeness, representativeness, and timeliness requires strengthening partnerships with clinics and submitting laboratories, as well as deepening investments in laboratory capacity and bioinformatics infrastructure, including applying culture-independent sequencing methods [46, 47].

Where WGS results were available, the tree nomenclature imposed certain limitations. For SARS-CoV-2, assigning Pango lineages by using UShER allowed accurate and stable lineage assignments at the expense of timeliness. As in the BQ lineages example, delays in updating nomenclature to recognize new lineage designations resulted in delayed detection. For Salmonella, the signal-to-noise ratio to detect a new cluster was poor for allele codes that ended in ‘x’, as the underlying four-, five-, and six-digit codes were masked due to low genomic diversity that increased the within-code distance beyond the assignment thresholds. More broadly, we rely on a standardized nomenclature, with no consideration of genetic distances between tree nodes.

TreeScan should complement, not replace, other cluster detection approaches using laboratory-based data, such as by examining allele ranges [11, 14]. Tree-temporal scan statistics could miss outbreaks where genetic or temporal clustering is weak. Zoonotic disease outbreaks, such as those associated with exposure to reptiles or backyard poultry, often involve multiple serotypes with large allelic diversity [15]. Patients’ isolates might be weakly clustered temporally for outbreaks due to persistent environmental contamination or following delays in accessing medical care or obtaining WGS results. Supplementary Appendix S3 provides additional minor limitations.

Conclusions

By decreasing the reliance on time-consuming, manual laboratory data review and by simultaneously analysing data not only by genetic relatedness, but also by temporal clustering, TreeScan analyses can help officials to focus limited investigative resources on emerging clusters and variants. Future work could apply this approach to additional pathogens [11, 48], use additional hierarchical nomenclature systems, analyse additional pathogen characteristics (e.g. antimicrobial resistance patterns), and analyse state- and national-level data to support multijurisdictional outbreak response. Incorporating TreeScan as a free and open-source tool into analytical pipelines [49] could strengthen strategic frameworks for genomic surveillance [50], including in low- and middle-income countries.

Health departments should routinely apply multiple cluster detection methods to quickly detect different types of outbreaks. In NYC, spatiotemporal analyses have quickly detected outbreaks with strong geographic clustering before laboratory subtyping results became available. However, these methods could miss geographically diffuse outbreaks, such as those following exposure to a widely disseminated source, or outbreaks that affect only a few patients. Despite lags in subtyping data availability, such outbreaks could be detected more quickly by the application of tree-temporal scan statistics to WGS data. TreeScan thus fills an important gap in the public health practitioner’s automated cluster detection and monitoring toolkit.

Ethics approval

The Institutional Review Board of the NYC Health Department determined this activity (No. 21–072) meets the definition of public health surveillance as set forth under 45 CFR§46.102(l)(2).

Acknowledgements

The authors thank Scott Hostovich at Information Management Services, Inc. for incorporating updates into the TreeScan software. We thank Helly Amin, Ahmed Rahat, Dr Faten Taki, Olivia Samson, Naama Kipperman, and Dr Mustapha Mustapha for analytic contributions. Mark Alexander and Kuan Chen set up SQL databases for automated data transfers. We thank Team Salmonella (including Allison Holmes, Alyssa Prince, Eve Curran, Kabeer Majhu, Lauren Hall, Olivia Wang, Elle Palmer, Jenna Mandell, Kimberly Kolsch, Nora Kuka, Sara del Aguila, and Tara Higgins) for exceptional work to complete timely patient interviews, which is key to identifying outbreak sources, as well as Lenka Malec, Marisa Gerard, Danielle Martinez, Samuel Davey, John Croft, and Athanasia Papadopoulos for managing and leading numerous salmonellosis cluster investigations. Christian King, Wai Sum So, Bun Tha, Moinuddin Chowdhury, and Nelson De La Cruz worked tirelessly in PHL’s Molecular Typing Laboratory. We thank Vasudha Reddy for sustained support and guidance on these activities. We thank Wadsworth Center for sequencing Salmonella isolates from NYC residents that were not sent to PHL. We thank CDC PulseNet database managers for assigning allele codes and Dr Lyndsay Bottichio and the Surveillance, Information Management, and Statistics Office in CDC’s Division of Foodborne, Waterborne, and Environmental Diseases for SEDRIC data provision. We thank CDC FoodCORE for continued funding to support both epidemiology and laboratory capacity for food-borne disease surveillance in NYC. Aviva Goldstein, Lauren da Fonte, and the Fund for Public Health in New York City provided fundraising and grant management support for TreeScan. Drs. Judy Maro and Katherine Yih provided valuable suggestions on an early manuscript draft. A preliminary version of this work was presented at the Integrated Foodborne Outbreak Response and Management (InFORM) Virtual Conference, 26–29 April 2022. This article was preprinted at https://www.medrxiv.org/content/10.1101/2024.08.28.24312512v1.

Author contributions

S.K.G.: conceptualization, formal analysis, funding acquisition, investigation, methodology, project administration, software, supervision, writing—original draft, writing—review & editing. J.L.: conceptualization, data curation, formal analysis, investigation, project administration, writing—review & editing. E.R.P.: formal analysis, software, writing—review & editing. A.L.-R.: formal analysis, methodology, software, visualization, writing—review & editing. E.L.: formal analysis, visualization, writing—review & editing. J.C.W.: conceptualization, data curation, writing—review & editing. K.B.: data curation, writing—review & editing. A.O.: data curation, writing—review & editing. L.L.: data curation, investigation, writing—review & editing. H.W.: data curation, writing—review & editing. A.M: formal analysis, writing—review & editing. R.R.: formal analysis, writing—review & editing. M.K.: conceptualization, methodology, formal analysis, software, writing—review & editing.

Supplementary data

Supplementary data are available at IJE online.

Conflict of interest: None declared.

Funding

This work was supported by the US Centers for Disease Control and Prevention (NU90TP922035-05, NU50CK000517-01–09, NU50CK000517-05–00).

Data availability

SARS-CoV-2 variant data for NYC residents are available on GitHub (https://github.com/nychealth/coronavirus-data/tree/master/variants). Allele codes for Salmonella isolates are available to CDC partners via SEDRIC (https://www.cdc.gov/foodborne-outbreaks/php/foodsafety/tools/). SAS code for generating TreeScan input files is available on GitHub (https://github.com/CityOfNewYork/communicable-disease-surveillance-nycdohmh). The TreeScan software (www.treescan.org) and source code (https://github.com/scanstatistics/treescan) are freely available.

Use of artificial intelligence (AI) tools

While TreeScan is a data mining tool, no AI tools were used in the writing of this publication.

References

1

Kwong
JC
,
McCallum
N
,
Sintchenko
V
,
Howden
BP.
 
Whole genome sequencing in clinical and public health microbiology
.
Pathology
 
2015
;
47
:
199
210
.

2

Paul
P
,
France
AM
,
Aoki
Y
 et al.  
Genomic surveillance for SARS-CoV-2 variants circulating in the United States, December 2020-May 2021
.
MMWR Morb Mortal Wkly Rep
 
2021
;
70
:
846
50
.

3

Ma
KC
,
Shirk
P
,
Lambrou
AS
 et al.  
Genomic surveillance for SARS-CoV-2 variants: circulation of Omicron lineages—United States, January 2022-May 2023
.
MMWR Morb Mortal Wkly Rep
 
2023
;
72
:
651
6
.

4

Luoma
E
,
Rohrer
R
,
Parton
H
 et al.  
Notes from the field: epidemiologic characteristics of SARS-CoV-2 recombinant variant XBB.1.5 - New York City, November 1, 2022-January 4, 2023
.
MMWR Morb Mortal Wkly Rep
 
2023
;
72
:
212
4
.

5

Greene
SK
,
Levin-Rector
A
,
Kyaw
NTT
 et al.  
Comparative hospitalization risk for SARS-CoV-2 Omicron and Delta variant infections, by variant predominance periods and patient-level sequencing results, New York City, August 2021-January 2022
.
Influenza Other Respir Viruses
 
2023
;
17
:
e13062
.

6

Liu
D
,
Cheng
Y
,
Zhou
H
 et al.  
Early introduction and community transmission of SARS-CoV-2 Omicron variant, New York, New York, USA
.
Emerg Infect Dis
 
2023
;
29
:
371
80
.

7

Grubaugh
ND
,
Hodcroft
EB
,
Fauver
JR
,
Phelan
AL
,
Cevik
M.
 
Public health actions to control new SARS-CoV-2 variants
.
Cell
 
2021
;
184
:
1127
32
.

8

Turner
S
,
Alisoltani
A
,
Bratt
D
 et al.  
U.S. National Institutes of Health prioritization of SARS-CoV-2 variants
.
Emerg Infect Dis
 
2023
;
29
:
e221646
.

9

CDC
. Variant Proportions, Variants & Genomic Surveillance, COVID Data Tracker. https://covid.cdc.gov/covid-data-tracker/#variants-genomic-surveillance (15 August 2023, date last accessed).

10

Nadon
C
,
Van Walle
I
,
Gerner-Smidt
P
 et al. ;
FWD-NEXT Expert Panel
.
PulseNet International: vision for the implementation of whole genome sequencing (WGS) for global food-borne disease surveillance
.
Euro Surveill
 
2017
;
22
:
30544
.

11

Tolar
B
,
Joseph
LA
,
Schroeder
MN
 et al.  
An overview of PulseNet USA databases
.
Foodborne Pathog Dis
 
2019
;
16
:
457
62
.

12

Paranthaman
K
,
Mook
P
,
Curtis
D
, et al.  
Development and evaluation of an outbreak surveillance system integrating whole genome sequencing data for non-typhoidal Salmonella in London and South East of England, 2016-17
.
Epidemiol Infect.
 
2021
;
149
:
e164
.

13

Schadron
T
,
van den Beld
M
,
Mughini-Gras
L
,
Franz
E.
 
Use of whole genome sequencing for surveillance and control of foodborne diseases: status quo and quo vadis
.
Front Microbiol
 
2024
;
15
:
1460335
.

14

Medus
C
,
Boxrud
D
,
Carleton
H
, Chapter 4: Foodborne illness surveillance and outbreak detection. In: Hedberg C (ed.), CIFOR Guidelines for Foodborne Disease Outbreak Response, 3rd edn. Council to Improve Foodborne Outbreak Response. Published
2020
. http://cifor.us/products/guidelines (15 August 2023, date last accessed).

15

Besser
JM
,
Carleton
HA
,
Trees
E
 et al.  
Interpretation of whole-genome sequencing for enteric disease surveillance and outbreak investigation
.
Foodborne Pathog Dis
 
2019
;
16
:
504
12
.

16

Barratt
JLN
,
Plucinski
MM.
 
Epidemiologic utility of a framework for partition number selection when dissecting hierarchically clustered genetic data evaluated on the intestinal parasite Cyclospora cayetanensis
.
Am J Epidemiol
 
2023
;
192
:
772
81
.

17

Mixao
V
,
Pinto
M
,
Sobral
D
,
Di Pasquale
A
,
Gomes
JP
,
Borges
V.
 
ReporTree: a surveillance-oriented tool to strengthen the linkage between pathogen genetic clusters and epidemiological data
.
Genome Med
 
2023
;
15
:
43
.

18

Payne
M
,
Hu
D
,
Wang
Q
 et al.  
DODGE: automated point source bacterial outbreak detection using cumulative long term genomic surveillance
.
Bioinformatics
 
2024
;
40
:
btae427
.

19

Kulldorff
M.
 
A spatial scan statistic
.
Commun Stat Theory Methods
 
1997
;
26
:
1481
96
.

20

Kulldorff
M
,
Heffernan
R
,
Hartman
J
,
Assunção
R
,
Mostashari
F.
 
A space-time permutation scan statistic for disease outbreak detection
.
PLoS Med
 
2005
;
2
:
e59
.

21

Stroup
DF
,
Wharton
M
,
Kafadar
K
,
Dean
AG.
 
Evaluation of a method for detecting aberrations in public health surveillance data
.
Am J Epidemiol
 
1993
;
137
:
373
80
.

22

CDC
. Readers’ Guide: Understanding Weekly and Annual National Notifiable Diseases Surveillance System WONDER Tables (rev. 04/21/2021). https://www.cdc.gov/nndss/docs/Readers-Guide-WONDER-Tables-20210421-508.pdf (8 May 2024, date last accessed).

23

Greene
SK
,
Peterson
ER
,
Kapell
D
,
Fine
AD
,
Kulldorff
M.
 
Daily reportable disease spatiotemporal cluster detection, New York City, New York, USA, 2014-2015
.
Emerg Infect Dis
 
2016
;
22
:
1808
12
.

24

Kulldorff
M
,
Fang
Z
,
Walsh
SJ.
 
A tree-based scan statistic for database disease surveillance
.
Biometrics
 
2003
;
59
:
323
31
.

25

Kulldorff
M
,
Dashevsky
I
,
Avery
TR
 et al.  
Drug safety data mining with a tree-based scan statistic
.
Pharmacoepidemiol Drug Saf
 
2013
;
22
:
517
23
.

26

Huybrechts
KF
,
Kulldorff
M
,
Hernández-Díaz
S
 et al.  
Active surveillance of the safety of medications used during pregnancy
.
Am J Epidemiol
 
2021
;
190
:
1159
68
.

27

Yih
WK
,
Daley
MF
,
Duffy
J
 et al.  
A broad assessment of COVID-19 vaccine safety using tree-based data-mining in the Vaccine Safety Datalink
.
Vaccine
 
2023
;
41
:
826
35
.

28

U.S. Food and Drug Administration
. Use of TreeScan by Non-Sentinel Investigators. https://www.sentinelinitiative.org/methods-data-tools/signal-identification-sentinel-system/use-treescan-non-sentinel-investigators (15 August 2023, date last accessed).

29

Shmuel
S
,
Leonard
CE
,
Bykov
K
,
Filion
KB
,
Seamans
MJ
,
Lund
JL.
 
Breaking research silos and stimulating "innovation at the edges" in epidemiology
.
Am J Epidemiol
 
2023
;
192
:
323
7
.

30

CDC
. Surveillance Case Definitions for Current and Historical Conditions.  https://ndc.services.cdc.gov/ (15 August 2023, date last accessed).

31

New York City Department of Health and Mental Hygiene
. NYC Coronavirus Disease 2019 (COVID-19) Data: Variants of the SARS-CoV-2 Virus. https://github.com/nychealth/coronavirus-data#variants-of-the-sars-cov-2-virus (15 August 2023, date last accessed).

32

Rambaut
A
,
Holmes
EC
,
O'Toole
A
 et al.  
A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology
.
Nat Microbiol
 
2020
;
5
:
1403
7
.

33

O'Toole
A
,
Scher
E
,
Underwood
A
 et al.  
Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool
.
Virus Evol
 
2021
;
7
:
veab064
.

34

de Bernardi Schneider
A
,
Su
M
,
Hinrichs
AS
 et al.  
SARS-CoV-2 lineage assignments using phylogenetic placement/UShER are superior to pangoLEARN machine-learning method
.
Virus Evol
 
2024
;
10
:
vead085
.

35

Levin-Rector
A
,
Kulldorff
M
,
Peterson
ER
,
Hostovich
S
,
Greene
SK.
 
Prospective spatiotemporal cluster detection using SaTScan: tutorial for designing and fine-tuning a system to detect reportable communicable disease outbreaks
.
JMIR Public Health Surveill.
 
2024
;
10
:
e50653
.

36

New York City Health Code
. Article 11 (Reportable Diseases and Conditions) and Article 13 (Laboratories).  https://www.nyc.gov/site/doh/about/about-doh/health-code-and-rules.page (15 August 2023, date last accessed).

37

Consolidated Laws of New York
. Section 576-C. Electronic reporting of disease and specimen submission. Chapter 45 (Public Health), Article 5 (Laboratories), Title 5 (Clinical Laboratory and Blood Banking Services). Published 22 September 2014. https://www.nysenate.gov/legislation/laws/PBH/576-C (15 August 2023, date last accessed).

38

Twohig
KA
,
Harman
K
,
Zaidi
A
, et al.  
Representativeness of whole genome sequencing approaches in England—the importance for understanding inequalities associated with SARS-CoV-2 infection
.
Epidemiol Infect.
 
2023
:
151
:
e169
.

39

Krieger
N
,
Waterman
PD
,
Spasojevic
J
,
Li
W
,
Maduro
G
,
Van Wye
G.
 
Public health monitoring of privilege and deprivation with the Index of Concentration at the Extremes
.
Am J Public Health
 
2016
;
106
:
256
63
.

40

The Centre for Genomic Pathogen Surveillance, Big Data Institute, University of Oxford
. Pango Designation Lineage Notes. https://github.com/cov-lineages/pango-designation/blob/master/lineage_notes.txt (15 August 2023, date last accessed).

41

Kulldorff
M.
 TreeScanTM User Guide for Version 2.1. Published July
2022
. https://www.treescan.org/ (15 August 2023, date last accessed)

42

Kleinman
K
,
Lazarus
R
,
Platt
R.
 
A generalized linear mixed models approach for detecting incident clusters of disease in small areas, with an application to biological terrorism
.
Am J Epidemiol
 
2004
;
159
:
217
24
.

43

WHO
. EG.5 Initial Risk Evaluation. Published 9 August 2023. https://www.who.int/docs/default-source/coronaviruse/09082023eg.5_ire_final.pdf (10 August 2023, date last accessed).

44

Kirby
AE
,
Welsh
RM
,
Marsh
ZA
 et al.  
Notes from the field: early evidence of the SARS-CoV-2 B.1.1.529 (Omicron) variant in community wastewater—United States, November-December 2021
.
MMWR Morb Mortal Wkly Rep
 
2022
;
71
:
103
5
.

45

Silk
BJ
,
Scobie
HM
,
Duck
WM
 et al.  
COVID-19 surveillance after expiration of the public health emergency declaration—United States, May 11, 2023
.
MMWR Morb Mortal Wkly Rep
 
2023
;
72
:
523
8
.

46

Carleton
HA
,
Besser
J
,
Williams-Newkirk
AJ
,
Huang
A
,
Trees
E
,
Gerner-Smidt
P.
 
Metagenomic approaches for public health surveillance of foodborne infections: opportunities and challenges
.
Foodborne Pathog Dis
 
2019
;
16
:
474
9
.

47

Ko
KKK
,
Chng
KR
,
Nagarajan
N.
 
Metagenomics-enabled microbial surveillance
.
Nat Microbiol
 
2022
;
7
:
486
96
.

48

McBroome
J
,
de Bernardi Schneider
A
,
Roemer
C
 et al.  
A framework for automated scalable designation of viral pathogen lineages from genomic data
.
Nat Microbiol
 
2024
;
9
:
550
60
.

49

Association of Public Health Laboratories
. PulseNet 2.0: A future-proof infrastructure for genomic data management and analytics. White paper version 1.0. Published May
2023
. https://stacks.cdc.gov/view/cdc/138367 (27 March 2025, date last accessed).

50

Broberg
E
,
Revez
J
,
Alm
E
,
Walle
IV
,
Palm
D
,
Struelens
M.
ECDC strategic framework for the integration of molecular and genomic typing into European surveillance and multi-country outbreak investigations, 2019–2021. Published 4 April 2019. https://www.ecdc.europa.eu/en/publications-data/ecdc-strategic-framework-integration-molecular-and-genomic-typing-european (15 August 2023, date last accessed).

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data