Abstract

Whole genome sequencing (WGS) has revolutionized the genotyping of bacterial pathogens and is expected to become the new gold standard for tracing the transmissions of bacterial infectious diseases for public health purposes. Traditional genomic epidemiology often uses WGS as a verification tool, namely, when a common source or epidemiological link is suspected, the collected isolates are sequenced for the determination of clonal relationships. However, increasingly frequent international travel and food transportation, and the associated potential for the cross-border transmission of bacterial pathogens, often lead to an absence of information on bacterial transmission routes. Here we introduce the concept of ‘reverse genomic epidemiology’, i.e. when isolates are inspected by genome comparisons to be sufficiently similar to one another, they are assumed to be a consequence of infection from a common source. Through BacWGSTdb (http://bacdb.org/BacWGSTdb/), a database we have developed for bacterial genome typing and source tracking, we have found that almost the entire analyzed 20 bacterial species exhibit the phenomenon of cross-border clonal dissemination. Five networks were further identified in which isolates sharing nearly identical genomes were collected from at least five different countries. Three of these have been documented as real infectious disease outbreaks, therefore demonstrating the feasibility and authority of reverse genomic epidemiology. Our survey and proposed strategy would be of potential value in establishing a global surveillance system for tracing bacterial transmissions and outbreaks; the related database and techniques require urgent standardization.

Introduction

Throughout human history, infectious disease has been a major and enduring threat to public health, with significant social and financial implications. Evidence has demonstrated the involvement of bacterial infection in the growth and development of various diseases including cholera, inflammatory bowel disease, tuberculosis, tetanus and various types of cancer [1–6]. While several improvements, including the discovery of antibiotics, have greatly limited the burden of bacterial contagions, pathogenic epidemics still pose a frequent threat to vulnerable populations. A well-known example is the 2010–11 Haiti cholera outbreak, which killed 4672 people and hospitalized thousands more [1, 7]. In the past, plagues were often regionally restricted due to limited international migration which, to some extent, isolated the transmission of bacterial infectious diseases. However, the increasingly intimate relationships between social economies have recently transformed the world into a global village. Accordingly, frequent and continual worldwide travel and food transportation enhance the likelihood for the rapid and widespread prevalence of hospital-acquired infections and foodborne ailments and, therefore, have stimulated the pressing need for international surveillance and early outbreak detection to prevent the dissemination of bacterial infectious diseases [8].

An outbreak alert might be triggered by the presence of groups of patients infected by a particular bacterial pathogen. Thus, a prerequisite for tracing bacterial transmissions and detecting outbreaks in their early stages is the development of a rapid and accurate typing technology and associated international nomenclature database. The performance of genotyping methods is reflected in their resolution: the higher the resolution the scheme reveals of the genetic relationship, the more suited it is to informing on follow-up public health actions. The previous gold-standard technique for strain classifying relies on multi-locus sequence typing (MLST) and pulsed-field gel electrophoresis (PFGE; see Box 1). MLST is an unambiguous procedure for characterizing isolates of bacterial species using the internal fragment sequences of (usually) seven housekeeping genes [9]. Unfortunately, this technique lacks sufficient discriminatory power due to the high degree of similarity between the strains of particular successful lineages [10]. Using the multidrug-resistant bacterial pathogen, Acinetobacter baumannii, as an example, its predominant lineage, ST208, disseminates worldwide, preventing the MLST from delineating more intricate transmission routes when outbreaks occur [11]. A few bacterial pathogens, including Bacillus anthracis, Mycobacterium tuberculosis and Yersinia pestis, are even more monomorphic and, therefore, lack MLST schemes. PFGE entails the variable migration of large DNA restriction fragments in an electrical field of alternating polarity. By comparing the fingerprints of any two isolates, one can investigate the degree to which they are genetically related [12]. Similarly, the method also suffers from a lack of resolution power to distinguish between isolates of nearly identical band pattern. For example, the methicillin-resistant Staphylococcus aureus, PFGE type USA300, is distributed in the entire United States or even worldwide, making it impossible to identify its more detailed transmission routes during outbreaks [13].

Box 1: Bacterial typing techniques.

  • Serotyping: a method of grouping microorganisms on the basis of the reactions between a given antiserum and cell surface antigens that allows classification at the subspecies level. The serotyping of most bacterial species is based on the detection of flagellar and somatic antigens, although capsular antigens may also be used. Serotyping has poor discriminatory power due to the large numbers of serotypes, cross-reactions to antigens and untypeable strains.

  • PFGE: PFGE is based upon the variable migration of large DNA restriction fragments within an electrical field of alternating polarity. By comparing the fingerprints of any two isolates, one can investigate whether they belong to the same strain (i.e. the two isolates are clonal) or are genetically unrelated. PFGE was introduced two decades ago and remains the ‘gold standard’ for molecular typing methods by PulseNet International for foodborne pathogens. Using PFGE, bacterial isolates that differ by a single genetic event (two or three band differences) are defined as closely related, while multiple band differences are indicative of unrelated strains. However, since PFGE relies on gel electrophoresis and the resolution of large bands, it is not sufficiently sensitive to detect single nucleotide polymorphisms (SNPs).

  • MLST: an unambiguous procedure for characterizing bacterial isolates using the sequences of internal fragments of (usually) seven housekeeping genes. For each housekeeping gene, the different sequences present within a bacterial species are assigned as distinct alleles and, for each isolate, the alleles at each of the seven loci define the allelic profile or sequence type (ST). Isolates having matching STs are defined as clonal, i.e. having a common ancestor. MLST is traditionally performed through polymerase chain reaction amplification and the Sanger sequencing of the target regions. It is now cheaper to sequence an entire genome by massive parallel sequencing than to sequence the seven MLST genes individually; therefore, there are several user-friendly platforms that are able to calculate STs from a whole genome sequence (in silico MLST), such as the Center for Genomic Epidemiology (http://www.genomicepidemiology.org). Within the past decade, MLST has broadened its basic scheme from using housekeeping genes to incorporate more and novel molecular markers, such as ribosomal proteins (rMLST) and polymorphic repeated sequences (MLVA); most recently, it has begun integrating draft and complete genomes (cgMLST/wgMLST). As a result, the traditional MLST has also been called ‘Legacy MLST’.

  • WGS: this refers to sequence determination of the complete or nearly complete genome of a microorganism. This technique is critically and irreversibly changing the way microbial DNA is analyzed, replacing traditional, targeted molecular methods for microbial detection, molecular epidemiology investigations and more in-depth pathogenomic studies. The WGS-based approaches rely on either the characterization of whole genome SNPs or a whole genome gene-by-gene multi-locus sequence typing approach encompassing a stable set of core genome genes (cgMLST).

  • SNP: a popular method for constructing phylogenies on the basis of whole genome SNPs that have been identified either by the mapping of short-read sequence data to a reference genome, or by aligning de novo-assembled sequences to a reference genome. This approach has been used successfully to investigate the epidemiology and evolution of a number of single-clone pathogens or members of the same lineage, but the requirement for a reference sequence makes its application into diverse or highly recombining organisms challenging. In contrast to a reference-based approach, a reference-free approach identifies SNPs directly from the sequence data. This eliminates any biases potentially introduced due to the selection of a reference and allows for the detection of SNPs not present in the reference genome. However, a reference-free approach may lead to a higher false discovery rate in SNP calling in the absence of appropriate thresholds.

  • cgMLST: an analysis method that detects variations in genes that are present in the majority (>95%) of strains of a given species. A comparison of the core genome of a given group provides high-resolution data across a group of related but non-identical isolates. As an extension form of conventional seven-gene MLST, the cgMLST approach allows immediate comparisons of newly determined genotypes with historical data, enabling continuous surveillance, in contrast to the SNP-based approaches that call for recalculations once data sets change unless a preliminarily defined reference genome is given. A cgMLST scheme is usually less discriminatory than a scheme conducted with an ‘SNP-like’ approach, namely, wgMLST, which detects variations in all genes (core and accessory genes) from an entire bacterial species for typing. However, wgMLST can be affected by such genetic events as homologous recombinations and the lateral transfer of mobile genetic elements, which can result in high density SNPs within short segments and, thereby, distort true phylogenetic relationships. A trade-off between cgMLST and wgMLST is warranted but the detailed methodology requires future research.

The emergence of next-generation sequencing and its continuously decreasing costs have paved the way for implementing the use of whole genome sequencing (WGS) as an accessible and affordable tool for genotyping and outbreak tracking [14]. Analysis of entire pathogen genomes via WGS can provide unprecedented resolution even in discriminating between very closely related bacteria that differ from one another by only a single base pair (bp). In addition, unlike the conventional typing methods such as PFGE and MLST, WGS provides large information levels from a single technique, including in silico predictions of virulence and antimicrobial resistance [15–17]. In the past few decades, the application of WGS to clinical and public health microbiology has already been demonstrated in proof-of-concept studies conducted retrospectively or in response to emerging outbreaks (see Box 1) [18].

Given that most infectious disease outbreaks take place within a hospital ward, entire hospital, city or, at the most, a country, most WGS-based outbreak studies have been performed in reference to these scales. However, increasingly frequent travel and food transportation activity, and the consequent potential cross-board transmission of bacterial pathogens, has encouraged us to believe that WGS-based global surveillance might be of much greater importance in the near future in alerts concerning ongoing global outbreaks [19, 20]. Until now, most genomic epidemiological studies have been conducted following a top-down approach, i.e. when a common source or epidemiological link is suspected, the isolates collected from foods and patients are sequenced for the determination of clonal relationships [21]. In many cases, however, despite enhanced surveillance during outbreaks, missing transmission links are inevitable due to missed sampling, suppression from antimicrobial therapy or delays in identifying contacts. Here we pose the concept of ‘reverse genomic epidemiology’, i.e. when isolates are found to be sufficiently similar by genome comparison, they are deemed to be a consequence of infection from a common source. In this regard, clinicians and epidemiologists can directly identify similar isolates in the database rather than seeking the assistance of other epidemiological evidence. In this study, we demonstrate the authority and feasibility of this bottom-up approach by comparing the in silico prediction of outbreak events with the real epidemiological records. It is, therefore, of particular value to establish a database, similar to PulseNet International for PFGE, for a multi-country, cross-border surveillance of bacterial transmission and outbreak dynamics.

Results and discussion

Availability of an online database

Box 2 lists several web resources that are widely used for bacterial genotyping. Among them, BacWGSTdb (http://bacdb.org/BacWGSTdb/) is a free publicly accessible database we have developed for tracing the infection sources [22]. The deposited genomic sequences and metadata are recruited from the NCBI GenBank database. The recently updated version of this database builds a phylogenetic relationship based on both the MLST and single nucleotide polymorphism (SNP) strategies. In addition, the prediction of antimicrobial resistance and virulence genes is also integrated, with the aim of providing a one-stop solution to bacterial genomic analysis for epidemiologists, clinicians and bench scientists. Currently, the database encompasses 20 bacterial species of medical importance.

A screenshot of the Browse interface and an overview of the graphic outputs of the online database, BacWGSTdb. Users can select the isolates of interest based on their sequence type, host, clinical outcome, geographical location or any other of the possible attributes. After selecting a reference genome (for the SNP approach) and an MLST scheme (for the cgMLST/wgMLST approach), the users can click on the ‘Phylogeny construction’ button to initiate a phylogenetic analysis. Both the neighbor joining and minimum spanning trees are provided; these reflect the phylogenetic relationship between the selected isolates.
Figure 1

A screenshot of the Browse interface and an overview of the graphic outputs of the online database, BacWGSTdb. Users can select the isolates of interest based on their sequence type, host, clinical outcome, geographical location or any other of the possible attributes. After selecting a reference genome (for the SNP approach) and an MLST scheme (for the cgMLST/wgMLST approach), the users can click on the ‘Phylogeny construction’ button to initiate a phylogenetic analysis. Both the neighbor joining and minimum spanning trees are provided; these reflect the phylogenetic relationship between the selected isolates.

Box 2: Bacterial typing databases.

  • PulseNet (http://www.pulsenetinternational.org): a network of national and regional laboratories dedicated to tracking foodborne infections worldwide. Each laboratory utilizes standardized genotyping methods and shares information in real time. Basically, it only serves the official users and deposits the fingerprints of isolates submitted by regional Centers for Disease Control and Prevention (CDC) and hospitals. Comparisons of the query isolates with those in the database allow investigators to locate the source and alert the public in a timely manner.

  • PubMLST (http://pubmlst.org) and Pasteur MLST (http://bigsdb.pasteur.fr): two major nomenclature databases that host MLST schemes of over 130 bacterial species. They provide reference nomenclatures for microbial strains and are mainly intended for the molecular epidemiology of pathogens of public health importance and for population biology research.

  • Enterobase (http://enterobase.warwick.ac.uk): a web-based platform that assembles draft genomes from Illumina short reads in the public domain or that are uploaded by users. It implements curated database and online resources for the molecular typing of Salmonella, E. coli, Yersinia spp. and Moraxella spp. using the cgMLST/wgMLST approach.

  • NCBI Pathogen Detection (https://www.ncbi.nlm.nih.gov/pathogens): an online platform that integrates bacterial pathogen genomic sequences retrieved from the NCBI GenBank database to share and compare SNPs data on outbreak strains. It currently contains databases for 22 bacterial species, focusing on foodborne pathogens and healthcare-associated infections. It is still a beta version and does not implement a data analysis module to handle user uploaded genome sequences.

  • BacWGSTdb (http://bacdb.org/BacWGSTdb): a database that offers genotyping by both the cgMLST/wgMLST and SNP strategies for 20 nosocomial and foodborne bacterial pathogens. With the aim of source tracking, users can either upload their own genome sequence(s) to find close isolates in the database or browse the isolates in the database and investigate their phylogenetic relationships. In addition, it also provides online analyses of in silico MLST and predictions of antimicrobial resistance and virulence genes.

A screenshot of the single genome analysis Analytical Tool interface and an overview of the graphic outputs. This tool offers both SNP and MLST approaches for investigating the phylogenetic relationship of the user uploaded genome sequence with those deposited in the BacWGSTdb. Predictions of virulence and the antimicrobial resistance genes are integrated into this tool.
Figure 2

A screenshot of the single genome analysis Analytical Tool interface and an overview of the graphic outputs. This tool offers both SNP and MLST approaches for investigating the phylogenetic relationship of the user uploaded genome sequence with those deposited in the BacWGSTdb. Predictions of virulence and the antimicrobial resistance genes are integrated into this tool.

A screenshot of the multiple genome analysis Analytical Tool interface and an overview of graphic outputs. Unlike the single genome analysis module, this one allows users to upload multiple query genome sequences and returns their phylogenetic relatedness following both the SNP and MLST approaches. Predictions of virulence and the antimicrobial resistance genes are also integrated into this tool.
Figure 3

A screenshot of the multiple genome analysis Analytical Tool interface and an overview of graphic outputs. Unlike the single genome analysis module, this one allows users to upload multiple query genome sequences and returns their phylogenetic relatedness following both the SNP and MLST approaches. Predictions of virulence and the antimicrobial resistance genes are also integrated into this tool.

Use of BacWGSTdb includes two major parts: Browse and Analytical Tools. The Browse function is designed to visualize and compare isolates deposited in the database. When users browse BacWGSTdb, the isolates can be sorted based on their sequence type, host, clinical outcome, geographical location or any other attributes. The phylogenetic relationship among the selected isolates is also provided (Figure 1). The Analytical Tool function can be further split into analysis of the user uploaded single or multiple genomes. For single genome analyses, the information on the genetically close isolates in the database is given on the basis of their genomic similarity to the query genome (Figure 2). For multiple genome analyses, the phylogenetic relatedness among the user uploaded genomes is provided rather than a comparison against the isolates deposited in the database (Figure 3).

The application of reverse genomic epidemiology for the detection of worldwide outbreaks

Here we screened BacWGSTdb and investigated isolates with highly similar genomes but obtained from diverse countries and regions. The typing strategy we chose was core genome multi-locus sequence typing (cgMLST), which is actually an extension of the conventional MLST but augments its typing of gene targets at the core genome level [23]. In BacWGSTdb, the development of a species-specific cgMLST scheme involves the following steps. First, the coding sequences extracted from all of the publicly available complete genomes (the chromosomes) are put together for the construction of a species-specific pan-genome. Next the transposon and plasmid genes are removed in order to minimize the effect of lateral genetic transfer upon phylogenic construction. Finally, genes with lengths of <50 bp are filtered out; the shortest of two overlapping genes by >4 nucleotides is also excluded. The remaining genes constitute the target genes of the whole genome multi-locus sequence typing (wgMLST) scheme. Then all of the publicly available genomes (including complete and draft ones) are compared against the wgMLST scheme. Genes that are present in >95% of the query genomes are further selected as targets for the cgMLST scheme.

Although cgMLST has already been recommended for strain typing in public health research due to its capacity to readily perform scaling analyses from the genus to clone level, no consensus has yet been arrived at on the degree of similarity that two bacterial isolates must possess to establish their having a common outbreak origin. The reasons are complex because the number of allelic differences may vary with the lifecycle of the bacterium, the time-frame of the sampling (a longer infection time could lead to an increased number of allelic variants), the mutation frequency of the bacterium and even the identity of the individual bacterial colony selected from the agar plate. To date, the relatedness criteria for cgMLST schemes of representative clinically relevant bacteria have ranged from 0 to 24 loci [24].

Here we calculated the pairwise genetic distance using 10 loci as a temporary threshold for the determination of clonal relationship. Pairs of isolates that met the threshold but were obtained from different countries were selected (Table 1). While most of the identified cross-border bacterial disseminations occurred only between two or three countries, five networks involved phylogenetically closely related isolates that appeared in at least five countries. In other words, these five networks may suggest pandemic clonal outbreaks; we have thus highlighted them as examples to elucidate the employment of reverse genomic epidemiology.

Case 1: the outbreak of enterohemorrhagic Escherichia coli O104:H4 in Germany (2011)

The first identified network actually arises from the Shiga toxin-producing E. coli O104:H4 that caused a large foodborne outbreak in the summer of 2011 (Figure 4). While originating in Germany, this outbreak soon spread across multiple countries including France, Denmark, United States, Norway, Sweden, UK and Canada, leading to >3000 infection cases and >40 deaths [25]. The majority of the isolates collected from this outbreak were successfully identified in this study as they met the <10 loci criterion. As a control, we also calculated the genetic distance between the German isolates from this outbreak and found that the maximum distance reached 18 loci, suggesting that 10 loci is a safe criterion for creating an epidemiological link in at least this case.

Although European epidemiologic studies, including analyses of restaurant cohorts and trace back investigations, ultimately implicated raw fenugreek sprouts as the food vehicle, none of the patients who had traveled to Germany from the United States in the period immediately preceding the epidemic definitively recalled having consumed such sprouts [26]. Strictly speaking, these American patients lacked a solid epidemiological link with the outbreak. These events highlight the challenges entailed in investigating outbreaks by the traditional top-down approach, and particularly epidemics caused by rare pathogens or associated with food vehicles that are consumed in small quantities as parts of other dishes. This is indeed why we promote reverse genomic epidemiology for practical use.

Case 2: the outbreak of Vibrio cholera in Haiti (2010)

The second network is associated with V. cholera. When the genome of V. cholerae from the outbreak in Haiti was placed into a phylogenetic context, we discovered that they were closely related with those recovered from Mozambique, South Africa, Russia, United States, India and Bangladesh (Figure 4). The Haiti outbreak began in October, 2010; it occurred 9 months after an earthquake and caused a cholera infection in nearly 800 000 Haitians and led to more than 9000 deaths [27]. No history of cholera had been reported in Haiti prior to 2010. Interestingly, the isolates from the other countries were collected much earlier than 2010, and their patients had all been travelers or immigrants to Haiti from the other countries, suggesting the introduction of cholera from an external source into Haiti [28]. Thus, we speculate that the pandemic V. cholera clone emerged initially in Asia or Africa, and later was occasionally brought to Haiti, where it replaced the indigenous clone and become endemic there.

Table 1

The prediction of international transmission events following reverse genomic epidemiology.

Bacterial speciesNo. of transmission events
(between countries/states)
Acinetobacter baumannii14/2
Bacillus anthracis3/0
Bacillus cereus0/0
Campylobacter jejuni0/0
Campylobacter coli3/0
Clostridioides difficile9/0
Enterococcus faecalis2/2
Enterococcus faecium5/3
Escherichia coli137/24
Klebsiella pneumoniae4/4
Listeria monocytogenes13/4
Mycobacterium abscessus6/0
Mycobacterium tuberculosis16/0
Staphylococcus aureus12/1
Salmonella enterica77/79
Streptococcus agalactiae1/1
Streptococcus pneumoniae1/0
Streptococcus suis0/2
Vibrio cholera20/0
Yersinia pestis1/1
Bacterial speciesNo. of transmission events
(between countries/states)
Acinetobacter baumannii14/2
Bacillus anthracis3/0
Bacillus cereus0/0
Campylobacter jejuni0/0
Campylobacter coli3/0
Clostridioides difficile9/0
Enterococcus faecalis2/2
Enterococcus faecium5/3
Escherichia coli137/24
Klebsiella pneumoniae4/4
Listeria monocytogenes13/4
Mycobacterium abscessus6/0
Mycobacterium tuberculosis16/0
Staphylococcus aureus12/1
Salmonella enterica77/79
Streptococcus agalactiae1/1
Streptococcus pneumoniae1/0
Streptococcus suis0/2
Vibrio cholera20/0
Yersinia pestis1/1

Note: We calculated the pairwise genetic distance between isolates stored in the BacWGSTdb by cgMLST, using 10 loci as the threshold for the determination of the clonal relationship. Pairs of isolates that met the threshold but were obtained from different countries (or states at an over 500 km distance from one another) were selected. Multiple isolates originating from the same pairs of regions were taken as redundant.

Table 1

The prediction of international transmission events following reverse genomic epidemiology.

Bacterial speciesNo. of transmission events
(between countries/states)
Acinetobacter baumannii14/2
Bacillus anthracis3/0
Bacillus cereus0/0
Campylobacter jejuni0/0
Campylobacter coli3/0
Clostridioides difficile9/0
Enterococcus faecalis2/2
Enterococcus faecium5/3
Escherichia coli137/24
Klebsiella pneumoniae4/4
Listeria monocytogenes13/4
Mycobacterium abscessus6/0
Mycobacterium tuberculosis16/0
Staphylococcus aureus12/1
Salmonella enterica77/79
Streptococcus agalactiae1/1
Streptococcus pneumoniae1/0
Streptococcus suis0/2
Vibrio cholera20/0
Yersinia pestis1/1
Bacterial speciesNo. of transmission events
(between countries/states)
Acinetobacter baumannii14/2
Bacillus anthracis3/0
Bacillus cereus0/0
Campylobacter jejuni0/0
Campylobacter coli3/0
Clostridioides difficile9/0
Enterococcus faecalis2/2
Enterococcus faecium5/3
Escherichia coli137/24
Klebsiella pneumoniae4/4
Listeria monocytogenes13/4
Mycobacterium abscessus6/0
Mycobacterium tuberculosis16/0
Staphylococcus aureus12/1
Salmonella enterica77/79
Streptococcus agalactiae1/1
Streptococcus pneumoniae1/0
Streptococcus suis0/2
Vibrio cholera20/0
Yersinia pestis1/1

Note: We calculated the pairwise genetic distance between isolates stored in the BacWGSTdb by cgMLST, using 10 loci as the threshold for the determination of the clonal relationship. Pairs of isolates that met the threshold but were obtained from different countries (or states at an over 500 km distance from one another) were selected. Multiple isolates originating from the same pairs of regions were taken as redundant.

Cases 3 and 4: dissemination of Salmonella enterica in Africa (2000s)

The third and fourth networks involve the global dissemination of S. enterica serovar Typhi. The former actually reflects the prevalence of the multidrug-resistant S. Typhi in the Gulf of Guinea region of Africa [29], and the latter is an unpublished event that involved 12 different countries (Figure 4). As the above V. cholera case, most isolates involved in the former network were obtained from travelers or immigrants to the area, although the second network’s circumstances are unknown. Carriers for the strains may be symptomatic or asymptomatic, with those who are asymptomatic posing a more severe risk to public health. The most famous asymptomatic carrier of S. Typhi is probably ‘Typhoid Mary’, who incited a minimum of seven outbreaks of typhoid fever, resulting in 57 cases of the disease and 3 deaths [30]. Traditional epidemiology tends to be lacking in resources to deal with asymptomatic carriers. The more unseen and unsampled the transmitters, the more difficult to accurately reconstruct the transmission route, and the more likely an outbreak would eventually occur. However, this surveillance issue might be resolved using reverse genomic epidemiology due to this approach not requiring clinical metadata.

Interestingly, the two Salmonella networks differ from one another by only 166 loci, and might be grouped together into one clone by the traditional typing approaches. Alternately, WGS can better isolate the strains uninvolved in an outbreak by excluding them as epidemiologically unrelated when their diversity exceeds a pre-established threshold; this underscores the high resolution of this method.

Whole genome sequencing for tracking the global bacterial transmission scenarios of four representative bacterial species A. baumannii (A), E. coli (B), S. enterica (C) and V. cholera (D). The geographical locations of all of these closely related isolates (differing from one another by <10 loci) were pinpointed on geographic maps, and the topologies of the lines connecting these isolates follows a minimum spanning tree, constructed based on cgMLST analysis. The digital numbers on the lines illustrate the number of allelic differences, while the collection years are color coded and listed to the right of the circles. The right minimum spanning tree was constructed for the control isolates, which were collected from a single site (symbolized as a red pentagram on each left-side map).
Figure 4

Whole genome sequencing for tracking the global bacterial transmission scenarios of four representative bacterial species A. baumannii (A), E. coli (B), S. enterica (C) and V. cholera (D). The geographical locations of all of these closely related isolates (differing from one another by <10 loci) were pinpointed on geographic maps, and the topologies of the lines connecting these isolates follows a minimum spanning tree, constructed based on cgMLST analysis. The digital numbers on the lines illustrate the number of allelic differences, while the collection years are color coded and listed to the right of the circles. The right minimum spanning tree was constructed for the control isolates, which were collected from a single site (symbolized as a red pentagram on each left-side map).

Case 5: the dissemination of Acinetobacter baumannii in East Asia (2010s)

The above four networks are all associated with foodborne diseases. Due to the accelerating international food trade, such cross-border outbreaks are to be expected. The last to be mentioned network, though, is connected with a hospital-associated bacterium, A. baumannii. This pathogen was initially referred to as ‘Iraqibacter’ due to its coming to Western attention when American soldiers returning from Iraq carried one of its clones back to the United States, where it continued to disseminate widely in the American nosocomial environment [31, 32]. In this case, we have discovered a previously unrecognized putative novel transmission route for the strains in East and Southeast Asia (Figure 4). The genomic relationship between these isolates was even closer than that for the isolates collected from the same hospital (>19 loci), which supports the direct cross-board transmission scenario. As an opportunistic bacterial pathogen, A. baumannii is often found at the stage of colonization but not infection, which might facilitate its wide dissemination.

Challenges facing reverse genomic epidemiology

Given the cases presented above, and particularly the verified outbreak of E. coli, S. enterica and V. cholera, it is reasonable to believe that the number of cross-border transmission events was largely underestimated. This can be attributed to any point in the following chain: asymptomatic carriers, symptomatic carriers not receiving medical treatment, patients receiving proper medical treatment but with their pathogens being unsampled or unsaved and patients with sampled pathogens which then remain unsequenced or with their genomic sequences not submitted to the public database. Reverse genomic epidemiology may have been successfully applied to any of these cases involving cross-border transmission surveillance. To fully exert the power of reverse genomic epidemiology, it is, therefore, imperative that a worldwide accessible database for storing and sharing bacterial genotypes be established and maintained, and that the chain issues mentioned above be well addressed.

Aside from its political and commercial issues, reverse genomic epidemiology is confronted with a variety of technical demands. Challenges come first from the comparability and compatibility of the sequence data generated from different platforms. Although Illumina has become the de facto standard for foodborne illness diagnosis and surveillance, new-generation technologies such as PacBio and MinIon are emerging; their utility in outbreak investigation has been recently demonstrated [21, 33]. The disparate throughput and error profiles by different sequencers may cause different levels of completeness and quality in the obtained genomes. As optimists, however, we believe that the rapid development of sequencing technologies will minimize these discrepancies and finally draw our attention back to the real biological differences between the bacterial isolates.

A more urgent challenge is the lack of internationally agreed standards for sequence analysis and data interpretation. In the frequent absence of an indexed patient and/or isolate, it is imperative to determine the threshold of the clonal relationship [34, 35]. For MLST, the definition of a clonal complex is clear in that only a one-to-two allelic difference is allowed for isolates of the same clonal complex. For PFGE, the definition of a clone is also explicit, in that isolates with a greater than two-band difference are interpreted as belonging to diverse clones. Unfortunately, a coherent threshold has not been established for WGS analysis, and significant difference must be established on an organism-by-organism and even case-by-case basis. Revisiting the genomic epidemiology for the E. coli O104:H4 outbreaks in Europe reveals that isolates from the German centered outbreak form a clade (2 SNPs) within the more diverse French outbreak strains (19 SNPs) [18]. This striking variance in diversity between the German and French outbreak samples could be explained by several hypotheses, including that a bottleneck had eradicated the diversity in the German isolates, a variation existed in the mutation rates of the two E. coli outbreak populations, or the presence of uneven diversity distributions in the seed populations leading to each outbreak. Consequently, the interpretation of WGS results for epidemiologic investigations should rely both on genetic distance and differences (SNPs or alleles) and on a topology of the dendrogram or phylogeny in order to draw conclusions about the matter of relatedness. Although a similarity threshold can guide the identification of clusters of potentially related isolates, isolates beyond this threshold but topologically associated justify further scrutiny to determine their potential relatedness.

Conclusions and perspectives

The world’s increasing internationalization is leading to the global dissemination of bacterial infections; the mechanisms are often obscure, making reverse genomic epidemiology a necessary endeavor. The involved technical manipulations and according criteria for this genomic technique require standardization, which in turn calls for extensive academic research to provide training data. Furthermore, in order to connect the growing number of genomes with the global surveillance database, a wealth of clinical and genomic sequencing data needs compiling for the bacterial isolates collected from expansive temporal and spatial dimensions. Unbiased sampling is essential, as previous studies often selected isolates for sequencing and typing based on pathogenicity and convenience, while overlooking much of the diversity exigent for informing public health policy decisions. From a policy-making standpoint, surveillance should not continue to be carried out merely at the national level; urgent international collaboration is occasioned. A consensus must be reached such that the sequencing data of every patient, clinician and country can willingly be uploaded and shared in real time.

An exciting future with outbreak investigation being revolutionized by reverse genomic epidemiology lies ahead of us since, one day, patient-sourced bacterial isolates collected will be sequenced in real time; the obtained genomic information will guide treatment by revealing its predicted antibiotic resistance and virulence potential. More importantly, an automated alert will be triggered if isolates with high relatedness are detected in the database. A predicted source and transmission route will be revealed, assisting clinicians to prevent and control outbreak events beginning in one’s infancy.

Key Points

  • Due to its single bp ultimate resolution, WGS has become a powerful and attractive tool for the rapid typing of bacterial isolates and has gradually replaced PFGE and traditional MLST as a new ‘gold standard’ in molecular epidemiology for the future surveillance of either local or global infectious diseases.

  • Classic genomic epidemiology utilizes WGS to verify the suspected epidemiological links, whereas ‘reverse genomic epidemiology’ establishes epidemiological links solely through their genomic similarly, without requiring clinical metadata.

  • Frequent and continual worldwide travel and food transportation often lead to an absence of indexed patients as well as epidemiological links and, therefore, require the urgent employment of reverse genomic epidemiology.

  • To fully exert the power of reverse genomic epidemiology, it is imperative to standardize the procedure for determining clonal relationships and establish a worldwide accessible database for storing and sharing bacterial genomes as well as epidemiological metadata.

  • We accordingly developed an online database called BacWGSTdb (http://bacdb.org/BacWGSTdb/), with the aim of providing a one-stop solution to epidemiological outbreak analysis and pioneering the movement of WGS from proof-of-concept studies to routine use in the clinical microbiology laboratory.

Funding

National Natural Science Foundation of China (31670132, 81401698) and Zhejiang Province Public Welfare Technology Application Research Project (LGF18H190001).

Zhi Ruan is an assistant professor at Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, China. His primary research interest is microbial genomics and the application of bioinformatics to studying the molecular epidemiology of infectious diseases.

Yunsong Yu is a professor at Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, China. He has an active clinical practice of infectious diseases and currently leads several large-scale bacterial whole genome sequencing projects in elucidating the molecular epidemiology, pathogenesis and antimicrobial resistance of the so-called ‘ESKAPE’ bacterial pathogens.

Ye Feng is a bioinformatics scientist at Zhejiang University School of Medicine, China. His research interests are developing and utilizing comparative genomic approaches for the investigation of microbial evolution and epidemiology.

References

1.

Chin
CS
,
Sorenson
J
,
Harris
JB
, et al.
The origin of the Haitian cholera outbreak strain
.
N Engl J Med
2011
;
364
:
33
42
.

2.

Khan
S
,
Imran
A
,
Malik
A
, et al.
Bacterial imbalance and gut pathologies: association and contribution of E. coli in inflammatory bowel disease
.
Crit Rev Clin Lab Sci
2018
. doi: .

3.

Hassel
B
.
Tetanus: pathophysiology, treatment, and the possibility of using botulinum toxin against tetanus-induced rigidity and spasms
.
Toxins
2013
;
5
:
73
83
.

4.

Khan
S
,
Zakariah
M
,
Rolfo
C
, et al.
Prediction of mycoplasma hominis proteins targeting in mitochondria and cytoplasm of host cells and their implication in prostate cancer etiology
.
Oncotarget
2017
;
8
:
30830
43
.

5.

Khan
S
.
Potential role of Escherichia coli DNA mismatch repair proteins in colon cancer
.
Crit Rev Oncol Hematol
2015
;
96
:
475
82
.

6.

Alshamsan
A
,
Khan
S
,
Imran
A
, et al.
Prediction of Chlamydia pneumoniae protein localization in host mitochondria and cytoplasm and possible involvements in lung cancer etiology: a computational approach
.
Saudi Pharm J
2017
;
25
:
1151
7
.

7.

Piarroux
R
,
Barrais
R
,
Faucher
B
, et al.
Understanding the cholera epidemic, Haiti
.
Emerg Infect Dis
2011
;
17
:
1161
8
.

8.

Ronholm
J
,
Nasheri
N
,
Petronella
N
, et al.
Navigating microbiological food safety in the era of whole-genome sequencing
.
Clin Microbiol Rev
2016
;
29
:
837
57
.

9.

Maiden
MC
,
Bygraves
JA
,
Feil
E
, et al.
Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms
.
Proc Natl Acad Sci U S A
1998
;
95
:
3140
3145
.

10.

Gilchrist
CA
,
Turner
SD
,
Riley
MF
, et al.
Whole-genome sequencing in outbreak analysis
.
Clin Microbiol Rev
2015
;
28
:
541
563
.

11.

Feng
Y
,
Ruan
Z
,
Shu
J
, et al.
A glimpse into evolution and dissemination of multidrug-resistant Acinetobacter baumannii isolates in East Asia: a comparative genomics study
.
Sci Rep
2016
;
6
:
24342
.

12.

Swaminathan
B
,
Barrett
TJ
,
Hunter
SB
, et al.
PulseNet: the molecular subtyping network for foodborne bacterial disease surveillance, United States
.
Emerg Infect Dis
2001
;
7
:
382
389
.

13.

Tenover
FC
,
Goering
RV
.
Methicillin-resistant Staphylococcus aureus strain USA300: origin and epidemiology
.
J Antimicrob Chemother
2009
;
64
:
441
446
.

14.

Quainoo
S
,
Coolen
JPM
,
van
Hijum
S
, et al.
Whole-genome sequencing of bacterial pathogens: the future of nosocomial outbreak analysis
.
Clin Microbiol Rev
2017
;
30
:
1015
1063
.

15.

Besser
J
,
Carleton
HA
,
Gerner-Smidt
P
, et al.
Next-generation sequencing technologies and their application to the study and control of bacterial infections
.
Clin Microbiol Infect
2018
;
24
:
335
341
.

16.

Zakariah
M
,
Khan
S
,
Chaudhary
AA
, et al.
To decipher the mycoplasma hominis proteins targeting into the endoplasmic reticulum and their implications in prostate cancer etiology using next-generation sequencing data
.
Molecules
2018
;
23
:
E994
.

17.

Khan
S
,
Zakariah
M
,
Palaniappan
S
.
Computational prediction of Mycoplasma hominis proteins targeting in nucleus of host cell and their implication in prostate cancer etiology
.
Tumour Biol
2016
;
37
:
10805
10813
.

18.

Gardy
JL
,
Loman
NJ
.
Towards a genomics-informed, real-time global pathogen surveillance system
.
Nat Rev Genet
2018
;
19
:
9
20
.

19.

Pak
TR
,
Kasarskis
A
.
How next-generation sequencing and multiscale data analysis will transform infectious disease management
.
Clin Infect Dis
2015
;
61
:
1695
1702
.

20.

Rossen
JWA
,
Friedrich
AW
,
Moran-Gilad
J
.
Practical issues in implementing whole-genome-sequencing in routine diagnostic microbiology
.
Clin Microbiol Infect
2018
;
24
:
355
360
.

21.

Deng
X
,
den
Bakker
HC
,
Hendriksen
RS
.
Genomic epidemiology: whole-genome-sequencing-powered surveillance and outbreak investigation of foodborne bacterial pathogens
.
Annu Rev Food Sci Technol
2016
;
7
:
353
374
.

22.

Ruan
Z
,
Feng
Y
.
BacWGSTdb, a database for genotyping and source tracking bacterial pathogens
.
Nucleic Acids Res
2016
;
44
:
D682
D687
.

23.

Maiden
MC
,
Jansen van Rensburg
MJ
,
Bray
JE
, et al.
MLST revisited: the gene-by-gene approach to bacterial genomics
.
Nat Rev Microbiol
2013
;
11
:
728
736
.

24.

Schurch
AC
,
Arredondo-Alonso
S
,
Willems
RJL
, et al.
Whole genome sequencing options for bacterial strain typing and epidemiologic analysis based on single nucleotide polymorphism versus gene-by-gene-based approaches
.
Clin Microbiol Infect
2018
;
24
:
350
354
.

25.

Grad
YH
,
Lipsitch
M
,
Feldgarden
M
, et al.
Genomic epidemiology of the Escherichia coli O104:H4 outbreaks in Europe 2011
.
Proc Natl Acad Sci U S A
2012
;
109
:
3065
3070
.

26.

Hebbelstrup Jensen
B
,
Olsen
KE
,
Struve
C
, et al.
Epidemiology and clinical manifestations of enteroaggregative Escherichia coli
.
Clin Microbiol Rev
2014
;
27
:
614
630
.

27.

Reimer
AR
,
Van Domselaar
G
,
Stroika
S
, et al.
Comparative genomics of Vibrio cholerae from Haiti, Asia, and Africa
.
Emerg Infect Dis
2011
;
17
:
2113
2121
.

28.

Orata
FD
,
Keim
PS
,
Boucher
Y
.
The 2010 cholera outbreak in Haiti: how science solved a controversy
.
PLoS Pathog
2014
;
10
:
e1003967
.

29.

Baltazar
M
,
Ngandjio
A
,
Holt
KE
, et al.
Multidrug-resistant Salmonella enterica serotype Typhi, Gulf of Guinea Region, Africa
.
Emerg Infect Dis
2015
;
21
:
655
659
.

30.

Mary Mallon (Typhoid Mary)
.
Am J Public Health Nations Health
1939
;
29
:
66
68
.

31.

Asif
M
,
Alvi
IA
,
Rehman
SU
.
Insight into Acinetobacter baumannii: pathogenesis, global resistance, mechanisms of resistance, treatment options, and alternative modalities
.
Infect Drug Resist
2018
;
11
:
1249
1260
.

32.

Ruan
Z
,
Chen
Y
,
Jiang
Y
, et al.
Wide distribution of CC92 carbapenem-resistant and OXA-23-producing Acinetobacter baumannii in multiple provinces of China
.
Int J Antimicrob Agents
2013
;
42
:
322
328
.

33.

Quick
J
,
Ashton
P
,
Calus
S
, et al.
Rapid draft sequencing and real-time nanopore sequencing in a hospital outbreak of Salmonella
.
Genome Biol
2015
;
16
:
114
.

34.

Reuter
S
,
Ellington
MJ
,
Cartwright
EJ
, et al.
Rapid bacterial whole-genome sequencing to enhance diagnostic and public health microbiology
.
JAMA Intern Med
2013
;
173
:
1397
1404
.

35.

Perez-Losada
M
,
Arenas
M
,
Castro-Nallar
E
.
Microbial sequence typing in the genomic era
.
Infect Genet Evol
2018
;
63
:
346
359
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)