Abstract

Background and Aims

Escherichia coli is over-abundant in the gut microbiome of patients with inflammatory bowel disease [IBD]. Here, we aimed to identify IBD-specific genomic functions of diverse E. coli lineages.

Methods

We investigated E. coli genomes from patients with ulcerative colitis [UC], Crohn’s disease [CD] or a pouch, and healthy subjects. The majority of genomes were reconstructed from metagenomic samples, including newly sequenced faecal metagenomes. Clinical metadata were collected. Functional analysis at the gene and mutation level were performed and integrated with IBD phenotypes and biomarkers.

Results

Overall, 530 E. coli genomes were analysed. The E. coli B2 lineage was more prevalent in UC compared with other IBD phenotypes. Genomic metabolic capacities varied across E. coli lineages and IBD phenotypes. Host mucin utilisation enzymes were present in a single lineage and depleted in patients with a pouch, whereas those involved in inulin hydrolysis were enriched in patients with a pouch. E. coli strains from patients with UC were twice as likely to encode the genotoxic molecule colibactin than strains from patients with CD or a pouch. Strikingly, patients with a pouch showed the highest inferred E. coli growth rates, even in the presence of antibiotics. Faecal calprotectin did not correlate with the relative abundance of E. coli. Finally, we identified multiple IBD-specific non-synonymous mutations in E. coli genes encoding for bacterial cell envelope components.

Conclusions

Comparative genomics indicates that E. coli is a commensal species adapted to the overactive mucosal immune milieu in IBD, rather than causing it. Our results reveal mutations that may lead to attenuated antigenicity in some E. coli strains.

1. Introduction

Inflammatory bowel diseases [IBD], including Crohn’s disease [CD], ulcerative colitis [UC], and pouchitis, are multifactorial diseases characterised by chronic and relapsing intestinal inflammation. CD typically affects the small intestine and colon, UC is confined to the colon, and pouchitis is inflammation of the previously normal small intestine comprising the pouch in patients with UC after proctocolectomy and ileal pouch-anal anastomosis.1 The aetiology of IBD is believed to involve an aberrant immune response towards the intestinal microbiota2 in a genetically predisposed host,3 induced by environmental triggers.4

The gut microbiome in patients with IBD is distinguished by taxonomic5,6 and functional7,8 dysbiosis. One of its hallmarks is the expansion of Enterobacteriaceae, in particular of Escherichia coli in both faecal8 and mucosal5,9 samples from patients with IBD. E. coli is a prevalent commensal in the human gut microbiota,10 but has high within-species variation in functional potential11 and some strains are pathogens, causing intestinal or extra-intestinal diseases.12 One E. coli strain [Nissle 1917] has been considered as a probiotic agent and reported to prolong remission in patients with UC.13

E. coli has been extensively studied in the context of IBD pathogenesis, yet the focus has been mostly on the adherent-invasive E. coli [AIEC] pathotype.14,15 AIEC strains are characterised by their phenotype of adhesion and invasion into intestinal epithelial cells and survival and replication within macrophages without prompting cell death,9 and are more prevalent in mucosal biopsies of patients with CD compared with patients with UC or healthy controls.9,16–18 Roughly 65% of the AIEC biopsy-derived isolates belong to phylogroup B2,16,19 which contains both commensal and many extra-intestinal pathogenic E. coli [ExPEC] strains. Such ExPEC strains typically have pathogenicity islands encoding capsular antigens, iron acquisition systems, cytotoxins, and haemolysins.20

Despite considerable efforts, comparative genomic analysis of AIEC strains failed to identify any discriminative genomic markers of this phenotype.21 Most studies investigating E. coli as a potential pathogen in IBD focused exclusively on AIEC, and the few previous comparative genomic studies of E. coli strains in IBD were limited by the number and diversity of the genomes available.22–24 Thus it remains unknown if there are E. coli lineages, genes, or alleles that are better adapted to patients with IBD. Here, we performed a large-scale comparative genomic study of E. coli from patients with CD, UC, and an ileal pouch, aiming to reveal IBD-specific or phenotype-specific adaptations and also test to what degree those strains are expected to promote intestinal inflammation.

2. Materials and Methods

The study was approved by the local institutional review board [0298–17] and the National Institutes of Health [NCT01266538]. All patients signed informed consent before inclusion to the study.

2.1. E. coli genomes from patients with IBD and healthy subjects and their metadata

The genomes of E. coli from patients with a pouch, CD, or UC, and from healthy subjects, were obtained from metagenomic and isolate genome data from multiple published studies [Supplementary Table S1]. In addition, 58 newly sequenced metagenomic samples from 22 patients with a pouch [including two UC samples prior to proctocolectomy and four stoma samples prior to pouch surgery], nine isolate genomes from patients with a pouch, and three isolate genomes from patients with CD from Rabin Medical Center [RMC], Israel, were added to the analysis. Publicly available isolate genomes from patients with CD or UC and from healthy subjects, and related metadata, were downloaded from the PATRIC database.25 Publicly available metagenomes used for reconstructing metagenome-assembled genomes [MAGs] were downloaded from the NCBI-SRA repository and included all phenotypes [Supplementary Table S1]. In total, the comparative genomic analysis included 530 [IBD- and healthy subject-derived] genomes and two versions of the E. coli Nissle-1917 genome.

Patients after pouch surgery were recruited at a dedicated pouch clinic in a tertiary IBD referral centre in Israel [RMC]. The study was approved by the local institutional review board [0298–17] and the National Institutes of Health [NCT01266538]. All patients signed informed consent before inclusion. Demographic and clinical data were obtained during clinic visits. Faecal sample collection, faecal calprotectin measurements, genomic DNA extraction and shotgun metagenomic sequencing, pouch disease behaviour [phenotype] definition, and treatment protocol were performed as described in Dubinsky et al.26

2.2. Pipeline for the reconstruction of genomes from metagenomes

To reconstruct the genomes of E. coli directly from metagenome samples [E. coli MAGs], we applied the following single-sample metagenomic assembly pipeline to each sample. [i] Metagenome reads were quality-filtered with Trimmomatic v0.3627 using default parameters, removing low-quality and short reads and low-quality bases, and clipping Illumina adaptors. Human-derived reads were removed by mapping them against the human genome [GRCh38] with Bowtie2 v2.2.928 in --very-sensitive local mode. [ii] The remaining reads were mapped against a reference database of the E. coli pangenome to retain only closely matching reads using MIDAS.29 [iii] Samples with a sufficient amount of mapped reads to E. coli pangenome [depth of coverage >5x] were separately de novo assembled with Unicycler v0.4.4,30 using default parameters. Contigs shorter than 1 Kb were discarded. [iv] To obtain better quality MAGs, the remaining contigs were screened with BLASTN against the NR database, and only contigs that aligned to E. coli genes as top hit with nucleotide identity of ≥90% were retained. [v] All the MAGs underwent the following quality control steps. Genome completeness and contamination measures were assessed using CheckM v1.0.731 in taxonomy-specific mode [taxonomy_wf] for the species E. coli. MAGs with levels of completeness <75% and contamination >5% were discarded. The amount of strain-level heterogeneity was estimated with CMSeq [https://github.com/SegataLab/cmseq], which calculates the polymorphism at each position [with coverage of >5x] in the contigs [a position was considered as non-polymorphic if the dominant allele frequency was >80%]. [vi] The relative abundance of each MAG from the corresponding metagenome sample was calculated by mapping reads against the reconstructed MAG from the same sample using Bowtie2 [--very-sensitive mode], and to avoid counting reads from closely related strains, only aligned reads with an edit-distance [mismatches] of two or less bases to the contig were considered, divided by the total number of quality-filtered reads in the sample.

Metagenomes where the E. coli MAGs failed to pass our minimum accepted quality criteria [completeness ≥75% and contamination ≤5%] were de novo assembled with metaSPAdes v3.1432 using default parameters and ‘--meta’ option, including the read error corrector BayesHammer, instead of the Unicycler assembler. The contigs were then used for binning with MetaBAT2 v2.1533 with ‘-m 1500’ option [minimal contig size for binning of ≥1500 bp] to obtain E. coli genome bins. The quality control steps with CheckM were repeated for the newly generated MAGs, and those that passed the accepted quality criteria were added into the analysis pipeline with the rest of the steps unchanged.

In total 412 E. coli MAGs were reconstructed and had genome completeness of >75% and contamination <5%, and 346 MAGs could be defined as high-quality according to current guidelines with completeness of >90%34; see Supplementary Table S2 for assembly quality control statistics.

2.3. Whole-genome phylogeny and average nucleotide identity analysis

Phylogenetic analysis of the E. coli genomes [MAGs and isolates] was performed using PhyloPhlAn-3.35 The phylogeny was built by annotating the input genomes with a set of species-specific marker genes identified from the UniRef90 database [2818 E. coli markers]. The following parameters were used: ‘--diversity low --accurate -min_num_entries 505 --force_nucleotides’, to generate high-resolution strain-level phylogeny [nucleotides level] with the genes defined as core if present in ≥95% of the genomes.

Whole-genome average nucleotide identity [ANI] between pairs of genomes was calculated with FastANI v1.3236 using default parameters. ANI values of ≥95% are the most frequently used as a species-level taxonomic boundary threshold for prokaryotic genomes.36

2.4. E. coli phylogroups typing

To perform in silico phylogroup typing for the E. coli genomes, the ClermonTyping tool [https://github.com/A-BN/ClermonTyping] was used, and the assembled contigs of each genome were used as input.37 This tool mimics the in vitro PCR method using a set of 30 primers that are aligned against the genomes with BLASTN.

2.5. Core genes single nucleotide polymorphism analysis

Core genes single nucleotide polymorphism [SNP] analysis of the E. coli genomes was performed using Prokka v1.13.338 in order to obtain the coding genes, and then Roary v3.1339 for pangenome building and core gene alignment. To convert the resulting core gene alignment file into an SNP distance matrix, the tool SNP-dists [https://github.com/tseemann/snp-dists] was used.

2.6. Functional annotation and analysis of the genomes

Functional annotation of the E. coli genomes was performed based on several tools and databases. EggNOG mapper v2.0.140 was used to generate the functional genes profile [pangenome] based on EggNOG orthology system with DIAMOND v0.9.2441 for the sequence similarity search. In addition, the genomes were annotated with KEGG Orthologs [KO] assignment based on hidden Markov model profiles using KofamScan v1.3.0.42 Carbohydrate-active enzymes [CAZy] families43 were identified using ‘hmmscan’ from HMMER v3.2.1 [http://hmmer.org] against dbCAN HMMs database release-9, applying stringent filtering cutoff as suggested by the authors [http://bcb.unl.edu/dbCAN2]. To add virulence genes annotation to the genomes, protein search BLASTP [DIAMOND] against the core set of experimentally verified virulence factors database [VFDB] was performed.44 Only sequences that had identity ≥90% to the proteins in VFDB were considered. Similarly, antibiotic resistance genes were identified by using the curated protein sequences [homologue model] in the comprehensive antibiotic resistance database [CARD].45 Only sequences that had identity ≥95% to the proteins in CARD were considered. For all the functional annotations, protein-coding genes within the contigs were predicted with Prodigal v2.6.3 and used as input query for sequence similarity searches.

2.7. Bacterial genome-wide association study analysis incorporating phylogenetic information

The bacterial genome-wide association algorithm PySEER46 was used to control for the phylogenetic structure of the E. coli strains. PySEER was run in linear mixed effects model mode, with a similarity matrix obtained from the previously computed whole-genome phylogenetic tree [see above]. Thus, population structure was used as a random effect and the phenotype of the subjects was used as a fixed effect. Matrices of genes presence/absence for each E. coli genome were used, and the analysis was performed based on all the functional gene groups as described above. The p-values from the mixed model associations were reported.

2.8. Genome replication rates of E. coli within subjects

Genome replication rate for the E. coli MAGs was calculated with the iRep [genome replication index] algorithm.47 This tool is based on the principle that actively replicating cells have higher coverage near the origin of replication, compared with the terminus in their genomes.48 Only MAGs that were obtained from faecal and aspirate metagenomes were used in this analysis [n = 391], since the mapping of metagenome reads to the assembled MAGs is required for this tool.

2.9. Identification of mutation in ciprofloxacin antibiotic target genes

The primary target genes for ciprofloxacin antibiotic, gyrA and parC, were extracted from the coding genes predicted by Prokka [see above]. Point mutations in these genes in positions known to confer resistance to ciprofloxacin were identified as described in Dubinsky et al.26

2.10. Mutational analysis of E. coli functional genes

To identify IBD-specific adaptations at the mutational level, Snippy v4.6.0 variant calling tool [https://github.com/tseemann/snippy] was used. The analysis was performed in a within-clade approach with a different isolate reference genome [high-quality draft] for each E. coli clade as was determined by the whole-genome phylogeny. For clade-I, clade-III, clade-IV, and clade-V, the following genomes from healthy subjects were used as reference [PATRIC genome ID]: 749549.3, 749550.3, 749540.3, and 749545.3, respectively. Functional annotations for these genomes were obtained from NCBI-Assembly according to NCBI prokaryotic genome annotation pipeline in genomic GenBank format, as required by Snippy. The analysis was not performed for clade-II due to the small number of genomes. The metagenomic reads composing the MAGs from IBD patients were mapped against the reference genomes, with a minimum read depth at a position for variant calling set to 10 [default]. Mutation variants were identified in coding and non-coding regions in the genome, and were classified as synonymous, non-synonymous, insertions/deletions, and frameshifts. Synonymous mutations and mutations in intergenic regions were not analysed further. To further validate the genes with non-synonymous mutations that were potentially IBD-specific [not present in non-IBD MAGs] from clade-III, homology search with BLASTN was performed. These genes were used as a query against a set of 23 non-IBD E. coli isolate genomes, and only genes with mutations that were absent in >80% of these non-IBD genomes [present in <20%] were labelled as IBD-specific and retained.

2.11. Statistical analysis

All statistical analysis was performed in R v3.6.3. The Kruskal–Wallis H-test with Dunn’s test correction for multiple pairwise comparisons was performed using ‘kruskal.test’ and ‘dunn.test’ functions. Wilcoxon [Mann–Whitney] rank test was performed using the ‘wilcox.test’ function. Fisher’s exact test with false-discovery rate correction [Benjamini–Hochberg] for multiple hypothesis testing was performed with the ‘fisher.test’ and ‘p.adjust’ functions. Spearman’s rank correlation was performed with the ‘cor.test’ function. Permutational multivariate analysis of variance [PERMANOVA] test was performed using the adonis function. A linear mixed effects model of iRep [dependent variable] as a function of genome and patient-related predictor variables, with individual patients from whom a set of samples were derived being specified as a random effect, was performed with the ‘lmer’ function.

2.12. Data availability [data sharing statement]

All the metagenomic sequence data generated in this study [patients with a pouch, UC and a stoma] have been deposited in NCBI-SRA, and the isolate genomes of patients with a pouch and CD were deposited in NCBI-Genome, and are available under BioProject number PRJNA730677. The assembled MAGs are available for download from Figshare data repository with the following URL: [https://doi.org/10.6084/m9.figshare.14625537].

3. Results

We studied the genomes of 530 E. coli strains from faecal [n = 438], aspirate [n = 24], and biopsy [n = 68] samples of patients with IBD [n = 459] and healthy individuals [n = 71]. We established a dataset that encompassed 21 individual studies, both publicly available and newly sequenced samples from our own cohort of patients with a pouch [Supplementary Table S1]. We reconstructed the majority of the analysed genomes [77%] directly from metagenomic samples (metagenome-assembled genomes [MAGs]) and the rest of the genomes were obtained from cultured isolates. The patients with IBD in this study, from whom we obtained E. coli genomes, included the following clinical phenotypes: pouch [n = 268], CD [n = 124], UC [n = 63], and an additional four patients after UC-associated proctocolectomy and a stoma [prior to pouch surgery]. For a subset of patients from our dataset, we had faecal calprotectin and antibiotic usage data [Supplementary Table S1].

3.1. E. coli strains from patients with IBD and healthy subjects belong to distinct lineages that are differentially distributed across phenotypes

First, the phylogenetic structure of E. coli strains was investigated using whole-genome analysis of 2818 core marker genes. We identified five distinct clades [main branches] comprising the E. coli strains in our dataset [Figure 1A], which we named Clade-I to Clade-V. Each clade included both isolate genomes and MAGs, reassuring that our computational approach for MAGs reconstruction from metagenomes is robust and the MAGs are compatible with isolate genomes. Although the majority of the E. coli strains in our dataset were from faecal samples, the genomes obtained from biopsies and aspirates clustered with the faecal strains throughout the five clades, implying that there is no clade that is specifically mucosa-associated.

Phylogenetic structure of the metagenome-assembled genomes [MAGs] and isolate genomes of E. coli from patients with IBD and healthy subjects. [A] A whole-genome phylogenetic tree based on core genes of 532 E. coli genomes. The inner ring indicates the clinical phenotype of the subject from whom the genome was derived; the middle ring corresponds to the E. coli phylogroup membership; the outer ring is the E. coli clade membership based on whole-genome phylogeny; circles/squares on tree nodes denote whether the genome is a MAG or isolate genome [WGS label], respectively; stars outside the rings denote the isolation source of the sample. The phylogenetic tree was built with RAxML [parameters ‘-p 1989 -m GTRCAT’] as part of PhyloPhlAn-3 tool,35 based on 2818 species-specific marker genes defined as core, which were present in ≥95% of the genomes. The branch scale is nucleotide substitutions per site. [B] Whole-genome average nucleotide identity [ANI] between pairs of E. coli genomes within a clade and between the clades. [C] Jaccard similarity in gene content [non-core functional genes based on EggNOG] between pairs of E. coli genomes within a clade and between the clades. [D] Core genome single-nucleotide polymorphism [SNP] distance between pairs of E. coli genomes within a clade and between the clades. In panels B to D, coloured bars are within a clade and grey bars are between the clades comparison, respectively. IBD, inflammatory bowel disease; pos., position.
Figure 1.

Phylogenetic structure of the metagenome-assembled genomes [MAGs] and isolate genomes of E. coli from patients with IBD and healthy subjects. [A] A whole-genome phylogenetic tree based on core genes of 532 E. coli genomes. The inner ring indicates the clinical phenotype of the subject from whom the genome was derived; the middle ring corresponds to the E. coli phylogroup membership; the outer ring is the E. coli clade membership based on whole-genome phylogeny; circles/squares on tree nodes denote whether the genome is a MAG or isolate genome [WGS label], respectively; stars outside the rings denote the isolation source of the sample. The phylogenetic tree was built with RAxML [parameters ‘-p 1989 -m GTRCAT’] as part of PhyloPhlAn-3 tool,35 based on 2818 species-specific marker genes defined as core, which were present in ≥95% of the genomes. The branch scale is nucleotide substitutions per site. [B] Whole-genome average nucleotide identity [ANI] between pairs of E. coli genomes within a clade and between the clades. [C] Jaccard similarity in gene content [non-core functional genes based on EggNOG] between pairs of E. coli genomes within a clade and between the clades. [D] Core genome single-nucleotide polymorphism [SNP] distance between pairs of E. coli genomes within a clade and between the clades. In panels B to D, coloured bars are within a clade and grey bars are between the clades comparison, respectively. IBD, inflammatory bowel disease; pos., position.

Population genetic studies of E. coli isolates have traditionally relied on the division into eight deep branching lineages called phylogroups49 based on a PCR-based typing method relying on four marker genes. An analysis of E. coli phylogroups [Figure 1A] indicated that the IBD strains originated from a diverse array of lineages [A, B1, B2, C, D, E, F, G]. The five clades we identified based on the whole-genome data largely corresponded to the phylogroups, with clade-I being consistent with D, clade-II with F, clade-III with B2, clade-IV with A and clade-V with B1 and C [Figure 1A; Supplementary Table S1]. As expected, strains within a clade had genomes that were more similar to one another in sequence [mean of 98.17% to 98.75% average nucleotide identity, ANI] compared with a mean ANI similarity of strains between clades (96.56% to 98.14% [Figure 1B]) and also had more similar gene content [Figure 1C and D].

We next analysed the distribution of the five E. coli clades among the clinical phenotypes. All the IBD phenotypes and healthy subjects had at least one strain from each clade, suggesting that many different E. coli lineages successfully colonise both patients with IBD and healthy subjects. Nonetheless, we observed an uneven distribution in clade-III [Fisher’s exact test, FDR p <0.05; Figure 2A], which was more prevalent in UC [49.2%] and non-IBD [40.8%] compared with CD [31.5%] and pouch [17.5%], whereas clade-I prevalence in UC was only 4.8%, the lowest among all the phenotypes.

Distribution of the E. coli clades across clinical phenotypes and lack of association between intestinal inflammation and E. coli clade and relative abundance. [A] Prevalence of E. coli clades in the different clinical phenotypes of patients with IBD [pouch, CD and UC] and non-IBD subjects. *FDR p <0.05; Fisher’s exact test. The test was performed separately for each clade, between the phenotypes—i.e., testing for each of the clades, is it differently distributed between the different phenotypes compared with the other clades. [B] Spearman correlation between E. coli MAGs relative abundance and faecal calprotectin level [log-transformed]. [C] Fecal calprotectin levels [log-transformed] associated with each E. coli clade. Box plot whiskers mark observations within the 1.5 interquartile range of the upper and lower quartiles. The blue horizontal lines mark the mean value in each clade. IBD, inflammatory bowel disease; CD, Crohn’s disease; UC, ulcerative colitis; FDR. false-discovery rate.
Figure 2.

Distribution of the E. coli clades across clinical phenotypes and lack of association between intestinal inflammation and E. coli clade and relative abundance. [A] Prevalence of E. coli clades in the different clinical phenotypes of patients with IBD [pouch, CD and UC] and non-IBD subjects. *FDR p <0.05; Fisher’s exact test. The test was performed separately for each clade, between the phenotypes—i.e., testing for each of the clades, is it differently distributed between the different phenotypes compared with the other clades. [B] Spearman correlation between E. coli MAGs relative abundance and faecal calprotectin level [log-transformed]. [C] Fecal calprotectin levels [log-transformed] associated with each E. coli clade. Box plot whiskers mark observations within the 1.5 interquartile range of the upper and lower quartiles. The blue horizontal lines mark the mean value in each clade. IBD, inflammatory bowel disease; CD, Crohn’s disease; UC, ulcerative colitis; FDR. false-discovery rate.

Finally, we tested to what degree can E. coli or specific clades of this species be associated with intestinal inflammation. Faecal calprotectin, used as a biomarker of intestinal inflammation, did not correlate with the relative abundance of the E. coli MAGs in the microbiomes from corresponding metagenomic samples [Spearman r = 0.0167, p = 0.77; Figure 2B]. Furthermore, the median calprotectin levels associated with each clade were similar [Kruskal–Wallis, p = 0.59; Figure 2C] although, in the group of patients with clade-III E. coli, the mean calprotectin was higher due to the inclusion of several patients with very high calprotectin values [>8000 μg.g-1].

3.2. E. coli clades present in patients with IBD may have particular functional potential

We next explored the functional potential of the E. coli clades through genome annotation based on several databases [see Materials and Methods]. We identified a pangenome of 7462 genes in the genomes set [with either a known gene name or KO number, according to the EggNOG], of which 2121 were non-core [present in less than 90% of the genomes] and thus might drive differences in the functional potential between the clades. A substantial variation in the functional genes was observed between the E. coli genomes [Figure 3A], which was explained to a large degree by the clade membership [PERMANOVA test, R2 = 0.437, p <0.001]. Clade-III was the most dissimilar from the other clades according to EggNOG, KO and CAZy gene annotations [Figure 3B and C] and to core-gene SNP alignment [Figure 3D]. Focusing on genes with functional annotations [KO], we detected 709 non-core genes that significantly differed in prevalence between the E. coli clades [Fisher’s exact test, FDR p <0.05; Supplementary Table S3].

Variation in the functional gene content of the E. coli genomes explained by clade membership. [A] Principal coordinates analysis [PCoA] of Jaccard distances [overall dissimilarity in the presence or absence of genes] based on non-core genes annotated with EggNOG. [B] PCoA of Jaccard distances based on non-core genes annotated with KEGG Orthologs [KO]. [C] PCoA of Jaccard distances based on non-core carbohydrate-active enzymes [CAZy] families. [D] PCoA based on dissimilarity in core-gene single-nucleotide polymorphisms [SNPs]. The variation in the functional gene content explained by the clade membership of E. coli genomes in panels A to D [PERMANOVA, R2 = 0.437, 0.401, 0.366, and 0.748, respectively; p <0.001].
Figure 3.

Variation in the functional gene content of the E. coli genomes explained by clade membership. [A] Principal coordinates analysis [PCoA] of Jaccard distances [overall dissimilarity in the presence or absence of genes] based on non-core genes annotated with EggNOG. [B] PCoA of Jaccard distances based on non-core genes annotated with KEGG Orthologs [KO]. [C] PCoA of Jaccard distances based on non-core carbohydrate-active enzymes [CAZy] families. [D] PCoA based on dissimilarity in core-gene single-nucleotide polymorphisms [SNPs]. The variation in the functional gene content explained by the clade membership of E. coli genomes in panels A to D [PERMANOVA, R2 = 0.437, 0.401, 0.366, and 0.748, respectively; p <0.001].

We also specifically analysed the potential differences in carbohydrate utilisation between the clades using CAZy protein families. Even though many of the identified 26 non-core CAZy families were broadly distributed, we nonetheless observed a distinct prevalence pattern between the clades for some of them [Supplementary Table S4]. The alpha-mannosidase family GH38 [involved in the oligosaccharide mannose degradation] was over 93% prevalent in clades IV and V, but less than 33% prevalent in the other three clades [Fisher’s exact test, FDR p <9.6 × 10-63]. On the other hand, the GH127 family involved in the degradation of other oligosaccharides, including arabinofuranose, was 98.8% and 100% prevalent in clades I and II, respectively, whereas in the other clades it was only 29.5% to 67% prevalent. Unlike the other clades, clade-III almost exclusively possessed a high prevalence [98%] of GH33 sialidases [Fisher’s exact test, FDR p <3.4 × 10-113], which are involved in the breakdown and utilisation of mucin-derived human glycans [by cleaving terminal sialic acid from sialylated mucins50].

Clade-III [B2 phylogroup equivalent] genomes encoded a high number of potentially proinflammatory proteins and toxins [based on the virulence factor database VFDB, Figure 4A, and Supplementary Table S5]. Half of the genomes in clade-III had genes encoding for biosynthesis of colibactin [pks genomic island—clb], a genotoxic molecule that can induce double-strand DNA breaks and chromosome aberrations.51 Several serine-protease autotransporters that are harmful for eukaryotic cells52 and are considered as enterotoxins [pic, sat, vat], were almost exclusively present in clade-III genomes, and their prevalence ranged from 30% to 77%. Moreover ɑ-haemolysin [hly], which was shown to disrupt epithelial cell tight junctions in patients with UC and thus promote intestinal inflammation,53 was present in 25% of clade-III genomes. Notably, only a small number of the genomes from the other clades possessed the above-mentioned virulence factors.

Virulence factor and antibiotic resistance gene profiles in the E. coli genomes from patients with IBD and healthy subjects. [A] Heatmap of virulence factors [rows] in E. coli genomes [columns] showing the presence [blue] or absence [beige] of the genes. [B] Heatmap of antibiotic resistance genes [rows] in E. coli genomes [columns] showing the presence [blue] or absence [beige] of the specific genes. Colour bars at the top of the plots represent the clinical phenotype of the subject and the clade membership of the E. coli genomes. Colour bars at the left of the plots represent the classification category of the virulence genes in panel [A] and the antibiotic resistance mechanism in panel [B]. Genes were clustered using Euclidean distances by applying the complete linkage method. For the list of p-values obtained from Fisher’s exact test for genes prevalence, see Supplementary Tables S5 and S6.
Figure 4.

Virulence factor and antibiotic resistance gene profiles in the E. coli genomes from patients with IBD and healthy subjects. [A] Heatmap of virulence factors [rows] in E. coli genomes [columns] showing the presence [blue] or absence [beige] of the genes. [B] Heatmap of antibiotic resistance genes [rows] in E. coli genomes [columns] showing the presence [blue] or absence [beige] of the specific genes. Colour bars at the top of the plots represent the clinical phenotype of the subject and the clade membership of the E. coli genomes. Colour bars at the left of the plots represent the classification category of the virulence genes in panel [A] and the antibiotic resistance mechanism in panel [B]. Genes were clustered using Euclidean distances by applying the complete linkage method. For the list of p-values obtained from Fisher’s exact test for genes prevalence, see Supplementary Tables S5 and S6.

We also explored the antibiotic resistance genes in the E. coli strains from each clade based on the CARD database for resistance genes. Genes encoding various efflux pumps [against tetracycline, macrolide, aminocoumarin, and aminoglycoside antibiotics] showed a diverse prevalence pattern between the clades [Figure 4B; and Supplementary Table S6]. Aminoglycoside resistance by inactivation had a relatively low prevalence in all the clades [<37%], but AAC[3]-IId was significantly higher in clade-II than in the other clades [Fisher’s exact test, FDR p <2.2 × 10-6]. We also detected beta-lactamase encoding genes, some of which were prevalent [ampC and TEM-1] whereas others were rare, specially those coding for the CTX-M family of extended-spectrum β-lactamases.54 Interestingly, genomes from clade-III had on average the lowest number of antibiotic resistance genes [Figure 4B].

Due to the marked differences in E. coli strains which we detected between clades, we followed with a phylogenetically-aware analysis to identify specific metabolic functions and virulence factors associated with IBD phenotypes, regardless of the clades [see below].

3.3. Antibiotic treatment in patients with IBD increases mobile antibiotic resistance genes in E. coli strains

Next we wanted to determine whether antibiotic treatment in a subset of patients with IBD affected the presence of antibiotic resistance genes in the E. coli strains. For 408 E. coli genomes, we obtained information regarding the antibiotic usage of the patients from whom the samples were obtained. By dividing these genomes into two groups, from subjects who had received antibiotic treatment in the last month prior to sampling [Abx+] and those who were free of antibiotics for at least 1 month [Abx-], we could detect 14 antibiotic resistance genes that were significantly enriched in the genomes from the Abx + group [Figure 5A]. Importantly, aac[6’]-Ib-cr, oxa-1, and catB3 genes conferring resistance against fluoroquinolones, beta-lactams, and chloramphenicol, respectively, were 16% prevalent in the genomes of Abx+, compared with only 2.54% in the Abx- [Fisher’s exact test, FDR p <8.4 × 10-5]. These genes are known to be encoded together on some plasmids of Enterobacteriaceae,55 and accordingly we observed these genes co-located on the same contigs of E. coli in our dataset. Overall, genomes in the Abx + and Abx- groups had a median of 11 and nine resistance genes, respectively [Mann-Whitney, p <5.3 × 10; Figure 5B]. Taken together, these findings suggest that E. coli strains of patients with IBD tend to be antibiotic-resistant, and antibiotic treatment further increases mobile antibiotic resistance in these strains.

Higher prevalence of antibiotic resistance genes of E. coli in subjects undergoing antibiotic treatment. [A] Heatmap showing the prevalence of antibiotic resistance genes in E. coli genomes from subjects who received antibiotic treatment in the past month [Abx+] compared with those that were free of antibiotics for at least 1 month [Abx-]. *FDR p <0.05; Fisher’s exact test. [B] The distribution of the number of antibiotic resistance genes [AMR] in E. coli genomes from Abx + and Abx- subjects as defined in panel A. *p <0.001; Mann–Whitney test. Box plot whiskers mark observations within the 1.5 interquartile range of the upper and lower quartiles. FDR, false-discovery rate.
Figure 5.

Higher prevalence of antibiotic resistance genes of E. coli in subjects undergoing antibiotic treatment. [A] Heatmap showing the prevalence of antibiotic resistance genes in E. coli genomes from subjects who received antibiotic treatment in the past month [Abx+] compared with those that were free of antibiotics for at least 1 month [Abx-]. *FDR p <0.05; Fisher’s exact test. [B] The distribution of the number of antibiotic resistance genes [AMR] in E. coli genomes from Abx + and Abx- subjects as defined in panel A. *p <0.001; Mann–Whitney test. Box plot whiskers mark observations within the 1.5 interquartile range of the upper and lower quartiles. FDR, false-discovery rate.

3.4. Genome replication rates of E. coli strains are associated with clinical phenotypes but not with intestinal inflammation or antibiotic usage

To test how well E. coli strains grow in the IBD gut, we analysed the growth dynamics of the E. coli genomes obtained from 375 faecal and 16 aspirate metagenomes of patients with IBD and of healthy subjects in vivo using iRep [genome replication index47]. This method is based on the fact that in replicating fast-growing bacteria, such as E. coli, the regions of the genome closer to the origin of replication are over-represented, and thus more metagenomic reads map to them. This analysis showed that patients with a pouch had the fastest-growing E. coli strains in vivo [Kruskal–Wallis, p <0.05; Figure 6A], followed by patients with CD, and those with UC and non-IBD subjects had the slowest-growing strains. It is worth noting that iRep was not correlated with the relative abundance of E. coli [Spearman r = -0.018, p = 0.79; Figure 6B]. A linear mixed effects model of iRep taking into account strain and patient-related parameters indicated that only the clinical phenotype of the subject was significantly associated with iRep [mixed effects model, p <0.001]. In contrast, the phylogenetic clade, relative abundance of E. coli, antibiotic usage [antibiotics in the past month], and faecal calprotectin [intestinal inflammation marker] failed to show statistically significant associations with iRep.

Inferred growth rates of E. coli strains from patients with IBD and healthy subjects. [A] Inferred growth rates [using iRep [genome replication index]] of E. coli strains from patients with a pouch, CD, UC and healthy subjects. *p <0.05; ***p <0.001; Kruskal–Wallis test. [B] Spearman correlation between iRep and the relative abundance of E. coli in the microbiomes from corresponding metagenomic samples. [C] iRep of E. coli strains from subjects who had used antibiotics in the past month [Abx+] compared with subjects who were antibiotic-free for at least 1 month [Abx-]. [D] iRep of E. coli strains from patients with a pouch who were treated with ciprofloxacin, grouped according to the number of resistant mutations in drug target genes [gyrA and parC]. Zero mutations represent ciprofloxacin-sensitive strains. ‘Abx+ during treatment’ and ‘Abx-’ division define samples from patients taken during antibiotic treatment or in the absence of antibiotics for at least 1 month, respectively. In panels A, C, and D, box plot whiskers mark observations within the 1.5 interquartile range of the upper and lower quartiles. IBD, inflammatory bowel disease; UC, ulcerative colitis; CD, Crohn’s disease.
Figure 6.

Inferred growth rates of E. coli strains from patients with IBD and healthy subjects. [A] Inferred growth rates [using iRep [genome replication index]] of E. coli strains from patients with a pouch, CD, UC and healthy subjects. *p <0.05; ***p <0.001; Kruskal–Wallis test. [B] Spearman correlation between iRep and the relative abundance of E. coli in the microbiomes from corresponding metagenomic samples. [C] iRep of E. coli strains from subjects who had used antibiotics in the past month [Abx+] compared with subjects who were antibiotic-free for at least 1 month [Abx-]. [D] iRep of E. coli strains from patients with a pouch who were treated with ciprofloxacin, grouped according to the number of resistant mutations in drug target genes [gyrA and parC]. Zero mutations represent ciprofloxacin-sensitive strains. ‘Abx+ during treatment’ and ‘Abx-’ division define samples from patients taken during antibiotic treatment or in the absence of antibiotics for at least 1 month, respectively. In panels A, C, and D, box plot whiskers mark observations within the 1.5 interquartile range of the upper and lower quartiles. IBD, inflammatory bowel disease; UC, ulcerative colitis; CD, Crohn’s disease.

Curiously, E. coli strains that grew in the presence of antibiotics had similar iRep to those from non-antibiotic treated patients [Mann–Whitney, p = 0.454; Figure 6C]. For a subset of patients with pouch [n = 233], for whom we had more comprehensive antibiotic usage data who were treated with ciprofloxacin, we could match resistance mutations with the respective iRep from the same sample. As the main mechanism of resistance again ciprofloxacin [a fluoroquinolone] is point mutations in drug target genes, gyrA and parC,26 we extracted these genes and analysed them for mutations in previously known positions conferring resistance. Even when taking into account E. coli from samples taken during active antibiotic treatment, highly resistant strains with three mutations had a similar iRep to sensitive strains [no mutations] [Figure 6D], suggesting that the resistant strains grow very well within hosts in the presence of ciprofloxacin, with rates comparable to sensitive strains growing in the absence of antibiotics.

3.5. Functional gene repertoire of IBD-associated E. coli strains

Our next goal was to determine if there is an association between specific functional genes of E. coli and patients with IBD or healthy subjects. As we have shown earlier, many functions [including metabolism-, virulence-, and resistance-related] exhibit strong clade-dependent distribution patterns. To overcome such biases, we analysed the E. coli genomes while phylogenetically controlling for their relatedness to one another [population structure] with PySEER [see Methods], and focused on the metabolic functions and virulence factors associated with specific IBD phenotypes or with healthy controls [Supplementary Tables S7 to S10]. A gene family involved in sucrose and inulin degradation [CAZy family GH32] was enriched in patients with a pouch, with a prevalence of 82.5% compared with 55.5% to 66.9% in the other phenotypes [PySEER analysis, p <0.05; Supplementary Table S7]. Furthermore, patients with a pouch also harboured E. coli strains enriched [23.5%] in alpha-galactosidase [GH36; involved in hydrolysis of alpha-galactosidic residues], which was very low in UC [8%] and CD [5.6%] and absent in healthy subjects. In contrast, patients with a pouch were depleted [1.87%] in endo-alpha-sialidase [GH58; involved in utilisation of sialylated mucins from human glycoproteins or animal-derived food], which was significantly more abundant in healthy subjects [15.5%], and patients with CD [8.06%] and UC [22.2%] [PySEER analysis, p <0.05; Supplementary Table S7]. Finally, we did not find significant differences in the prevalence of propanediol dehydratase [pduC] between patients with CD and healthy subjects [Supplementary Table S8], in agreement with previous reports.56,57.

In terms of virulence factors, the colibactin gene cluster was significantly more abundant in patients with UC [33.8%] and healthy subjects [26.1%] than in patients with CD [10.3%] and those with a pouch [6.5%] [PySEER analysis, p <0.05; Supplementary Table S9]. Surprisingly, healthy subjects were significantly enriched in senB and cnf1, encoding for enterotoxic and cytotoxic proteins, respectively. Last, iron transport [iro] genes showed significantly lower presence in healthy subjects compared with all IBD phenotypes. In terms of antibiotic resistance genes, the gene tetM, conferring resistance against tetracyclines, was almost solely found in 32.5% of the genomes in patients with a pouch [PySEER analysis, p <0.05; Supplementary Table S10]. In addition, the TEM-1 and OXA-1 beta-lactamases, conferring resistance against beta-lactam antibiotics, were significantly more prevalent in E. coli strains from patients with a pouch as well [PySEER analysis, p <0.05; Supplementary Table S10].

3.6. Polymorphism analysis of functional genes reveals IBD-specific adaptations

To complement the analysis of the differential prevalence of functional genes in patients with IBD and healthy subjects, we tested whether there are IBD-specific point mutations that are potentially adaptive for E. coli. We mapped the metagenome reads composing the MAGs from IBD phenotypes against a reference isolate genome from the same clade from a healthy phenotype, richly annotated by the NCBI prokaryotic genome pipeline to obtain all potential polymorphic positions in our E. coli strains. If a polymorphism is beneficial for the bacterium when colonising the IBD gut, we can expect it to be present in multiple IBD-associated strains and nearly absent in strains that colonise individuals without IBD. We therefore defined IBD-associated polymorphisms as those within a clade that are present in over 30% of IBD-derived genomes and absent from >80% non-IBD strains [see Methods]. We identified several polymorphisms that are predicted to change the amino acid sequence [henceforth non-synonymous mutations] in genes with annotated functions of E. coli strains from clade-III [Table 1; Supplementary Table S11].

Table 1.

Polymorphisms in genes encoding bacterial cell envelope and secretion components in IBD-derived genomes.

Protein functionVariant pos. [amino-acid]Mutation type[B]uried/
[E]xposed residue
Mutation
prevalence
Capsular polysaccharide biosynthesis protein KpsC148, 233Non-synonymousE, E0.43, 0.48
Colanic acid biosynthesis glycosyltransferase WcaC303, 357Non-synonymousE, E0.37, 0.56
OmpA family protein202, 237, 242, 257Non-synonymous,E, E, E, B,0.38, 0.43, 0.41, 0.31,
Type VI secretion system ATPase TssH261, 787, 797Non-synonymousB, E, E0.44, 0.39, 0.35
Type VI secretion system baseplate subunit TssK296Non-synonymousE0.36
Protein functionVariant pos. [amino-acid]Mutation type[B]uried/
[E]xposed residue
Mutation
prevalence
Capsular polysaccharide biosynthesis protein KpsC148, 233Non-synonymousE, E0.43, 0.48
Colanic acid biosynthesis glycosyltransferase WcaC303, 357Non-synonymousE, E0.37, 0.56
OmpA family protein202, 237, 242, 257Non-synonymous,E, E, E, B,0.38, 0.43, 0.41, 0.31,
Type VI secretion system ATPase TssH261, 787, 797Non-synonymousB, E, E0.44, 0.39, 0.35
Type VI secretion system baseplate subunit TssK296Non-synonymousE0.36

Prevalence of selected mutation variants in genes encoding for bacterial cell surface- or secretion-associated proteins discussed in the text. For the full list of mutation variants, see Supplementary Table S11.

IBD, inflammatory bowel disease.

Table 1.

Polymorphisms in genes encoding bacterial cell envelope and secretion components in IBD-derived genomes.

Protein functionVariant pos. [amino-acid]Mutation type[B]uried/
[E]xposed residue
Mutation
prevalence
Capsular polysaccharide biosynthesis protein KpsC148, 233Non-synonymousE, E0.43, 0.48
Colanic acid biosynthesis glycosyltransferase WcaC303, 357Non-synonymousE, E0.37, 0.56
OmpA family protein202, 237, 242, 257Non-synonymous,E, E, E, B,0.38, 0.43, 0.41, 0.31,
Type VI secretion system ATPase TssH261, 787, 797Non-synonymousB, E, E0.44, 0.39, 0.35
Type VI secretion system baseplate subunit TssK296Non-synonymousE0.36
Protein functionVariant pos. [amino-acid]Mutation type[B]uried/
[E]xposed residue
Mutation
prevalence
Capsular polysaccharide biosynthesis protein KpsC148, 233Non-synonymousE, E0.43, 0.48
Colanic acid biosynthesis glycosyltransferase WcaC303, 357Non-synonymousE, E0.37, 0.56
OmpA family protein202, 237, 242, 257Non-synonymous,E, E, E, B,0.38, 0.43, 0.41, 0.31,
Type VI secretion system ATPase TssH261, 787, 797Non-synonymousB, E, E0.44, 0.39, 0.35
Type VI secretion system baseplate subunit TssK296Non-synonymousE0.36

Prevalence of selected mutation variants in genes encoding for bacterial cell surface- or secretion-associated proteins discussed in the text. For the full list of mutation variants, see Supplementary Table S11.

IBD, inflammatory bowel disease.

Remarkably, some genes showed multiple independent non-synonymous mutations in the same operon in different genomes [Table 1; Supplementary Table S11]. Such convergent mutational patterns were especially notable in genes encoding bacterial cell surface- or secretion-associated components, such as colanic acid biosynthesis proteins [wcaC], outer membrane proteins OmpA, and the type VI secretion system baseplate subunit TssK. Such proteins may have direct proinflammatory roles in the gut, and some of them are known to be important antigens58–60 and thus may contribute to clearance of the bacteria. Interestingly, the gene encoding the lipoprotein metalloprotease SslE, which contributes to activation of macrophages via toll-like receptor 2,61 showed several different mutations in various strains and was also significantly more abundant among isolate genomes from non-IBD strains compared with IBD strains [10/23 vs 4/36, respectively; p <0.05; Fisher’s exact test]. To confirm that mutations could alter the antigenicity of the mutated proteins, we mapped the mutations in outer-membrane proteins of E. coli [likely to interact with host immune system] onto the three-dimensional structure of the respective proteins, using ConSurf.62 This analysis revealed that the polysaccharide export protein Wza, the OmpA family protein YiaD, and the ferrichrome porin FhuA indeed showed mutations in surface-exposed residues.

Last, several of those mutated genes showed polymorphisms in multiple other clades [Supplementary Table S12], although the mutated positions were different. One may speculate that this may represent a convergent evolution process potentially toward lower antigenicity, which could be beneficial to the bacteria in the intestines of patients with IBD; however this requires experimental validation.

4. Discussion

Comparative genomic studies of E. coli in IBD have focused mostly on AIEC strains isolated from mucosal samples of patients with CD, and were unable to define genomic markers associated with this phenotype.19,21 Analyses of commensal E. coli from faecal and aspirate samples of patients with IBD could not identify IBD-specific strains, genes, or alleles.22–24 Here we characterised large and diverse collections of E. coli genomes from different IBD phenotypes to obtain a broader understanding of this species in the context of IBD pathogenesis.

We could classify these diverse E. coli strains into five distinct clades. Although no clade was unique to IBD, clade-III was more prevalent in UC [49%] and healthy controls [41%] than in patients with CD [31.5%] or pouchitis [17.5%]. This finding corroborates reports observing a comparable prevalence of E. coli isolates of 47% to 55% from the B2 phylogroup in biopsy tissues of patients with UC.63,64 Notably, we did not detect associations between the relative abundance or the clade affiliation of E. coli and intestinal inflammation, in contrast to Mirsepasi-Lauridsen et al.65 who found that patients with active UC colonised with E. coli B2 had significantly increased levels of faecal calprotectin. However, clade-III strains did have a higher presence of intestine-relevant virulence factors, which implies that some of them can potentially contribute to the severity of IBD, especially UC.

The genomic metabolic capacities varied across the clades and IBD phenotypes. Of particular interest were the sialidase genes, that encode enzymes that cleave sialic acid from host sialylated mucins, and were exclusively present in clade-III. Strains from this clade may therefore be better adapted for using host-derived mucin glycans. Sialylated glycans increase from the ileum to the colon along the gut.66 As patients after pouch surgery are missing a colon, and the pouch is made from ileal tissue, this might explain the observation of fewer E. coli strains capable of using sialylated mucins in these patients. Indeed, those patients instead had higher abundance of strains that can use food-derived sugars, such as sucrose or inulin [poly-fructose]. Inulin supplementation was shown to reduce endoscopic and histological inflammation of the mucosa, coupled with increased butyrate concentration, in patients with a pouch67; thus having inulin-degrading E. coli strains may counter such approaches in which inulin is intended to stimulate butyrate-producing bacteria.

The pdu operon [propanediol] is part of a pathway involved in the metabolism of fucose. It was found that utilisation of 1,2-propanediol by AIEC derived from patients with CD is critical for promoting T cell-driven intestinal inflammation in a mouse colitis model68 and is correlated with increased cellular invasion and persistence in vitro.56 The ability to metabolise fucose together with sialic acid [two major components of intestinal mucus] is linked with the expansion of enteropathogenic species like Salmonella typhimurium and Clostridioides difficile, which exploit increased mucosal carbohydrate availability following disruption of commensal microbiota by antibiotics.69 We observed comparable prevalence of pduC in all IBD phenotypes and in healthy subjects.

Here we go beyond preliminary reports of higher within-host growth rates of E. coli in CD48 into a comprehensive analysis of the major three IBD phenotypes. E. coli often attains higher abundance in patients with IBD due to the fact that higher levels of nitrate, which accumulate in the inflamed gut, allow it to perform anaerobic respiration rather than fermentation and consequently grow faster than other bacteria.70 Enigmatically, we show that patients with a pouch harbour E. coli with the highest growth rates, even in the presence of antibiotics. Given that patients with a pouch also tend to have the highest stool frequency, it is possible that E. coli growth rates reflect intestinal emptying: very short transit times mean that luminal bacteria do not have time to use up all available nutrients and at the same time must divide fast enough or be diluted out. Conversely, we did not observe an association between the iRep and the calprotectin levels of the patients.

The Clade-III strains encoded the highest number of putative virulence factors, including ɑ-haemolysin, serine-protease autotransporters, the yersiniabactin iron uptake system, and the island for biosynthesis of the genotoxic molecule colibactin, confirming previous reports on B2 phylogroup strains.12,20 Recently, mutational signatures that are caused by colibactin exposure were identified in human organoids and matched the mutational signature of human colorectal cancer tumours, implying that colibactin exposure increases colorectal cancer risk.71 Conversely, the pks genomic island is also required for the anti-colitis properties of the probiotic E. coli strain Nissle 191772 that has shown clinical benefit in maintaining remission in patients with UC.73 Using publicly available metagenomic data, we previously showed that the prevalence of E. coli colibactin genes is similar between IBD and non-IBD controls, but the relative abundance of colibactin pks island is 3-fold higher in IBD.74 Here we show that E. coli from patients with UC are twice more likely to encode the pks island than strains from patients with CD or pouch. Combined with the direct effect of years of chronic inflammation of the colon that characterises UC, these patients may thus be further exposed to a higher risk of developing colorectal cancer, and should undergo surveillance colonoscopy even during years of remission.

When non-synonymous mutations occur independently in the same gene in multiple strains, this suggests that the gene in question is involved in host adaptation.75 Here, we have identified multiple IBD-specific non-synonymous mutations in E. coli genes related to bacterial cell envelope and secretion components. This could reflect adaptation to the overactive host immune response2 that enables such E. coli strains to expand in the gut of patients with IBD.5,8,9 One may speculate that a more intimate contact of E. coli with host tissue and host immunity can select for bacteria that are less ‘visible’ to the immune system, and thus more rarely attacked by neutrophils or cleared by antibodies; however, such a hypothesis requires support by further experiments. Moreover, some mutations could actually make IBD-associated strains less proinflammatory to the host.

Another example is the numerous mutations we detected in the genes encoding the metalloprotease SslE, which were also rarer in IBD strains. SslE is generally involved in E. coli degradation of mucin substrates and colonisation of the intestinal and urinary tract,76 in addition to stimulating the production of reactive nitrogen and oxygen species and proinflammatory chemokines in mouse macrophages.61 Thus, a change in SslE function could reduce the pathogenicity of strains that encode it to their hosts. One may speculate that some IBD-associated mutations, especially in proteins that may come into contact with the immune system, such as outer membrane proteins, could also result in less inflammatory variants of these bacterial proteins.

Our study has several limitations. First, genomic data can provide the genotypic potential of E. coli strains, but IBD adaptation can also take the form of changes in gene expression or post-transcriptional regulation that cannot be inferred directly from sequence. Second, we had substantially fewer genomes from UC and from healthy subjects than from CD and patients with a pouch. This is to a large extent because we relied on faecal metagenomes with relatively high E. coli abundance that enabled our analyses. It is possible that E. coli strains that are a small minority in the gut will show different genomic signatures, but to obtain MAGs from such strains will require deeper sequencing of metagenomes than is currently practised.

Our results may have important clinical implications. E. coli is often considered as an opportunistic pathogen in IBD, but direct mechanistic links between its potential virulence and inflammation in vivo [in patients with IBD] are still missing. The current study presents E. coli not just as a species that is generally well-adapted to the IBD gut, but also one that shows some strain-specific adaptations. This could explain the lower proinflammatory potential previously observed in E. coli strains from patients with pouchitis.26 Furthermore, although E. coli prospers in the IBD gut, perhaps partly because it is a fast-grower, its abundance is not associated with inflammation in our dataset. Eradicating E. coli is therefore unlikely to generally benefit patients with IBD, unless the strain in question has pro-carcinogenic potential, like colibactin-producing E. coli.

Funding

This work was supported by a generous grant from the Leona M. and Harry B. Helmsley Charitable Trust [grant number #2018PG-CD007]. VD was partially supported by a fellowship from the Edmond J. Safra Center for Bioinformatics at Tel-Aviv University. UG was also supported by the Israeli Ministry of Science and Technology and the Israeli Ministry of Health.

Conflict of Interest

ID: consultation/advisory boards for Pfizer, Janssen, Abbvie, Takeda, Roche/Genentech, Celltrion, Celgene, Medtronic/Given Imaging, Rafa Laboratories, Neopharm, Sublimity, Arena, Gilead; speaking/teaching: Pfizer, Janssen, Abbvie, Takeda, Roche/Genentech, Celltrion, Celgene, Falk Pharma, Ferring. Grant support: Pfizer, Altman Research. The remaining authors declare no competing interests.

Acknowledgements

The authors wish to thank the members of the RMC IBD Centre and the Comprehensive Pouch Clinic for their help and stimulating discussion of the project. The authors wish to thank Stefan Green of the University of Illinois at Chicago, for his continued expert help in the metagenomic sequencing.

Author Contributions

VD, UG, and ID conceived and designed the study; VD developed the bioinformatic analysis and the comparative genomic pipelines and analysed the metagenomic data; LR analysed the metagenomic data; KR and ID collected, analysed and provided the clinical data; NW and ID enrolled and examined the patients; VD, UG, and ID wrote the paper. All authors read, discussed, and approved the final manuscript.

References

1.

Tulchinsky
H
,
Dotan
I
,
Alper
A
, et al. .
Comprehensive pouch clinic concept for follow-up of patients after ileal pouch anal anastomosis: report of 3 yearsʼ experience in a tertiary referral center
.
Inflamm Bowel Dis
2008
;
14
:
1125
32
.

2.

Sartor
RB
,
Wu
GD.
Roles for intestinal bacteria, viruses, and fungi in pathogenesis of inflammatory bowel diseases and therapeutic approaches.
Gastroenterology
2017
;
152
:
327
39.e4
.

3.

Liu
JZ
,
van Sommeren
S
,
Huang
H
, et al. .;
International Multiple Sclerosis Genetics Consortium; International IBD Genetics Consortium
.
Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations.
Nat Genet
2015
;
47
:
979
86
.

4.

Ananthakrishnan
AN
,
Bernstein
CN
,
Iliopoulos
D
, et al. .
Environmental triggers in IBD: a review of progress and evidence.
Nat Rev Gastroenterol Hepatol
2018
;
15
:
39
49
.

5.

Gevers
D
,
Kugathasan
S
,
Denson
LA
, et al. .
The treatment-naive microbiome in new-onset Crohn’s disease.
Cell Host Microbe
2014
;
15
:
382
92
.

6.

Reshef
L
,
Kovacs
A
,
Ofer
A
, et al. .
Pouch inflammation is associated with a decrease in specific bacterial taxa.
Gastroenterology
2015
;
149
:
718
27
.

7.

Lloyd-Price
J
,
Arze
C
,
Ananthakrishnan
AN
, et al. .
Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases
.
Nature
2019
;
569
:
655
62
.

8.

Dubinsky
V
,
Reshef
L
,
Rabinowitz
K
, et al. .
Dysbiosis in metabolic genes of the gut microbiomes of patients with an ileo-anal pouch resembles that observed in Crohn’s disease
.
mSystems
2021
;
6
:
e00984
20
.

9.

Darfeuille-Michaud
A
,
Boudeau
J
,
Bulois
P
, et al. .
High prevalence of adherent-invasive Escherichia coli associated with ileal mucosa in Crohn’s disease.
Gastroenterology
2004
;
127
:
412
21
.

10.

Tenaillon
O
,
Skurnik
D
,
Picard
B
,
Denamur
E.
The population genetics of commensal Escherichia coli.
Nat Rev Microbiol
2010
;
8
:
207
17
.

11.

Horesh
G
,
Blackwell
GA
,
Tonkin-Hill
G
, et al. .
A comprehensive and high-quality collection of Escherichia coli genomes and their genes
.
Microb Genomics
2021
. Doi: 10.1099/mgen.0.000499.

12.

Denamur
E
,
Clermont
O
,
Bonacorsi
S
,
Gordon
D.
The population genetics of pathogenic Escherichia coli.
Nat Rev Microbiol
2021
;
19
:
37
54
.

13.

Kruis
W
,
Fric
P
,
Pokrotnieks
J
, et al. .
Maintaining remission of ulcerative colitis with the probiotic Escherichia coli Nissle 1917 is as effective as with standard mesalazine.
Gut
2004
;
53
:
1617
23
.

14.

Darfeuille-Michaud
A
,
Neut
C
,
Barnich
N
, et al. .
Presence of adherent Escherichia coli strains in ileal mucosa of patients with Crohn’s disease.
Gastroenterology
1998
;
115
:
1405
13
.

15.

Palmela
C
,
Chevarin
C
,
Xu
Z
, et al. .
Adherent-invasive Escherichia coli in inflammatory bowel disease.
Gut
2018
;
67
:
574
87
.

16.

Martinez-Medina
M
,
Aldeguer
X
,
Lopez-Siles
M
, et al. .
Molecular diversity of Escherichia coli in the human gut: new ecological evidence supporting the role of adherent-invasive E. coli [AIEC] in Crohnʼs disease
.
Inflamm Bowel Dis
2009
;
15
:
872
82
.

17.

Elliott
TR
,
Hudspith
BN
,
Wu
G
, et al. .
Quantification and characterization of mucosa-associated and intracellular Escherichia coli in inflammatory bowel disease
.
Inflamm Bowel Dis
2013
;
19
:
2326
38
.

18.

Conte
MP
,
Longhi
C
,
Marazzato
M
, et al. .
Adherent-invasive Escherichia coli [AIEC] in pediatric Crohn’s disease patients: phenotypic and genetic pathogenic features.
BMC Res Notes
2014
;
7
:
748
.

19.

Desilets
M
,
Deng
X
,
Rao
C
, et al. .
Genome-based definition of an inflammatory bowel disease-associated adherent-invasive Escherichia coli Pathovar
.
Inflamm Bowel Dis
2016
;
22
:
1
12
.

20.

Ostblom
A
,
Adlerberth
I
,
Wold
AE
,
Nowrouzian
FL.
Pathogenicity island markers, virulence determinants malX and usp, and the capacity of Escherichia coli to persist in infants’ commensal microbiotas.
Appl Environ Microbiol
2011
;
77
:
2303
8
.

21.

O’Brien
CL
,
Bringer
MA
,
Holt
KE
, et al. .
Comparative genomics of Crohn’s disease-associated adherent-invasive Escherichia coli
.
Gut
2017
;
66
:
1382
9
.

22.

Vejborg
RM
,
Hancock
V
,
Petersen
AM
,
Krogfelt
KA
,
Klemm
P.
Comparative genomics of Escherichia coli isolated from patients with inflammatory bowel disease.
BMC Genomics
2011
;
12
:
316
.

23.

Rakitina
DV
,
Manolov
AI
,
Kanygina
AV
, et al. .
Genome analysis of E. coli isolated from Crohn’s disease patients.
BMC Genomics
2017
;
18
:
544
.

24.

Tyakht
AV
,
Manolov
AI
,
Kanygina
AV
, et al. .
Genetic diversity of Escherichia coli in gut microbiota of patients with Crohn’s disease discovered using metagenomic and genomic analyses.
BMC Genomics
2018
;
19
:
968
.

25.

Wattam
AR
,
Davis
JJ
,
Assaf
R
, et al. .
Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center.
Nucleic Acids Res
2017
;
45
:
D535
42
.

26.

Dubinsky
V
,
Reshef
L
,
Bar
N
, et al. .
Predominantly antibiotic-resistant intestinal microbiome persists in patients with pouchitis who respond to antibiotic therapy.
Gastroenterology
2020
;
158
:
610
24.e13
.

27.

Bolger
AM
,
Lohse
M
,
Usadel
B.
Trimmomatic: a flexible trimmer for Illumina sequence data.
Bioinformatics
2014
;
30
:
2114
20
.

28.

Langmead
B
,
Salzberg
SL.
Fast gapped-read alignment with Bowtie 2.
Nat Methods
2012
;
9
:
357
9
.

29.

Nayfach
S
,
Rodriguez-Mueller
B
,
Garud
N
,
Pollard
KS.
An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography.
Genome Res
2016
;
26
:
1612
25
.

30.

Wick
RR
,
Judd
LM
,
Gorrie
CL
,
Holt
KE.
Unicycler: resolving bacterial genome assemblies from short and long sequencing reads.
PLoS Comput Biol
2017
;
13
:
e1005595
.

31.

Parks
DH
,
Imelfort
M
,
Skennerton
CT
,
Hugenholtz
P
,
Tyson
GW.
CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes.
Genome Res
2015
;
25
:
1043
55
.

32.

Nurk
S
,
Meleshko
D
,
Korobeynikov
A
,
Pevzner
PA.
metaSPAdes: a new versatile metagenomic assembler.
Genome Res
2017
;
27
:
824
34
.

33.

Kang
DD
,
Li
F
,
Kirton
E
, et al. .
MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies.
PeerJ
2019
;
7
:
e7359
.

34.

Bowers
RM
,
Kyrpides
NC
,
Stepanauskas
R
, et al. .;
Genome Standards Consortium
.
Minimum information about a single amplified genome [MISAG] and a metagenome-assembled genome [MIMAG] of bacteria and archaea.
Nat Biotechnol
2017
;
35
:
725
31
.

35.

Asnicar
F
,
Thomas
AM
,
Beghini
F
, et al. .
Precise phylogenetic analysis of microbial isolates and genomes from metagenomes using PhyloPhlAn 3.0.
Nat Commun
2020
;
11
:
2500
.

36.

Jain
C
,
Rodriguez-R
LM
,
Phillippy
AM
,
Konstantinidis
KT
,
Aluru
S.
High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries.
Nat Commun
2018
;
9
:
5114
.

37.

Beghain
J
,
Bridier-Nahmias
A
,
Le Nagard
H
,
Denamur
E
,
Clermont
O.
ClermonTyping: an easy-to-use and accurate in silico method for Escherichia genus strain phylotyping
.
Microb Genomics
2018
;
4
:
e000192
.

38.

Seemann
T.
Prokka: rapid prokaryotic genome annotation.
Bioinformatics
2014
;
30
:
2068
9
.

39.

Page
AJ
,
Cummins
CA
,
Hunt
M
, et al. .
Roary: rapid large-scale prokaryote pan genome analysis.
Bioinformatics
2015
;
31
:
3691
3
.

40.

Huerta-Cepas
J
,
Forslund
K
,
Coelho
LP
, et al. .
Fast genome-wide functional annotation through orthology assignment by eggNOG-Mapper.
Mol Biol Evol
2017
;
34
:
2115
22
.

41.

Buchfink
B
,
Xie
C
,
Huson
DH.
Fast and sensitive protein alignment using DIAMOND.
Nat Methods
2015
;
12
:
59
60
.

42.

Aramaki
T
,
Blanc-Mathieu
R
,
Endo
H
, et al. .
KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold.
Bioinformatics
2020
;
36
:
2251
2
.

43.

Lombard
V
,
Golaconda Ramulu
H
,
Drula
E
,
Coutinho
PM
,
Henrissat
B.
The carbohydrate-active enzymes database [CAZy] in 2013.
Nucleic Acids Res
2014
;
42
:
D490
5
.

44.

Chen
L
,
Zheng
D
,
Liu
B
,
Yang
J
,
Jin
Q.
VFDB 2016: hierarchical and refined dataset for big data analysis–10 years on.
Nucleic Acids Res
2016
;
44
:
D694
7
.

45.

Alcock
BP
,
Raphenya
AR
,
Lau
TT
, et al. .
CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database
.
Nucleic Acids Res
2019
. Doi: 10.1093/nar/gkz935.

46.

Lees
JA
,
Galardini
M
,
Bentley
SD
,
Weiser
JN
,
Corander
J.
pyseer: a comprehensive tool for microbial pangenome-wide association studies.
Bioinformatics
2018
;
34
:
4310
2
.

47.

Brown
CT
,
Olm
MR
,
Thomas
BC
,
Banfield
JF.
Measurement of bacterial replication rates in microbial communities.
Nat Biotechnol
2016
;
34
:
1256
63
.

48.

Korem
T
,
Zeevi
D
,
Suez
J
, et al. .
Growth dynamics of gut microbiota in health and disease inferred from single metagenomic samples.
Science
2015
;
349
:
1101
6
.

49.

Clermont
O
,
Christenson
JK
,
Denamur
E
,
Gordon
DM.
The Clermont Escherichia coli phylo-typing method revisited: improvement of specificity and detection of new phylo-groups: a new E. coli phylo-typing method
.
Environ Microbiol Rep
2013
;
5
:
58
65
.

50.

Tailford
LE
,
Crost
EH
,
Kavanaugh
D
,
Juge
N.
Mucin glycan foraging in the human gut microbiome.
Front Genet
2015
;
6
:
81
.

51.

Nougayrède
JP
,
Homburg
S
,
Taieb
F
, et al. .
Escherichia coli induces DNA double-strand breaks in eukaryotic cells.
Science
2006
;
313
:
848
51
.

52.

Ruiz-Perez
F
,
Nataro
JP.
Bacterial serine proteases secreted by the autotransporter pathway: classification, specificity, and role in virulence.
Cell Mol Life Sci
2014
;
71
:
745
70
.

53.

Mirsepasi-Lauridsen
HC
,
Du
Z
,
Struve
C
, et al. .
Secretion of alpha-hemolysin by escherichia coli disrupts tight junctions in ulcerative colitis patients
.
Clin Transl Gastroenterol
2016
;
7
:
e149
.

54.

Rossolini
GM
,
D’andrea
MM
,
Mugnaioli
C.
The spread of CTX-M-type extended-spectrum β-lactamases
.
Clin Microbiol Infect
2008
;
14
:
33
41
.

55.

Bunny
KL
,
Hall
RM
,
Stokes
HW.
New mobile gene cassettes containing an aminoglycoside resistance gene, aacA7, and a chloramphenicol resistance gene, catB3, in an integron in pBWH301.
Antimicrob Agents Chemother
1995
;
39
:
686
93
.

56.

Dogan
B
,
Suzuki
H
,
Herlekar
D
, et al. .
Inflammation-associated adherent-invasive Escherichia coli are enriched in pathways for use of propanediol and iron and M-cell translocation
.
Inflamm Bowel Dis
2014
;
20
:
1919
32
.

57.

Viladomiu
M
,
Kivolowitz
C
,
Abdulhamid
A
, et al. .
IgA-coated E. coli enriched in Crohn’s disease spondyloarthritis promote T H 17-dependent inflammation
.
Sci Transl Med
2017
;
9
:
eaaf9655
.

58.

Chaban
B
,
Hughes
HV
,
Beeby
M.
The flagellum in bacterial pathogens: for motility and a whole lot more.
Semin Cell Dev Biol
2015
;
46
:
91
103
.

59.

Confer
AW
,
Ayalew
S.
The OmpA family of proteins: roles in bacterial pathogenesis and immunity.
Vet Microbiol
2013
;
163
:
207
22
.

60.

Navarro-Garcia
F
,
Ruiz-Perez
F
,
Cataldi
Á
,
Larzábal
M.
Type VI secretion system in pathogenic Escherichia coli: structure, role in virulence, and acquisition.
Front Microbiol
2019
;
10
:
1965
.

61.

Tapader
R
,
Bose
D
,
Dutta
P
,
Das
S
,
Pal
A.
SslE [YghJ], a cell-associated and secreted lipoprotein of neonatal septicemic Escherichia coli, induces toll-like receptor 2-dependent macrophage activation and proinflammation through NF-κB and MAP kinase signaling.
Infect Immun
2018
;
86
,
e00399
18
.

62.

Ashkenazy
H
,
Abadi
S
,
Martz
E
, et al. .
ConSurf 2016: an improved methodology to estimate and visualize evolutionary conservation in macromolecules.
Nucleic Acids Res
2016
;
44
:
W344
50
.

63.

Kotlowski
R
,
Bernstein
CN
,
Sepehri
S
,
Krause
DO.
High prevalence of Escherichia coli belonging to the B2+D phylogenetic group in inflammatory bowel disease.
Gut
2007
;
56
:
669
75
.

64.

Sepehri
S
,
Khafipour
E
,
Bernstein
CN
, et al. .
Characterization of Escherichia coli isolated from gut biopsies of newly diagnosed patients with inflammatory bowel disease
.
Inflamm Bowel Dis
2011
;
17
:
1451
63
.

65.

Mirsepasi-Lauridsen
HC
,
Halkjaer
SI
,
Mortensen
EM
, et al. .
Extraintestinal pathogenic Escherichia coli are associated with intestinal inflammation in patients with ulcerative colitis.
Sci Rep
2016
;
6
:
31152
.

66.

Robbe
C
,
Capon
C
,
Coddeville
B
,
Michalski
JC.
Structural diversity and specific distribution of O-glycans in normal human mucins along the intestinal tract.
Biochem J
2004
;
384
:
307
16
.

67.

Welters
CF
,
Heineman
E
,
Thunnissen
FB
,
van den Bogaard
AE
,
Soeters
PB
,
Baeten
CG.
Effect of dietary inulin supplementation on inflammation of pouch mucosa in patients with an ileal pouch-anal anastomosis.
Dis Colon Rectum
2002
;
45
:
621
7
.

68.

Viladomiu
M
,
Metz
ML
,
Lima
SF
, et al. .;
JRI Live Cell Bank
.
Adherent-invasive E. coli metabolism of propanediol in Crohn’s disease regulates phagocytes to drive intestinal inflammation.
Cell Host Microbe
2021
;
29
:
607
19.e8
.

69.

Ng
KM
,
Ferreyra
JA
,
Higginbottom
SK
, et al. .
Microbiota-liberated host sugars facilitate post-antibiotic expansion of enteric pathogens.
Nature
2013
;
502
:
96
9
.

70.

Winter
SE
,
Winter
MG
,
Xavier
MN
, et al. .
Host-derived nitrate boosts growth of E. coli in the inflamed gut.
Science
2013
;
339
:
708
11
.

71.

Pleguezuelos-Manzano
C
,
Puschhof
J
,
Rosendahl Huber
A
, et al. .;
Genomics England Research Consortium
.
Mutational signature in colorectal cancer caused by genotoxic pks+ E. coli.
Nature
2020
;
580
:
269
73
.

72.

Olier
M
,
Marcq
I
,
Salvador-Cartier
C
, et al. .
Genotoxicity of Escherichia coli Nissle 1917 strain cannot be dissociated from its probiotic activity.
Gut Microbes
2012
;
3
:
501
9
.

73.

Schultz
M.
Clinical use of E. coli Nissle 1917 in inflammatory bowel disease
.
Inflamm Bowel Dis
2008
;
14
:
1012
8
.

74.

Dubinsky
V
,
Dotan
I
,
Gophna
U.
Carriage of colibactin-producing bacteria and colorectal cancer risk.
Trends Microbiol
2020
;
28
:
874
6
.

75.

Zhao
S
,
Lieberman
TD
,
Poyet
M
, et al. .
Adaptive evolution within gut microbiomes of healthy people.
Cell Host Microbe
2019
;
25
:
656
67.e8
.

76.

Nesta
B
,
Spraggon
G
,
Alteri
C
, et al. .
FdeC, a novel broadly conserved Escherichia coli adhesin eliciting protection against urinary tract infections
.
mBio
2012
;
3
:
9
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/pages/standard-publication-reuse-rights)