Abstract

The gut microbiota plays a vital role in human health, and significant effort has been made to predict human phenotypes, especially diseases, with the microbiota as a promising indicator or predictor with machine learning (ML) methods. However, the accuracy is impacted by a lot of factors when predicting host phenotypes with the metagenomic data, e.g. small sample size, class imbalance, high-dimensional features, etc. To address these challenges, we propose MicroHDF, an interpretable deep learning framework to predict host phenotypes, where a cascade layers of deep forest units is designed for handling sample class imbalance and high dimensional features. The experimental results show that the performance of MicroHDF is competitive with that of existing state-of-the-art methods on 13 publicly available datasets of six different diseases. In particular, it performs best with the area under the receiver operating characteristic curve of 0.9182 ± 0.0098 and 0.9469 ± 0.0076 for inflammatory bowel disease (IBD) and liver cirrhosis, respectively. Our MicroHDF also shows better performance and robustness in cross-study validation. Furthermore, MicroHDF is applied to two high-risk diseases, IBD and autism spectrum disorder, as case studies to identify potential biomarkers. In conclusion, our method provides an effective and reliable prediction of the host phenotype and discovers informative features with biological insights.

Introduction

The human microbiota, which comprises bacteria, archaea, protists, fungi, and viruses, inhabits the human body and forms a complex and diverse ecosystem. Mounting evidence indicates that dysregulation of the human microbiome is associated with numerous diseases including obesity, diabetes, inflammatory bowel disease (IBD), cardiovascular diseases (CVD), and autism spectrum disorders (ASD), among others [1]. Patients with CVD exhibit abnormal gut microorganisms, where the abundance of certain gut microbiota (such as Prevotella, Bifidobacterium, and Firmicutes) in hypertensive patients is significantly lower than that in healthy individuals [2]. In addition, researches have revealed that disruptions in the intestinal microbiome can influence the progression of ASD in children, owing to the intricate relationship between the nervous system and gastrointestinal tract [3]. For example, the production of butyrate by Bacteroides and Prevotella has been found to potentially promotes carbohydrate depletion in individuals with ASD [4]. Therefore, investigating disease mechanisms from a microbiome perspective may offer new insights into disease diagnosis, treatment, and prognosis.

With advances in sequencing technology, microbial data have been generated and stored in public repositories (MGnify [5], GMrepo v2 [6], Qiita [7], and the National Centre for Biotechnology Information [8] et al.), making it possible to explore diseases in a data-driven manner. Currently, machine learning (ML)-based approaches are widely applied to various microbiome analysis tasks, including phenotype prediction, microbial feature classification, studying microbiome component interactions, and monitoring changes in microbiome composition [9, 10]. In this study, we focused on predicting the host phenotype using metagenomic data.

The human microbiome contains rich data expressed in different feature forms such as operational taxonomic units, amplicon sequence variants, and k-mer representation. Usually, classical ML methods, such as SVM and random forest (RF), make host phenotypic predictions based on these features or newly generated feature representations [11–13]. In addition, several studies have focused on large-scale meta-analyses to validate findings across cohorts using 16S rRNA, shotgun metagenomic, and other omics data [12, 14, 15]. In exploring colorectal cancer, Wirvel et al. [12] revealed disease-specific microbial signatures in a meta-analysis, and Thomas et al.[14]identified replicable cross-cohort microbial signatures. Jiang et al. [15] attempted to identify consistent microbial alterations in common intestinal diseases through meta-analysis. Regarding the input feature problem, Giliberti et al. [16] performed multiple case–control studies and indicated that the presence, rather than the relative abundance, of microbial taxa is important for classification models. To evaluate the model for host phenotype classification, Pasolli et al. [17] proposed a comprehensive meta-analysis framework that applied cross-validation, cross-study validation, and cross-disease validation to ML methods. Li et al. [18] following this evaluation principle conducted a comprehensive meta-analysis of 20 diseases in 83 cohorts and demonstrated that different diseases, sequencing types, and sample sizes were important factors for ML classifier performance. Moreover, ML-based web services and toolboxes have been developed to facilitate phenotype classification tasks, such as ML Repo [19], SIAMCAT [20], GutBalance [21], MarkerML [22], and DisBalance [23] et al. On the other hand, it is well-known that different taxonomic ranks can significantly impact the learning and predictive performance of ML models, which poses challenges in developing robust ML models.

Recently, deep learning approaches have become popular solutions for host phenotype classification in case–control microbiome studies [10]. The basic concept is to transform microbiota abundance data into a low-dimensional representation for downstream analysis, where relevant works refer to the method using multilayer perceptron neural networks (MLP) with augmented samples [24], ensemble CNN method considering the inherent correlation between taxonomy [25], and methods obtaining feature representation via autoencoder [26] or deep variational information bottlenecks [27]. With the popularity of graph neural network, one relevant work is that ensemble GraphSAGE models were used to construct a disease network module for metagenomic disease prediction from an OUT table [28]. Moreover, Liao et al. [29] constructed an inter-host microbiome similarity graph using the latent features of samples learned from a deep adaptation network, and the GCN mode was trained to classify disease status. Although these methods achieved satisfactory results, they neglected the relationships among taxonomic hierarchical structures. Other researchers integrated the phylogenetic tree to improve prediction performance, where the phylogenetic tree is traversed according to different strategies for visiting tree nodes (position mapping and tree traversal). Reiman et al. [30] initially applied the CNN method to predict host phenotypes by embedding microbial phylogenetic trees and the abundance of microbial taxa in a 2D matrix, maintaining spatial and quantitative information from metagenomic data. Li et al. [31] further extended phylogenetic information integration with graph normalization by considering the children of nodes, heights of layers, and patristic distance. Moreover, Chen et al. [32] proposed a framework to obtain an embedded taxonomic representation by level and post-order traversals of the phylogenetic tree.

Although significant advancements have been made in the field of phenotypic prediction, notable limitations exist despite the extensive efforts of previous researchers. Traditional ML approaches in meta-analysis are capable of yielding profound discoveries, but they fall short in terms of accuracy and cross-cohort predictions. In contrast, deep learning methods exhibit high predictive accuracy but operate as black boxes, lacking model interpretability. Moreover, these methods overlook the prevalent issues of imbalanced sample classes within a dataset, which widely exists in clinical scenario. Minority classes may be underrepresented when machine-learning methods are trained on class-imbalanced data, resulting in poor predictive performance [33]. In recent years, the deep forest (DF) tree model [34] has been acknowledged for its interpretability and has been applied in metagenomic analysis [35], but still fail to overcome these limitations [36].

Inspired by previous work [32, 34, 37], we address the challenges of sample class imbalance and high-dimensional feature by introducing a new framework, named Microbiome Hybrid Deep Forest (MicroHDF), to enhance the accuracy of host phenotype prediction based on metagenomic data. We first design and integrate two types of forest units into an ensemble framework with cascading layers. One unit employs a RF combined with affinity propagation (AP) clustering and stratified under-sampling(RF-CUS [37]), specifically to address class imbalance. The other unit uses an extreme tree forest (ERTs [38]) based on data complexity reduction to handle the challenges of high-dimensional data. The phylogenetic tree, which depicts evolutionary history and relationships among species, is used to generate a novel feature matrix incorporating both the structural information of the tree and species abundance data. This new feature matrix, along with the species abundance matrix, are then introduced into the ensemble framework through two distinct channels. Moreover, robust features for the classification task are generated by aggregating features from these two channels. The experimental results show that MicroHDF outperform existing state-of-the-art methods on 13 publicly available datasets of six different diseases. Our method provides an effective and reliable prediction of the host phenotype and discovers informative features with biological insights.

Materials and methods

MicroHDF is composed of two modules: generating feature matrix and DF module. An overview of MicroHDF is shown in Fig. 1, in which the microbial abundance profile and phylogenetic tree-based feature matrix are the inputs of the modified DF module. New feature representations are learned on two channels, considering different data information from the DF units. Finally, class distributions are used to make decisions.

Overall workflow of MicroHDF. Metagenomic abundance data embedded into the phylogenetic tree is transformed to new feature matrixes by two different tree traversals in the feature matrix generation module. Next, different feature representation are learned from the cascade layers with different DF-based units. Finally, the learned features are aggregated to output layer and set to perform classification tasks.
Figure 1

Overall workflow of MicroHDF. Metagenomic abundance data embedded into the phylogenetic tree is transformed to new feature matrixes by two different tree traversals in the feature matrix generation module. Next, different feature representation are learned from the cascade layers with different DF-based units. Finally, the learned features are aggregated to output layer and set to perform classification tasks.

Construction of phylogenetic tree-based feature matrix

Numerous recent studies have highlighted the potential improvement of model by incorporating phylogenetic information, given that phylogenetic analysis considers the evolutionary relationships among species [30–32, 39]. Typically, phylogenetic relationships are illustrated in the form of a phylogenetic tree, in which related organisms tend to exhibit similar characteristics. In this study, we initially utilize PhyIoT [40] to construct a least-prunes phylogenetic tree based on the taxonomic annotations of microbial species. Furthermore, we integrate the phylogenetic tree with microbial abundance profiles.

In the light of reported research, the taxonomic spatial information can be gleaned through traversing the phylogenetic tree [32]. In this study, we investigate two order templates employed in popular traversal algorithms for tree structure post-order and level traversal. With the post-order traversal-based order template, taxa are rearranged in a left-to-right sequence, where groups of taxa sharing the same ancestry reflect the relative temporal order of species nodes during the evolutionary process. Conversely, the order template based on level traversal captures every node at each level of the tree, from top to bottom and left to right, associating tree levels with diverse evolutionary phases and genetic information and representing the node evolutionary stage. Subsequently, the abundance of taxa in a sample is used to populate the order templates as new feature vectors. Finally, two phylogenetic tree-based feature matrices are constructed based on the new feature vectors (details are provided in Fig. S1).

Random forest units for dealing with metagenomic data

It is widely acknowledged that analysing metagenomic data is challenging because of imbalanced samples and high-dimensional features. Traditional machine-learning approaches typically address sample imbalances or focus on overcoming the challenges associated with high-dimensional features, whereas only a few models consider both aspects simultaneously. Inspired by [37], we devised various RF-based units (RF-CUS and ERTs)to address this challenge. The standard operational process of the RF primarily involves bootstrap sampling and feature selection. The advantages of the RF are its efficient training process and reduced susceptibility to overfitting.

In this work, we apply a DF-based unit (RF-CUS) [37] to deal with the imbalanced data, which leverage a class-rebalancing strategy and RF ensemble prediction (Fig. S2). The training set undergoes AP clustering to cluster majority class samples. The AP clustering algorithm can dynamically discover cluster centers and autonomously assign data points to clusters. In our study, we employed the Bray–Curtis dissimilarity metric for sample similarity calculations. When the initial method results in too few or too many clusters that fail to represent the underlying data structure accurately, we adjust the number of iterations and convergence to obtain the optimal number of clusters (typically k = 15) under the best classification performance. Furthermore, a fallback mechanism is designed to replace AP algorithm with K-means clustering if the model fails to converge. Subsequently, stratified undersampling is performed within different clusters to generate subsets. Each cluster is sampled at a specific ratio according to the size of the small sample dataset, and this process is repeated N times. This results in the creation of a serial of sample sets for majority classes. Bootstrap sampling is repeatedly applied to the minority class samples to generate an equal number of sample sets. Subsequently, pairwise sampling is conducted to create class-balanced training sets from these two sequences of sample subsets. Finally, the basic RF classifier is employed on each of the balanced training sets, and the classification results are aggregated by voting to formulate the final prediction. Stratified undersampling performed through AP clustering ensures the preservation of valuable information from the majority class, while achieving a balanced merging of sampled datasets. Additionally, the RF ensemble framework effectively rebalances the training dataset to enhance the prediction accuracy.

Moreover, dimensionality reduction is an effective approach for managing the high-dimensional features of microbiome data. To enhance the prediction performance, we adopt ERT-based units to reduce the feature dimension. Initially, the Linear discriminant analysis (LDA) scores are utilized to rank the differential taxa when distinguishing disease and control samples [41]. Subsequently, the feature subset with the most discriminative ability is identified from the ranked feature dataset using a wrapper method, where ERTs served as the basic classifier, and sequential forward selection is adopted as the search strategy. Finally, the local optimal feature subset is employed to train the ERTs, utilizing these features as candidates to be randomly selected to obtain the optimal split node.

Architectures of the MicroHDF

Our framework follows a three-step procedure: (i) construction of the microbial abundance matrix and phylogenetic tree-based feature matrix to facilitate proper learning; (ii) utilization of DF units to learn different feature representations; and (iii) utilization of class distributions to make predictions. As depicted in Fig. 1, the microbiota abundance and phylogenetic tree-based features are incorporated into the two cascade forest structures within the DF module. Taking cues from deep neural networks, the cascade forest structure resembles a multilayered approach, where feature information progresses from one level to the next. Specifically, the RF-CUS and ERTs serve as embedded units to address sample imbalances and high-dimensional feature issues at each level. Essentially, each level functions as an ensemble of decision-tree forests. Subsequently, the class vectors representing the estimated class distribution for a given sample are concatenated with the raw feature vector and inputted to the subsequent cascade level. Following multilayer operations, new feature representations in the form of estimated class distributions are obtained. Furthermore, new feature representations derived from the abundance feature matrix are combined with those derived from the phylogenetic tree and inputted into the final layer. Here, the RF and ERTs serve as the basic units for the final layer, and the class distribution vectors from different units are averaged to yield the final prediction. This systematic process is formalized as follows:

(1)
(2)

where x denotes the input sample vector, hl denotes a new feature representation after l-th cascade layer, f is the ensemble tree model, and || denotes the vector concatenation operation. In Formula 2, for the cascade layer l-th, tree model j is RF-CUS and ERTs, and on the last layer j is RF and ERTs. k is the modal unit index. Moreover, [p1,0,p2,1]is the class probability vector of the mode unit used for binary classification.

Batch effect adjustment

In cross-study validation, we corrected batch effects between datasets using the R package MMUPHin, where the variable ‘dataset’ and sample labels in the metadata are as covariates. PERMANOVA test was conducted to evaluate the effect of batch adjustment, which can be performed with the adonis function in the vegan package in R.

Performance evaluation

MicroHDF is a deep learning approach based on a DF framework, wherein each cascade layer is composed of ensemble RFs. Consequently, the ensemble of decision-tree forests provides interpretability, and the significance of features in classification tasks can be quantified using pertinent indicators. Typically, classical assessment metrics are used in decision trees, such as information gain or Gini impurities. The feature importance value (FIV) for a specific feature can be viewed as the ratio of its information gain values relative to those of all features in the trees of the cascade structure.

(3)
(4)

where in the l-th layer, Fld denotes the sum of the information gain of the feature d-th in the t tree of the n forest. Equation (4) expresses the normalized FIV for feature d.

To demonstrate the effectiveness of MicroHDF, we utilize five-fold cross-validation repeated ten-fold, to assess the performance of the model. The evaluation is based on the area under the receiver operating characteristic curve (AUC) and the area under the precision-recall curve (AUPR), with the AUPR being particularly informative in the assessment of imbalanced data [42]. Furthermore, in addition to the aforementioned metrics, we employ other quantitative measurements, including accuracy, recall, and F1-score.

Results

Datasets and settings

To benchmark with other methods, we collect species-relative abundance profiles of the human gut microbiota from 13 distinct publicly available datasets covering six different diseases (Table 1): obesity [43] (Obesity), colorectal cancer [44] (Colorectal), liver cirrhosis [45] (Cirrhosis), type 2 diabetes in China [46] (C-T2D), type 2 diabetes in Europe [47] (EW-T2D), inflammatory bowel disease(IBD [48], NielsenHB_2014 [49], IjazUz_2017 [50], ICDf [51]) and ASD (Li_ASD [52], Chen_ASD [53], Arizo_ASD [54], Dan_ASD [55]). These datasets were selected as benchmarks to evaluate model performance across different sample class imbalance ratio (IR). Additionally, the IBD and ASD datasets were used for further cross-study validation. The IR is defined as the ratio of the majority class to the minority class, which conveys the degree of sample-type imbalance. To facilitate cross-validation and cross-study validation, we calculated the relative abundances of taxa and removed batch effects by using the R package MMUPHin. For IBD disease, the ‘dataset’ variable can explain 34.49% of the variance in microbial profiles between studies (IjazUz_2017, ICDf), which decreased to 23.52% after batch effect adjustment. For ASD disease (Li_ASD, Dan_ASD), the variance explained by the ‘dataset’ variable is 20.01% before and 18.74% after batch effect adjustment.

Table 1

The detailed information of 13 cohorts used in our study. The first six cohorts are used for the five-fold-cross-validation experiment. The inflammatory bowel disease and ASD disease cohorts are used for cross study experiment.

Disease nameCohort
name
Disease casesHealthy controlsTotal number samples (TN)Feature numbersImbalance Ratio (IR)Reference
obesityObesity164892534651.84[17, 43]
colorectal cancerColorectal48731215071.52[17, 44]
liver cirrhosisCirrhosis1181142325411.03[17, 45]
Type 2 diabetesEW-T2D5343963811.23[17, 47]
C-T2D1701743446061.02[17, 46]
Inflammatory bowel diseaseIBD25851104433.4[17, 48]
NielsenHB_20141482483962921.67[16, 49]
IjazUz_20175638942301.34[16, 50]
ICDf4438822461.15[30, 51]
Autism spectrum disordersLi_ASD12040152113423.34[52]
Chen_ASD76471234551.61[53]
Arizo_ASD2020408411.0[54]
Dan_ASD14314328611201.0[55]
Disease nameCohort
name
Disease casesHealthy controlsTotal number samples (TN)Feature numbersImbalance Ratio (IR)Reference
obesityObesity164892534651.84[17, 43]
colorectal cancerColorectal48731215071.52[17, 44]
liver cirrhosisCirrhosis1181142325411.03[17, 45]
Type 2 diabetesEW-T2D5343963811.23[17, 47]
C-T2D1701743446061.02[17, 46]
Inflammatory bowel diseaseIBD25851104433.4[17, 48]
NielsenHB_20141482483962921.67[16, 49]
IjazUz_20175638942301.34[16, 50]
ICDf4438822461.15[30, 51]
Autism spectrum disordersLi_ASD12040152113423.34[52]
Chen_ASD76471234551.61[53]
Arizo_ASD2020408411.0[54]
Dan_ASD14314328611201.0[55]
Table 1

The detailed information of 13 cohorts used in our study. The first six cohorts are used for the five-fold-cross-validation experiment. The inflammatory bowel disease and ASD disease cohorts are used for cross study experiment.

Disease nameCohort
name
Disease casesHealthy controlsTotal number samples (TN)Feature numbersImbalance Ratio (IR)Reference
obesityObesity164892534651.84[17, 43]
colorectal cancerColorectal48731215071.52[17, 44]
liver cirrhosisCirrhosis1181142325411.03[17, 45]
Type 2 diabetesEW-T2D5343963811.23[17, 47]
C-T2D1701743446061.02[17, 46]
Inflammatory bowel diseaseIBD25851104433.4[17, 48]
NielsenHB_20141482483962921.67[16, 49]
IjazUz_20175638942301.34[16, 50]
ICDf4438822461.15[30, 51]
Autism spectrum disordersLi_ASD12040152113423.34[52]
Chen_ASD76471234551.61[53]
Arizo_ASD2020408411.0[54]
Dan_ASD14314328611201.0[55]
Disease nameCohort
name
Disease casesHealthy controlsTotal number samples (TN)Feature numbersImbalance Ratio (IR)Reference
obesityObesity164892534651.84[17, 43]
colorectal cancerColorectal48731215071.52[17, 44]
liver cirrhosisCirrhosis1181142325411.03[17, 45]
Type 2 diabetesEW-T2D5343963811.23[17, 47]
C-T2D1701743446061.02[17, 46]
Inflammatory bowel diseaseIBD25851104433.4[17, 48]
NielsenHB_20141482483962921.67[16, 49]
IjazUz_20175638942301.34[16, 50]
ICDf4438822461.15[30, 51]
Autism spectrum disordersLi_ASD12040152113423.34[52]
Chen_ASD76471234551.61[53]
Arizo_ASD2020408411.0[54]
Dan_ASD14314328611201.0[55]

Within the cascade layers of MicroHDF, the number of decision trees is an essential hyperparameter and it was set to 100. Additionally, the undersampling ratio in the RF-CUS unit was set to 0.4. At the same time, we adopt RF-CUS*2 and ERT*2 as the forest units, and more detailed information is in Supplementary (Table S1).

Simulation Study

We initially apply MicroHDF to synthetic datasets to analyse the impact of different imbalance and under-sampling ratios for RF-CUS units. To simulate species abundance, we employ SparseDOSSA2 (R package) to generate synthetic microbiome data [56]. Figure 2A illustrates the prediction performance of the MicroHDFs on synthetic datasets with different IR and under-sampling ratio. Furthermore, on the simulated dataset, the method with under-sampling ratio of 0.4 and IR of four yielded the top performance with 0.892 AUC, as depicted in Fig. 2B. Under these conditions, our method exhibits superior performance compared to the other methods on synthetic metagenomic data, as depicted in Fig. 2C.

The impact of performance under different class IRs and under-sampling ratios on MicroHDF in the synthetic dataset.
Figure 2

The impact of performance under different class IRs and under-sampling ratios on MicroHDF in the synthetic dataset.

Comparisons with state-of-the-art methods on real datasets

In this section, we benchmark our model to state-of-the-art methods for host phenotype prediction, including three ensemble learning methods, MetAML [17], RF and GMHI [57], two classical ML methods, SVM and LASSO, and seven deep-learning-based state-of-the-art models: MLPNN [58], CNN1D [59], DeepMicro [26], Deep Forest [60], PopPhy-CNN [30], GNIP [31], and GDmicro [29].

To ensure a fair comparison, all experiments are conducted on different disease datasets with different class IRs, whereas five-fold cross-validation is performed for ten repetitions. Figure 3(A) and (B) present the ROC-AUC in the C-T2D (IR:1.02, TN:344) and obese (IR:1.84, TN:253) cohorts, respectively. In the Obesity cohort, MicroHDF achieves 0.6970 ± 0.0125 AUC and 0.7518 ± 0.0124 AUPR outperforming all the other methods. In the C-T2D cohort, although MicroHDF obtains a competitive AUC of 0.7896 ± 0.0049, less than the first-ranked method (GDmicro), our method has an AUPR of 0.7592 ± 0.0147 (Table S3), which is a 1.8% improvement over GDmicro (0.7412 ± 0.0350). To achieve a comprehensive assessment, we further investigate different metrics such as AUC, AUPR, accuracy, recall, and F1-score, on Cirrhosis (IR:1.03, TN:232) and IBD (IR:3.4, TN:110) cohorts. The result presents that the performance of MicroHDF is superior to that of the other methods not only in low-IR cohort(Cirrhosis AUC: 0.9469 ± 0.0076, AUPR: 0.9480 ± 0.0049, F1-score: 0.8891 ± 0.0101), but also in higher IR cohort(IBD AUC: 0.9182 ± 0.0098, AUPR: 0.7962 ± 0.0117, F1-score: 0.8959 ± 0.0134), despite some methods yield a slightly higher accuracy and recall (GDmicro) compared to our method (Table 2). For the EW-T2D and Colorectal datasets, the MicroHDF also achieves notable performance improvements (Table S3). In addition, the Friedman test was used to analyse the performance difference among the 13 methods on six datasets. The Friedman statistical values and the corresponding p values for AUC and AUPR are 50.26 (P = 1.26e-06), and 45.27 (P = 9.25e-06), respectively (N = 6, K = 13), as shown in supplement Tables S5 and S6. The result indicates that our method exhibits significant differences compared to the other 12 algorithms. Our results indicate that MicroHDF consistently demonstrates competitive performance, achieving higher evaluation metric results across different class IRs. In contrast, ensemble method-based approaches generally exhibit better prediction performance than single-learner paradigms (e.g. SVM and LASSO) for microbial abundance data, possibly because of their capacity to mitigate overfitting risks. Furthermore, although GNPI and PopPhy-CNN demonstrated slightly inferior performance compared to MicroHDF, their excellence may be attributed to the incorporation of phylogenetic tree information. In addition, although the transfer-learning-based method (GDmicro) may outperform MicroHDF on certain metrics, this can be attributed to the use of more samples to learn from different domains using semi-supervised learning and domain adaptation.

ROC-AUC comparison between MicroHDF and the other 12 models. (A) the C-T2D disease dataset with low IR. (B) the obesity disease dataset with high IR.
Figure 3

ROC-AUC comparison between MicroHDF and the other 12 models. (A) the C-T2D disease dataset with low IR. (B) the obesity disease dataset with high IR.

Table 2

The performance comparison between MicroHDF and 12 baseline methods on cirrhosis and IBD disease datasets, with the best result highlighted in bold and the second rank result underlined.

DatasetMethodsAUC(%)AUPR(%)Accuracy(%)Recall(%)F1-score(%)
Cirrhosis IR = 1.03RF93.3593.2888.3588.2787.23
SVM93.2092.2883.1985.7985.59
LASSO88.9588.8077.5979.0078.79
MetAML92.1993.5587.7087.4087.89
GHMI90.5189.0181.9186.2281.76
MLPNN92.1492.1384.0684.5184.23
CNN1D90.5790.4880.5682.8482.68
DeepMicro88.5093.2081.7074.0080.00
Deep Forest92.0391.1386.2486.2481.23
PopPhy-CNN90.5391.4084.3283.3280.76
GNPI92.2393.8789.3687.2381.07
GDmicro94.6393.8291.1392.3288.39
MicroHDF94.6994.8089.9391.2688.91
IBD IR = 3.4RF87.3675.2182.7080.9074.87
SVM75.9761.6680.0078.1372.51
LASSO76.5258.9678.1876.3668.62
MetAML88.2071.5575.9076.0075.70
GHMI85.7476.2880.9183.7287.02
MLPNN77.8372.4082.7268.1868.72
CNN1D84.6466.7681.8175.4574.73
DeepMicro85.0074.2077.2780.0085.33
Deep Forest84.3076.3578.1878.1673.69
PopPhy-CNN83.5070.0072.1878.1877.39
GNPI87.5373.1279.9479.9472.84
GDmicro88.5878.8987.2787.2487.83
MicroHDF91.8279.6286.3586.4989.59
DatasetMethodsAUC(%)AUPR(%)Accuracy(%)Recall(%)F1-score(%)
Cirrhosis IR = 1.03RF93.3593.2888.3588.2787.23
SVM93.2092.2883.1985.7985.59
LASSO88.9588.8077.5979.0078.79
MetAML92.1993.5587.7087.4087.89
GHMI90.5189.0181.9186.2281.76
MLPNN92.1492.1384.0684.5184.23
CNN1D90.5790.4880.5682.8482.68
DeepMicro88.5093.2081.7074.0080.00
Deep Forest92.0391.1386.2486.2481.23
PopPhy-CNN90.5391.4084.3283.3280.76
GNPI92.2393.8789.3687.2381.07
GDmicro94.6393.8291.1392.3288.39
MicroHDF94.6994.8089.9391.2688.91
IBD IR = 3.4RF87.3675.2182.7080.9074.87
SVM75.9761.6680.0078.1372.51
LASSO76.5258.9678.1876.3668.62
MetAML88.2071.5575.9076.0075.70
GHMI85.7476.2880.9183.7287.02
MLPNN77.8372.4082.7268.1868.72
CNN1D84.6466.7681.8175.4574.73
DeepMicro85.0074.2077.2780.0085.33
Deep Forest84.3076.3578.1878.1673.69
PopPhy-CNN83.5070.0072.1878.1877.39
GNPI87.5373.1279.9479.9472.84
GDmicro88.5878.8987.2787.2487.83
MicroHDF91.8279.6286.3586.4989.59
Table 2

The performance comparison between MicroHDF and 12 baseline methods on cirrhosis and IBD disease datasets, with the best result highlighted in bold and the second rank result underlined.

DatasetMethodsAUC(%)AUPR(%)Accuracy(%)Recall(%)F1-score(%)
Cirrhosis IR = 1.03RF93.3593.2888.3588.2787.23
SVM93.2092.2883.1985.7985.59
LASSO88.9588.8077.5979.0078.79
MetAML92.1993.5587.7087.4087.89
GHMI90.5189.0181.9186.2281.76
MLPNN92.1492.1384.0684.5184.23
CNN1D90.5790.4880.5682.8482.68
DeepMicro88.5093.2081.7074.0080.00
Deep Forest92.0391.1386.2486.2481.23
PopPhy-CNN90.5391.4084.3283.3280.76
GNPI92.2393.8789.3687.2381.07
GDmicro94.6393.8291.1392.3288.39
MicroHDF94.6994.8089.9391.2688.91
IBD IR = 3.4RF87.3675.2182.7080.9074.87
SVM75.9761.6680.0078.1372.51
LASSO76.5258.9678.1876.3668.62
MetAML88.2071.5575.9076.0075.70
GHMI85.7476.2880.9183.7287.02
MLPNN77.8372.4082.7268.1868.72
CNN1D84.6466.7681.8175.4574.73
DeepMicro85.0074.2077.2780.0085.33
Deep Forest84.3076.3578.1878.1673.69
PopPhy-CNN83.5070.0072.1878.1877.39
GNPI87.5373.1279.9479.9472.84
GDmicro88.5878.8987.2787.2487.83
MicroHDF91.8279.6286.3586.4989.59
DatasetMethodsAUC(%)AUPR(%)Accuracy(%)Recall(%)F1-score(%)
Cirrhosis IR = 1.03RF93.3593.2888.3588.2787.23
SVM93.2092.2883.1985.7985.59
LASSO88.9588.8077.5979.0078.79
MetAML92.1993.5587.7087.4087.89
GHMI90.5189.0181.9186.2281.76
MLPNN92.1492.1384.0684.5184.23
CNN1D90.5790.4880.5682.8482.68
DeepMicro88.5093.2081.7074.0080.00
Deep Forest92.0391.1386.2486.2481.23
PopPhy-CNN90.5391.4084.3283.3280.76
GNPI92.2393.8789.3687.2381.07
GDmicro94.6393.8291.1392.3288.39
MicroHDF94.6994.8089.9391.2688.91
IBD IR = 3.4RF87.3675.2182.7080.9074.87
SVM75.9761.6680.0078.1372.51
LASSO76.5258.9678.1876.3668.62
MetAML88.2071.5575.9076.0075.70
GHMI85.7476.2880.9183.7287.02
MLPNN77.8372.4082.7268.1868.72
CNN1D84.6466.7681.8175.4574.73
DeepMicro85.0074.2077.2780.0085.33
Deep Forest84.3076.3578.1878.1673.69
PopPhy-CNN83.5070.0072.1878.1877.39
GNPI87.5373.1279.9479.9472.84
GDmicro88.5878.8987.2787.2487.83
MicroHDF91.8279.6286.3586.4989.59

On the other hand, several studies have reported that models applied to obesity cohorts often exhibit lower performance. Consistent with these findings, our results also indicate only a slight improvement in performance. It is possible that obesity is a complex disease, caused by the interaction of genetic, environmental, and lifestyle factors. Microbial disorders may be just one of the factors contributing to the disease.

Cross-study validation

The cross-validation discussed in the previous section allows the evaluation of disease phenotype predictability in a single cohort. However, it may not effectively assess the generalizability of the model to independent validation samples, which is a scenario that is more relevant to a clinical setting. We addressed this question by applying a cross-study strategy to validate the performance of the model.

First, we focus on IBD, referring to two types of datasets with differential IR (i.e. slightly different IR between the two cohorts and a substantial difference in IR between the two cohorts) across four different cohorts: IjazUz_2017(IR:1.34), ICDf(IR:1.15), IBD(IR:3.4), and NielsenHB_2014(IR:1.67). The generalization of the model is assessed by training the model on the cohort (TR) and applying it to a test cohort (TS). Figure 4 shows the cross-study validation results for the datasets with different IR values. Our method achieves competitive performance compared to baseline methods. Specifically, the AUC value of MicroHDF in the IBD cohort increases slightly from 0.5899 to 0.6273 when generated from the NielsenHB_2014 cohort. In addition, MicroHDF outperforms all the other methods in terms of AUPR, which increased by ~ 2% for the second-rank method (GDmicro). A similar analysis is performed on ASD in four distinct cohorts (Arizo_ASD(IR:1), Dan_ASD(IR:1), Li_ASD(IR:3.34), and Chen_ASD(IR:1.61)), each representing different population characteristics from various countries. Despite clear cohort effects, we observe generalizations from one study to another, as shown in (Fig. 5). Validation of the Arizo_ASD and Dan_ASD cohorts reveals that the AUC value for MicroHDF is lower than that for GDmicro when constructed on Dan_ASD and Arizo_ASD, respectively. However, for the validation of the datasets (Li_ASD and Chen_ASD) with a higher IR difference, our method has an AUC of 0.6946 and 0.6525, which is higher than that of the top tools. Notably, our method achieves the top AUPR compared to other methods for cross-study validation. Additionally, we conducted performance comparisons across multiple cohorts for the same disease, with the results shown in Fig. S3. In conclusion, the experimental results indicate that a larger number of samples in the training dataset corresponds to improved model performance, and training on a dataset with a high-class IR may result in reduced generalization.

AUC and AUPR through cross-study analysis for IBD disease. The model was generated through training (TR) and then applied to test (TS). In bold we report the top value for each setting.
Figure 4

AUC and AUPR through cross-study analysis for IBD disease. The model was generated through training (TR) and then applied to test (TS). In bold we report the top value for each setting.

AUC and AUPR through cross-study analysis for ASD disease. The model was generated through training (TR) and then applied to test (TS). In bold we report the top value for each setting.
Figure 5

AUC and AUPR through cross-study analysis for ASD disease. The model was generated through training (TR) and then applied to test (TS). In bold we report the top value for each setting.

Components affecting prediction performance

Effect of phylogenetic information

To evaluate the effectiveness of the model, we conduct additional experiments on seven cohorts to compare prediction performance with and without phylogenetic tree information. Table 3 presents the results of the study, in which the feature format based on phylogenetic trees include raw features (L), raw features (P), and raw features (L + P), representing information obtained through level traversal, post-order traversal, and a combination of both. The raw feature (O) indicates that the model trained the dataset without phylogenetic tree information. Across several cohorts, spatial information from phylogenetic trees can improve the final prediction accuracy by 1%–5%, although there is variation in performance among different profiles with distinct taxonomic tree spatial information. Detailed model predictions are provided in Table S7.

Table 3

Comparison with and without phylogenetic tree information and different microbial features representation.

DatasetRaw feature(O)
AUC(%)
Raw feature (O + L)
AUC(%)
Raw feature (O + P)
AUC(%)
Raw feature (O + L + P)
AUC(%)
IBD89.8891.2991.2991.82
Obesity68.0968.9269.0669.70
Colorectal68.9070.6272.0176.64
EW-T2D68.0968.9269.7673.18
Cirrhosis94.4894.4994.5294.69
C-T2D77.8677.9678.4278.96
Li_ASD76.4878.4878.9680.66
DatasetRaw feature(O)
AUC(%)
Raw feature (O + L)
AUC(%)
Raw feature (O + P)
AUC(%)
Raw feature (O + L + P)
AUC(%)
IBD89.8891.2991.2991.82
Obesity68.0968.9269.0669.70
Colorectal68.9070.6272.0176.64
EW-T2D68.0968.9269.7673.18
Cirrhosis94.4894.4994.5294.69
C-T2D77.8677.9678.4278.96
Li_ASD76.4878.4878.9680.66
Table 3

Comparison with and without phylogenetic tree information and different microbial features representation.

DatasetRaw feature(O)
AUC(%)
Raw feature (O + L)
AUC(%)
Raw feature (O + P)
AUC(%)
Raw feature (O + L + P)
AUC(%)
IBD89.8891.2991.2991.82
Obesity68.0968.9269.0669.70
Colorectal68.9070.6272.0176.64
EW-T2D68.0968.9269.7673.18
Cirrhosis94.4894.4994.5294.69
C-T2D77.8677.9678.4278.96
Li_ASD76.4878.4878.9680.66
DatasetRaw feature(O)
AUC(%)
Raw feature (O + L)
AUC(%)
Raw feature (O + P)
AUC(%)
Raw feature (O + L + P)
AUC(%)
IBD89.8891.2991.2991.82
Obesity68.0968.9269.0669.70
Colorectal68.9070.6272.0176.64
EW-T2D68.0968.9269.7673.18
Cirrhosis94.4894.4994.5294.69
C-T2D77.8677.9678.4278.96
Li_ASD76.4878.4878.9680.66

Furthermore, considering that different taxonomic levels of microorganisms can influence model performance, we conducted experiments to compare the performance of various models using an integrated abundance data matrix across all hierarchical levels (Table S8) as well as different taxonomic ranks (Table S12). In six datasets, our method, MicroHDF-S, which does not incorporate phylogenetic tree information, achieved competitive results, attaining the second highest AUC of 94.41 and 78.33 for the Cirrhosis and C-T2D datasets, respectively. The most models show performance improvements due to the inclusion of additional features and implicit hierarchical information. Furthermore, we compared our method, MicroHDF-T, which integrates species-level abundance profiles with phylogenetic tree information. This approach yielded the highest AUC of 94.69 and AUR of 94.8 for the Cirrhosis dataset, as well as the second highest AUC of 69.70 and AUR of 75.18 for the Obesity dataset. It also demonstrated competitive performance across other datasets, outperforming most models that rely solely on aggregated abundance data across all hierarchical levels. Therefore, for phenotypic classification tasks, the effective incorporation of phylogenetic tree information can enhance model performance.

Effect of different architectures

In addition, we compare the performance of a single-channel module with that of a two-channel module based on microbial abundance and phylogenetic tree features. As shown in Table S9, the two-channel module yields improved prediction performance compared to the single-channel module across seven metagenomic datasets. The average AUCs and F1 scores increased by 1%–3% across the different datasets. The single-channel learning module utilizes a modified DF model to simultaneously learn the microbial abundance profile and phylogenetic tree features. The two-channel learning module consists of two separate and identical modules, each addressing the sample class imbalance and high dimensionality (Fig. S4). It enables independent learning of the embeddings of phylogenetic tree features and microbiological abundance features, with the learned embeddings from both channels combined as inputs to the prediction module.

Experimental ablation study

To further evaluate the capabilities of MicroHDFs in managing imbalanced class distributions and high-dimensional microbiota data, we conduct an ablation study to compare the classification performance of RF-CUS and ERTs as embedded units in cascade layer settings. For this purpose, we derive the following variants of our model to assess the impact of each unit:

gcForest: Two fully randomized forests and two RFs embedded in cascade layers.

The MicroHDF-CUS uses only four RF-CUS units embedded in the cascade layers of the DF.

employing only four ERTs units embed in the cascade layers of the DF.

Figure 6 illustrates that the MicroHDF consistently outperforms the other variants. The RF-CUS unit plays a crucial role in addressing imbalanced sample distributions, as MicroHDF-CUS achieves comparable AUC performance on the IBD dataset. Moreover, MicroHDF-CUS demonstrates a higher performance than MicroHDF-ERTs and gcForest, highlighting the contribution of the RF-CUS unit in enhancing the prediction accuracy of our model. Additionally, the ERTs unit enhances the prediction capability of MicroHDF, as evidenced by the higher AUC values of MicroHDF-ERTs compared with MicroHDF-CUS on the C-T2D dataset, which have balanced sample distributions.

Comparative analysis between MicroHDF and its variants on IBD cohort (A) and C-T2D cohort (B).
Figure 6

Comparative analysis between MicroHDF and its variants on IBD cohort (A) and C-T2D cohort (B).

Interpretable analysis of important features

MicroHDF addresses the issue of limited model interpretability in most deep-learning methods by quantifying the contribution of microbes and detecting the most discriminative features relevant to the predicted class. By sorting the features through FIV, we are able to identify the most important bacteria for each dataset with a predominant influence on disease predictions. In Tables 4 and 5, we present the top 20 microbes based on their FIV for two common diseases: IBD and ASD, the top 50 microbes related to the disease are detailed in Tables S10 and S11. Additionally, we use linear discriminant analysis Effect Size (LEfSe) to analyse the relationship between microbiota and disease. Linear Discriminant Analysis (LDA) scores are used to quantify the extent of differences, with larger values indicating more significant differences.

Table 4

Prediction results of the top 20 IBD-associated microbes.

DiseaseRankFIVLDA scoreMicrobeEvidence
IBD10.28183.252Unclassified OscillibacterPMID:35533243
20.23643.427Bacteroides intestinalisPMID:25307765
30.17453.165Odoribacter splanchnicusPMID:33281770
40.15453.244Ruminococcus lactarisPMID:33330572
50.13642.990Roseburia hominisPMID:33135936
60.10913.210Bifidobacterium bifidumPMID:29796620
70.10913.201Alistipes finegoldiiPMID:26288277
80.10903.171Ruminococcus bromiiPMID:31378787
90.10002.777Lachnospiracea bacterium 8_1_57FAAPMID:30936547
100.08182.689Bacteroides vulgatusPMID:32761124
110.08082.620Lachnospiracea bacterium 8_1_58FAAPMID:30936547
120.07992.611Akkermansia muciniphilaPMID:32761124
130.07842.602Faecalibacterium prausnitziiPMID:29796220
140.07842.597Alistipes shahiiPMID:26288277
150.07712.597Bacteroides cellulosilyticusPMID:30124831
160.07692.577Coprococcus comesPMID:24629344
170.07682.564Odoribacter_unclassifiedPMID:26789999
180.07672.537Butyrivibrio_crossotusPMID:32761142
190.07572.476Coprococcus_sp_ART55_1Unconfirmed
200.07562.426Barnesiella_intestinihominisUnconfirmed
DiseaseRankFIVLDA scoreMicrobeEvidence
IBD10.28183.252Unclassified OscillibacterPMID:35533243
20.23643.427Bacteroides intestinalisPMID:25307765
30.17453.165Odoribacter splanchnicusPMID:33281770
40.15453.244Ruminococcus lactarisPMID:33330572
50.13642.990Roseburia hominisPMID:33135936
60.10913.210Bifidobacterium bifidumPMID:29796620
70.10913.201Alistipes finegoldiiPMID:26288277
80.10903.171Ruminococcus bromiiPMID:31378787
90.10002.777Lachnospiracea bacterium 8_1_57FAAPMID:30936547
100.08182.689Bacteroides vulgatusPMID:32761124
110.08082.620Lachnospiracea bacterium 8_1_58FAAPMID:30936547
120.07992.611Akkermansia muciniphilaPMID:32761124
130.07842.602Faecalibacterium prausnitziiPMID:29796220
140.07842.597Alistipes shahiiPMID:26288277
150.07712.597Bacteroides cellulosilyticusPMID:30124831
160.07692.577Coprococcus comesPMID:24629344
170.07682.564Odoribacter_unclassifiedPMID:26789999
180.07672.537Butyrivibrio_crossotusPMID:32761142
190.07572.476Coprococcus_sp_ART55_1Unconfirmed
200.07562.426Barnesiella_intestinihominisUnconfirmed
Table 4

Prediction results of the top 20 IBD-associated microbes.

DiseaseRankFIVLDA scoreMicrobeEvidence
IBD10.28183.252Unclassified OscillibacterPMID:35533243
20.23643.427Bacteroides intestinalisPMID:25307765
30.17453.165Odoribacter splanchnicusPMID:33281770
40.15453.244Ruminococcus lactarisPMID:33330572
50.13642.990Roseburia hominisPMID:33135936
60.10913.210Bifidobacterium bifidumPMID:29796620
70.10913.201Alistipes finegoldiiPMID:26288277
80.10903.171Ruminococcus bromiiPMID:31378787
90.10002.777Lachnospiracea bacterium 8_1_57FAAPMID:30936547
100.08182.689Bacteroides vulgatusPMID:32761124
110.08082.620Lachnospiracea bacterium 8_1_58FAAPMID:30936547
120.07992.611Akkermansia muciniphilaPMID:32761124
130.07842.602Faecalibacterium prausnitziiPMID:29796220
140.07842.597Alistipes shahiiPMID:26288277
150.07712.597Bacteroides cellulosilyticusPMID:30124831
160.07692.577Coprococcus comesPMID:24629344
170.07682.564Odoribacter_unclassifiedPMID:26789999
180.07672.537Butyrivibrio_crossotusPMID:32761142
190.07572.476Coprococcus_sp_ART55_1Unconfirmed
200.07562.426Barnesiella_intestinihominisUnconfirmed
DiseaseRankFIVLDA scoreMicrobeEvidence
IBD10.28183.252Unclassified OscillibacterPMID:35533243
20.23643.427Bacteroides intestinalisPMID:25307765
30.17453.165Odoribacter splanchnicusPMID:33281770
40.15453.244Ruminococcus lactarisPMID:33330572
50.13642.990Roseburia hominisPMID:33135936
60.10913.210Bifidobacterium bifidumPMID:29796620
70.10913.201Alistipes finegoldiiPMID:26288277
80.10903.171Ruminococcus bromiiPMID:31378787
90.10002.777Lachnospiracea bacterium 8_1_57FAAPMID:30936547
100.08182.689Bacteroides vulgatusPMID:32761124
110.08082.620Lachnospiracea bacterium 8_1_58FAAPMID:30936547
120.07992.611Akkermansia muciniphilaPMID:32761124
130.07842.602Faecalibacterium prausnitziiPMID:29796220
140.07842.597Alistipes shahiiPMID:26288277
150.07712.597Bacteroides cellulosilyticusPMID:30124831
160.07692.577Coprococcus comesPMID:24629344
170.07682.564Odoribacter_unclassifiedPMID:26789999
180.07672.537Butyrivibrio_crossotusPMID:32761142
190.07572.476Coprococcus_sp_ART55_1Unconfirmed
200.07562.426Barnesiella_intestinihominisUnconfirmed
Table 5

Prediction results of the top 20 ASD-associated microbes.

DiseaseRankFIVLDA scoreMicrobeEvidence
ASD10.05764.524Eubacterium limosumPMID:32867322
20.05754.503Ruminococcaceae UCG-003PMID:32546239
30.05634.498Clostridium sensu stricto 13Unconfirmed
40.05334.441PrevotellaPMID:28122648
50.05334.441Lachnospiraceae_NK4A136PMID:34562157
60.05324.389LachnospiraPMID:35275534
70.05314.328Clostridium perfringensPMID:32312186
80.05304.328BifdobacteriumPMID:32867322
90.05124.220SulfurovumUnconfirmed
100.05034.216PedobacterUnconfirmed
110.05004.206Bacteroides_uniformisPMID:20613793
120.04974.191Ruminococcus_bicirculans.1PMID:30567928
130.04964.179Blautia gluceraseaPMID:28429209
140.04964.143Alistipes_putredinis.1PMID:30567928
150.04964.124Bacteroides_stercorisPMID:22180058
160.04864.024Bacteroides_uniformis.2PMID:26789999
170.04854.003Bacteroides_vulgatus8PMID:22202440
180.04854.002Alistipes_putredinis.2PMID:24130822
190.04824.010Bacteroides_vulgatus11PMID:20603222
200.04794.122Akkermansia muciniphilaPMID:24130822
DiseaseRankFIVLDA scoreMicrobeEvidence
ASD10.05764.524Eubacterium limosumPMID:32867322
20.05754.503Ruminococcaceae UCG-003PMID:32546239
30.05634.498Clostridium sensu stricto 13Unconfirmed
40.05334.441PrevotellaPMID:28122648
50.05334.441Lachnospiraceae_NK4A136PMID:34562157
60.05324.389LachnospiraPMID:35275534
70.05314.328Clostridium perfringensPMID:32312186
80.05304.328BifdobacteriumPMID:32867322
90.05124.220SulfurovumUnconfirmed
100.05034.216PedobacterUnconfirmed
110.05004.206Bacteroides_uniformisPMID:20613793
120.04974.191Ruminococcus_bicirculans.1PMID:30567928
130.04964.179Blautia gluceraseaPMID:28429209
140.04964.143Alistipes_putredinis.1PMID:30567928
150.04964.124Bacteroides_stercorisPMID:22180058
160.04864.024Bacteroides_uniformis.2PMID:26789999
170.04854.003Bacteroides_vulgatus8PMID:22202440
180.04854.002Alistipes_putredinis.2PMID:24130822
190.04824.010Bacteroides_vulgatus11PMID:20603222
200.04794.122Akkermansia muciniphilaPMID:24130822
Table 5

Prediction results of the top 20 ASD-associated microbes.

DiseaseRankFIVLDA scoreMicrobeEvidence
ASD10.05764.524Eubacterium limosumPMID:32867322
20.05754.503Ruminococcaceae UCG-003PMID:32546239
30.05634.498Clostridium sensu stricto 13Unconfirmed
40.05334.441PrevotellaPMID:28122648
50.05334.441Lachnospiraceae_NK4A136PMID:34562157
60.05324.389LachnospiraPMID:35275534
70.05314.328Clostridium perfringensPMID:32312186
80.05304.328BifdobacteriumPMID:32867322
90.05124.220SulfurovumUnconfirmed
100.05034.216PedobacterUnconfirmed
110.05004.206Bacteroides_uniformisPMID:20613793
120.04974.191Ruminococcus_bicirculans.1PMID:30567928
130.04964.179Blautia gluceraseaPMID:28429209
140.04964.143Alistipes_putredinis.1PMID:30567928
150.04964.124Bacteroides_stercorisPMID:22180058
160.04864.024Bacteroides_uniformis.2PMID:26789999
170.04854.003Bacteroides_vulgatus8PMID:22202440
180.04854.002Alistipes_putredinis.2PMID:24130822
190.04824.010Bacteroides_vulgatus11PMID:20603222
200.04794.122Akkermansia muciniphilaPMID:24130822
DiseaseRankFIVLDA scoreMicrobeEvidence
ASD10.05764.524Eubacterium limosumPMID:32867322
20.05754.503Ruminococcaceae UCG-003PMID:32546239
30.05634.498Clostridium sensu stricto 13Unconfirmed
40.05334.441PrevotellaPMID:28122648
50.05334.441Lachnospiraceae_NK4A136PMID:34562157
60.05324.389LachnospiraPMID:35275534
70.05314.328Clostridium perfringensPMID:32312186
80.05304.328BifdobacteriumPMID:32867322
90.05124.220SulfurovumUnconfirmed
100.05034.216PedobacterUnconfirmed
110.05004.206Bacteroides_uniformisPMID:20613793
120.04974.191Ruminococcus_bicirculans.1PMID:30567928
130.04964.179Blautia gluceraseaPMID:28429209
140.04964.143Alistipes_putredinis.1PMID:30567928
150.04964.124Bacteroides_stercorisPMID:22180058
160.04864.024Bacteroides_uniformis.2PMID:26789999
170.04854.003Bacteroides_vulgatus8PMID:22202440
180.04854.002Alistipes_putredinis.2PMID:24130822
190.04824.010Bacteroides_vulgatus11PMID:20603222
200.04794.122Akkermansia muciniphilaPMID:24130822

IBD is a general term encompassing ulcerative colitis (UC) and Crohn’s disease (CD). Recently, microorganisms are shown to have a significant impact on the development, progression, and treatment of IBD [61]. Based on our findings, Unclassified Oscillibacter is the most associated with IBD which is the top of the FIV. Odoribacter splanchnicus ranks third, is generally discovered to decrease in abundance, and is linked to the early stages of IBD [62]. Alistipes finegoldii ranks seventh, and has been identified as a microbial driver of IBD [63–65]. Our model-computed FIV is consistent with the trend in the results of the LEfSe analysis. In other words, MicroHDF can aid in the discovery of disease biomarkers.

ASD is a severe neurodevelopmental disorder whose prevalence has increased dramatically over the past several years. A recent clinical study has reported a close association between a wide range of microbes and ASD. Prevotella and Firmicutes exhibited differing abundances between the ASD and neurotypical (NT) groups, whereas Lachnospira shows a pattern of decreased abundance [66]. These microorganisms are ranked in the top ten list of results from our model. In addition to the microorganisms confirmed in the literature, we discover three other microorganisms, Clostridium sensu stricto 13, Sulfurovum and Pedobacter, which are not directly associated with ASD. There is a report that pedobacter and Sulfurovum appear to be more abundant in gastrointestinal (GI) symptoms(such as gaseousness and diarrhea) [67], which often co-occur with core ASD symptoms in children with ASD [67–69]. Their significance as microbes in ASD should be confirmed in future clinical experiments. In summary, we observe that the microbial features selected in our framework are closely associated with disease, indicating that MicroHDF can predict candidate microbes for a given disease, thus greatly assisting in the screening of candidate biomarkers.

Discussion

Predicting host phenotypes based on microbiome data is challenging because of the substantial individual differences in the microbial community influenced by genetics, diet, lifestyle, environmental factors, and other variables. In this study, we present a novel prediction framework, MicroHDF, which is a DF-based ensemble model for inferring host phenotypes from microbial data. MicroHDF initially embed phylogenetic tree knowledge of the microbiome and the relative abundance of microbial taxa to construct a new feature matrix. Subsequently, the RF-CUS and ERTs units in the cascade layers are utilized to address class-imbalanced and high-dimensional problems, respectively. Simultaneously, the learned embeddings incorporated by the cascade layers aid in feature representation learning for the output layer. Finally, predicted labels are obtained using an ensemble strategy. Additionally, FIV is introduced to rank and quantify the contribution of microbiome features to disease prediction. A comparison with state-of-the-art methods and ablation study results confirmed that the improved deep-forest-based approach learns more discriminative features from microbes and effectively addresses metagenomic data challenges. Furthermore, the experimental results demonstrated the enhancement gained by fusing the phylogenetic tree information in our proposed approach.

MicroHDF offers the advantage of integrating multiple microbial features regardless of the sample label imbalance and feature dimension. In addition, it performs well when trained on small-to medium-scale datasets. However, our method has some limitations. First, MicroHDF exclusively utilizes microbial features and overlooks the potential of metadata (such as age, sex, BMI, and lifestyle) in host disease prediction, potentially biasing well-known microbe-disease associations. To address this issue, we can easily integrate the collected host metadata such as diet, stress, and drug frequency. These factors are generally considered confounding variables that must be addressed before model application. In metagenomic data analysis, batch effects can arise from biological factors (e.g. variations in demographics), technical factors (e.g. experiment temperatures, reagents, runs, platforms), and computational factors (e.g. pipelines, software, parameters). Biological factors can alter microbiota composition by affecting certain microorganisms, while technical factors can introduce artificial heterogeneity, and computational aspects can systematically influence all microbial variables. Therefore, to achieve more accurate predictions, it is crucial to account for and remove these confounding factors. Furthermore, the gut microbiomes associated with comorbidities may exhibit microbial patterns that are different from those linked to a single disease, potentially significantly disrupting disease prediction. Hence, we should take into account the impact of different disease correlations. Finally, the predictive model may not be generalizable to complex mental diseases, which presents an inherent challenge for disease prediction. In the future, we plan to apply more advanced deep learning models such as graph neural networks to further enhance disease prediction and explore mental disease screening.

Collectively, MicroHDF demonstrates outstanding performance and interpretability for host phenotype prediction, confirming its effectiveness through cross-validation and cross-study validation. MicroHDF can serve as a potent tool for disease prediction based on metagenomic data.

Key Points
  • For microbiome-based supervised disease classification, classifying the host disease phenotype is still a big challenge, partly because of imbalanced classification labels and high-dimensional taxonomic abundance features in metagenomic data.

  • We propose a framework with an improved DF based on the microbiome, termed MicroHDF, for host disease prediction.

  • In MicroHDF, two modified forest types are embedded in the cascade layer: an RF-CUS unit dealing with data imbalance and an ERTs unit dealing with high-dimensional features.

  • MicroHDF leverages the biological knowledge of microbial taxa abundance profiles through a phylogenetic tree, which improves the robustness of classification performance.

Acknowledgements

The authors thank the referees for suggestions that helped improve the manuscript substantially.

Conflict of interest: None declared.

Funding

This work was supported by the National Natural Science Foundation of China (62162019, 62166014), Shanghai Municipal Science and Technology Major Project (No.2018SHZDZX01), Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (LCNBI) and ZJLab, Guangxi Key Laboratory Fund of Embedded Technology and Intelligent System, Special Funds for Guiding Local Scientific and Technological Development by the Central Government (No. Guike ZY22096025), the startup Grant in Guilin University of Technology, Innovation Project of Guangxi Graduate Education(YCSW2024357).

Data availability

The original contributions presented in this study are included in the article/supplementary material, further inquiries can be directed to the corresponding authors. The codes and datasets are available online at https://github.com/glutBiolab/MicroHDF.

References

1.

Perler
BK
,
Friedman
ES
,
Wu
GD
.
The role of the gut microbiota in the relationship between diet and human health
.
Annu Rev Physiol
2023
;
85
:
449
68
. .

2.

O'Donnell
JA
,
Zheng
T
,
Meric
G
. et al.
The gut microbiome and hypertension
.
Nat Rev Nephrol
2023
;
19
(
3
):
153
67
. .

3.

Morais
LH
.
The gut microbiota-brain axis in behaviour and brain disorders
.
Microbiology
2021
;
19
(
4
):
241
55
. .

4.

Morton
JT
,
Jin
DM
,
Mills
RH
. et al.
Multi-level analysis of the gut-brain axis shows autism spectrum disorder-associated molecular and microbial profiles
.
Nat Neurosci
2023
;
26
(
7
):
1208
17
. .

5.

Richardson
L
,
Allen
B
,
Baldi
G
. et al.
MGnify: the microbiome sequence data analysis resource in 2023
.
Nucleic Acids Res
2023
;
51
(D1):
D753
9
. .

6.

Dai
D
,
Zhu
J
,
Sun
C
. et al.
GMrepo v2: a curated human gut microbiome database with special focus on disease markers and cross-dataset comparison
.
Nucleic Acids Res
2022
;
50
(D1):
D777
84
. .

7.

Gonzalez
A
,
Navas-Molina
JA
,
Kosciolek
T
. et al.
Qiita: rapid, web-enabled microbiome meta-analysis
.
Nat Methods
2018
;
15
(
10
):
796
8
. .

8.

Barrett T, Clark K, Gevorgyan R. et al.

BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata
.
Nucleic Acids Res
2012;
40
(Database issue):D57–63. .

9.

Papoutsoglou
G
,
Tarazona
S
,
Lopes
MB
. et al.
Machine learning approaches in microbiome research: challenges and best practices
.
Front Microbiol
2023
;
14
:
1261889
. .

10.

Hernández Medina
R
,
Kutuzova
S
,
Nielsen
KN
. et al.
Machine learning and deep learning applications in microbiome research. ISME
.
Communications
2022
;
2
(
1
):
98
. .

11.

Shi
K
,
Zhang
L
,
Yu
J
. et al.
A 12-genus bacterial signature identifies a group of severe autistic children with differential sensory behavior and brain structures
.
Clin Transl Med
2021
;
11
(
2
):
e314
. .

12.

Wirbel
J
,
Pyl
PT
,
Kartal
E
. et al.
Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer
.
Nat Med
2019
;
25
(
4
):
679
89
. .

13.

Topçuoğlu
BD
,
Lesniak
NA
,
Ruffin
MT
IV
. et al.
A framework for effective application of machine learning to microbiome-based classification problems
.
M bio
2020
;
11
(
3
):e00434–20. .

14.

Thomas
AM
,
Manghi
P
,
Asnicar
F
. et al.
Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation
.
Nat Med
2019
;
25
(
4
):
667
78
. .

15.

Jiang
P
,
Wu
S
,
Luo
Q
. et al.
Metagenomic analysis of common intestinal diseases reveals relationships among microbial signatures and powers multidisease diagnostic models
.
Microbial systems
2021
;
6
(
3
):
112
21
. .

16.

Giliberti
R
,
Cavaliere
S
,
Mauriello
IE
. et al.
Host phenotype classification from human microbiome data is mainly driven by the presence of microbial taxa
.
PLoS Comput Biol
2022
;
18
(
4
):
e1010066
. .

17.

Pasolli
E
,
Truong
DT
,
Malik
F
. et al.
Machine learning meta-analysis of large metagenomic datasets: tools and biological insights
.
PLoS Comput Biol
2016
;
12
(
7
):
e1004977
. .

18.

Li
M
,
Liu
J
,
Zhu
J
. et al.
Performance of gut microbiome as an independent diagnostic tool for 20 diseases: cross-cohort validation of machine-learning classifiers
.
Gut Microbes
2023
;
15
(
1
):
2205386
. .

19.

Vangay
P
,
Hillmann
BM
,
Knights
D
.
Microbiome learning repo (ML repo): a public repository of microbiome regression and classification tasks
.
Gigascience
2019
;
8
(
1
):
giz042
. .

20.

Wirbel
J
,
Zych
K
,
Essex
M
. et al.
Microbiome meta-analysis and cross-disease comparison enabled by the SIAMCAT machine learning toolbox
.
Genome Biol
2021
;
22
(
1
):
93
. .

21.

Yang
F
,
Zou
Q
,
Gao
B
.
GutBalance: a server for the human gut microbiome-based disease prediction and biomarker discovery with compositionality addressed
.
Brief Bioinform
2021
;
22
(
5
):bbaa436. .

22.

Nagpal
S
,
Singh
R
,
Taneja
B
. et al.
MarkerML - marker feature identification in metagenomic datasets using interpretable machine learning
.
J Mol Biol
2022
;
434
(
11
):167589. .

23.

Yang
F
,
Quan
Z
.
DisBalance: a platform to automatically build balance-based disease prediction models and discover microbial biomarkers from microbiome data
.
Brief Bioinform
2021
;
22
(
5
):bbab094. .

24.

Lo
C
,
Marculescu
R
.
MetaNN: accurate classification of host phenotypes from metagenomic data using neural networks
.
BMC Bioinformatics
2019
;
20
(Suppl 12):
314
. .

25.

Sharma
D
,
Paterson
AD
,
Xu
W
.
TaxoNN: ensemble of neural networks on stratified microbiome data for disease prediction
.
Bioinformatics
2020
;
36
(
17
):
4544
50
. .

26.

Oh
M
,
Zhang
L
.
DeepMicro: deep representation learning for disease prediction based on microbiome data
.
Sci Rep
2020
;
10
(
1
):
6026
. .

27.

Grazioli
F
,
Siarheyeu
R
,
Alqassem
I
. et al.
Microbiome-based disease prediction with multimodal variational information bottlenecks
.
PLoS Comput Biol
2022
;
18
(
4
):
e1010050
. .

28.

Syama
K
,
Jothi
JAA
,
Khanna
N
.
Automatic disease prediction from human gut metagenomic data using boosting GraphSAGE
.
BMC Bioinformatics
2023
;
24
(
1
):
126
. .

29.

Liao
H
,
Shang
J
,
Sun
Y
.
GDmicro: classifying host disease status with GCN and deep adaptation network based on the human gut microbiome data
.
Bioinformatics
2023
;
39
(
12
):
747
. .

30.

Reiman
D
,
Metwally
AA
,
Sun
J
. et al.
PopPhy-CNN: a phylogenetic tree embedded architecture for convolutional neural networks to predict host phenotype from metagenomic data
.
IEEE J Biomed Health Inform
2020
;
24
(
10
):
2993
3001
. .

31.

Li
B
,
Zhong
D
,
Qiao
J
. et al.
GNPI: graph normalization to integrate phylogenetic information for metagenomic host phenotype prediction
.
Methods
2022
;
205
:
11
7
. .

32.

Chen
X
,
Zhang
W
, Wong KC.
Human disease prediction from microbiome data by multiple feature fusion and deep learning
.
Cell iScience
2022
;
25
(
4
):104081. .

33.

Sharma S, Gosain A, Jain S.

A Review of the Oversampling Techniques in Class Imbalance Problem
. In:
International Conference on Innovative Computing and Communications
. Singapore, 2022, p. 459–72. .

34.

Zhou
Z-H
,
Feng
J
.
Deep forest
.
NSR
2019
;
6
(
1
):
74
86
. .

35.

Jin
S
,
Zeng
X
,
Xia
F
. et al.
Application of deep learning methods in biological networks
.
Brief Bioinform
2021
;
22
(
2
):
1902
17
. .

36.

Zhu
Q
,
Li
B
,
He
T
. et al.
Robust biomarker discovery for microbiome-wide association studies
.
Methods
2020
;
173
:
44
51
. .

37.

Wu
L
,
Gao
J
,
Zhang
Y
. et al.
A hybrid deep forest-based method for predicting synergistic drug combinations
.
Cell Rep Methods
2023
;
3
(
2
):100411. .

38.

Geurts
P
,
Ernst
D
,
Wehenkel
L
.
Extremely randomized trees
.
Machine Learning
2006
;
63
:
3
42
. .

39.

Wang
Y
,
Bhattacharya
T
,
Jiang
Y
. et al.
A novel deep learning method for predictive modeling of microbiome data
.
Brief Bioinform
2021
;
22
(
3
):
bbaa073
. .

40.

Letunic
I
,
Bork
P
.
Interactive tree of life (iTOL) v5: an online tool for phylogenetic tree display and annotation
.
Nucleic Acids Res
2021
;
49
(W1):
W293
6
. .

41.

Segata
N
,
Izard
J
,
Waldron
L
. et al.
GBC
2011
;
12
(
6
):
R60
. .

42.

Brock
G
,
Saito
T
,
Rehmsmeier
M
.
The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets
.
PloS One
2015
;
10
(
3
):
e0118432
. .

43.

Le Chatelier
E
,
Nielsen
T
,
Qin
J
. et al.
Richness of human gut microbiome correlates with metabolic markers
.
Nature
2013
;
500
(
7464
):
541
6
. .

44.

Zeller
G
,
Tap
J
,
Voigt
AY
. et al.
Potential of fecal microbiota for early-stage detection of colorectal cancer
.
Mol Syst Biol
2014
;
10
(
11
):
766
. .

45.

Qin
N
,
Yang
F
,
Li
A
. et al.
Alterations of the human gut microbiome in liver cirrhosis
.
Nature
2014
;
513
(
7516
):
59
64
. .

46.

Qin
J
,
Li
Y
,
Cai
Z
. et al.
A metagenome-wide association study of gut microbiota in type 2 diabetes
.
Nature
2012
;
490
(
7418
):
55
60
. .

47.

Karlsson
FH
,
Tremaroli
V
,
Nookaew
I
. et al.
Gut metagenome in European women with normal, impaired and diabetic glucose control
.
Nature
2013
;
498
(
7452
):
99
103
. .

48.

Qin
J
,
Li
R
,
Raes
J
. et al.
A human gut microbial gene catalogue established by metagenomic sequencing
.
Nature
2010
;
464
(
7285
):
59
65
. .

49.

Nielsen
HB
,
Almeida
M
,
Juncker
AS
. et al.
Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes
.
Nat Biotechnol
2014
;
32
(
8
):
822
8
. .

50.

Ijaz
UZ
,
Quince
C
,
Hanske
L
. et al.
The distinct features of microbial 'dysbiosis' of Crohn's disease do not occur to the same extent in their unaffected, genetically-linked kindred
.
PloS One
2017
;
12
(
2
):e0172605. .

51.

Sokol
H
,
Leducq
V
,
Aschard
H
. et al.
Fungal microbiota dysbiosis in IBD
.
Gut
2017
;
66
(
6
):
1039
48
. .

52.

Li
N
,
Chen
H
,
Cheng
Y
. et al.
Fecal microbiota transplantation relieves gastrointestinal and autism symptoms by improving the gut microbiota in an open-label study
.
Front Cell Infect Microbiol
2021
;
11
:759435. .

53.

Chen
Y
,
Fang
H
,
Li
C
. et al.
Gut Bacteria shared by children and their mothers associate with developmental level and social deficits in autism Spectrum disorder
.
Clin Vaccine Immunol
2020
;
5
(
6
):e01044–20. .

54.

Kang
DW
,
Park
JG
,
Ilhan
ZE
. et al.
Reduced incidence of Prevotella and other fermenters in intestinal microflora of autistic children
.
PLoS One
2013
;
8
(
7
):
e68322
. .

55.

Dan
Z
,
Mao
X
,
Liu
Q
. et al.
Altered gut microbial profile is associated with abnormal metabolism activity of autism Spectrum disorder
.
Gut Microbes
2020
;
11
(
5
):
1246
67
. .

56.

Ma
S
,
Ren
B
,
Mallick
H
. et al.
A statistical model for describing and simulating microbial community profiles
.
PLoS Comput Biol
2021
;
17
(
9
):
e1008913
. .

57.

Gupta
VK
,
Kim
M
,
Bakshi
U
. et al.
A predictive index for health status using species-level gut microbiome profiling
.
Nat Commun
2020
;
11
(
1
):
4635
. .

58.

Gregory Ditzler
RP
,
Rosen
G
.
Multi layer and recursive neural networks for metagenomic classification
.
IEEE Trans Nanobioscience
2015
;
14
(
6
):
608
16
. .

59.

Reiman D, Metwally A, Yang D.

Using convolutional neural networks to explore the microbiome
.
Annu Int Conf IEEE Eng Med Biol Soc
2017;
2017
:4269–72. .

60.

Zhu Q, Zhu Q, Pan M. et al.

The Phylogenetic Tree based Deep Forest for Metagenomic Data Classification
. In:
2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
.
2018
, p. 279–282.

61.

Andoh
A
,
Nishida
A
.
Alteration of the gut microbiome in inflammatory bowel disease
.
Digestion
2023
;
104
(
1
):
16
23
. .

62.

Lima
SF
,
Gogokhia
L
,
Viladomiu
M
. et al.
Transferable immunoglobulin A-coated Odoribacter splanchnicus in responders to Fecal microbiota transplantation for ulcerative colitis limits colonic inflammation
.
Gastroenterology
2022
;
162
(
1
):
166
78
. .

63.

Yan
P
,
Sun
Y
,
Luo
J
. et al.
Integrating the serum proteomic and fecal metaproteomic to analyze the impacts of overweight/obesity on IBD: a pilot investigation. Clin
.
Proteomics
2023
;
20
(
1
):
6
. .

64.

Bi
Z
,
Chen
J
,
Chang
X
. et al.
ADT-OH improves intestinal barrier function and remodels the gut microbiota in DSS-induced colitis
.
Front Med
2023
;
17
(
5
):
972
92
. .

65.

Yang
Y
,
Jobin
C
.
Novel insights into microbiome in colitis and colorectal cancer
.
Curr Opin Gastroenterol
2017
;
33
(
6
):
422
7
. .

66.

West
KA
,
Yin
X
,
Rutherford
EM
. et al.
Multi-angle meta-analysis of the gut microbiome in autism Spectrum disorder: a step toward understanding patient subgroups
.
Sci Rep
2022
;
12
(
1
):
17034
. .

67.

Dovgan
K
,
Gynegrowski
K
,
Ferguson
BJ
.
Bidirectional relationship between internalizing symptoms and gastrointestinal problems in youth with autism Spectrum disorder
.
J Autism Dev Disord
2023
;
53
(
11
):
4488
94
. .

68.

Lasheras
I
,
Real-Lopez
M
,
Santabarbara
J
.
Prevalence of gastrointestinal symptoms in autism spectrum disorder: a meta-analysis
.
An Pediatr (Engl Ed)
2023
;
99
(
2
):
102
10
. .

69.

Mazefsky
CA
,
Schreiber
DR
,
Olino
TM
. et al.
The association between emotional and behavioral problems and gastrointestinal symptoms among children with high-functioning autism
.
Autism
2014
;
18
(
5
):
493
501
. .

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.