Abstract

Microbes have a profound impact on human health. Identifying disease-associated microbes would provide helpful guidance for drug development and disease treatment. Through an enormous experimental effort, limited disease-associated microbes have been determined. Accurate computational approaches are needed to predict potential microbe-disease associations for biomedical screening. In this study, we present an ensemble learning framework entitled SABMDA to improve microbe-disease association inference. We first integrate multi-source of information from both microbes and diseases, and develop two matrix completion algorithms to predict microbe-disease associations successively. Ablation tests show combining the two matrix completion algorithms can receive better prediction performance. Moreover, comprehensive experiments, including cross-validations and independent test, demonstrate that SABMDA outperforms seven recent baseline methods significantly. Finally, we apply SABMDA to three diseases to predict their associated microbes, and results show SABMDA’s remarkable prediction ability in real situations.

Introduction

Microbes widely exist on our planet, including in oceans, soils, and the human body [1]. It is estimated that the human body hosts approximately 350 trillion microbial cells [2]. With recent advances in sequencing and new bioinformatics development, significant progress has been made in revealing how microbiome composition and function affect human health. For example, studies showed that changes in the composition of human microbiota might impact immunological and pathological conditions [3–5]. Additionally, it has been discovered that hepatic metabolism was regulated by microbiota in the liver through decreasing energy expenditure and promoting adiposity [6]. Because of their fundamental roles in human health, accumulating studies have been conducted to elucidate the relationships between microbes and various types of diseases (see review [7] for more details).

Biomedical efforts to discover disease-associated microbes are often time-consuming and costly. Computational approaches to inferring potential microbe-disease associations would therefore bring benefits to the scientific communities. So far, developing algorithms for microbe-disease association prediction has aroused enormous interest in bioinformatics field [8]. Generally, these computational methods apply Random Walk [9–11], bipartite local models (BLMs) [12–14], matrix factorization [15–17], and machine learning [18, 19] for association prediction. These methods prioritize the potential associations according to their received scores based on graph theory, or consider the association prediction as a classification or regression problem in machine learning.

Even though, increasing prediction accuracy has gradually been received from the above computational methods. Some shortcomings should be mentioned. For example, the graph-based algorithms are sensitive to noise and spareness in data, which can lead to misleading predictions or severely impact performance. For the machine learning methods, selecting relevant features from high-dimensional biomedical data can be difficult. Poor feature selection can lead to suboptimal prediction performance.

More recently, with the rapid development of molecular biology science, new biomedical data about microbes and diseases is continuously emerging. The heterogeneous data provides complementary information while may contain noise. Integrating the biomedical information from different sources would improve the accuracy of microbe-disease association prediction. Meanwhile, improved prediction performance is required to offer useful guidance for biomedical researchers. More reliable prediction algorithms therefore need to be developed with the fast advance of modern computer science.

In this study, we present an ensemble learning method entitled SABMDA based on matrix completion to improve microbe-disease association prediction. Specifically, SABMDA first fuses multiple biomedical information of microbes and diseases to form a microbe-disease matrix. It then applies a singular value thresholding (SVT) algorithm to complete the original microbe-disease matrix. We finally use a bounded nuclear norm regularization (BNNR) algorithm with constraints to predict microbe-disease associations. By 5-CV, 10-CV, and independent validation tests, SABMDA receives the best prediction performance when compared to seven baseline methods. In addition, case studies are conducted and results show that SABMDA exhibits reliable inference ability in real situations. In summary, the excellent performance of SABMDA suggests that it is a powerful and effective computational tool for inferring new microbe-disease associations.

Materials and methods

Data preparation

We first download the benchmark dataset from reference [19], in which 4499 experimentally confirmed microbe-disease associations were collected and four categories of similarities of both microbe-microbe and disease-disease were calculated. For the 4499 associations, there exist 1177 microbes and 134 diseases. For the four kinds of similarities in microbes, we refer to as functional similarity (FS), cosine similarity (COS_MS), Gaussian interaction profile similarity (GIP_MS), and sigmoid kernel function similarity (SIG_MS). For disease similarities, semantic similarity (DS), cosine similarity (COS_DS), Gaussian interaction profile similarity (GIP_DS), and sigmoid kernel function similarity (SIG_DS) were computed.

We then fuse the four similarity matrices of microbes and diseases as follows:

(1)
(2)

where SM denotes the fused microbe similarity matrix and SD represents the fused disease similarity matrix. Finally, we integrate the two fused similarity matrices with the microbe-disease association matrix |${A}^{\prime }$| to obtain a new matrix X:

(3)

The framework of SABMDA

The workflow of SABMDA for microbe-disease association predictions is shown in Fig. 1. After similarity fusion, SABMDA combines the similarity matrices with the adjacency matrices, and performs matrix completion using a SVT algorithm. Then, a BNNR algorithm is applied to further optimize the complemented matrices with meta-type integration, and finally a score matrix for predicting microbe-disease associations is obtained. We prioritize the scores to screen potential associations.

The workflow of SABMDA for microbe-disease association prediction.
Figure 1

The workflow of SABMDA for microbe-disease association prediction.

Specifically, we consider the microbe-disease association prediction as a matrix completion problem and first apply a SVT algorithm to solve this problem. The algorithm was previously proposed as a solution to the famous Netflix problem [20]. It hierarchizes the data with different features and finds the optimal threshold value based on the specificity of each feature to achieve an accurate classification of the relevant data.

We update the matrix values by iterations, and in each iteration a matrix Xi  |$\in{\mathbb{R}}^{N_{m+d}\times{N}_{m+d}}$| (i denotes the number of current iterations) is generated. When the iteration ends, a matrix Xn (n denotes the number of final iterations) is obtained showcasing all microbe-disease association scores. In order to ensure that the association score matrix XA in Xn is close to the score in the adjacency matrix A, the following optimization problem needs to be solved:

(4)

where XA is the training matrix after removing the validation component, and |${P}_{\varOmega }$| is the orthogonal projector over the span of the matrix that vanishes outside Ω. The (i, j)th component of |${P}_{\varOmega }$| (XA) is equal to X(i, j), if (i,j)∈Ω and zero otherwise. |${f}_t\left({X}_A\right)$| is a nonlinear function of XA and is defined in the following form:

(5)

where |${\left\Vert{X}_A\right\Vert}_{\ast }$| is the sum of singular values of XA, |${\left\Vert{X}_A\right\Vert}_F$| denotes the Frobenius form of XA, which can also be denoted as |${\left\Vert{X}_A\right\Vert}_F=\sqrt{\sum\nolimits_{i=1}^{n_d}\sum\nolimits_{j=1}^{n_m}{X}_A{\left(i,j\right)}^2}$|⁠, and |$\tau$| is a threshold value.

Then, we introduce the SVT operator and consider the singular value decomposition of a matrix X |$\in{\mathbb{R}}^{N_{m+d}\times{N}_{m+d}}$| of rank r. The definition is as follows:

(6)

where U and V represent |${N}_{m+d}\times r$| matrices with orthogonal columns and |${\sigma}_i$| is singular and greater than zero. For each |$\tau \ge 0$|⁠, we introduce the soft threshold operator |${D}_{\tau }$| defined as follows:

(7)

where |${\left\{{\sigma}_i-\tau \right\}}_{+}$| represents the positive part of |$\left\{{\sigma}_i-\tau \right\}$|⁠. This operation is able to apply the soft threshold rule to the singular values of X, effectively shrinking these singular values to zero.

According to reference [21], SVT can be optimized using the Lagrange multiplier method, and the Lagrange multiplier Y can be obtained as follows:

(8)

where M is defined as

$M=\left[\begin{array}{@{}cc@{}} SD& {A}^T\\{}A& SM\end{array}\right]$
⁠. In each iteration, we apply two key steps from Uzawa’s algorithm [22]. The first one is to update X with Y:

(9)

Then we use X to update Y:

(10)

where |${Y}^0$| is the zero matrix [23], and |${\delta}_k$| is the step size. We assume that the iterations converge to a unique solution when |$0<{\delta}_k<2$|⁠. We show the best performance of our model can be received when |${\delta}_k=0.1$| in parametric experiments. Subsequently, we set the iteration period n to limit the maximum number of iterations to avoid infinite loops, and details about the setting of n will be described in the Results part. Finally, we obtain the matrix Xn, which is then sigmoid, normalized to obtain |${X_n}^{\prime }$| using the following equation:

(11)

We extract the score matrix |${X}_A$| from the corresponding position in |${X_n}^{\prime }$|⁠. We further construct a heterogeneous network G that integrates disease-disease similarity network, microbe-disease network, and microbe-microbe similarity network, as follows:

(12)

As the elements in the microbe similarity matrix SM and the disease similarity matrix SD are in the interval [0,1], and the elements in |${X}_A$| are also in the interval [0,1], we expect the predicted values of the unknown associations to be in the interval [0,1] too. Therefore, we add boundary treaties to our model to ensure that the elements to be predicted are also in the interval [0,1].

In addition, since there is a large amount of noise in the data, especially when measuring similarities, our proposed model should effectively tolerate this noise. We therefore redefine the model as:

(13)

Where |$\varepsilon$| measures the level of noise. Meanwhile, because the level of noise is unknown, it is not easy to calculate the effective noise level. Therefore, in order to solve the above problems, inspired by references [24, 25], we further optimize the score matrix |${X}_A$| with meta-level type integration. The model is defined as follows:

(14)

where |$\alpha$| is the parameter that balances the kernel paradigm and the error term. To make sure all elements in G belong to the interval [0,1], we use ADMM [26] to solve this problem.

Before using the ADMM framework for problem solving, we introduce an auxiliary function W for optimization:

(15)

Thus, the augmented Lagrangian function can be written in the following form:

(16)

where |$Y$| is the Lagrange multiplier and |$\beta$| is the penalty parameter and is greater than 0. After the Kth time, the model computes |${W}_{k+1}$|⁠, |${G}_{k+1}$|⁠, and |${Y}_{k+1}$|⁠.

ADMM is then applied to iteratively update |${W}_{k+1}$|⁠, |${G}_{k+1}$|⁠, and |${Y}_{k+1}$|⁠. Their iterative procedures are denoted as follows:

(17)
(18)
(19)

Based on Equation (17), we can obtain the optimal solution |${W}^{\ast }$| for |${W}_{k+1}$|⁠, which is computed as follows:

(20)

According to Equation (18), |${G}_{k+1}$| can be calculated by the following equation:

(21)

Finally, we can get the microbe-disease score matrix |${X_A}^{\prime }$| from |${G}_{k+1}$| after iterations. We prioritize the received scores for association prediction.

Results

Experimental setting

To evaluate the prediction ability of SABMDA, we perform 5-fold cross-validation (5-CV), 10-fold cross-validation (10-CV), and independent test based on the benchmark dataset. For 5-CV, we randomly divide all the microbe-disease associations into 5 equal portions, of which 4 portions are used to train the model, and the remaining 1 portion is used for testing. We take the similar steps in 10-CV. For the independent test, where the microbe-disease association matrix is divided into training, testing, and validation sets by rows (diseases) according to the ratio of 8:1:1, in which 8 parts are for the training set, 1 part is for the testing set and 1 part is for the validation set. We calculate AUC (area under the ROC curve), AUPR (area under the Precision-Recall curve), Recall, Precision (Pre), Accuracy (ACC), and F1-score as the metrics for evaluating the performance of the model.

Parameter analysis

Since parameters exist in our method SABMDA, we set the initial values for the parameters based on references [27, 28], and subsequently empirically perform parameter sensitivity analysis for adjustment. We examine the effects of the threshold value |$\tau$|⁠, step size |${\delta}_k$|⁠, iteration period n, as well as the parameter |$\alpha$| that balances the kernel paradigm and the error term, and the penalty parameter |$\beta$|⁠. While keeping all other parameters fixed, we conduct experiments on the benchmark dataset and evaluate the impact of the above parameters on the performance under 10-CV.

Firstly, we test the threshold value |$\tau$| and step size |${\delta}_k$|⁠. We take the threshold value |$\tau$| as 10, 50, 100, and 150, and the step size |${\delta}_k$| as 0.1, 0.5, 1.0, and 1.5, respectively. The results are shown in Fig. 2 and Table 1. We discover that the AUC, AUPR, ACC, and F1-Score of our model will reach the best when the threshold value |$\tau =10$|⁠, and the step size |${\delta}_k=0.1$|⁠.

The effect of $\tau$ and ${\delta}_k$ on AUC and AUPR.
Figure 2

The effect of |$\tau$| and |${\delta}_k$| on AUC and AUPR.

Table 1

The effect of |$\tau$| and |${\delta}_k$| on model performance

 AUCAUPRACCPreRecallF1-score
|$\tau =10,{\delta}_k=0.1$|0.99340.99300.96580.96160.97060.9660
|$\tau =10,{\delta}_k=0.5$|0.99310.99280.96550.96170.96970.9656
|$\tau =10,{\delta}_k=1.0$|0.99310.99280.96550.96180.96980.9657
|$\tau =10,{\delta}_k=1.5$|0.99300.99280.96530.95980.97130.9655
|$\tau =50,{\delta}_k=0.1$|0.98950.98860.95630.95430.95860.9564
|$\tau =50,{\delta}_k=0.5$|0.99110.99030.95960.95460.96530.9599
|$\tau =50,{\delta}_k=1.0$|0.99090.99010.95880.95360.96480.9591
|$\tau =50,{\delta}_k=1.5$|0.99090.99010.95870.95330.96490.9590
|$\tau =100,{\delta}_k=0.1$|0.98350.98300.94360.93890.94950.9439
|$\tau =100,{\delta}_k=0.5$|0.99060.98980.95790.95490.96150.9581
|$\tau =100,{\delta}_k=1.0$|0.99010.98940.95710.95340.96130.9572
|$\tau =100,{\delta}_k=1.5$|0.98990.98930.95690.95210.96240.9572
|$\tau =150,{\delta}_k=0.1$|0.97710.97830.93280.93650.92900.9325
|$\tau =150,{\delta}_k=0.5$|0.98940.98880.95380.94910.95930.9541
|$\tau =150,{\delta}_k=1.0$|0.99010.98960.95620.95240.96040.9563
|$\tau =150,{\delta}_k=1.5$|0.98960.98910.95560.95230.95950.9558
 AUCAUPRACCPreRecallF1-score
|$\tau =10,{\delta}_k=0.1$|0.99340.99300.96580.96160.97060.9660
|$\tau =10,{\delta}_k=0.5$|0.99310.99280.96550.96170.96970.9656
|$\tau =10,{\delta}_k=1.0$|0.99310.99280.96550.96180.96980.9657
|$\tau =10,{\delta}_k=1.5$|0.99300.99280.96530.95980.97130.9655
|$\tau =50,{\delta}_k=0.1$|0.98950.98860.95630.95430.95860.9564
|$\tau =50,{\delta}_k=0.5$|0.99110.99030.95960.95460.96530.9599
|$\tau =50,{\delta}_k=1.0$|0.99090.99010.95880.95360.96480.9591
|$\tau =50,{\delta}_k=1.5$|0.99090.99010.95870.95330.96490.9590
|$\tau =100,{\delta}_k=0.1$|0.98350.98300.94360.93890.94950.9439
|$\tau =100,{\delta}_k=0.5$|0.99060.98980.95790.95490.96150.9581
|$\tau =100,{\delta}_k=1.0$|0.99010.98940.95710.95340.96130.9572
|$\tau =100,{\delta}_k=1.5$|0.98990.98930.95690.95210.96240.9572
|$\tau =150,{\delta}_k=0.1$|0.97710.97830.93280.93650.92900.9325
|$\tau =150,{\delta}_k=0.5$|0.98940.98880.95380.94910.95930.9541
|$\tau =150,{\delta}_k=1.0$|0.99010.98960.95620.95240.96040.9563
|$\tau =150,{\delta}_k=1.5$|0.98960.98910.95560.95230.95950.9558

Note: The best results are marked in bold.

Table 1

The effect of |$\tau$| and |${\delta}_k$| on model performance

 AUCAUPRACCPreRecallF1-score
|$\tau =10,{\delta}_k=0.1$|0.99340.99300.96580.96160.97060.9660
|$\tau =10,{\delta}_k=0.5$|0.99310.99280.96550.96170.96970.9656
|$\tau =10,{\delta}_k=1.0$|0.99310.99280.96550.96180.96980.9657
|$\tau =10,{\delta}_k=1.5$|0.99300.99280.96530.95980.97130.9655
|$\tau =50,{\delta}_k=0.1$|0.98950.98860.95630.95430.95860.9564
|$\tau =50,{\delta}_k=0.5$|0.99110.99030.95960.95460.96530.9599
|$\tau =50,{\delta}_k=1.0$|0.99090.99010.95880.95360.96480.9591
|$\tau =50,{\delta}_k=1.5$|0.99090.99010.95870.95330.96490.9590
|$\tau =100,{\delta}_k=0.1$|0.98350.98300.94360.93890.94950.9439
|$\tau =100,{\delta}_k=0.5$|0.99060.98980.95790.95490.96150.9581
|$\tau =100,{\delta}_k=1.0$|0.99010.98940.95710.95340.96130.9572
|$\tau =100,{\delta}_k=1.5$|0.98990.98930.95690.95210.96240.9572
|$\tau =150,{\delta}_k=0.1$|0.97710.97830.93280.93650.92900.9325
|$\tau =150,{\delta}_k=0.5$|0.98940.98880.95380.94910.95930.9541
|$\tau =150,{\delta}_k=1.0$|0.99010.98960.95620.95240.96040.9563
|$\tau =150,{\delta}_k=1.5$|0.98960.98910.95560.95230.95950.9558
 AUCAUPRACCPreRecallF1-score
|$\tau =10,{\delta}_k=0.1$|0.99340.99300.96580.96160.97060.9660
|$\tau =10,{\delta}_k=0.5$|0.99310.99280.96550.96170.96970.9656
|$\tau =10,{\delta}_k=1.0$|0.99310.99280.96550.96180.96980.9657
|$\tau =10,{\delta}_k=1.5$|0.99300.99280.96530.95980.97130.9655
|$\tau =50,{\delta}_k=0.1$|0.98950.98860.95630.95430.95860.9564
|$\tau =50,{\delta}_k=0.5$|0.99110.99030.95960.95460.96530.9599
|$\tau =50,{\delta}_k=1.0$|0.99090.99010.95880.95360.96480.9591
|$\tau =50,{\delta}_k=1.5$|0.99090.99010.95870.95330.96490.9590
|$\tau =100,{\delta}_k=0.1$|0.98350.98300.94360.93890.94950.9439
|$\tau =100,{\delta}_k=0.5$|0.99060.98980.95790.95490.96150.9581
|$\tau =100,{\delta}_k=1.0$|0.99010.98940.95710.95340.96130.9572
|$\tau =100,{\delta}_k=1.5$|0.98990.98930.95690.95210.96240.9572
|$\tau =150,{\delta}_k=0.1$|0.97710.97830.93280.93650.92900.9325
|$\tau =150,{\delta}_k=0.5$|0.98940.98880.95380.94910.95930.9541
|$\tau =150,{\delta}_k=1.0$|0.99010.98960.95620.95240.96040.9563
|$\tau =150,{\delta}_k=1.5$|0.98960.98910.95560.95230.95950.9558

Note: The best results are marked in bold.

Secondly, we carry out experiments on the iteration period n, and select its value from 50, 100, 200, 500, 1000, and 2000. Figure 3 and Table 2 show that the performance of our method will be optimal when the iteration period n = 500.

The effect of iteration period n in parameter analysis.
Figure 3

The effect of iteration period n in parameter analysis.

Table 2

The effect of iteration period n on model performance

 AUCAUPRACCPreRecallF1-score
500.99060.99040.95890.95250.96620.9592
1000.99220.99190.96310.95360.97350.9634
2000.99330.99290.96630.95990.97330.9665
5000.99340.99300.96580.96160.97060.9660
10000.99320.99280.96550.95940.97220.9657
20000.99310.99280.96540.96000.97130.9656
 AUCAUPRACCPreRecallF1-score
500.99060.99040.95890.95250.96620.9592
1000.99220.99190.96310.95360.97350.9634
2000.99330.99290.96630.95990.97330.9665
5000.99340.99300.96580.96160.97060.9660
10000.99320.99280.96550.95940.97220.9657
20000.99310.99280.96540.96000.97130.9656

Note: The best results are marked in bold.

Table 2

The effect of iteration period n on model performance

 AUCAUPRACCPreRecallF1-score
500.99060.99040.95890.95250.96620.9592
1000.99220.99190.96310.95360.97350.9634
2000.99330.99290.96630.95990.97330.9665
5000.99340.99300.96580.96160.97060.9660
10000.99320.99280.96550.95940.97220.9657
20000.99310.99280.96540.96000.97130.9656
 AUCAUPRACCPreRecallF1-score
500.99060.99040.95890.95250.96620.9592
1000.99220.99190.96310.95360.97350.9634
2000.99330.99290.96630.95990.97330.9665
5000.99340.99300.96580.96160.97060.9660
10000.99320.99280.96550.95940.97220.9657
20000.99310.99280.96540.96000.97130.9656

Note: The best results are marked in bold.

Finally, we analyze the effects of two parameters |$\alpha$| and |$\beta$|⁠. We take their values as 1.0, 10.0, 50.0, and 100.0, and the results are shown in Fig. 4 and Table 3, in which when |$\alpha =1.0$| and |$\beta =50.0$|⁠, the best performance can be received for our model.

The effect of $\alpha$ and $\beta$ on AUC and AUPR.
Figure 4

The effect of |$\alpha$| and |$\beta$| on AUC and AUPR.

Table 3

The effect of |$\alpha$| and |$\beta$| on model performance

 AUCAUPRACCPreRecallF1-score
|$\alpha =1.0,\beta =1.0$|0.99270.99250.96160.96250.96080.9616
|$\alpha =1.0,\beta =10.0$|0.99290.99260.96310.96000.97310.9635
|$\alpha =1.0,\beta =50.0$|0.99340.99300.96580.96160.97060.9660
|$\alpha =1.0,\beta =100.0$|0.99110.99110.96140.95460.96880.9617
|$\alpha =10.0,\beta =1.0$|0.99210.99190.96300.95360.97300.9623
|$\alpha =10.0,\beta =10.0$|0.99160.99110.96040.95520.96670.9612
|$\alpha =10.0,\beta =50.0$|0.99270.99240.96450.95850.97110.9647
|$\alpha =10.0,\beta =100.0$|0.99090.99090.96080.95420.96820.9611
|$\alpha =50.0,\beta =1.0$|0.99160.99150.96240.95340.97240.9628
|$\alpha =50.0,\beta =10.0$|0.99290.99260.96470.95790.97220.9650
|$\alpha =50.0,\beta =50.0$|0.99120.99120.96140.95410.96950.9617
|$\alpha =50.0,\beta =100.0$|0.99030.99040.95900.95170.96730.9594
|$\alpha =100.0,\beta =1.0$|0.99000.99020.95890.95310.96550.9592
|$\alpha =100.0,\beta =10.0$|0.99100.99100.96110.95440.96840.9613
|$\alpha =100.0,\beta =50.0$|0.99030.99040.95920.95150.96770.9595
|$\alpha =100.0,\beta =100.0$|0.98980.99000.95830.95130.96620.9586
 AUCAUPRACCPreRecallF1-score
|$\alpha =1.0,\beta =1.0$|0.99270.99250.96160.96250.96080.9616
|$\alpha =1.0,\beta =10.0$|0.99290.99260.96310.96000.97310.9635
|$\alpha =1.0,\beta =50.0$|0.99340.99300.96580.96160.97060.9660
|$\alpha =1.0,\beta =100.0$|0.99110.99110.96140.95460.96880.9617
|$\alpha =10.0,\beta =1.0$|0.99210.99190.96300.95360.97300.9623
|$\alpha =10.0,\beta =10.0$|0.99160.99110.96040.95520.96670.9612
|$\alpha =10.0,\beta =50.0$|0.99270.99240.96450.95850.97110.9647
|$\alpha =10.0,\beta =100.0$|0.99090.99090.96080.95420.96820.9611
|$\alpha =50.0,\beta =1.0$|0.99160.99150.96240.95340.97240.9628
|$\alpha =50.0,\beta =10.0$|0.99290.99260.96470.95790.97220.9650
|$\alpha =50.0,\beta =50.0$|0.99120.99120.96140.95410.96950.9617
|$\alpha =50.0,\beta =100.0$|0.99030.99040.95900.95170.96730.9594
|$\alpha =100.0,\beta =1.0$|0.99000.99020.95890.95310.96550.9592
|$\alpha =100.0,\beta =10.0$|0.99100.99100.96110.95440.96840.9613
|$\alpha =100.0,\beta =50.0$|0.99030.99040.95920.95150.96770.9595
|$\alpha =100.0,\beta =100.0$|0.98980.99000.95830.95130.96620.9586

Note: The best results are marked in bold.

Table 3

The effect of |$\alpha$| and |$\beta$| on model performance

 AUCAUPRACCPreRecallF1-score
|$\alpha =1.0,\beta =1.0$|0.99270.99250.96160.96250.96080.9616
|$\alpha =1.0,\beta =10.0$|0.99290.99260.96310.96000.97310.9635
|$\alpha =1.0,\beta =50.0$|0.99340.99300.96580.96160.97060.9660
|$\alpha =1.0,\beta =100.0$|0.99110.99110.96140.95460.96880.9617
|$\alpha =10.0,\beta =1.0$|0.99210.99190.96300.95360.97300.9623
|$\alpha =10.0,\beta =10.0$|0.99160.99110.96040.95520.96670.9612
|$\alpha =10.0,\beta =50.0$|0.99270.99240.96450.95850.97110.9647
|$\alpha =10.0,\beta =100.0$|0.99090.99090.96080.95420.96820.9611
|$\alpha =50.0,\beta =1.0$|0.99160.99150.96240.95340.97240.9628
|$\alpha =50.0,\beta =10.0$|0.99290.99260.96470.95790.97220.9650
|$\alpha =50.0,\beta =50.0$|0.99120.99120.96140.95410.96950.9617
|$\alpha =50.0,\beta =100.0$|0.99030.99040.95900.95170.96730.9594
|$\alpha =100.0,\beta =1.0$|0.99000.99020.95890.95310.96550.9592
|$\alpha =100.0,\beta =10.0$|0.99100.99100.96110.95440.96840.9613
|$\alpha =100.0,\beta =50.0$|0.99030.99040.95920.95150.96770.9595
|$\alpha =100.0,\beta =100.0$|0.98980.99000.95830.95130.96620.9586
 AUCAUPRACCPreRecallF1-score
|$\alpha =1.0,\beta =1.0$|0.99270.99250.96160.96250.96080.9616
|$\alpha =1.0,\beta =10.0$|0.99290.99260.96310.96000.97310.9635
|$\alpha =1.0,\beta =50.0$|0.99340.99300.96580.96160.97060.9660
|$\alpha =1.0,\beta =100.0$|0.99110.99110.96140.95460.96880.9617
|$\alpha =10.0,\beta =1.0$|0.99210.99190.96300.95360.97300.9623
|$\alpha =10.0,\beta =10.0$|0.99160.99110.96040.95520.96670.9612
|$\alpha =10.0,\beta =50.0$|0.99270.99240.96450.95850.97110.9647
|$\alpha =10.0,\beta =100.0$|0.99090.99090.96080.95420.96820.9611
|$\alpha =50.0,\beta =1.0$|0.99160.99150.96240.95340.97240.9628
|$\alpha =50.0,\beta =10.0$|0.99290.99260.96470.95790.97220.9650
|$\alpha =50.0,\beta =50.0$|0.99120.99120.96140.95410.96950.9617
|$\alpha =50.0,\beta =100.0$|0.99030.99040.95900.95170.96730.9594
|$\alpha =100.0,\beta =1.0$|0.99000.99020.95890.95310.96550.9592
|$\alpha =100.0,\beta =10.0$|0.99100.99100.96110.95440.96840.9613
|$\alpha =100.0,\beta =50.0$|0.99030.99040.95920.95150.96770.9595
|$\alpha =100.0,\beta =100.0$|0.98980.99000.95830.95130.96620.9586

Note: The best results are marked in bold.

To summarize, we finally set the parameters |$\tau =10$|⁠, |${\delta}_k=0.1$|⁠, n = 500, |$\alpha =1.0$| and |$\beta =50.0$| in our model.

Ablation test

Our computational framework applies two matrix completion strategies (i.e. SVT and BNNR) for prediction. We therefore conduct ablation experiments based on 10-CV to investigate the impact of these components on model performance. Below are the three models we compare:

SABMDA-SVT model: we remove the SVT algorithm and predict microbe-disease associations only by the BNNR algorithm.

SABMDA-BNNR model: we remove the BNNR algorithm and make predictions only through the SVT algorithm.

SABMDA-SO model: we first perform the BNNR algorithm to generate the score matrix and then apply the SVT algorithm to further meta-type integration of the generated score matrix with the parameters kept unchanged.

It can be concluded from Fig. 5 that both the two matrix completion strategies are excellent in filling in the missing values in our study. After the first round of matrix completion, some of the missing values in the original matrix are recovered. Based on the new matrix, the second round of matrix completion makes more missing values recovered. Our ensemble learning framework SABMDA finally demonstrate superior performance.

Results of ablation experiments based on 10-CV.
Figure 5

Results of ablation experiments based on 10-CV.

Performance comparison with other methods

In this section, we compare our model SABMDA with the following baseline methods:

SGJMDA [29]: a method based on similarity fusion using graph convolution networks and jumping knowledge networks for microbe-disease association predictions.

DSAE_RF [19]: a method combining deep sparse autoencoder neural network (DSAE) and random forest (RF) for microbe-disease association predictions.

AMHMDA [30]: a method based on attention aware multi-view similarity networks and hypergraph learning for MiRNA-Disease Associations identification.

MHCLMDA [31]: a multihypergraph contrastive learning method for miRNA–disease association predictions.

MNNMDA [32]: a method to predict microbe-disease associations (MDAs) by applying a Matrix Nuclear Norm method into known microbe and disease data.

LRLSHMDA [13]: a method using Laplacian regularized least squares for human microbe–disease association predictions.

NTSHMDA [33]: a method to predict Human Microbe-Disease Associations based on Random Walk by Integrating Network Topological Similarity.

All methods are compared based on the same experimental setup. For the 5-CV experiment, we plot the ROC and PR curves in Fig. 6. The results show that SABMDA has the highest AUC and AUPR values, where the AUC value is 0.9919 and the AUPR value is 0.9920, which are 4.52% and 5.43% higher than the second-best method SGJMDA, respectively. The values of all the performance indicators for the 5-CV are shown in Table 4. We further calculate P-values (Table 5) based on the AUC and AUPR results received from the 5-CV experiments, which indicates the significant differences between our method and the other seven baseline methods. All the results show that SABMDA outperforms the seven state-of-the-art methods based on 5-CV.

ROC and PR curves based on 5-CV experiments.
Figure 6

ROC and PR curves based on 5-CV experiments.

Table 4

Performance comparison based on 5-CV

MethodAUCAUPRACCPreRecallF1-score
SABMDA0.99190.9920.96140.94880.96480.9755
SGJMDA0.94670.93770.88510.86730.90990.8879
DSAE_RF0.92380.91780.84840.8490.84820.8481
AMHMDA0.88830.88130.79020.83390.73130.775
MHCLMDA0.88410.87630.71780.77870.86350.8187
MNNMDA0.91070.91960.88860.9020.87260.8867
LRLSHMDA0.82330.79060.76170.71950.86060.7832
NTSHMDA0.79720.7720.71280.66460.86570.7507
MethodAUCAUPRACCPreRecallF1-score
SABMDA0.99190.9920.96140.94880.96480.9755
SGJMDA0.94670.93770.88510.86730.90990.8879
DSAE_RF0.92380.91780.84840.8490.84820.8481
AMHMDA0.88830.88130.79020.83390.73130.775
MHCLMDA0.88410.87630.71780.77870.86350.8187
MNNMDA0.91070.91960.88860.9020.87260.8867
LRLSHMDA0.82330.79060.76170.71950.86060.7832
NTSHMDA0.79720.7720.71280.66460.86570.7507

Note: The best results are marked in bold.

Table 4

Performance comparison based on 5-CV

MethodAUCAUPRACCPreRecallF1-score
SABMDA0.99190.9920.96140.94880.96480.9755
SGJMDA0.94670.93770.88510.86730.90990.8879
DSAE_RF0.92380.91780.84840.8490.84820.8481
AMHMDA0.88830.88130.79020.83390.73130.775
MHCLMDA0.88410.87630.71780.77870.86350.8187
MNNMDA0.91070.91960.88860.9020.87260.8867
LRLSHMDA0.82330.79060.76170.71950.86060.7832
NTSHMDA0.79720.7720.71280.66460.86570.7507
MethodAUCAUPRACCPreRecallF1-score
SABMDA0.99190.9920.96140.94880.96480.9755
SGJMDA0.94670.93770.88510.86730.90990.8879
DSAE_RF0.92380.91780.84840.8490.84820.8481
AMHMDA0.88830.88130.79020.83390.73130.775
MHCLMDA0.88410.87630.71780.77870.86350.8187
MNNMDA0.91070.91960.88860.9020.87260.8867
LRLSHMDA0.82330.79060.76170.71950.86060.7832
NTSHMDA0.79720.7720.71280.66460.86570.7507

Note: The best results are marked in bold.

Table 5

Statistical test results based on 5-CV between SABMDA and the other seven methods

 SGJMDADSAE_RFAMHMDAMHCLMDAMNNMDALRLSHMDANTSHMDA
p-value based on AUC results2.69 × 10−101.00 × 10−74.95 × 10−116.62 × 10−91.40 × 10−61.78 × 10−101.80 × 10−13
p-value based on AUPR results2.05 × 10−79.11 × 10−97.68 × 10−101.90 × 10−71.78 × 10−71.83 × 10−93.40 × 10−11
 SGJMDADSAE_RFAMHMDAMHCLMDAMNNMDALRLSHMDANTSHMDA
p-value based on AUC results2.69 × 10−101.00 × 10−74.95 × 10−116.62 × 10−91.40 × 10−61.78 × 10−101.80 × 10−13
p-value based on AUPR results2.05 × 10−79.11 × 10−97.68 × 10−101.90 × 10−71.78 × 10−71.83 × 10−93.40 × 10−11
Table 5

Statistical test results based on 5-CV between SABMDA and the other seven methods

 SGJMDADSAE_RFAMHMDAMHCLMDAMNNMDALRLSHMDANTSHMDA
p-value based on AUC results2.69 × 10−101.00 × 10−74.95 × 10−116.62 × 10−91.40 × 10−61.78 × 10−101.80 × 10−13
p-value based on AUPR results2.05 × 10−79.11 × 10−97.68 × 10−101.90 × 10−71.78 × 10−71.83 × 10−93.40 × 10−11
 SGJMDADSAE_RFAMHMDAMHCLMDAMNNMDALRLSHMDANTSHMDA
p-value based on AUC results2.69 × 10−101.00 × 10−74.95 × 10−116.62 × 10−91.40 × 10−61.78 × 10−101.80 × 10−13
p-value based on AUPR results2.05 × 10−79.11 × 10−97.68 × 10−101.90 × 10−71.78 × 10−71.83 × 10−93.40 × 10−11

For 10-CV, we plot ROC and PR curves in Fig. 7. The detailed results are listed in Table 6. As can be seen from Table 6, SABMDA outperforms the other seven methods in all assessment metrics, with an AUC value of 0.9934, which is 4.39% higher than SGJMDA, which has the second best results, and an AUPR value of 0.9930, which is 5.02% higher than SGJMDA, which has the second best results. Further statistical test results (Table 7) also indicate the significant differences between our method and the other seven baseline methods.

ROC and PR curves based on 10-CV experiments.
Figure 7

ROC and PR curves based on 10-CV experiments.

Table 6

Performance comparison based on 10-CV

MethodAUCAUPRACCPreRecallF1-score
SABMDA0.99340.9930.96580.96160.97060.966
SGJMDA0.94950.94280.89010.86980.9190.8933
DSAE_RF0.92550.91990.8480.84860.8480.8478
AMHMDA0.89220.88540.79550.84430.72740.7794
MHCLMDA0.88170.86730.72950.77230.88440.8237
MNNMDA0.92090.93470.8920.89860.8850.8914
LRLSHMDA0.82920.79490.77470.73380.86280.7928
NTSHMDA0.79620.76950.71480.66630.87620.7548
MethodAUCAUPRACCPreRecallF1-score
SABMDA0.99340.9930.96580.96160.97060.966
SGJMDA0.94950.94280.89010.86980.9190.8933
DSAE_RF0.92550.91990.8480.84860.8480.8478
AMHMDA0.89220.88540.79550.84430.72740.7794
MHCLMDA0.88170.86730.72950.77230.88440.8237
MNNMDA0.92090.93470.8920.89860.8850.8914
LRLSHMDA0.82920.79490.77470.73380.86280.7928
NTSHMDA0.79620.76950.71480.66630.87620.7548

Note: The best results are marked in bold.

Table 6

Performance comparison based on 10-CV

MethodAUCAUPRACCPreRecallF1-score
SABMDA0.99340.9930.96580.96160.97060.966
SGJMDA0.94950.94280.89010.86980.9190.8933
DSAE_RF0.92550.91990.8480.84860.8480.8478
AMHMDA0.89220.88540.79550.84430.72740.7794
MHCLMDA0.88170.86730.72950.77230.88440.8237
MNNMDA0.92090.93470.8920.89860.8850.8914
LRLSHMDA0.82920.79490.77470.73380.86280.7928
NTSHMDA0.79620.76950.71480.66630.87620.7548
MethodAUCAUPRACCPreRecallF1-score
SABMDA0.99340.9930.96580.96160.97060.966
SGJMDA0.94950.94280.89010.86980.9190.8933
DSAE_RF0.92550.91990.8480.84860.8480.8478
AMHMDA0.89220.88540.79550.84430.72740.7794
MHCLMDA0.88170.86730.72950.77230.88440.8237
MNNMDA0.92090.93470.8920.89860.8850.8914
LRLSHMDA0.82920.79490.77470.73380.86280.7928
NTSHMDA0.79620.76950.71480.66630.87620.7548

Note: The best results are marked in bold.

Table 7

Statistical test results based on 10-CV between SABMDA and the other seven methods

 SGJMDADSAE_RFAMHMDAMHCLMDAMNNMDALRLSHMDANTSHMDA
p-value based on AUC results2.11 × 10−132.37 × 10−158.71 × 10−251.08 × 10−171.53 × 10−139.26 × 10−215.77 × 10−20
p-value based on AUPR results2.81 × 10−132.65 × 10−163.67 × 10−223.43 × 10−159.53 × 10−163.31 × 10−201.30 × 10−19
 SGJMDADSAE_RFAMHMDAMHCLMDAMNNMDALRLSHMDANTSHMDA
p-value based on AUC results2.11 × 10−132.37 × 10−158.71 × 10−251.08 × 10−171.53 × 10−139.26 × 10−215.77 × 10−20
p-value based on AUPR results2.81 × 10−132.65 × 10−163.67 × 10−223.43 × 10−159.53 × 10−163.31 × 10−201.30 × 10−19
Table 7

Statistical test results based on 10-CV between SABMDA and the other seven methods

 SGJMDADSAE_RFAMHMDAMHCLMDAMNNMDALRLSHMDANTSHMDA
p-value based on AUC results2.11 × 10−132.37 × 10−158.71 × 10−251.08 × 10−171.53 × 10−139.26 × 10−215.77 × 10−20
p-value based on AUPR results2.81 × 10−132.65 × 10−163.67 × 10−223.43 × 10−159.53 × 10−163.31 × 10−201.30 × 10−19
 SGJMDADSAE_RFAMHMDAMHCLMDAMNNMDALRLSHMDANTSHMDA
p-value based on AUC results2.11 × 10−132.37 × 10−158.71 × 10−251.08 × 10−171.53 × 10−139.26 × 10−215.77 × 10−20
p-value based on AUPR results2.81 × 10−132.65 × 10−163.67 × 10−223.43 × 10−159.53 × 10−163.31 × 10−201.30 × 10−19

Independent test

To further validate the prediction performance of SABMDA, we conduct independent tests on our model SABMDA by dividing the microbe-disease association matrix by rows (diseases) into a training set, a test set, and a validation set, with a ratio of 8:1:1. We plot the ROC and PR curves of the independent tests in Fig. 8. In addition, the detailed results of all metrics are displayed in Table 8. It can be found that SABMDA receives scores of 0.8570, 0.8726, 0.8034, 0.7968, 0.8195, and 0.8065 for AUC, AUPR, ACC, Pre, Recall, and F1-Score, respectively. Compared with the other seven models, the other metrics are optimal, except for Recall, which does not reach the highest value. Taken together, these significant advantages highlight the effectiveness and excellence of SABMDA compared with existing methods.

ROC and PR curves based on independent test.
Figure 8

ROC and PR curves based on independent test.

Table 8

Performance comparison based on independent test

MethodAUCAUPRACCPreRecallF1-score
SABMDA0.8570.87260.80340.79680.81950.8065
SGJMDA0.78420.77960.70430.65520.87990.7492
DSAE_RF0.77960.80380.72710.73240.72710.7254
AMHMDA0.77820.78470.68070.79190.65370.6005
MHCLMDA0.61080.59120.57140.55650.9260.693
MNNMDA0.710.7430.6290.59070.88480.7053
LRLSHMDA0.7140.77540.68630.69970.70860.6948
NTSHMDA0.61140.57110.57770.54490.94290.6906
MethodAUCAUPRACCPreRecallF1-score
SABMDA0.8570.87260.80340.79680.81950.8065
SGJMDA0.78420.77960.70430.65520.87990.7492
DSAE_RF0.77960.80380.72710.73240.72710.7254
AMHMDA0.77820.78470.68070.79190.65370.6005
MHCLMDA0.61080.59120.57140.55650.9260.693
MNNMDA0.710.7430.6290.59070.88480.7053
LRLSHMDA0.7140.77540.68630.69970.70860.6948
NTSHMDA0.61140.57110.57770.54490.94290.6906

Note: The best results are marked in bold.

Table 8

Performance comparison based on independent test

MethodAUCAUPRACCPreRecallF1-score
SABMDA0.8570.87260.80340.79680.81950.8065
SGJMDA0.78420.77960.70430.65520.87990.7492
DSAE_RF0.77960.80380.72710.73240.72710.7254
AMHMDA0.77820.78470.68070.79190.65370.6005
MHCLMDA0.61080.59120.57140.55650.9260.693
MNNMDA0.710.7430.6290.59070.88480.7053
LRLSHMDA0.7140.77540.68630.69970.70860.6948
NTSHMDA0.61140.57110.57770.54490.94290.6906
MethodAUCAUPRACCPreRecallF1-score
SABMDA0.8570.87260.80340.79680.81950.8065
SGJMDA0.78420.77960.70430.65520.87990.7492
DSAE_RF0.77960.80380.72710.73240.72710.7254
AMHMDA0.77820.78470.68070.79190.65370.6005
MHCLMDA0.61080.59120.57140.55650.9260.693
MNNMDA0.710.7430.6290.59070.88480.7053
LRLSHMDA0.7140.77540.68630.69970.70860.6948
NTSHMDA0.61140.57110.57770.54490.94290.6906

Note: The best results are marked in bold.

Robustness analysis

To further evaluate the generalization ability of SABMDA for association prediction, we apply our model to the HMDD v3.2 dataset [34] for inference, and the results based on 10-CV are shown in Table 9. The results show that SABMDA also exhibits excellent prediction performance on other datasets, which validates SABMDA’s wide application.

Table 9

The performance of SABMDA on the HMDD v3.2 dataset

DatasetAUCAUPRACCPreRecallF1-score
HMDD v3.20.94750.95400.88850.88310.89610.8894
DatasetAUCAUPRACCPreRecallF1-score
HMDD v3.20.94750.95400.88850.88310.89610.8894
Table 9

The performance of SABMDA on the HMDD v3.2 dataset

DatasetAUCAUPRACCPreRecallF1-score
HMDD v3.20.94750.95400.88850.88310.89610.8894
DatasetAUCAUPRACCPreRecallF1-score
HMDD v3.20.94750.95400.88850.88310.89610.8894

Case studies

In this section, we further test the prediction ability of SABMDA based on case studies. We first conduct the experiments by removing the association information of two specific diseases (i.e. OBESITY [35] and ASTHMA [36]) from the benchmark dataset and then apply our model to predict the microbes associated with the two diseases. We search the latest version of PubMed (https://pubmed.ncbi.nlm.nih.gov/) for confirmation, and the increased or decreased microbiota profile level is manually extracted from the related papers. The validation results are listed in Tables 10 and 11, respectively.

Table 10

The top 20 predicted microbes associated with OBESITY

RankingMicrobeEvidenceDescription
1StreptococcusNANA
2FaecalibacteriumNANA
3HaemophilusPMID:31976177Increased
4Streptococcus mitisNANA
5ParaprevotellaPMID:30525950Increased
6DialisterPMID:32624568Decreased
7ParabacteroidesPMID:31530820Increased
8PrevotellaPMID:31024514Decreased
9AkkermansiaPMID:30810328Decreased
10RoseburiaPMID:34978141Increased
11Streptococcus salivariusPMID:36264094Decreased
12Streptococcus gordoniiNANA
13BifidobacteriumPMID:29280312Decreased
14AlloprevotellaPMID:30611080Decreased
15RuminococcusPMID:31315227Increased
16MegasphaeraPMID:39033197Increased
17BacteroidetesPMID:17183312NA
18ActinomycesPMID:35880087Increased
19FusobacteriumPMID:29280312Increased
20VeillonellaPMID:31024514Decreased
RankingMicrobeEvidenceDescription
1StreptococcusNANA
2FaecalibacteriumNANA
3HaemophilusPMID:31976177Increased
4Streptococcus mitisNANA
5ParaprevotellaPMID:30525950Increased
6DialisterPMID:32624568Decreased
7ParabacteroidesPMID:31530820Increased
8PrevotellaPMID:31024514Decreased
9AkkermansiaPMID:30810328Decreased
10RoseburiaPMID:34978141Increased
11Streptococcus salivariusPMID:36264094Decreased
12Streptococcus gordoniiNANA
13BifidobacteriumPMID:29280312Decreased
14AlloprevotellaPMID:30611080Decreased
15RuminococcusPMID:31315227Increased
16MegasphaeraPMID:39033197Increased
17BacteroidetesPMID:17183312NA
18ActinomycesPMID:35880087Increased
19FusobacteriumPMID:29280312Increased
20VeillonellaPMID:31024514Decreased

Note: NA indicates not available. Different conditions of obesity, such as childhood and adult obesity, are not distinguished in this table as only obesity is included in the benchmark dataset.

Table 10

The top 20 predicted microbes associated with OBESITY

RankingMicrobeEvidenceDescription
1StreptococcusNANA
2FaecalibacteriumNANA
3HaemophilusPMID:31976177Increased
4Streptococcus mitisNANA
5ParaprevotellaPMID:30525950Increased
6DialisterPMID:32624568Decreased
7ParabacteroidesPMID:31530820Increased
8PrevotellaPMID:31024514Decreased
9AkkermansiaPMID:30810328Decreased
10RoseburiaPMID:34978141Increased
11Streptococcus salivariusPMID:36264094Decreased
12Streptococcus gordoniiNANA
13BifidobacteriumPMID:29280312Decreased
14AlloprevotellaPMID:30611080Decreased
15RuminococcusPMID:31315227Increased
16MegasphaeraPMID:39033197Increased
17BacteroidetesPMID:17183312NA
18ActinomycesPMID:35880087Increased
19FusobacteriumPMID:29280312Increased
20VeillonellaPMID:31024514Decreased
RankingMicrobeEvidenceDescription
1StreptococcusNANA
2FaecalibacteriumNANA
3HaemophilusPMID:31976177Increased
4Streptococcus mitisNANA
5ParaprevotellaPMID:30525950Increased
6DialisterPMID:32624568Decreased
7ParabacteroidesPMID:31530820Increased
8PrevotellaPMID:31024514Decreased
9AkkermansiaPMID:30810328Decreased
10RoseburiaPMID:34978141Increased
11Streptococcus salivariusPMID:36264094Decreased
12Streptococcus gordoniiNANA
13BifidobacteriumPMID:29280312Decreased
14AlloprevotellaPMID:30611080Decreased
15RuminococcusPMID:31315227Increased
16MegasphaeraPMID:39033197Increased
17BacteroidetesPMID:17183312NA
18ActinomycesPMID:35880087Increased
19FusobacteriumPMID:29280312Increased
20VeillonellaPMID:31024514Decreased

Note: NA indicates not available. Different conditions of obesity, such as childhood and adult obesity, are not distinguished in this table as only obesity is included in the benchmark dataset.

Table 11

The top 20 predicted microbes associated with ASTHMA

RankingMicrobeEvidenceDescription
1BifidobacteriumPMID:30290688Increased
2Helicobacter pyloriNANA
3HaemophilusPMID:32072252Increased
4FaecalibacteriumPMID:26424567Decreased
5StreptococcusPMID:25329665Decreased
6Bifidobacterium adolescentisPMID:26840903Decreased
7Akkermansia muciniphilaPMID:35265071Decreased
8RothiaPMID:26424567Decreased
9Faecalibacterium prausnitziiPMID:30208875Decreased
10PrevotellaNANA
11DialisterPMID:36969260Increased
12BacteroidetesPMID:20052417Decreased
13Veillonella disparNANA
14VeillonellaPMID:29445257NA
15PseudomonasPMID:25329665Increased
16NeisseriaPMID:37287344Decreased
17BurkholderiaPMID:39549985Increased
18Pseudomonas aeruginosaNANA
19RoseburiaPMID:29031597NA
20ParascardoviaNANA
RankingMicrobeEvidenceDescription
1BifidobacteriumPMID:30290688Increased
2Helicobacter pyloriNANA
3HaemophilusPMID:32072252Increased
4FaecalibacteriumPMID:26424567Decreased
5StreptococcusPMID:25329665Decreased
6Bifidobacterium adolescentisPMID:26840903Decreased
7Akkermansia muciniphilaPMID:35265071Decreased
8RothiaPMID:26424567Decreased
9Faecalibacterium prausnitziiPMID:30208875Decreased
10PrevotellaNANA
11DialisterPMID:36969260Increased
12BacteroidetesPMID:20052417Decreased
13Veillonella disparNANA
14VeillonellaPMID:29445257NA
15PseudomonasPMID:25329665Increased
16NeisseriaPMID:37287344Decreased
17BurkholderiaPMID:39549985Increased
18Pseudomonas aeruginosaNANA
19RoseburiaPMID:29031597NA
20ParascardoviaNANA
Table 11

The top 20 predicted microbes associated with ASTHMA

RankingMicrobeEvidenceDescription
1BifidobacteriumPMID:30290688Increased
2Helicobacter pyloriNANA
3HaemophilusPMID:32072252Increased
4FaecalibacteriumPMID:26424567Decreased
5StreptococcusPMID:25329665Decreased
6Bifidobacterium adolescentisPMID:26840903Decreased
7Akkermansia muciniphilaPMID:35265071Decreased
8RothiaPMID:26424567Decreased
9Faecalibacterium prausnitziiPMID:30208875Decreased
10PrevotellaNANA
11DialisterPMID:36969260Increased
12BacteroidetesPMID:20052417Decreased
13Veillonella disparNANA
14VeillonellaPMID:29445257NA
15PseudomonasPMID:25329665Increased
16NeisseriaPMID:37287344Decreased
17BurkholderiaPMID:39549985Increased
18Pseudomonas aeruginosaNANA
19RoseburiaPMID:29031597NA
20ParascardoviaNANA
RankingMicrobeEvidenceDescription
1BifidobacteriumPMID:30290688Increased
2Helicobacter pyloriNANA
3HaemophilusPMID:32072252Increased
4FaecalibacteriumPMID:26424567Decreased
5StreptococcusPMID:25329665Decreased
6Bifidobacterium adolescentisPMID:26840903Decreased
7Akkermansia muciniphilaPMID:35265071Decreased
8RothiaPMID:26424567Decreased
9Faecalibacterium prausnitziiPMID:30208875Decreased
10PrevotellaNANA
11DialisterPMID:36969260Increased
12BacteroidetesPMID:20052417Decreased
13Veillonella disparNANA
14VeillonellaPMID:29445257NA
15PseudomonasPMID:25329665Increased
16NeisseriaPMID:37287344Decreased
17BurkholderiaPMID:39549985Increased
18Pseudomonas aeruginosaNANA
19RoseburiaPMID:29031597NA
20ParascardoviaNANA

Moreover, we use the whole information in the benchmark dataset and then apply SABMDA to predict potential microbe-disease associations. We check the top 20 potential microbes for CROHN’S DISEASE [37] and the top 20 potential microbe-disease associations, and Tables 12 and 13 show the results of our experiments. The results of the four case studies indicate that SABMDA is an effective tool in predicting new microbe-disease associations.

Table 12

The top 20 potential microbes for CROHN’S DISEASE

RankingMicrobeEvidenceDescription
1ActinobacteriaNANA
2PrevotellaceaePMID:35890149Increased
3Porphyromonas gingivalisNANA
4AlcaligenaceaeNANA
5Methanobrevibacter smithiiNANA
6Streptococcus sanguinisNANA
7Lactobacillus crispatusNANA
8
9
Fusobacterium periodonticum  
Tanerella forsythia
PMID:37932491
NA
Increased
NA
10Streptococcus gordoniiPMID:39438255Increased
11Treponema denticolaPMID:23060013Increased
12Streptococcus oralisPMID:34646784Increased
13Lactobacillus inersNANA
14Streptococcus constellatusPMID:34725610NA
15Campylobacter rectusPMID:31522142NA
16Eikenella corrodensPMID:29574823Increased
17Selenomonas noxiaNANA
18Aggregatibacter actinomycetemcomitansPMID:36768711Increased
19Capnocytophaga sputigenaNANA
20Capnocytophaga ochraceaNANA
RankingMicrobeEvidenceDescription
1ActinobacteriaNANA
2PrevotellaceaePMID:35890149Increased
3Porphyromonas gingivalisNANA
4AlcaligenaceaeNANA
5Methanobrevibacter smithiiNANA
6Streptococcus sanguinisNANA
7Lactobacillus crispatusNANA
8
9
Fusobacterium periodonticum  
Tanerella forsythia
PMID:37932491
NA
Increased
NA
10Streptococcus gordoniiPMID:39438255Increased
11Treponema denticolaPMID:23060013Increased
12Streptococcus oralisPMID:34646784Increased
13Lactobacillus inersNANA
14Streptococcus constellatusPMID:34725610NA
15Campylobacter rectusPMID:31522142NA
16Eikenella corrodensPMID:29574823Increased
17Selenomonas noxiaNANA
18Aggregatibacter actinomycetemcomitansPMID:36768711Increased
19Capnocytophaga sputigenaNANA
20Capnocytophaga ochraceaNANA
Table 12

The top 20 potential microbes for CROHN’S DISEASE

RankingMicrobeEvidenceDescription
1ActinobacteriaNANA
2PrevotellaceaePMID:35890149Increased
3Porphyromonas gingivalisNANA
4AlcaligenaceaeNANA
5Methanobrevibacter smithiiNANA
6Streptococcus sanguinisNANA
7Lactobacillus crispatusNANA
8
9
Fusobacterium periodonticum  
Tanerella forsythia
PMID:37932491
NA
Increased
NA
10Streptococcus gordoniiPMID:39438255Increased
11Treponema denticolaPMID:23060013Increased
12Streptococcus oralisPMID:34646784Increased
13Lactobacillus inersNANA
14Streptococcus constellatusPMID:34725610NA
15Campylobacter rectusPMID:31522142NA
16Eikenella corrodensPMID:29574823Increased
17Selenomonas noxiaNANA
18Aggregatibacter actinomycetemcomitansPMID:36768711Increased
19Capnocytophaga sputigenaNANA
20Capnocytophaga ochraceaNANA
RankingMicrobeEvidenceDescription
1ActinobacteriaNANA
2PrevotellaceaePMID:35890149Increased
3Porphyromonas gingivalisNANA
4AlcaligenaceaeNANA
5Methanobrevibacter smithiiNANA
6Streptococcus sanguinisNANA
7Lactobacillus crispatusNANA
8
9
Fusobacterium periodonticum  
Tanerella forsythia
PMID:37932491
NA
Increased
NA
10Streptococcus gordoniiPMID:39438255Increased
11Treponema denticolaPMID:23060013Increased
12Streptococcus oralisPMID:34646784Increased
13Lactobacillus inersNANA
14Streptococcus constellatusPMID:34725610NA
15Campylobacter rectusPMID:31522142NA
16Eikenella corrodensPMID:29574823Increased
17Selenomonas noxiaNANA
18Aggregatibacter actinomycetemcomitansPMID:36768711Increased
19Capnocytophaga sputigenaNANA
20Capnocytophaga ochraceaNANA
Table 13

The top 20 potential microbe-disease associations predicted by SABMDA

RankingMicrobeDiseaseEvidenceDescription
1VeillonellaAnkylosing spondylitisPMID:30944880Increased
2LactobacillusAnkylosing spondylitisPMID:36548483Decreased
3BlautiaAutoimmune hepatitisPMID:32640728Increased
4VeillonellaBiliary atresiaPMID:34630385Increased
5ActinomycesObesityNANA
6CoprococcusAnkylosing spondylitisPMID:37875269Increased
7FaecalibacteriumAutoimmune hepatitisPMID:37945156Increased
8ParabacteroidesUlcerative colitisPMID:36547911Increased
9FaecalibacteriumAlzheimer’s diseaseNANA
10RuminococcaceaeMajor depressive disorderPMID:32229219Decreased
11LeptotrichiaCrohn’s diseasePMID:38849764Increased
12LactobacillusPancreatitisNANA
13BifidobacteriumPrimary biliary cholangitisPMID:36287108Increased
14RothiaColorectal cancerPMID:33844851Increased
15ParabacteroidesPrimary biliary cholangitisNANA
16VeillonellaColorectal cancerPMID:36539569Increased
17CoprococcusShort bowel syndromeNANA
18ParabacteroidesCoronary artery diseasePMID:35343796Decreased
19BifidobacteriumOsteoporosisPMID:37118342Decreased
20FaecalibacteriumOsteoporosisPMID:37810879Decreased
RankingMicrobeDiseaseEvidenceDescription
1VeillonellaAnkylosing spondylitisPMID:30944880Increased
2LactobacillusAnkylosing spondylitisPMID:36548483Decreased
3BlautiaAutoimmune hepatitisPMID:32640728Increased
4VeillonellaBiliary atresiaPMID:34630385Increased
5ActinomycesObesityNANA
6CoprococcusAnkylosing spondylitisPMID:37875269Increased
7FaecalibacteriumAutoimmune hepatitisPMID:37945156Increased
8ParabacteroidesUlcerative colitisPMID:36547911Increased
9FaecalibacteriumAlzheimer’s diseaseNANA
10RuminococcaceaeMajor depressive disorderPMID:32229219Decreased
11LeptotrichiaCrohn’s diseasePMID:38849764Increased
12LactobacillusPancreatitisNANA
13BifidobacteriumPrimary biliary cholangitisPMID:36287108Increased
14RothiaColorectal cancerPMID:33844851Increased
15ParabacteroidesPrimary biliary cholangitisNANA
16VeillonellaColorectal cancerPMID:36539569Increased
17CoprococcusShort bowel syndromeNANA
18ParabacteroidesCoronary artery diseasePMID:35343796Decreased
19BifidobacteriumOsteoporosisPMID:37118342Decreased
20FaecalibacteriumOsteoporosisPMID:37810879Decreased
Table 13

The top 20 potential microbe-disease associations predicted by SABMDA

RankingMicrobeDiseaseEvidenceDescription
1VeillonellaAnkylosing spondylitisPMID:30944880Increased
2LactobacillusAnkylosing spondylitisPMID:36548483Decreased
3BlautiaAutoimmune hepatitisPMID:32640728Increased
4VeillonellaBiliary atresiaPMID:34630385Increased
5ActinomycesObesityNANA
6CoprococcusAnkylosing spondylitisPMID:37875269Increased
7FaecalibacteriumAutoimmune hepatitisPMID:37945156Increased
8ParabacteroidesUlcerative colitisPMID:36547911Increased
9FaecalibacteriumAlzheimer’s diseaseNANA
10RuminococcaceaeMajor depressive disorderPMID:32229219Decreased
11LeptotrichiaCrohn’s diseasePMID:38849764Increased
12LactobacillusPancreatitisNANA
13BifidobacteriumPrimary biliary cholangitisPMID:36287108Increased
14RothiaColorectal cancerPMID:33844851Increased
15ParabacteroidesPrimary biliary cholangitisNANA
16VeillonellaColorectal cancerPMID:36539569Increased
17CoprococcusShort bowel syndromeNANA
18ParabacteroidesCoronary artery diseasePMID:35343796Decreased
19BifidobacteriumOsteoporosisPMID:37118342Decreased
20FaecalibacteriumOsteoporosisPMID:37810879Decreased
RankingMicrobeDiseaseEvidenceDescription
1VeillonellaAnkylosing spondylitisPMID:30944880Increased
2LactobacillusAnkylosing spondylitisPMID:36548483Decreased
3BlautiaAutoimmune hepatitisPMID:32640728Increased
4VeillonellaBiliary atresiaPMID:34630385Increased
5ActinomycesObesityNANA
6CoprococcusAnkylosing spondylitisPMID:37875269Increased
7FaecalibacteriumAutoimmune hepatitisPMID:37945156Increased
8ParabacteroidesUlcerative colitisPMID:36547911Increased
9FaecalibacteriumAlzheimer’s diseaseNANA
10RuminococcaceaeMajor depressive disorderPMID:32229219Decreased
11LeptotrichiaCrohn’s diseasePMID:38849764Increased
12LactobacillusPancreatitisNANA
13BifidobacteriumPrimary biliary cholangitisPMID:36287108Increased
14RothiaColorectal cancerPMID:33844851Increased
15ParabacteroidesPrimary biliary cholangitisNANA
16VeillonellaColorectal cancerPMID:36539569Increased
17CoprococcusShort bowel syndromeNANA
18ParabacteroidesCoronary artery diseasePMID:35343796Decreased
19BifidobacteriumOsteoporosisPMID:37118342Decreased
20FaecalibacteriumOsteoporosisPMID:37810879Decreased

Conclusion

In this study, we develop a computational approach SABMDA based on ensemble learning for microbe-disease association prediction. Our method first fuses multiple information from both microbes and diseases as input features. We then develop two matrix completion strategies to recover unknown microbe-disease associations. We conduct comprehensive experiments, and results demonstrate the superiority of our model in inferring new associations between microbes and diseases.

The excellent performance of our method can be attributed to three factors. The first is that we use reliable biomedical information as benchmark datasets in this study. The second is that we integrate multiple information from both microbes and diseases as input features. The third factor is that we apply ensemble learning for association prediction. Combining two matrix completion algorithms improve the inference performance. Meanwhile, it should be noted that the mechanism of how microbes affecting human diseases is complex. Our method predicts only microbe-disease associations. These associations are not true causal relationships between microbes and diseases. Details about how microbes positively or negatively contribute to human health need to be further investigated. Revealing the real causal effects between them would provide more useful help for biomedical research, which is a future research direction.

Key Points
  • We propose an ensemble learning method SABMDA to predict novel disease-associated microbes, in which two matrix completion strategies are developed and used for prediction.

  • Combination of the two matrix completion strategies improves microbe-disease association prediction.

  • Comprehensive experiments demonstrate SABMDA outperforms recent state-of-the-art methods significantly.

Conflict of interest: The authors have declared that no competing interests exist.

Funding

This work was supported by Jiangxi Provincial Natural Science Foundation, China (20242BAB25083).

Data availability

The data and source code for this study are available at https://github.com/IamChenHailin/SABMDA.

References

1.

Human Microbiome Project C
.
A framework for human microbiome research
.
Nature
 
2012
;
486
:
215
21
.

2.

Ley
 
RE
,
Peterson
 
DA
,
Gordon
 
JI
.
Ecological and evolutionary forces shaping microbial diversity in the human intestine
.
Cell
 
2006
;
124
:
837
48
.

3.

Eckburg
 
PB
,
Bik
 
EM
,
Bernstein
 
CN
. et al.  
Diversity of the human intestinal microbial flora
.
Science
 
2005
;
308
:
1635
8
.

4.

Peterson
 
J
,
Garges
 
S
,
Giovanni
 
M
. et al.  
The NIH human microbiome project
.
Genome Res
 
2009
;
19
:
2317
23
.

5.

Blaser
 
MJ
.
Harnessing the power of the human microbiome
.
Proc Natl Acad Sci
 
2010
;
107
:
6125
6
.

6.

Nicholson
 
JK
,
Holmes
 
E
,
Kinross
 
J
. et al.  
Host-gut microbiota metabolic interactions
.
Science
 
2012
;
336
:
1262
7
.

7.

Althani
 
AA
,
Marei
 
HE
,
Hamdi
 
WS
. et al.  
Human microbiome and its association with health and diseases
.
J Cell Physiol
 
2016
;
231
:
1688
94
.

8.

Wen
 
Z
,
Yan
 
C
,
Duan
 
G
. et al.  
A survey on predicting microbe-disease associations: biological data and computational methods
.
Brief Bioinform
 
2021
;
22
:bbaa157.

9.

Niu
 
Y-W
,
Qu
 
C-Q
,
Wang
 
G-H
. et al.  
RWHMDA: random walk on hypergraph for microbe-disease association prediction
.
Front Microbiol
 
2019
;
10
:
1578
.

10.

Yan
 
C
,
Duan
 
G
,
Wu
 
F-X
. et al.  
BRWMDA: predicting microbe-disease associations based on similarities and bi-random walk on disease and microbe networks
.
IEEE/ACM Trans Comput Biol Bioinform
 
2019
;
17
:
1595
604
.

11.

Wang
 
L
,
Wang
 
Y
,
Li
 
H
. et al.  
A bidirectional label propagation based computational model for potential microbe-disease association prediction
.
Front Microbiol
 
2019
;
10
:
684
.

12.

Huang
 
Y-A
,
You
 
Z-H
,
Chen
 
X
. et al.  
Prediction of microbe–disease association from the integration of neighbor and graph with collaborative recommendation model
.
J Transl Med
 
2017
;
15
:
1
11
.

13.

Wang
 
F
,
Huang
 
Z-A
,
Chen
 
X
. et al.  
LRLSHMDA: Laplacian regularized least squares for human microbe–disease association prediction
.
Sci Rep
 
2017
;
7
:
7601
.

14.

Zou
 
S
,
Zhang
 
J
,
Zhang
 
Z
.
Novel human microbe-disease associations inference based on network consistency projection
.
Sci Rep
 
2018
;
8
:
8034
.

15.

Qu
 
J
,
Zhao
 
Y
,
Yin
 
J
.
Identification and analysis of human microbe-disease associations by matrix decomposition and label propagation
.
Front Microbiol
 
2019
;
10
:
291
.

16.

Wu
 
C
,
Gao
 
R
,
Zhang
 
Y
.
mHMDA: human microbe-disease association prediction by matrix completion and multi-source information
.
IEEE Access
 
2019
;
7
:
106687
93
.

17.

Shi
 
J-Y
,
Huang
 
H
,
Zhang
 
Y-N
. et al.  
BMCMDA: a novel model for predicting human microbe-disease associations via binary matrix completion
.
BMC Bioinformatics
 
2018
;
19
:
85
92
.

18.

Li
 
H
,
Wang
 
Y
,
Zhang
 
Z
. et al.  
Identifying microbe-disease association based on a novel back-propagation neural network model
.
IEEE/ACM Trans Comput Biol Bioinform
 
2020
;
18
:
2502
13
.

19.

Wang
 
L
,
Wang
 
Y
,
Xuan
 
C
. et al.  
Predicting potential microbe–disease associations based on multi-source features and deep learning
.
Brief Bioinform
 
2023
;
24
:bbad255.

20.

Bennett
 
J
,
Elkan
 
C
,
Liu
 
B
. et al.  
Kdd cup and workshop 2007
.
ACM SIGKDD Explorations Newsletter
 
2007
;
9
:
51
2
.

21.

Bertsekas
 
DP
.
Nonlinear programming
.
Journal of the Operational Research Society
 
1997
;
48
:
334
4
.

22.

Elman
 
HC
,
Golub
 
GH
.
Inexact and preconditioned Uzawa algorithms for saddle point problems
.
SIAM Journal on Numerical Analysis
 
1994
;
31
:
1645
61
.

23.

Cai
 
J-F
,
Candès
 
EJ
,
Shen
 
Z
.
A singular value thresholding algorithm for matrix completion
.
SIAM Journal on Optimization
 
2010
;
20
:
1956
82
.

24.

Candes
 
E
,
Recht
 
B
.
Simple bounds for recovering low-complexity models
.
Mathematical Programming
 
2013
;
141
:
577
89
.

25.

Chen
 
C
,
He
 
B
,
Yuan
 
X
.
Matrix completion via an alternating direction method
.
IMA Journal of Numerical Analysis
 
2012
;
32
:
227
45
.

26.

Li
 
C-N
,
Shao
 
Y-H
,
Yin
 
W
. et al.  
Robust and sparse linear discriminant analysis via an alternating direction method of multipliers
.
IEEE Transactions on Neural Networks and Learning Systems
 
2019
;
31
:
915
26
.

27.

Li
 
J-Q
,
Rong
 
Z-H
,
Chen
 
X
. et al.  
MCMDA: matrix completion for MiRNA-disease association prediction
.
Oncotarget
 
2017
;
8
:
21187
99
.

28.

Yang
 
M
,
Luo
 
H
,
Li
 
Y
. et al.  
Drug repositioning based on bounded nuclear norm regularization
.
Bioinformatics
 
2019
;
35
:
i455
63
.

29.

Chen
 
H
,
Chen
 
K
.
Predicting disease-associated microbes based on similarity fusion and deep learning
.
Brief Bioinform
 
2024
;
25
:
bbae550
.

30.

Ning
 
Q
,
Zhao
 
Y
,
Gao
 
J
. et al.  
AMHMDA: attention aware multi-view similarity networks and hypergraph learning for miRNA-disease associations identification
.
Brief Bioinform
 
2023
;
24
:bbad094.

31.

Peng
 
W
,
He
 
Z
,
Dai
 
W
. et al.  
MHCLMDA: multihypergraph contrastive learning for miRNA-disease association prediction
.
Brief Bioinform
 
2023
;
25
:bbad524.

32.

Liu
 
H
,
Bing
 
P
,
Zhang
 
M
. et al.  
MNNMDA: predicting human microbe-disease association via a method to minimize matrix nuclear norm
.
Comput Struct Biotechnol J
 
2023
;
21
:
1414
23
.

33.

Luo
 
J
,
Long
 
Y
.
NTSHMDA: prediction of human microbe-disease association based on random walk by integrating network topological similarity
.
IEEE/ACM Trans Comput Biol Bioinform
 
2020
;
17
:
1341
51
.

34.

Huang
 
Z
,
Shi
 
J
,
Gao
 
Y
. et al.  
HMDD v3.0: a database for experimentally supported human microRNA-disease associations
.
Nucleic Acids Res
 
2019
;
47
:
D1013
7
.

35.

Kopelman
 
PG
.
Obesity as a medical problem
.
Nature
 
2000
;
404
:
635
43
.

36.

Borish
 
L
.
The immunology of asthma: asthma phenotypes and their implications for personalized treatment
.
Ann Allergy Asthma Immunol
 
2016
;
117
:
108
14
.

37.

Baumgart
 
DC
,
Sandborn
 
WJ
.
Crohn's disease
.
The Lancet
 
2012
;
380
:
1590
605
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

Supplementary data