Abstract

With the development and decreasing cost of next-generation sequencing technologies, the study of the human microbiome has become a rapid expanding research field, which provides an unprecedented opportunity in various clinical applications such as drug response predictions and disease diagnosis. It is thus essential and desirable to build a prediction model for clinical outcomes based on microbiome data that usually consist of taxon abundance and a phylogenetic tree. Importantly, all microbial species are not uniformly distributed in the phylogenetic tree but tend to be clustered at different phylogenetic depths. Therefore, the phylogenetic tree represents a unique correlation structure of microbiome, which can be an important prior to improve the prediction performance. However, prediction methods that consider the phylogenetic tree in an efficient and rigorous way are under-developed. Here, we develop a novel deep learning prediction method MDeep (microbiome-based deep learning method) to predict both continuous and binary outcomes. Conceptually, MDeep designs convolutional layers to mimic taxonomic ranks with multiple convolutional filters on each convolutional layer to capture the phylogenetic correlation among microbial species in a local receptive field and maintain the correlation structure across different convolutional layers via feature mapping. Taken together, the convolutional layers with its built-in convolutional filters capture microbial signals at different taxonomic levels while encouraging local smoothing and preserving local connectivity induced by the phylogenetic tree. We use both simulation studies and real data applications to demonstrate that MDeep outperforms competing methods in both regression and binary classifications. Availability and Implementation: MDeep software is available at https://github.com/lichen-lab/MDeep Contact:[email protected]

Introduction

Human microbiota, including bacteria, archaea, viruses and eukaryotes, colonizes the human body and affects host physiology to a great extent. The composition and function of microbiota vary across different body sites, ages, genders, races and dietaries of the host [18]. The roles of human microbiota playing in human health are usually summarized in three aspects. First, the microbiota could potentially aid the digestive system by more efficiently extracting energy from food and harvesting nutrients [13, 41] as microbiota provides humans with enzymes and biochemical pathways [13] produced by versatile metabolic microbial genes that are far more than found in human genome. Second, the human microbiota protects its host against invasive pathogens by providing a physical barrier, producing antimicrobial substances or involving in competitive exclusion [6, 19]. Third, the microbiota is essential in the induction, training and function of the host immune system [4, 25]. As a consequence, dysbiosis of human microbiota, that is the abnormal change of relative abundances of microbiota, could lead to various human diseases such as infection (e.g. Clostridium difficile, Helicobacter pylori, Bacterial vaginosis), liver diseases (e.g. acute-on-chronic liver failure), gastrointestinal malignancy (e.g. gastric cancer, colorectal cancer), metabolic disorders (e.g. obesity, type 2 diabetes), autoimmune diseases (e.g. Crohn’s disease) and even mental or psychological diseases (e.g. autism spectrum disorder) [30].

16S rRNA gene-target sequencing is a cost effective metagenomic sequencing technology, which has been widely used in microbiome studies for mainly uncovering bacteria and archaea by sequencing the structural components of the ribosome (V1–V9) that could be used as a molecular clock to identify phylogeny [22]. The raw sequencing reads could be processed using established bioinformatics pipelines such as Quantitative Insights Into Microbial Ecology [5], which clustered the reads into operational taxonomic units (OTUs) at different taxonomic levels. As a result, the processed microbiome data are generated, consisting of an OTU abundance matrix with rows as samples and columns as OTUs along with a phylogenetic tree based on which the phylogenetic information among OTUs could be inferred.

As human microbiota is closely associated with human health, it is natural to use OTUs as ‘biomarkers’ to predict host phenotypes or clinic outcomes. It should be noted that microbiome data are usually high-dimensional (more OTUs than samples), over-dispersed (large variability) and sparse (excessive zeros in the OTU abundance matrix). These data characteristics make robust machine learning models such as random forest (RF) [17] a desirable microbiome-based prediction model, which deals with high-dimensional features by randomly sampling a subset of features in each decision tree to reduce the possibility of overfitting and the final prediction is the aggregated predictions of all trees. Moreover, modern regression methods such as lasso [40], MCP [50] and elastic net (Enet) [52] are designed in nature for high-dimensional classification and regression and have been widely used in microbiome-based prediction [21, 29, 39]. These regression models usually incorporate a sparse penalty to select the most predictive taxa in the training set and thus improve the prediction performance in the testing set using the selected taxa. Besides the sparse penalty, phylogeny-regularized regression models have been recently developed by including an additional tree-guided smoothness penalty [8, 43, 44]. Though these models improve the prediction performance when the tree is informative, they are computational expensive and sensitive to multiple tuning parameter selection.

Although these methods adopt different approaches to address the high-dimensional prediction task, they are limited in exploring the phylogenetic relationship among taxa. The phylogenetic tree is an informative prior as the microbial community changes are not randomly distributed but tend to occur in clades at varying phylogenetic depths corresponding to different taxonomic ranks. In other words, the tree could provide a phylogenetically correlated structure among taxa, based on which we can cluster and aggregate taxa abundance to achieve better predictive performance. Moreover, the cluster size and signal density also vary. For diseases such as colorectal cancer or arthritis [34, 49], few marker taxa are found to be associated to the disease state, whereas effects on the overall composition are very mild. In contrast, obesity and inflammatory bowel disease are associated with marked changes in the overall composition [23, 26]. Taken together, an optimized prediction model should have robust prediction performance across different cluster size and signal density.

Recently, deep learning methods especially convolutional neural network (CNN) have been widely used in bioinformatics for various prediction tasks: (i) predicting genomic regions such as TF-DNA binding sites [1, 42] and modification sites [24], (ii) predicting genomic features such as non-coding RNAs [47] and enhancer [46], (iii) predicting genomic signals such as gene expression [36] and DNA methylation [2] and (iv) predicting regulatory variants [42, 51]. The advantages of CNN lie on its ability of capturing the local correlations of features, enhancing the local connectivity across different levels and reducing the parameters via convolutional operations, which improve the prediction performance. In these studies, CNN achieves an overall better prediction than other methods and the sample size of training set usually ranges from thousands [47] to millions [51]. However, the prediction performance of CNN is rarely exploited when the sample size of training set is as few as hundreds. Moreover, a recent CNN-based deep learning method Ph-CNN has been proposed to perform the task of binary classification with the consideration of phylogenetic tree [12]. However, the CNN architecture of Ph-CNN is not optimized. In addition, the utilization of phylogenetic tree based on finding neighbors of each OTU via multidimensional scaling projection is computationally intensive, which limits the scalability of Ph-CNN to high-dimensional microbiome data. Importantly, Ph-CNN is only designed for binary outcome only. Considering these, a computational efficient deep learning-based prediction, which can utilize the phylogenetic tree, for the purpose of predicting both continuous and binary outcome, is in demand.

In this work, we develop MDeep, a novel microbiome-based deep learning method, for predicting continuous and binary outcome by utilizing both the taxon abundance and phylogenetic tree. Based on an evolutionary model, MDeep is designed to model the taxonomic rank by the convolutional layer and capture the phylogenetic correlation on each taxonomic rank via the convolutional operation. The main contributions of MDeep can be summarized as (i) predicting both continuous and binary outcome; (ii) utilizing the phylogenetic information besides taxon abundance for improving prediction accuracy; (iii) obtaining an overall better prediction than other methods especially CNN without using the phylogenetic tree in a conducted comprehensive simulation study with considerations of signal density, cluster size and informativeness of phylogeny; and (iv) outperforming existing methods in real datasets including CNN and Ph-CNN. To the best of our knowledge, MDeep is the first deep learning approach that can efficiently utilize the phylogenetic tree in predicting both continuous and binary outcome. Importantly, this is the first study systematically exploring the prediction performance of deep learning approaches in a comprehensive simulation study, which will inform deep-learning-based approaches’ favorable scenario, that is, dense and large-clustered signals. We believe MDeep will be a valuable addition to the microbiome research community.

Methods

Encoding phylogenetic information by convolutional operation

Before we introduce the deep learning predictive model, we introduce a phylogeny-induced correlation structure |$C_{p \times p}$| among taxa based on a evolutionary model [27], which is defined as
(1)
where |$D$| denotes the pairwise patristic distance between taxa (e.g. the length of the shortest path between two taxa in the tree) that could be estimated by the function cophenetic in the R package ape that takes a phylogenetic tree as the input. |$\rho \in (0, \infty )$| characterizes the evolutionary rate: fast evolution corresponds to a large |$\rho$| (a small |$C$|⁠). Alternatively, |$\rho$| can be interpreted as a parameter that controls the phylogenetic depth at which the taxa are grouped: a larger cluster at a lower phylogenetic depth indicates a larger |$\rho$| (a smaller |$C$|⁠). In other words, a large |$\rho$| represents a small phylogenetic correlation among taxa and large clusters, making the tree less informative. Because a phylogenetic depth corresponds to a taxonomic rank, |$\rho$| has a similar effect as taxonomic grouping conceptually, where taxa at different taxonomic ranks are grouped together according to their taxonomy. To avoid additional computational cost for tuning |$\rho$|⁠, we fix |$\rho$| as 2.

Microbiome-based deep learning architecture

Overview of MDeep

MDeep is essential a phylogeny-regularized CNN, which is composed of multiple convolutional layers followed by fully connected layers. The taxa are clustered based on |$C$| before the taxon abundance is passed into the network. Following the input layer, convolutional layers are designed to include the phylogenetic correlation across different phylogenetic depths as much as possible (Figure 1). Since the phylogenetically correlated taxa are grouped, convolutional operation is more efficient to capture phylogenetic correlation in the local receptive field, and thus encouraging local smoothing. Notably, convolutional layers not only bring a solution to high-dimensional input variables by reducing the number of parameters but also allow to encode the phylogenetic information in the local receptive fields, which encourages spatially local input patterns. Additionally, convolutional layers encourage local connectivity in the way of making hidden nodes only receive input from only a restricted subarea of the previous convolutional layer. In contrast to convolutional layers, fully connected layers exploit the nonlinear and high-order interactions among input features globally, further improving the feature representation. In sum, MDeep maximizes the feature representation of microbiome data to improve the prediction performance potentially.

(A) The conceptual analogy between MDeep and taxonomic levels of the phylogenetic tree. OTUs on the species level are clustered based on the evolutionary model. This clustering step makes convolutional operation capture OTUs highly correlated in the phylogenetic tree. The number of hidden nodes decreases as the convolutional layer moves forward, reflecting the taxonomic grouping. (B) Input layer, convolutional layers, fully connected layers and output layer of MDeep. There are three convolutional layers and three fully connected layers in MDeep. The output layer, connected to the last fully connected layer, contains a single node for continuous outcome and two nodes for binary outcome.
Figure 1

(A) The conceptual analogy between MDeep and taxonomic levels of the phylogenetic tree. OTUs on the species level are clustered based on the evolutionary model. This clustering step makes convolutional operation capture OTUs highly correlated in the phylogenetic tree. The number of hidden nodes decreases as the convolutional layer moves forward, reflecting the taxonomic grouping. (B) Input layer, convolutional layers, fully connected layers and output layer of MDeep. There are three convolutional layers and three fully connected layers in MDeep. The output layer, connected to the last fully connected layer, contains a single node for continuous outcome and two nodes for binary outcome.

Encoding phylogenetic information by convolutional operation

To incorporate the phylogenetic information via convolutional operation more efficiently, MDeep first clusters taxa based on |$C$|⁠, which will make the phylogenetically correlated taxa close to each other. MDeep then adopts multiple 1D convolutional filters to capture the phylogenetic correlation among taxa on each convolutional layer. Specifically, we let the training set in a batch consist of |$n$|-labeled samples |$(\textbf{x},y)_{n}$|⁠, where |$\textbf{x}$| is a taxon abundance vector of size |$p$|⁠, |$y \in \{0,1\}$| for binary classification and |$y \in \{-\infty ,+\infty \}$| for regression. Suppose there are |$n_{f}$| filters with a size |$l$| and each filter will perform sliding window operations with a stride |$s$| for a consecutive movement from position 1 to |$\lceil *{\frac{p}{s}}\rceil$| across |$p$| taxa, resulting in a two-dimensional feature map |$\textbf{M}$| (⁠|$\lceil *{\frac{p}{s}}\rceil \times n_{f}$|⁠). Specifically, |$\textbf{M}$| can be derived from |$\textbf{x}$| and |$\textbf{W}$| as
(2)
(3)
|$m_{ij}$| is the entry of the feature map |$\textbf{M}$| from |$i$|-th taxon (⁠|$i \in \{1,\ldots ,\lceil *{\frac{p}{s}}\rceil$|⁠) and |$j$|-th filter (⁠|$j \in \{1,\ldots ,n_{f}\}$|⁠). Two-dimensional |$\textbf{W}$| (⁠|$l \times n_{f}$|⁠) is a matrix of weights for |$n_{f}$| filters with each column corresponding to the weights of an individual filter. |$b_{j}$| is a bias term specifically for filter |$j$|⁠. |$f$| is a nonlinear activation function such as the hyperbolic tangent function.

Input layer, convolutional layers, fully connected layers and output layer

The conceptual analogy between MDeep and taxonomy can be viewed in Figure 1A, where each convolutional layer mimics a taxonomic rank such that the input layer represents species level with each node corresponding to a species, followed by multiple convolutional layers starting from the representation of a lower phylogenetic/taxonomic level (e.g. genus) to a higher phylogenetic/taxonomic level (e.g. order). Multiple convolutional filters operate on each convolutional layer to encourage local smoothing, and feature maps between adjacent convolutional layers are restricted by the local connectivity, which resembles the scenarios that nearby taxa on the lower phylogenetic/taxonomic level are more likely close to each other on the higher phylogenetic/taxonomic level. In addition, the dimension of feature map decreases with a series of convolutional operations across multiple convolutional layers. The decreased dimension of feature map via convolutional operation mimics the structure of the phylogenetic tree, where the dimension of taxa decreases as the taxonomic rank increases, corresponding to phylogenetic depth from high to low (Figure 1B).

The feature map of the last convolutional layer is flattened as the input for the feed-forward neural network (NN) consisting of several fully connected layers. The dimension of the fully connected layer also decreases with the NN moving forward in order to capture the high-order and nonlinear interactions of the features.

The output layer, connected to the last fully connected layer, contains a single neuron for continuous outcome and two nodes for binary outcome. For binary outcome, the output is further passed through a sigmoid function to produce the prediction probability for |$y=1$|⁠, defined as |$p(y=1|\textbf{x})$|⁠. The prediction probability for |$y=0$| can be obtained by |$p(y=0|\textbf{x})=1-p(y=1|\textbf{x})$| accordingly.

Parameter estimation

Prediction mean squared error (PMSE), defined as |$l_{c}=\frac{1}{n}\sum _{i=1}^{n} (y_{i}-\hat{y}_{i})^{2}$|⁠, is used as the cost function for regression, whereas cross-entropy, defined as |$l_{c}=-\frac{1}{n}\sum _{i=1}^{n} (y_{i} log(p(y_{i}=1|\textbf{x})+(1-y_{i}) log(p(y_i=0|\textbf{x}))$|⁠, is used for binary classification. The objective function is |$f_{o}=min_{(\textbf{W})} l_{c}+\lambda \sum ||\textbf{W}||^{2}$|⁠, where |$\lambda$| is the tuning parameter for the |$l_{2}$| penalty of |$\textbf{W}$|⁠. |$f_{o}$|⁠, is used to minimize the training error and thus update the network parameters via a back-propagation algorithm. Specifically, MDeep adopts adaptive moment estimation [20] as the optimizer in the process of back-propagation since it is an improved version of stochastic gradient descent method by computing adaptive learning rate for each weight.

Model selection

An appropriate network architecture is essential to the model fitting: deep layers with many parameters will cause overfitting while shallow layers with few parameters will result in underfitting. Considering the sample size of a typical microbiome study is relatively smaller compared with a traditional CNN prediction task, it is feasible to manually select the optimal number of convolutional layers and fully connected layers. To be specific, we start with a single convolutional layer and incrementally increase the number of convolutional layers with systematic exploration of the number of hidden nodes and dropout rates to find a proper architecture. We stop adding convolutional layers when increasing the number of layers does not improve prediction performance. We use the same procedure to select the number of fully connected layers. The network architecture ends up with three convolutional layers followed by three fully connected layers.

Approaches to avoid overfitting

To improve generalization and to prevent overfitting, we leverage the dropout-layer technique and |$L_{2}$| regularization for |$\textbf{W}$| in the cost function. Each fully connected layer is followed by a dropout layer to avoid overfitting [38]. Specifically, 50% hidden neurons in the fully connected layer are randomly dropped out.

Summary of MDeep

Altogether, MDeep is a NN, |$f(\textbf{x})=net^{3}(conv^{3}(\textbf{x}))$|⁠, which consists of one input layer, three convolutional layers, connected with three fully connected layers and one output layer (Figure 1B). Each convolutional layer or fully connected layer is activated by the hyperbolic tangent function. There are 64 kernels in each convolutional layer. The kernel size is 8 and the stride is 4. We run MDeep multiple epochs in the training. In an epoch, the whole training set is split into multiple batches and each batch pass forward and backward through the network. Here, we set batch size 16 and number of epochs 2000. When the objective function converges after multiple epochs, estimated parameters are stabilized and finalized. Given a testing sample, the trained MDeep computes the prediction probability for binary classification and prediction values for regression.

Performance evaluation

We use |$R^{2}$| and PMSE as the evaluation metric for regression, and |$R^{2}$|⁠, AUC (Area Under the Curve) along with Sensitivity, Specificity, Accuracy, Precision, F1 score and Matthews correlation coefficient (MCC) for binary classification. These metrics are defined as

In the above formula, TP and TN are the numbers of positive and negative samples that are correctly classified, FN and FP are the numbers of positive and negative samples that are misclassified. Sensitivity or Recall (⁠|$\in [0,1]$|⁠) is the proportion of true positives that are correctly predicted among all true positives. Specificity (⁠|$\in [0,1]$|⁠) is the proportion of true negatives that are correctly predicted among all true negatives. Precision (⁠|$\in [0,1]$|⁠) is the proportion of true positives that are correctly predicted among all predicted positives. Accuracy is the proportion of correctly predicted samples (TP and TN) among all samples. F1 (⁠|$\in [0,1]$|⁠) considers both the Precision and Recall. MCC (⁠|$\in [-1,1]$|⁠) indicates the correlation between predicted labels and true labels: +1 represents a perfect prediction, 0 means random guess and −1 indicates total disagreement between prediction and truth.

Simulation studies

Simulation strategy

We conduct comprehensive simulation studies to evaluate the prediction performance of MDeep along with other competing methods for both regression and binary classification. We synthetically generate |$200$| independent samples in the training set and an equal number of |$200$| independent samples in the testing set, respectively. The two classes are restricted to be balanced in both training and testing sets for binary classification. By assuming that OTU abundance follows a Dirichlet-multinomial distribution (DM), we estimate parameters including dispersion and mean proportion from a real human upper respiratory tract microbiome data [7] consisting of 778 OTUs from 60 samples. Using these estimated parameters, we generate the read counts parametrically based on DM and the total read count is drawn from a negative binomial distribution with the mean of 5000 and the dispersion of 25. The OTU abundance is further transformed into relative abundance by dividing the total read counts. We then generate the outcome based on the abundance of outcome-associated OTUs (‘aOTUs’) that form in different number of outcome-associated OTU clusters (‘aClusters’) with different cluster size. We further evaluate the effect of number of clusters and cluster size on the prediction performance.

Constructing outcome-associated OTU clusters

For an informative tree, aOTUs are more likely clustered and have a similar magnitude and the same direction of effect to the outcome. In other words, aOTUs form outcome-associated clusters (‘aClusters’) to have an impact on the outcome. Since the density of aClusters and the size of aClusters may also vary, we therefore systematically investigate how cluster size and density of aClusters affect the prediction performance of MDeep along with other methods in the simulation study. In addition, we evaluate the effect of an non-informative tree on the prediction performance of MDeep, where clustered aOTUs have an opposite direction of effect to the outcome. To be specific, the parameters are set as follows.

  • Cluster size: aOTUs are clustered at different phylogenetic depths, resulting in different sizes of aClusters. We cluster OTUs into 50, 20 and 10 clusters based on |$C$|⁠, representing small, medium and large aClusters.

  • Signal density: proportion of aClusters among all clusters, which is chosen from 10%, 20% and 40% to represent low, medium and high signal density.

  • Informativeness of phylogeny: for an informative tree (Scenario 1 or S1), we allow aOTUs in each aCluster have the same effect in the same direction to the outcome. For a non-informative tree (Scenario 2 or S2), we let adjacent aOTUs in an aCluster have opposite effects to the outcome, which violates the assumption that closely related aOTUs have similar biological functions and thus have similar effects to the outcome (Figure 2).

Illustration for the simulation strategy. S1 (informative phylogeny): all OTUs in C1 or C3 have the same effect size in same effect direction to the outcome. S2 (non-informative phylogeny), the adjacent OTUs in C1 or C3 have opposite effects to the outcome. C1 and C3 are two aClusters. C2 is a non-aCluster. Red circles represent aOTUs having positive effects to the outcome while blue circles represent aOTUs having negative effects to the outcome.
Figure 2

Illustration for the simulation strategy. S1 (informative phylogeny): all OTUs in C1 or C3 have the same effect size in same effect direction to the outcome. S2 (non-informative phylogeny), the adjacent OTUs in C1 or C3 have opposite effects to the outcome. C1 and C3 are two aClusters. C2 is a non-aCluster. Red circles represent aOTUs having positive effects to the outcome while blue circles represent aOTUs having negative effects to the outcome.

Generating outcomes based on aClusters

We denote |$\mathcal{A}_{l}$| as the set containing indices of |$lth$| aCluster among |$m$| aClusters (⁠|$l \in{1,\ldots ,m}$|⁠), |$x_{ij}$| represents the |$jth$| OTU abundance in sample |$i$|⁠, and |$\eta _i$| is the expected outcome value of sample |$i$|⁠, which can be generated based on the following linear relationship:
(4)
(5)
Notably, |$\beta _{l}$| is sampled from a centered normal distribution and thus the effect of an aCluster can be either positive or negative.
We add random error from another centered normal distribution to obtain the continuous outcome |$y_{i}$| for |$ith$| sample,
(6)
We perform inverse logit transformation to |$\eta _{i}$| to obtain probability |$\pi _{i}$|⁠, based on which the binary outcome |$y_{i}$| for |$ith$| sample is generated from a Bernoulli distribution,
(7)

We set |$\sigma ^{2}_{\beta }$| 2 for continuous outcome and 4 for binary outcome. For continuous outcome, we adjust |$\sigma ^2_{\beta }$| and |$\sigma ^{2}_{\epsilon }$| jointly to control the signal-to-noise ratio that fixed the percentage of variability explained by OTUs.

Competing methods, model selection and evaluation

We compare MDeep with sparse regression models including lasso and Enet, and machine learning models such as RF, feed-forward NN and CNN. NN, CNN and MDeep are implemented in TensorFlow that is an open source artificial intelligence library developed by Google. The details of selecting number of hidden layers, fully connected layers and other parameters of MDeep are described in the Methods section. Sparse regression models such as lasso and Enet are implemented in glmnet R package. RF is implemented in randomForest R package with the default parameter setting.

Particularly, NN, designed with the same number of fully connected layers as in MDeep, is used as a baseline comparison to MDeep for demonstrating the advantages of convolutional operation to capture local correlations of OTUs at different phylogenetic depths and at the same time reduce possible overfitting. Since a typical CNN does not automatically exploit the phylogenetic information, we also include a typical CNN with the same architecture as MDeep as a comparison with demonstrate the importance of utilizing the phylogenetic information. Specifically, we randomly shuffle OTUs in the input layer of CNN instead of forcing phylogenetically correlated OTUs clustered as in MDeep. In this way, each convolutional operation will likely include both aOTUs and non-aOTUs, making the convolutional operation less effective.

Optimal tuning parameter of lasso and Enet is selected based on 5-fold cross-validation (5-CV). In 5-CV, the whole training set is divided into 5-folds, where 4-folds is used to train the model and 1-fold is used to obtain the metric for evaluating prediction performance of the trained model. Specifically, we used PMSE as the metric for regression and AUC for binary classification. The optimal tuning parameter is chosen based on the average metric in 5-CV and a final model is fitted using the whole training set with the optimal tuning parameter. To evaluate an independent testing set, we use |$R^{2}$| and PMSE for regression; |$R^{2}$| and metrics including Sensitivity, Specificity, Accuracy, Precision, F1 score and MCC for binary classification. In each simulation setting, 50 training and testing sets are generated and the average metric is reported.

Additional data transformation and data generation strategy

We evaluate the robustness of MDeep by using different data transformation and data generation strategy. We apply CLR (centered log ratio transformation) to OTU counts generated from Dirichlet-multinomial distribution for evaluating the impact of data transformation on compositional data [14]. Additionally, we generate OTU counts using negative binomial distribution with the consideration of the microbial correlations [16]. First, microbial correlations are estimated from a real dataset using MAGMA [10]. Second, correlated multivariate normal data are generated with mean as 0 and correlations as the estimated microbial correlations. Third, correlated multivariate normal data are transformed to the copula space using the cumulative distribution function (CDF) of the standard normal distribution. Fourth, correlated OTU counts following negative binomial distribution is generated by the inverse CDF of negative binomial distribution, which is performed on the values obtained in the copula space. Last, CLR is applied on the correlated OTU counts.

Results of prediction for continuous outcome

Overall comparison: when the phylogeny is informative, regardless of signal density and cluster size, MDeep outperforms other methods overall by obtaining a higher |$R^{2}$| (Figure 3A) and lower PMSE (Fig S1A) across different scenarios and signal densities. Particularly, when the cluster size is large or the signal density is high, MDeep has an enormous advantage over other methods due to its ability for exploiting the phylogenetic structure. The improved prediction by MDeep confirms the benefits of encoding the phylogenetic information via the convolutional operation. Especially, NN does not perform well either due to overfitting or inability to utilize the phylogenetic information. MDeep has a significant performance gain over a typical CNN without considering the phylogenetic information, indicating the importance to include phylogeny in the predictive modeling. Moreover, it is expected and observed that other methods achieve similar performance between S1 and S2 as they are not able to utilize phylogenetic information. In contrast, the prediction performance of MDeep deteriorates in S2 compared with S1. Nevertheless, MDeep still outperforms other methods in general and has a clear advantage over other methods even when the tree is non-informative. These observations indicate the fact that utilization of phylogenetic information improves MDeep’s prediction performance, while has little effect on methods without considering the tree.

Prediction performance measured by $R^{2}$ for (A) continuous outcome and (B) binary outcome in the simulation study. OTU counts are generated by Dirichlet-multinomial distribution, followed by relative abundance transformation. Scenario1 (S1) represents informative phylogeny and Scenario2 (S2) represents non-informative phylogeny. Cluster-S, -M, and -L represent small, medium and large clusters. Signal-L, -M, and -H represent low, medium and high signal density, respectively. The whisker shows the standard deviation of $R^{2}$ for each method.
Figure 3

Prediction performance measured by |$R^{2}$| for (A) continuous outcome and (B) binary outcome in the simulation study. OTU counts are generated by Dirichlet-multinomial distribution, followed by relative abundance transformation. Scenario1 (S1) represents informative phylogeny and Scenario2 (S2) represents non-informative phylogeny. Cluster-S, -M, and -L represent small, medium and large clusters. Signal-L, -M, and -H represent low, medium and high signal density, respectively. The whisker shows the standard deviation of |$R^{2}$| for each method.

Signal density: for each cluster size, we observe an overall decrease in |$R^2$| when the signal density increases in both S1 and S2, which can be explained by the result of decreasing individual effects with an increasing number of aOTUs while the percentage of variability explained by aOTUs is fixed. This reduction in individual effects is unfavorable for both sparse regression methods and tree-based methods. The decreasing |$R^{2}$| of sparse regression methods is attributed to weaker individual effects of more aOTUs, a scenario where sparse regression methods tend to have a low sensitivity and specificity to identify aOTUs. As a result, the identified OTUs cannot explain an enough percentage of variability of the outcome, leading to decreasing prediction performance. The deteriorating performance of RF can be explained by an insufficient tree depth to accommodate an expansion of aOTUs when the signal density increases, resulting in a potential underfitting. However, the increase of signal density imposes little adverse effect on either NN, CNN or MDeep, which indicates the NN architecture is robust across different levels of signal density. This robustness may be attributable to two reasons. First, neural architecture does not assume sparsity in the model while utilizes the information of all aOTUs via feature mapping across multiple layers. Second, the dropout technique employed in fully connected layers and regularization technique for the weights can further reduce the potential risk of overfitting when the signal is sparse. In all, MDeep has an overall better prediction than other methods across various levels of signal density. To justify the significant improvement of MDeep, we perform paired Wilcoxon test between |$R^{2}$| of MDeep and |$R^{2}$| of any other method in each scenario. We find that MDeep achieves an overall statistically significant improvement over any other method in each scenario (⁠|$P<0.05$|⁠) except in the ‘Cluster-S, Signal-L’ scenario where MDeep, lasso and Enet have comparable performance.

Cluster size: it is noteworthy that reducing cluster size decreases the phylogenetic information. We thus observe that the improved prediction of MDeep over other methods is diminishing as the cluster size becomes small. When signal density is low and the cluster size is small (‘Cluster-S, Signal-L’), the prediction performance of MDeep is on par with other sparse regression models and the improved prediction of MDeep over other methods increases as the cluster size increases. The trend is similar when the signal density is medium or high. Clearly, MDeep benefits more from a large aCluster because the convolutional operation will mostly capture aOTUs from a large aCluster, however, may include irrelevant OTUs when the aCluster is small.

Variability of |$\beta$| in aCluster: due to the effect size of aOTUs may vary within each aCluster in reality, we perform an additional simulation study to investigate whether MDeep is robust when |$\beta$| of aOTUs varies but in the same direction within each aCluster. Without loss of generality, we choose the scenario when cluster is large and signal density is high (Cluster-L and Signal-H) for predicting continuous outcome. In order to add the variability to the cluster-level |$\beta$| and maintain |$\beta$| of all aOTUs in the same direction, we sample |$\beta$| of all aOTUs in each aCluster from a normal distribution with cluster-level |$\beta$| as mean and 0.1 as standard deviation. As expected, prediction performance remains similarly between same |$\beta$| and different |$\beta$| in the same direction for aOTUs in each aCluster (Fig S2).

Results of prediction for binary outcome

We repeat the same simulations for the binary classification. Compared with regression, we find the overall trends of MDeep are similar in each setting. However, there are some changes of other methods. First, performance of NN is improved and is superior to RF (Figure 3B). Second, MDeep benefits more by exploiting the phylogenetic information demonstrated by more improved prediction in S1 over S2 between binary classification and regression. In addition to |$R^2$|⁠, we also include other metrics (Fig S1B, Table 1, S1-S8). Based on all evaluation metrics, we conclude that MDeep is consistently superior to other methods when the cluster size is relatively high or signal is relatively dense, and comparable with other methods in other scenarios. Without loss of generality, we use |$R^{2}$| as the metric to evaluate the significant improvement of MDeep. We find that MDeep achieves an overall statistically significant improvement over any other method in each scenario by using paired Wilcoxon test (⁠|$P<0.05$|⁠) except in the ‘Cluster-S, Signal-L’ scenario where MDeep, lasso and Enet are comparable.

Table 1

Prediction performance in binary classification (Cluster-L, Signal-L)

ScenarioNNRFCNNLassoEnetMDeep
SensitivityS10.66200.65340.69320.72080.72900.8188
S20.67940.67300.69320.74800.74600.7904
SpecificityS10.66920.65860.69620.73460.73500.8254
S20.65820.65080.69580.74320.74420.7720
AccuracyS10.66560.65600.69470.72770.73200.8221
S20.66880.66190.69450.74560.74510.7812
PrecisionS10.67090.65940.69760.73870.74060.8271
S20.67020.65930.69510.74370.74550.7777
MCCS10.33540.31390.39140.45880.46650.6468
S20.34200.32500.39090.49350.49210.5644
F1 scoreS10.66220.65440.69330.72510.73100.8210
S20.67120.66490.69230.74290.74430.7827
ScenarioNNRFCNNLassoEnetMDeep
SensitivityS10.66200.65340.69320.72080.72900.8188
S20.67940.67300.69320.74800.74600.7904
SpecificityS10.66920.65860.69620.73460.73500.8254
S20.65820.65080.69580.74320.74420.7720
AccuracyS10.66560.65600.69470.72770.73200.8221
S20.66880.66190.69450.74560.74510.7812
PrecisionS10.67090.65940.69760.73870.74060.8271
S20.67020.65930.69510.74370.74550.7777
MCCS10.33540.31390.39140.45880.46650.6468
S20.34200.32500.39090.49350.49210.5644
F1 scoreS10.66220.65440.69330.72510.73100.8210
S20.67120.66490.69230.74290.74430.7827

Top performed method in each metric is bold.

Table 1

Prediction performance in binary classification (Cluster-L, Signal-L)

ScenarioNNRFCNNLassoEnetMDeep
SensitivityS10.66200.65340.69320.72080.72900.8188
S20.67940.67300.69320.74800.74600.7904
SpecificityS10.66920.65860.69620.73460.73500.8254
S20.65820.65080.69580.74320.74420.7720
AccuracyS10.66560.65600.69470.72770.73200.8221
S20.66880.66190.69450.74560.74510.7812
PrecisionS10.67090.65940.69760.73870.74060.8271
S20.67020.65930.69510.74370.74550.7777
MCCS10.33540.31390.39140.45880.46650.6468
S20.34200.32500.39090.49350.49210.5644
F1 scoreS10.66220.65440.69330.72510.73100.8210
S20.67120.66490.69230.74290.74430.7827
ScenarioNNRFCNNLassoEnetMDeep
SensitivityS10.66200.65340.69320.72080.72900.8188
S20.67940.67300.69320.74800.74600.7904
SpecificityS10.66920.65860.69620.73460.73500.8254
S20.65820.65080.69580.74320.74420.7720
AccuracyS10.66560.65600.69470.72770.73200.8221
S20.66880.66190.69450.74560.74510.7812
PrecisionS10.67090.65940.69760.73870.74060.8271
S20.67020.65930.69510.74370.74550.7777
MCCS10.33540.31390.39140.45880.46650.6468
S20.34200.32500.39090.49350.49210.5644
F1 scoreS10.66220.65440.69330.72510.73100.8210
S20.67120.66490.69230.74290.74430.7827

Top performed method in each metric is bold.

Results of additional data transformation and generation

When CLR is used as data transformation, the simulation results are presented in Fig S3A, B. Though CLR does not necessarily improve the prediction performance compared with relative OTU abundance Figure 3, the overall trend still holds. MDeep still has a clear advantage over other methods when the tree is informative and is comparable with other methods when the tree is non-informative. When negative binomial distribution with the consideration of microbial correlations is used as data generation distribution, the simulation results have the similar trends as using Dirichlet-multinomial distribution (Fig S3C, D).

Application of MDeep in real datasets

We apply MDeep along with other methods aforementioned to three real datasets. The first dataset is acquired from a study that explores how gut microbiome varies across age and topography [48]. The second dataset is obtained from a longitudinal comparative study that investigates the fecal microbiome of monozygotic (MZ) and dizygotic (DZ) twin pairs born in Malawi who became discordant for kwashiorkor [37]. The third dataset is taken from a study which investigates the role for intestinal bacteria in rheumatoid arthritis[33]. To incorporate the phylogenetic information via the convolutional operation, we apply hierarchical agglomerative clustering (HAC) algorithm to the phylogeny-induced correlation structure |$C$|⁠, which is inferred from the phylogenetic tree. In this way, phylogenetically related OTUs will be close to each other and irrelevant OTUs will be far apart from each other. HAC is an unsupervised clustering algorithm, which will group OTUs in a ‘bottom-up’ approach, where each OTU is considered as a cluster in the bottom and merged with close OTUs to form new clusters and move up to form a hierarchy. Therefore, no number of clusters needs to be specified and no membership needs to be assigned for OTUs since they will be ordered in a hierarchical structure automatically. In addition, hierarchical clustered OTUs can potentially reflect the taxonomic grouping of OTUs.

Predicting chronological age based on gut microbiome of individuals in the United States

In this study, there are 531 individuals (115 from Malawi, 100 from Venezuela and 316 from the United States). Gut microbiome of all the individuals are profiled using 16S rRNA gene-targeted sequencing and the datasets are deposited to Qiita [15] with assigned study ID 850. We first download and process the datasets using QIMIE, resulting in a total of 14 170 OTUs. The phylogenetic tree is constructed using FastTree [31]. We then use individuals in USA for the prediction task as the sample size is relatively large for the deep learning approach. Subsequently, we carry out a series of data pre-processing steps as described in [45], consisting of (i) removing outlier samples, (ii) removing less informative and noisy OTUs (OTU prevalence ¡ 10%; median non-zero counts ¡ 10), (iii) normalizing zero-inflated OTU counts using GMPR [9], (iv) replacing outlier counts using winsorization with 97% quantile and (v) reducing the influence of highly abundant taxa counts by square-root transformation. As a result, we have 308 individuals profiled with 1087 OTUs for model training and testing. The final phylogenetic tree is tuned accordingly with 1087 leaves left and the phylogeny-induced correlation structure |$C_{1087 \times 1087}$| is calculated based on the tree.

We treat age as either a continuous outcome or a binary outcome to evaluate the prediction performance for both regression and binary classifications. Specifically, for regression, ages of all individuals are directly treated as continuous outcomes; for binary classification, we classify all the individuals to three age groups, namely baby (age|$\leq$|3 years, |$n=54$|⁠), child (3<age<18 years, |$n=125$|⁠) and adult (age|$\geq$|18 years, |$n=129$|⁠) [45]. Particularly, we evaluate the prediction performance in classifying two age groups: ‘baby versus child’ and ‘child versus adult’. We exclude ‘baby versus adult’, an easy case where all methods have superior and indistinguishable prediction performance due to the drastically distinct microbiome composition. In contrast, adjacent age periods such as ‘baby versus child’ or ‘child versus adult’ are ideal for identifying the difference of the prediction performance among different methods.

We compare MDeep with other methods aforementioned. In addition, we include Ph-CNN as a comparison for binary classification as it is only designed for binary outcome. Moreover, we compare the MDeep with CNN without using the phylogenetic information by randomly shuffling OTUs. To evaluate the prediction performance, we randomly split the dataset into two sets with 80% as the training set and the rest 20% as the testing set. Tuning parameter selection and model fitting are performed on the training set and prediction performance is evaluated on the testing set. We repeat the randomization 50 times and report the metrics: |$R^2$| and PMSE for regression; |$R^2$|⁠, sensitivity, specificity, accuracy, precision, F1 score and MCC for binary classification.

For continuous prediction of age, the overall trends of prediction performance of all methods are similar to the simulation study, where MDeep ranks top by achieving the highest |$R^{2}$| and lowest PMSE, followed by Enet. RF and NN do not perform well. Importantly, MDeep outperforms CNN, which indicates an improved prediction is achieved by utilizing the phylogenetic information (Figure 4A; S4A). In addition, we perform paired Wilcoxon test between |$R^{2}$| of MDeep and |$R^{2}$| of any other method and find that MDeep achieves an overall statistically significant improvement over any other method (⁠|$P<0.05$|⁠).

Prediction performance of chronological age based on gut microbiome of individuals in the United States: (A) $R^{2}$ for all ages (B) $R^{2}$ for ‘Baby versus Child’ (C) $R^{2}$ for ‘Child versus Adult’. Prediction performance of gender based gut microbiome of Malawian twins: (D) $R^{2}$ for male and female. The blue dashed line the mean value of $R^2$ for MDeep. The whisker shows the standard deviation of $R^{2}$ for each method.
Figure 4

Prediction performance of chronological age based on gut microbiome of individuals in the United States: (A) |$R^{2}$| for all ages (B) |$R^{2}$| for ‘Baby versus Child’ (C) |$R^{2}$| for ‘Child versus Adult’. Prediction performance of gender based gut microbiome of Malawian twins: (D) |$R^{2}$| for male and female. The blue dashed line the mean value of |$R^2$| for MDeep. The whisker shows the standard deviation of |$R^{2}$| for each method.

For binary classification of age groups, MDeep obtains the highest |$R^{2}$| in the comparisons of both ‘baby versus child’ and ‘child versus adult’ (Figure 4B, C). In addition, other metrics further demonstrate the MDeep’s superb performance overall (Fig S4B, C; Table 2). Except Ph-CNN, NN architecture is consistently robust in the ‘Baby versus Child’ comparison, where MDeep, NN and CNN perform better than the others (Figure 4B). In contrast, the relative performance of NN compared with CNN and MDeep is deteriorated significantly in the ‘Child versus Adult’ comparison (Figure 4C) because of potential overfitting. Ph-CNN has overall worst performance in both binary classification tasks maybe because of the NN architecture is suboptimal or the phylogenetic information is not fully exploited. Altogether, MDeep has the best prediction performance overall in predicting age values and classifying age groups. In addition, the accurate prediction of age based on human gut microbiome further validates the hypothesis that the composition of human gut microbiome changes with age [28]. Finally, paired Wilcoxon test between |$R^{2}$| of MDeep and |$R^{2}$| of any other method demonstrates that MDeep achieves a statistically significant improvement of prediction over any other method in the comparison of ‘child versus adult’ (⁠|$P<0.05$|⁠), and over Ph-CNN, lasso, RF and Enet in the comparison of ‘baby versus child’ (⁠|$P<0.05$|⁠). To sum up, based on the results of both regression and binary classification, MDeep has the overall best prediction performance for the chronological age.

Table 2

Prediction performance of age groups based on gut microbiome of individuals in the United States

TypeLassoEnetRFPh-CNNCNNNNMDeep
SensitivityBC0.98560.97840.98560.99360.99040.99200.9896
CA0.75310.78080.76690.71080.81230.79770.8254
SpecificityBC0.80180.86360.83270.75450.87640.86910.8891
CA0.77360.81120.77120.68960.81760.77600.8160
AccuracyBC0.92940.94330.93890.92060.95560.95440.9589
CA0.76310.79570.7690.70040.81490.78710.8208
PrecisionBC0.92160.94380.93290.90480.94930.94690.9544
CA0.78260.81650.78030.71520.82760.79170.8286
MCCBC0.83340.86740.85680.81550.8960.89330.9041
CA0.53170.59570.54050.44880.6340.57740.6459
F1 scoreBC0.95160.96010.95770.94630.9690.96840.9712
CA0.76340.79560.77170.69320.81720.79240.8242
TypeLassoEnetRFPh-CNNCNNNNMDeep
SensitivityBC0.98560.97840.98560.99360.99040.99200.9896
CA0.75310.78080.76690.71080.81230.79770.8254
SpecificityBC0.80180.86360.83270.75450.87640.86910.8891
CA0.77360.81120.77120.68960.81760.77600.8160
AccuracyBC0.92940.94330.93890.92060.95560.95440.9589
CA0.76310.79570.7690.70040.81490.78710.8208
PrecisionBC0.92160.94380.93290.90480.94930.94690.9544
CA0.78260.81650.78030.71520.82760.79170.8286
MCCBC0.83340.86740.85680.81550.8960.89330.9041
CA0.53170.59570.54050.44880.6340.57740.6459
F1 scoreBC0.95160.96010.95770.94630.9690.96840.9712
CA0.76340.79560.77170.69320.81720.79240.8242

BC represents ‘Baby versus Child’; CA, ‘Child versus Adult’. Top performed method in each metric is bold.

Table 2

Prediction performance of age groups based on gut microbiome of individuals in the United States

TypeLassoEnetRFPh-CNNCNNNNMDeep
SensitivityBC0.98560.97840.98560.99360.99040.99200.9896
CA0.75310.78080.76690.71080.81230.79770.8254
SpecificityBC0.80180.86360.83270.75450.87640.86910.8891
CA0.77360.81120.77120.68960.81760.77600.8160
AccuracyBC0.92940.94330.93890.92060.95560.95440.9589
CA0.76310.79570.7690.70040.81490.78710.8208
PrecisionBC0.92160.94380.93290.90480.94930.94690.9544
CA0.78260.81650.78030.71520.82760.79170.8286
MCCBC0.83340.86740.85680.81550.8960.89330.9041
CA0.53170.59570.54050.44880.6340.57740.6459
F1 scoreBC0.95160.96010.95770.94630.9690.96840.9712
CA0.76340.79560.77170.69320.81720.79240.8242
TypeLassoEnetRFPh-CNNCNNNNMDeep
SensitivityBC0.98560.97840.98560.99360.99040.99200.9896
CA0.75310.78080.76690.71080.81230.79770.8254
SpecificityBC0.80180.86360.83270.75450.87640.86910.8891
CA0.77360.81120.77120.68960.81760.77600.8160
AccuracyBC0.92940.94330.93890.92060.95560.95440.9589
CA0.76310.79570.7690.70040.81490.78710.8208
PrecisionBC0.92160.94380.93290.90480.94930.94690.9544
CA0.78260.81650.78030.71520.82760.79170.8286
MCCBC0.83340.86740.85680.81550.8960.89330.9041
CA0.53170.59570.54050.44880.6340.57740.6459
F1 scoreBC0.95160.96010.95770.94630.9690.96840.9712
CA0.76340.79560.77170.69320.81720.79240.8242

BC represents ‘Baby versus Child’; CA, ‘Child versus Adult’. Top performed method in each metric is bold.

Predicting gender based on gut microbiome of human twins

One previous research shows that dysbiosis of gut microbiota is involved in metabolic syndrome development, which has a different incidence between men and women [32]. In light of this, we seek to study whether the composition of gut microbiome is different between male and female discordant for kwashiorkor. To achieve this, we treat gender as a binary outcome and classifying gender based on gut microbiome using the dataset collected from twin pairs in Malawi. This dataset is also conducted with 16S rRNA gene-targeted sequencing and deposited in Qiita with study ID 737. We download and process the dataset, resulting in 4321 OTUs profiling a total of 1041 twins (including MZ and DZ) consisting of 483 females, 512 males and 46 samples with missing gender information. After aforementioned data pre-processing steps are performed, we have 995 individuals profiled with 2291 OTUs for the binary classification task. Accordingly, the phylogenetic tree is constructed using FastTree [31] and truncated with the final number of leaves the same as the number of left OTUs. A phylogeny-induced correlation structure |$C_{2291 \times 2291}$| is calculated based on the tree.

For binary classification of gender, MDeep significantly outperforms other any method by achieving a much higher |$R^{2}$| (⁠|$P<0.05$| via paired Wilcoxon test). NN architectures are superior to sparse regression methods and RF, where MDeep, NN and CNN obtain a higher |$R^{2}$| (Figure 4D). However, Ph-CNN still does not perform well. Overall, MDeep maintains the overall top performance across various evaluation metrics (Fig S4D; Table 3). The accurate classification of gender indicates that composition of gut microbiome is different between men and women affected by kwashiorkor. In other words, there is a sex-dependent effect on shaping the gut microbiome.

Table 3

Prediction performance for gender based on gut microbiome of Malawian twins

LassoENetRFPh-CNNCNNNNMDeep
Sensitivity0.8020.8120.85040.83150.81530.82740.8227
Specificity0.73520.74680.73150.69190.8020.80010.8263
Accuracy0.76950.78040.79150.76410.80860.81360.8235
Precision0.76670.77750.77430.77010.81650.81730.8364
MCC0.53840.56040.58640.62630.61660.62710.6475
F1 score0.78270.79310.80880.79010.8150.82120.8284
LassoENetRFPh-CNNCNNNNMDeep
Sensitivity0.8020.8120.85040.83150.81530.82740.8227
Specificity0.73520.74680.73150.69190.8020.80010.8263
Accuracy0.76950.78040.79150.76410.80860.81360.8235
Precision0.76670.77750.77430.77010.81650.81730.8364
MCC0.53840.56040.58640.62630.61660.62710.6475
F1 score0.78270.79310.80880.79010.8150.82120.8284

Top performed method in each metric is bold.

Table 3

Prediction performance for gender based on gut microbiome of Malawian twins

LassoENetRFPh-CNNCNNNNMDeep
Sensitivity0.8020.8120.85040.83150.81530.82740.8227
Specificity0.73520.74680.73150.69190.8020.80010.8263
Accuracy0.76950.78040.79150.76410.80860.81360.8235
Precision0.76670.77750.77430.77010.81650.81730.8364
MCC0.53840.56040.58640.62630.61660.62710.6475
F1 score0.78270.79310.80880.79010.8150.82120.8284
LassoENetRFPh-CNNCNNNNMDeep
Sensitivity0.8020.8120.85040.83150.81530.82740.8227
Specificity0.73520.74680.73150.69190.8020.80010.8263
Accuracy0.76950.78040.79150.76410.80860.81360.8235
Precision0.76670.77750.77430.77010.81650.81730.8364
MCC0.53840.56040.58640.62630.61660.62710.6475
F1 score0.78270.79310.80880.79010.8150.82120.8284

Top performed method in each metric is bold.

Predicting new-onset rheumatoid arthritis based on human microbiome

Besides traits (e.g. age and gender), we also apply MDeep to a disease-related study, which investigates the role for intestinal bacteria in rheumatoid arthritis [33]. In this study, there are four groups of samples, which are new-onset rheumatoid arthritis (NORA), chronic, treated rheumatoid arthritis, psoriatic arthritis and healthy controls (HLTs) with sample size 44, 26, 16 and 28, respectively. The prediction task is the binary classification of NORA from HLT. Since NORA has been found associated with fewer marker taxa (e.g. Prevotella copri), the dataset is helpful to evaluate whether MDeep could maintain robust performance in the unfavorable scenario, where only a fewer marker taxa are associated with the outcome. The dataset is analyzed by MOTHUR [35], resulting in |$997$| OTUs and a phylogenetic tree built by FastTree [31]. After aforementioned data pre-processing steps are performed, an OTU abundance matrix consisting of |$883$| OTUs and the corresponding phylogeny-induced correlation structure |$C_{883 \times 883}$| are obtained for model training and testing.

For the binary classification for NORA from HLT, MDeep still remains the highest |$R^2$| (Figure 5). The improvement of MDeep over other methods is significant (⁠|$P<0.05$| via paired Wilcoxon test). We also find that overall performance of NN-based approaches (MDeep, NN, CNN) outperforms sparse regression models (Enet, lasso and RF). Additionally, MDeep maintains the overall top performance across various evaluation metrics (Fig S5; Table 4). The accurate classification of NORA from HLT strengthens the claim that MDeep will maintain robust performance even if the signal is sparse, where only a fewer marker taxa are associated with the outcome.

Table 4

Prediction performance of NORA based on human intestinal microbiome

LassoENetRFPh-CNNCNNNNMDeep
Sensitivity0.86820.81450.88310.75560.84730.83930.8407
Specificity0.41770.64660.62080.59190.64680.66510.7166
Accuracy0.66860.730.76290.68860.76570.76860.7857
Precision0.6990.77260.77410.74680.79070.79530.8245
MCC0.29860.4630.52450.35410.5150.51930.5642
F1 score0.75420.77560.80890.73770.80660.80630.8188
LassoENetRFPh-CNNCNNNNMDeep
Sensitivity0.86820.81450.88310.75560.84730.83930.8407
Specificity0.41770.64660.62080.59190.64680.66510.7166
Accuracy0.66860.730.76290.68860.76570.76860.7857
Precision0.6990.77260.77410.74680.79070.79530.8245
MCC0.29860.4630.52450.35410.5150.51930.5642
F1 score0.75420.77560.80890.73770.80660.80630.8188

Top performed method in each metric is bold.

Table 4

Prediction performance of NORA based on human intestinal microbiome

LassoENetRFPh-CNNCNNNNMDeep
Sensitivity0.86820.81450.88310.75560.84730.83930.8407
Specificity0.41770.64660.62080.59190.64680.66510.7166
Accuracy0.66860.730.76290.68860.76570.76860.7857
Precision0.6990.77260.77410.74680.79070.79530.8245
MCC0.29860.4630.52450.35410.5150.51930.5642
F1 score0.75420.77560.80890.73770.80660.80630.8188
LassoENetRFPh-CNNCNNNNMDeep
Sensitivity0.86820.81450.88310.75560.84730.83930.8407
Specificity0.41770.64660.62080.59190.64680.66510.7166
Accuracy0.66860.730.76290.68860.76570.76860.7857
Precision0.6990.77260.77410.74680.79070.79530.8245
MCC0.29860.4630.52450.35410.5150.51930.5642
F1 score0.75420.77560.80890.73770.80660.80630.8188

Top performed method in each metric is bold.

Prediction performance for NORA based on human intestinal microbiome: $R^2$ for NORA and HLT. The blue dashed line the mean value of $R^2$ for MDeep. The whisker shows the standard deviation of $R^{2}$ for each method.
Figure 5

Prediction performance for NORA based on human intestinal microbiome: |$R^2$| for NORA and HLT. The blue dashed line the mean value of |$R^2$| for MDeep. The whisker shows the standard deviation of |$R^{2}$| for each method.

Computational performance

Comparing MDeep and Ph-CNN, the two deep learning models that utilize the phylogenetic information, we find MDeep provides superior computational performance since clustering phylogenetically related OTUs only based on the evolutionary model using HAC. Ph-CNN, on the other hand, relying on multidimensional scaling projection, is much more computationally intensive. To demonstrate this, we benchmark the computational performances of MDeep and Ph-CNN on a server consisting of dual 8-core Xeon Sandy Bridge E5–2670 processors with 64 GB system memory. We use the processed dataset collected from twin pairs in Malawi consisting of 995 individuals profiled with 2291 OTUs. Without loss of generality, we record the time of first 100 epochs in one batch in the training process for classifying gender. We find that Ph-CNN runs slowly, which is as nearly as 90 times of the number of epochs. In contrast, the running time of MDeep is stable and increases little (Table S9; Fig S6). For example, in first 100 epochs, Ph-CNN takes 8879 seconds, compared with 57 seconds from MDeep, that is, MDeep is 158 times faster than Ph-CNN. Similarly, MDeep is 8 times, 42 times and 121 times faster than Ph-CNN in first 1, 10 and 50 epochs. This observation indicates that the gain of computational efficiency is more evident when the number of epochs increases for MDeep compared with Ph-CNN. Overall, MDeep outperforms Ph-CNN not only in binary classification but also in computational performance. Additionally, we compare MDeep with all other methods in terms of both computational performance and memory usage when all methods converge in the training. Clearly, Ph-CNN costs much more time and consumes much higher memory than others. The running time and memory consumption of MDeep and NN-based methods (NN, CNN) are comparable but are higher than other methods (Fig S6, S7) due to the intrinsic complex architecture. Overall, MDeep is computationally efficient considering the running time (100s) and memory consumption (1kb).

Discussion

In this work, we develop a microbiome-based deep learning model, MDeep for predicting both continuous and binary outcomes. The effectiveness of the proposed model relies on its ability to maximize the utilization of the information in the microbiome data, which consists of both phylogenetic tree and the OTU abundance. First, MDeep exploits the phylogenetic information by clustering OTUs based on phylogenetic correlation before the input layer. This strategy induces the local smoothing, which is phylogenetically correlated OTUs will be more likely captured together by the convolutional operation in the local receptive field. Second, MDeep encourages the local connectivity in the feature mapping, where phylogenetically related OTUs at a lower taxonomy are more likely maintained close to each other at a higher taxonomy. To further improve MDeep, we will utilize the parallel computing technique to tune |$\rho$| efficiently.

We carry out comprehensive simulations to evaluate the prediction performance with the considerations of cluster size, signal density and informativeness of the phylogenetic tree. As a result, MDeep favors the scenarios of dense and large-clustered signals where it outperforms other methods and it is still comparable with others as the signal density and cluster size decreases. Especially, MDeep has a superior prediction performance not only to NN and CNN with a similar network architecture but also to sparse regression models and RF. This observation strengthens the benefit of exploiting phylogenetic information in predictive modeling. Moreover, the same observation persists when the tree is uninformative, which demonstrates the robustness of MDeep. In reality, a noisy or misspecified tree is not uncommon because of an inappropriate tree construction method or inconsistency between DNA sequence similarity and biological similarity. Furthermore, MDeep is robust to data transformation (e.g. relative abundance and CLR) and underlying data generation distribution (e.g. Dirichlet multinomial and negative binomial with considering microbial correlations). In the analysis of three real datasets, MDeep keeps the trends to outperform other methods overall. Particularly, MDeep outperforms Ph-CNN, another deep learning method that also utilizes the phylogenetic information in both classification performance and computational efficiency.

Although CNN has been widely used to improve the prediction performance with a relatively large sample size ranging from thousands to millions, we demonstrate here that CNN can potentially achieve a high prediction accuracy when the sample size is moderate (e.g. hundreds). This is achieved by developing a ‘shallow’ NN architecture with a few convolutional layers and fully connected layers.

We use mapped reads of OTUs as the feature representation. Notably, reference-free or alignment-free approach is an alternative feature representation approach of microbiome data. Kmers of raw reads can be directly used as features and phylogenetic tree can be constructed based on Kmers [3, 11]. This approach has the benefit of skipping computationally costly sequence alignments required in OTU-picking, thus making the process of training a prediction model more efficient. The comparison between OTU-based and Kmer-based MDeep needs further investigations.

Key Points

  • We present a novel deep learning approach named MDeep, which is designed based on CNN, for performing either regression or binary classification. The novelty of MDeep lies on its ability to exploit phylogenetic tree, which is an important prior in microbiome data.

  • A comprehensive simulation study evaluates the performance of MDeep along with other competing methods, considering factors such as cluster size, signal density and informativeness of the phylogenetic tree. As a result, MDeep favors the scenarios with large clusters and high signal density. Overall, MDeep is superior to other methods when the tree is informative and still achieves a robust performance when the tree is uninformative.

  • Experimental results demonstrate that, for both regression and binary classification, MDeep can achieve competitive performance, compared with classic CNN, Ph-CNN that is another CNN-based method utilizing the phylogenetic information, and other state-of-the-art machine learning methods.

Funding

Indiana University Precision Health Initiative.

Ye Wang is a PhD student at Auburn University. His research interests include deep learning, machine learning and bioinformatics. He is now a visiting student at center for computational biology and bioinformatics at Indiana University School of Medicine (IUSM).

Tathagata Bhattacharya is a PhD student at Auburn University. His research interests include parallel computing and machine learning.

Yuchao Jiang is an assistant professor in the department of biostatistics and the department of genetics at UNC. He is also a member of the Lineberger Comprehensive Cancer Center. His research interests include statistical modeling, method development and data analysis in genetics and genomics.

Yue Wang is a research assistant professor in the department of medical and molecular genetics at IUSM. Dr Wang is experienced in neuroscience and brain biology.

Yunlong Liu is the T. K. Li professor of medical and molecular genetics at IUSM. He directs the Center for Computational Biology and Bioinformatics and the Center for Medical Genomics at IUSM. He has extensive experience in computational modeling in gene regulation and data analysis for high throughput genomics technologies.

Xiao Qin is a professor in the computer science and software engineering department at Auburn University. His research interests include parallel and distributed systems, storage systems and performance evaluation.

Andrew J. Saykin is a professor in department of radiology and imaging sciences at IUSM. He is also the director of the Indiana Alzheimer Disease Center. His research interests is using brain imaging and genomic methods to study mechanisms of memory dysfunction and treatment response in neurological and psychiatric disorders.

Li Chen is an assistant professor of medicine at IUSM. He is also a member of center for computational biology and bioinformatics. He works on developing statistical and computational methods for analyzing omics data such as metagenomics, epigenomics and transcriptomics data.

References

1.

Alipanahi
B
,
Delong
A
,
Weirauch
MT
, et al.
Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning
.
Nat Biotechnol
2015
;
33
(
8
):
831
8
.

2.

Angermueller
C
,
Lee
HJ
,
Reik
W
, et al.
DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning
.
Genome Biol
2017
;
18
(
1
):
67
.

3.

Bertels
F
,
Silander
OK
,
Pachkov
M
, et al.
Automated reconstruction of whole-genome phylogenies from short-sequence reads
.
Mol Biol Evol
2014
;
31
(
5
):
1077
88
.

4.

Bouskra
D
,
Brezillon
C
,
Berard
M
, et al.
Lymphoid tissue genesis induced by commensals through NOD1 regulates intestinal homeostasis
.
Nature
2008
;
456
(
7221
):
507
10
.

5.

Gregory
J
,
Caporaso
JK
,
Stombaugh
J
, et al.
QIIME allows analysis of high-throughput community sequencing data
.
Nat Methods
2010
;
7
(
5
):
335
6
.

6.

Cash
HL
,
Whitham
CV
,
Behrendt
CL
, et al.
Symbiotic bacteria direct expression of an intestinal bactericidal lectin
.
Science
2006
;
313
(
5790
):
1126
30
.

7.

Charlson
ES
,
Chen
J
,
Allen
RC
, et al.
Disordered microbial communities in the upper respiratory tract of cigarette smokers
.
PLoS One
2010
;
5
(
12
):
e15216
.

8.

Chen
L
,
Liu
H
,
Kocher
JP
, et al.
glmgraph: an R package for variable selection and predictive modeling of structured genomic data
.
Bioinformatics
2015
;
31
(
24
):
3991
3
.

9.

Chen
L
,
Reeve
J
,
Zhang
L
, et al.
GMPR: a robust normalization method for zero-inflated count data with application to microbiome sequencing data
.
PeerJ
2018
;
6
:
e4600
.

10.

Cougoul
A
,
Bailly
X
,
Wit
EC
et al.
MAGMA: inference of sparse microbial association networks
.
bioRxiv
,
2019
.

11.

Fan
H
,
Ives
AR
,
Surget-Groba
Y
, et al.
An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data
.
BMC Genomics
2015
;
16
:
522
.

12.

Fioravanti
D
,
Giarratano
Y
,
Maggio
V
, et al.
Phylogenetic convolutional neural networks in metagenomics
.
BMC Bioinformatics
2018
;
19
(
Suppl 2
):
49
.

13.

Steven
R
,
Gill
MP
,
DeBoy
RT
, et al.
Metagenomic analysis of the human distal gut microbiome
.
Science
2006
;
312
(
5778
):
1355
9
.

14.

Gloor
GB
,
Macklaim
JM
,
Pawlowsky-Glahn
V
, et al.
Microbiome datasets are compositional: and this is not optional
.
Front Microbiol
2017
;
8
:
2224
.

15.

Gonzalez
A
,
Navas-Molina
JA
,
Kosciolek
T
, et al.
Qiita: rapid, web-enabled microbiome meta-analysis
.
Nat Methods
2018
;
15
(
10
):
796
8
.

16.

Hawinkel
S
,
Mattiello
F
,
Bijnens
L
, et al.
A broken promise: microbiome differential abundance methods do not control the false discovery rate
.
Brief Bioinform
2019
;
20
(
1
):
210
21
.

17.

Ho
TK
.
Random decision forests
. In:
Proceedings of 3rd International Conference on Document Analysis and Recognition
, vol.
1
.
IEEE
,
1995
,
278
82
.

18.

Hollister
EB
,
Gao
C
,
Versalovic
J
.
Compositional and functional features of the gastrointestinal microbiome and their effects on human health
.
Gastroenterology
2014
;
146
(
6
):
1449
58
.

19.

Hooper
LV
,
Stappenbeck
TS
,
Hong
CV
, et al.
Angiogenins: a new class of microbicidal proteins involved in innate immunity
.
Nat Immunol
2003
;
4
(
3
):
269
73
.

20.

Kingma
DP
,
Ba
J
.
Adam: a method for stochastic optimization
.
arXiv preprint arXiv:1412.6980
,
2014
.

21.

Knights
D
,
Costello
EK
,
Knight
R
.
Supervised classification of human microbiota
.
FEMS Microbiol Rev
2011
;
35
(
2
):
343
59
.

22.

Kuczynski
J
,
Lauber
CL
,
Walters
WA
, et al.
Experimental and analytical tools for studying the human microbiome
.
Nat Rev Genet
2011
;
13
(
1
):
47
58
.

23.

Le
,
Chatelier
,
Nielsen
T
,
Qin
J
, et al.
Richness of human gut microbiome correlates with metabolic markers
.
Nature
2013
;
500
(
7464
):
541
6
.

24.

Luo
F
,
Wang
M
,
Yu
L
, et al.
DeepPhos: prediction of protein phosphorylation sites with deep learning
.
Bioinformatics
2019
;
35
(
16
):
2766
73
.

25.

Macpherson
AJ
,
Harr0is
NL
.
Interactions between commensal intestinal bacteria and the immune system
.
Nat Rev Immunol
2004
;
4
(
6
):
478
85
.

26.

Manichanh
C
,
Borruel
N
,
Casellas
F
, et al.
The gut microbiota in IBD
.
Nat Rev Gastroenterol Hepatol
2012
;
9
(
10
):
599
608
.

27.

Martins
E
,
Hansen
T
.
Phylogenies and the comparative method: a general approach to incorporating phylogenetic information into the analysis of interspecific data
.
Am Nat
1997
;
149
(
4
):
646
67
.

28.

Odamaki
T
,
Kato
K
,
Sugahara
H
, et al.
Age-related changes in gut microbiota composition from newborn to centenarian: a cross-sectional study
.
BMC Microbiol
2016
;
16
:
90
.

29.

Pasolli
E
,
Truong
DT
,
Malik
F
, et al.
Machine learning meta-analysis of large metagenomic datasets: tools and biological insights
.
PLoS Comput Biol
2016
;
12
(
7
):
e1004977
.

30.

Pflughoeft
KJ
,
Versalovic
J
.
Human microbiome in health and disease
.
Annu Rev Pathol
2012
;
7
:
99
122
.

31.

Price
MN
,
Dehal
PS
,
Arkin
AP
.
FastTree: computing large minimum evolution trees with profiles instead of a distance matrix
.
Mol Biol Evol
2009
;
26
(
7
):
1641
50
.

32.

Santos-Marcos
JA
,
Haro
C
,
Rojas
AV
, et al.
Sex differences in the gut microbiota as potential determinants of gender predisposition to disease
.
Mol Nutr Food Res
2019
;
63
(
7
):
e1800870
.

33.

Scher
JU
,
Sczesnak
A
,
Longman
RS
, et al.
Expansion of intestinal prevotella copri correlates with enhanced susceptibility to arthritis
.
Elife
2013
;
2
:
e01202
.

34.

Scher
JU
,
Sczesnak
A
,
Longman
RS
, et al.
Expansion of intestinal prevotella copri correlates with enhanced susceptibility to arthritis
.
Elife
2013
;
2
:
e01202
.

35.

Schloss
PD
,
Westcott
SL
,
Ryabin
T
, et al.
Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities
.
Appl Environ Microbiol
2009
;
75
(
23
):
7537
41
.

36.

Singh
R
,
Lanchantin
J
,
Robins
G
, et al.
DeepChrome: deep-learning for predicting gene expression from histone modifications
.
Bioinformatics
2016
;
32
(
17
):
i639
48
.

37.

Smith
MI
,
Yatsunenko
T
,
Manary
MJ
, et al.
Gut microbiomes of Malawian twin pairs discordant for kwashiorkor
.
Science
2013
;
339
(
6119
):
548
54
.

38.

Srivastava
N
,
Hinton
G
,
Krizhevsky
A
, et al.
Dropout: a simple way to prevent neural networks from overfitting
.
J Mach Learn Res
2014
;
15
(
1
):
1929
58
.

39.

Statnikov
A
,
Henaff
M
,
Narendra
V
, et al.
A comprehensive evaluation of multicategory classification methods for microbiomic data
.
Microbiome
2013
;
1
(
1
):
11
.

40.

Tibshirani
R
.
Regression shrinkage and selection via the lasso
.
J R Stat Soc B Methodol
1996
;
58
(
1
):
267
88
.

41.

Turnbaugh
PJ
,
Ley
RE
,
Mahowald
MA
, et al.
An obesity-associated gut microbiome with increased capacity for energy harvest
.
Nature
2006
;
444
(
7122
):
1027
31
.

42.

Wang
M
,
Tai
C
,
Weinan
E
, et al.
DeFine: deep convolutional neural networks accurately quantify intensities of transcription factor-DNA binding and facilitate evaluation of functional non-coding variants
.
Nucleic Acids Res
2018
;
46
(
11
):
e69
.

43.

Xiao
J
,
Chen
L
,
Johnson
S
, et al.
Predictive modeling of microbiome data using a phylogeny-regularized generalized linear mixed model
.
Front Microbiol
2018
;
9
:
1391
.

44.

Xiao
J
,
Chen
L
,
Yu
Y
, et al.
A phylogeny-regularized sparse regression model for predictive modeling of microbial community data
.
Front Microbiol
2018
;
9
:
3112
.

45.

Xiao
J
,
Chen
L
,
Johnson
S
, et al.
Predictive modeling of microbiome data using a phylogeny-regularized generalized linear mixed model
.
Front Microbiol
2018
;
9
:
1391
.

46.

Yang
B
,
Liu
F
,
Ren
C
, et al.
BiRen: predicting enhancers with a deep-learning-based model using the DNA sequence alone
.
Bioinformatics
2017
;
33
(
13
):
1930
6
.

47.

Cheng
Y
,
Yang
L
,
Zhou
M
, et al.
LncADeep: an ab initio lncRNA identification and functional annotation tool based on deep learning
.
Bioinformatics
2018
;
34
(
22
):
3825
34
.

48.

Yatsunenko
T
,
Rey
FE
,
Manary
MJ
, et al.
Human gut microbiome viewed across age and geography
.
Nature
2012
;
486
(
7402
):
222
.

49.

Zeller
G
,
Tap
J
,
Voigt
AY
, et al.
Potential of fecal microbiota for early-stage detection of colorectal cancer
.
Mol Syst Biol
2014
;
10
:
766
.

50.

Zhang
C-H
.
Nearly unbiased variable selection under minimax concave penalty
.
Ann Statist
2010
;
38
(
2
):
894
942
.

51.

Zhou
J
,
Troyanskaya
O
.
Predicting effects of noncoding variants with deep learning-based sequence model
.
Nat Methods
2015
;
12
(
10
):
931
4
.

52.

Zou
H
,
Hastie
T
.
Regularization and variable selection via the elastic net
.
J R Stat Soc Series B Stat Methodology
2005
;
67
(
2
):
301
20
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data