-
PDF
- Split View
-
Views
-
Cite
Cite
Huaying Fang, gmcoda: Graphical model for multiple compositional vectors in microbiome studies, Bioinformatics, Volume 39, Issue 11, November 2023, btad700, https://doi.org/10.1093/bioinformatics/btad700
- Share Icon Share
Abstract
Microbes are essential components in the ecosystem and participate in most biological procedures in environments. The high-throughput sequencing technologies help researchers directly quantify the abundance of microbes in a natural environment. Microbiome studies explore the construction, stability, and function of microbial communities with the aid of sequencing technology. However, sequencing technologies only provide relative abundances of microbes, and this kind of data is called compositional data in statistics. The constraint of the constant-sum requires flexible statistical methods for analyzing microbiome data. Current statistical analysis of compositional data mainly focuses on one compositional vector such as bacterial communities. The fungi are also an important component in microbial communities and are always measured by sequencing internal transcribed spacer instead of 16S rRNA genes for bacteria. The different sequencing methods between fungi and bacteria bring two compositional vectors in microbiome studies.
We propose a novel statistical method, called gmcoda, based on an additive logistic normal distribution for estimating the partial correlation matrix for cross-domain interactions. A majorization–minimization algorithm is proposed to solve the optimization problem involved in gmcoda. Through simulation studies, gmcoda is demonstrated to work well in estimating partial correlations between two compositional vectors. Gmcoda is also applied to infer cross-domain interactions in a real microbiome dataset and finds potential interactions between bacteria and fungi.
Gmcoda is open source and freely available from https://github.com/huayingfang/gmcoda under GNU LGPL v3.
1 Introduction
Microbes are small organisms living almost everywhere in natural environments, and most of them cannot be observed without the help of a microscope. Microbes influence their surroundings and vice versa. For example, some bacterial species in the gut are suggested to interact with host metabolism and influence host phenotypes such as obesity and diabetes (Patterson et al. 2016). Microbes do not exist solely, and they interact with each other to maintain the stability of the micro-ecosystem where they live. Microbiome studies explore and validate the structure and function of microbes including bacteria and fungi in the natural micro-ecosystem while abundances of microbes are measured by high-throughput sequencing technologies.
However, the microbiome experiment can only provide relative abundances, which is also called compositional data. Two main characteristics of compositional data are that components are non-negative, and the sum of all components is constant. This constant-sum constraint for compositional data brings difficulties in analyzing the interaction network by using classical methods. For example, Pearson correlations of components have a negative trend of correlations between components. Another challenge in microbiome data is that the sample size is much smaller than the number of microbial species in microbiome studies. The compositional and high-dimensional features of microbiome data provide new challenges and opportunities for statisticians and data scientists.
There are a variety of methods for inferring correlation structure among microbes with the compositional effect in microbiome studies (Friedman and Alm 2012, Fang et al. 2015, Cao et al. 2019). The pairwise correlations do not consider the impact of other microbes for estimating the interaction between two microbes. In contrast, the partial correlation matrix in the Gaussian graphical model can be used to measure the conditional dependence between two random variables given others, such that an element in the partial correlation matrix is 0 if and only if the two microbial species are conditionally independent given all other species (Yuan and Lin 2007). Kurtz et al. (2015) proposed an approximate procedure, named SPIEC-EASI (SParse InversE Covariance estimation for Ecological ASsociation Inference), to infer the partial correlation among components by the sparse neighborhood or the inverse covariance selection, but this approximation may fail in some situations. Fang et al. (2017) extended the Gaussian graphical model in compositional data by an additive logistic normal distribution.
To date, a majority of microbiome analyses measure the relative abundance of bacteria using the 16S RNA gene as the marker gene. The resulting data consists of a single compositional vector. Most existing graphical models for compositional data have also been designed to model this kind of compositional vector data. Increasingly, however, multiple compositional vectors have arisen in microbiome studies. As an example, fungi are important in maintaining the stability of the micro-ecosystem. The relative abundance of fungi species can be quantified based on the internal transcribed spacer (ITS), in much a similar fashion as the 16S rRNA gene for bacteria. Thus, the same biospecimen and the same sequencing run can produce two or more compositional vectors describing microbial communities: one for bacteria and another for fungi. Furthermore, multiple compositional vectors can also appear for different body sites of 16S rRNA genes in the same subject. Lastly, some recent microbiome studies have included, as environmental variables, compositional vectors representing mass-spectrometry generated metabolites. The metabolite abundances quantified from mass spectrometry are also relative data. Most existing methods do not allow for jointly modeling multiple compositional data. Tipton et al. (2018) applied an extension version of SPIEC-EASI for exploring the partial correlation between bacteria and fungi in microbial studies, and this extension version also inherits the drawback of SPIEC-EASI that the approximation may be not good in some cases.
In this article, we propose a novel statistical method called gmcoda for inferring the partial correlation matrix of microbes from multiple compositional vectors based on the additive logistic normal distribution. The additive logistic normal distribution is the extension of the model proposed in Fang et al. (2015, 2017). This model assumes that the observed compositional data are from an additive logistic transformation of latent variables and the total abundances of compositional vectors are lost in the transformation. We derive the likelihood function under the multivariate normal assumption of the latent variables and propose a penalized likelihood method to estimate a sparse partial correlation matrix for multiple compositional vectors. A majorization–minimization algorithm is proposed to solve the optimization problem in the penalized likelihood. Simulation studies and real data analysis are performed to show the advantages of the proposed method. Applied to a real dataset, we demonstrate that the proposed method detects a large number of interactions between bacteria and fungi. The inter-domain interaction, between bacteria and fungi, has been associated with inflammatory responses in humans and model organisms (Krüger et al. 2019). This type of interaction, however, cannot be found using existing methods that treat each compositional vector separately.
2 Methods
2.1 Additive logistic normal model for multiple compositional vectors
2.2 Penalized likelihood for Ω
2.3 Optimization algorithm
3 Results
3.1 Simulation studies
Simulation studies are performed to compare gmcoda with other methods for estimating the partial correlation matrix. Graphical lasso with logarithm transformation of proportions is denoted as glasso. SPIEC-EASI variants with sparse neighborhood and covariance selection are denoted as SEMB and SEGL, respectively. In simulations, each element of the mean vector for latent variables is independently generated from the uniform distribution on . The precision matrix Ω is generated in the following way. A predefined network is selected to decide the network structure. Simulation studies consider the following six structures: random, neighbor, band, hub, block, and scale-free networks, as described below.
Random network: Pairs of nodes are connected with probability 0.1 and the strengths of edges are set 0.15 and −0.15 with equal probabilities.
Neighbor network: A distance matrix of p nodes is first constructed based on the random positions on and each node is connected with its eight closest neighbors. The strengths of edges are set as 0.15 and −0.15 with equal probabilities.
Band network: Nodes are sorted in a linear order and each node is connected with its five closest neighbors. The strengths of edges are set 0.4, 0.2, −0.2, −0.1, and −0.1 as the distances are 1, 2, 3, 4, and 5.
Hub network: Four nodes are selected as hub nodes. Hub nodes are connected with hub nodes with probability 1 and connected with non-hub nodes with probability 0.75. The strength of edges involved with hub nodes is set as 0.1. Pairs of non-hub nodes are connected with probability 0.05 and strength 0.15.
Block network: Nodes are divided into eight blocks with nearly equal sizes. Pairs of nodes in the same block are connected with probability 0.45 and strength 0.2. Pairs of nodes from different blocks are connected with probability 0.05 and strength 0.15.
Scale-free network: The Barabási–Albert model (Barabási and Albert 1999) is used with a power coefficient of 2 and 10 edges are added in each step. The strengths of edges are generated from a uniform distribution on a discrete set .
Let p be the number of nodes. The edge densities for random, neighbor, band, hub, block, and scale-free networks are about 0.1, , 0.1, and , respectively. The number of nodes is set as p = 50 and the sample size is set as n = 100 and n = 300. The diagonal elements of the precision matrix are set large enough to make the matrix positive definite. Two compositional vectors (d = 2) are considered in simulation studies that and .
We first compare gmcoda with gcoda (Fang et al. 2017) in recovering edges from compositional data since gmcoda can be seen as an extension for multiple compositional vectors of gcoda. Two scenarios are considered: First, two compositional vectors are sub-components from the single compositional vector; Second, two compositional vectors are first generated and then directly merged into a single compositional vector. The performances of gmcoda and gcoda are similar in the first scenario while gmcoda performs better than gcoda for most cases in the second scenario (Supplementary Table S1). The overlap of shared edges between gmcoda and gcoda in the first scenario is higher than that in the second scenario. The results of these two approaches are consistent if multiple compositional vectors are sub-components of one compositional vector.
Performance comparison of gmcoda, glasso and SPIEC-EASI for the inferred partial correlation matrix in simulation studies with .a
Network . | Method . | FPR . | TPR (Recall) . | Precision . | F1 score . | d1 . | . | Time (s) . |
---|---|---|---|---|---|---|---|---|
Random | gmcoda | 0.023 | 0.592 | 0.768 | 0.651 | 0.012 | 0.039 | 0.242 |
glasso | 0.000 | 0.013 | 0.148 | 0.023 | 0.015 | 0.047 | 25.971 | |
SEGL | 0.000 | 0.016 | 0.546 | 0.029 | 0.015 | 0.047 | 38.180 | |
SEMB | 0.000 | 0.016 | 0.546 | 0.029 | – | – | 37.196 | |
Neighbor | gmcoda | 0.084 | 0.739 | 0.676 | 0.704 | 0.019 | 0.046 | 0.357 |
glasso | 0.001 | 0.120 | 0.980 | 0.212 | 0.027 | 0.063 | 25.173 | |
SEGL | 0.001 | 0.117 | 0.976 | 0.209 | 0.027 | 0.063 | 38.925 | |
SEMB | 0.000 | 0.089 | 0.990 | 0.162 | – | – | 34.560 | |
Band | gmcoda | 0.151 | 0.661 | 0.511 | 0.576 | 0.022 | 0.051 | 0.468 |
glasso | 0.000 | 0.267 | 1.000 | 0.421 | 0.036 | 0.096 | 24.726 | |
SEGL | 0.000 | 0.339 | 0.850 | 0.476 | 0.035 | 0.091 | 39.681 | |
SEMB | 0.000 | 0.368 | 0.749 | 0.469 | – | – | 35.569 | |
Hub | gmcoda | 0.014 | 0.148 | 0.646 | 0.233 | 0.017 | 0.046 | 0.238 |
glasso | 0.002 | 0.024 | 0.607 | 0.046 | 0.017 | 0.045 | 25.283 | |
SEGL | 0.002 | 0.030 | 0.618 | 0.056 | 0.017 | 0.045 | 36.890 | |
SEMB | 0.002 | 0.028 | 0.614 | 0.053 | – | – | 36.293 | |
Block | gmcoda | 0.022 | 0.514 | 0.740 | 0.574 | 0.015 | 0.049 | 0.262 |
glasso | 0.000 | 0.015 | 0.250 | 0.027 | 0.017 | 0.056 | 23.923 | |
SEGL | 0.000 | 0.052 | 0.550 | 0.084 | 0.017 | 0.055 | 36.163 | |
SEMB | 0.000 | 0.063 | 0.548 | 0.102 | – | – | 34.464 | |
Scale-free | gmcoda | 0.169 | 0.884 | 0.540 | 0.666 | 0.017 | 0.040 | 0.502 |
glasso | 0.006 | 0.119 | 0.701 | 0.203 | 0.026 | 0.062 | 26.543 | |
SEGL | 0.006 | 0.116 | 0.762 | 0.193 | 0.026 | 0.062 | 43.301 | |
SEMB | 0.004 | 0.134 | 0.893 | 0.232 | – | – | 36.964 |
Network . | Method . | FPR . | TPR (Recall) . | Precision . | F1 score . | d1 . | . | Time (s) . |
---|---|---|---|---|---|---|---|---|
Random | gmcoda | 0.023 | 0.592 | 0.768 | 0.651 | 0.012 | 0.039 | 0.242 |
glasso | 0.000 | 0.013 | 0.148 | 0.023 | 0.015 | 0.047 | 25.971 | |
SEGL | 0.000 | 0.016 | 0.546 | 0.029 | 0.015 | 0.047 | 38.180 | |
SEMB | 0.000 | 0.016 | 0.546 | 0.029 | – | – | 37.196 | |
Neighbor | gmcoda | 0.084 | 0.739 | 0.676 | 0.704 | 0.019 | 0.046 | 0.357 |
glasso | 0.001 | 0.120 | 0.980 | 0.212 | 0.027 | 0.063 | 25.173 | |
SEGL | 0.001 | 0.117 | 0.976 | 0.209 | 0.027 | 0.063 | 38.925 | |
SEMB | 0.000 | 0.089 | 0.990 | 0.162 | – | – | 34.560 | |
Band | gmcoda | 0.151 | 0.661 | 0.511 | 0.576 | 0.022 | 0.051 | 0.468 |
glasso | 0.000 | 0.267 | 1.000 | 0.421 | 0.036 | 0.096 | 24.726 | |
SEGL | 0.000 | 0.339 | 0.850 | 0.476 | 0.035 | 0.091 | 39.681 | |
SEMB | 0.000 | 0.368 | 0.749 | 0.469 | – | – | 35.569 | |
Hub | gmcoda | 0.014 | 0.148 | 0.646 | 0.233 | 0.017 | 0.046 | 0.238 |
glasso | 0.002 | 0.024 | 0.607 | 0.046 | 0.017 | 0.045 | 25.283 | |
SEGL | 0.002 | 0.030 | 0.618 | 0.056 | 0.017 | 0.045 | 36.890 | |
SEMB | 0.002 | 0.028 | 0.614 | 0.053 | – | – | 36.293 | |
Block | gmcoda | 0.022 | 0.514 | 0.740 | 0.574 | 0.015 | 0.049 | 0.262 |
glasso | 0.000 | 0.015 | 0.250 | 0.027 | 0.017 | 0.056 | 23.923 | |
SEGL | 0.000 | 0.052 | 0.550 | 0.084 | 0.017 | 0.055 | 36.163 | |
SEMB | 0.000 | 0.063 | 0.548 | 0.102 | – | – | 34.464 | |
Scale-free | gmcoda | 0.169 | 0.884 | 0.540 | 0.666 | 0.017 | 0.040 | 0.502 |
glasso | 0.006 | 0.119 | 0.701 | 0.203 | 0.026 | 0.062 | 26.543 | |
SEGL | 0.006 | 0.116 | 0.762 | 0.193 | 0.026 | 0.062 | 43.301 | |
SEMB | 0.004 | 0.134 | 0.893 | 0.232 | – | – | 36.964 |
SEMB and SEGL are two options for SPIEC-EASI. FPR and TPR (also called recall) are the false positive rate and the true positive rate, respectively. d1 and for SEMB are not available since SEMB doesn’t return an estimated precision matrix. The results are the averages over 20 runs.
Performance comparison of gmcoda, glasso and SPIEC-EASI for the inferred partial correlation matrix in simulation studies with .a
Network . | Method . | FPR . | TPR (Recall) . | Precision . | F1 score . | d1 . | . | Time (s) . |
---|---|---|---|---|---|---|---|---|
Random | gmcoda | 0.023 | 0.592 | 0.768 | 0.651 | 0.012 | 0.039 | 0.242 |
glasso | 0.000 | 0.013 | 0.148 | 0.023 | 0.015 | 0.047 | 25.971 | |
SEGL | 0.000 | 0.016 | 0.546 | 0.029 | 0.015 | 0.047 | 38.180 | |
SEMB | 0.000 | 0.016 | 0.546 | 0.029 | – | – | 37.196 | |
Neighbor | gmcoda | 0.084 | 0.739 | 0.676 | 0.704 | 0.019 | 0.046 | 0.357 |
glasso | 0.001 | 0.120 | 0.980 | 0.212 | 0.027 | 0.063 | 25.173 | |
SEGL | 0.001 | 0.117 | 0.976 | 0.209 | 0.027 | 0.063 | 38.925 | |
SEMB | 0.000 | 0.089 | 0.990 | 0.162 | – | – | 34.560 | |
Band | gmcoda | 0.151 | 0.661 | 0.511 | 0.576 | 0.022 | 0.051 | 0.468 |
glasso | 0.000 | 0.267 | 1.000 | 0.421 | 0.036 | 0.096 | 24.726 | |
SEGL | 0.000 | 0.339 | 0.850 | 0.476 | 0.035 | 0.091 | 39.681 | |
SEMB | 0.000 | 0.368 | 0.749 | 0.469 | – | – | 35.569 | |
Hub | gmcoda | 0.014 | 0.148 | 0.646 | 0.233 | 0.017 | 0.046 | 0.238 |
glasso | 0.002 | 0.024 | 0.607 | 0.046 | 0.017 | 0.045 | 25.283 | |
SEGL | 0.002 | 0.030 | 0.618 | 0.056 | 0.017 | 0.045 | 36.890 | |
SEMB | 0.002 | 0.028 | 0.614 | 0.053 | – | – | 36.293 | |
Block | gmcoda | 0.022 | 0.514 | 0.740 | 0.574 | 0.015 | 0.049 | 0.262 |
glasso | 0.000 | 0.015 | 0.250 | 0.027 | 0.017 | 0.056 | 23.923 | |
SEGL | 0.000 | 0.052 | 0.550 | 0.084 | 0.017 | 0.055 | 36.163 | |
SEMB | 0.000 | 0.063 | 0.548 | 0.102 | – | – | 34.464 | |
Scale-free | gmcoda | 0.169 | 0.884 | 0.540 | 0.666 | 0.017 | 0.040 | 0.502 |
glasso | 0.006 | 0.119 | 0.701 | 0.203 | 0.026 | 0.062 | 26.543 | |
SEGL | 0.006 | 0.116 | 0.762 | 0.193 | 0.026 | 0.062 | 43.301 | |
SEMB | 0.004 | 0.134 | 0.893 | 0.232 | – | – | 36.964 |
Network . | Method . | FPR . | TPR (Recall) . | Precision . | F1 score . | d1 . | . | Time (s) . |
---|---|---|---|---|---|---|---|---|
Random | gmcoda | 0.023 | 0.592 | 0.768 | 0.651 | 0.012 | 0.039 | 0.242 |
glasso | 0.000 | 0.013 | 0.148 | 0.023 | 0.015 | 0.047 | 25.971 | |
SEGL | 0.000 | 0.016 | 0.546 | 0.029 | 0.015 | 0.047 | 38.180 | |
SEMB | 0.000 | 0.016 | 0.546 | 0.029 | – | – | 37.196 | |
Neighbor | gmcoda | 0.084 | 0.739 | 0.676 | 0.704 | 0.019 | 0.046 | 0.357 |
glasso | 0.001 | 0.120 | 0.980 | 0.212 | 0.027 | 0.063 | 25.173 | |
SEGL | 0.001 | 0.117 | 0.976 | 0.209 | 0.027 | 0.063 | 38.925 | |
SEMB | 0.000 | 0.089 | 0.990 | 0.162 | – | – | 34.560 | |
Band | gmcoda | 0.151 | 0.661 | 0.511 | 0.576 | 0.022 | 0.051 | 0.468 |
glasso | 0.000 | 0.267 | 1.000 | 0.421 | 0.036 | 0.096 | 24.726 | |
SEGL | 0.000 | 0.339 | 0.850 | 0.476 | 0.035 | 0.091 | 39.681 | |
SEMB | 0.000 | 0.368 | 0.749 | 0.469 | – | – | 35.569 | |
Hub | gmcoda | 0.014 | 0.148 | 0.646 | 0.233 | 0.017 | 0.046 | 0.238 |
glasso | 0.002 | 0.024 | 0.607 | 0.046 | 0.017 | 0.045 | 25.283 | |
SEGL | 0.002 | 0.030 | 0.618 | 0.056 | 0.017 | 0.045 | 36.890 | |
SEMB | 0.002 | 0.028 | 0.614 | 0.053 | – | – | 36.293 | |
Block | gmcoda | 0.022 | 0.514 | 0.740 | 0.574 | 0.015 | 0.049 | 0.262 |
glasso | 0.000 | 0.015 | 0.250 | 0.027 | 0.017 | 0.056 | 23.923 | |
SEGL | 0.000 | 0.052 | 0.550 | 0.084 | 0.017 | 0.055 | 36.163 | |
SEMB | 0.000 | 0.063 | 0.548 | 0.102 | – | – | 34.464 | |
Scale-free | gmcoda | 0.169 | 0.884 | 0.540 | 0.666 | 0.017 | 0.040 | 0.502 |
glasso | 0.006 | 0.119 | 0.701 | 0.203 | 0.026 | 0.062 | 26.543 | |
SEGL | 0.006 | 0.116 | 0.762 | 0.193 | 0.026 | 0.062 | 43.301 | |
SEMB | 0.004 | 0.134 | 0.893 | 0.232 | – | – | 36.964 |
SEMB and SEGL are two options for SPIEC-EASI. FPR and TPR (also called recall) are the false positive rate and the true positive rate, respectively. d1 and for SEMB are not available since SEMB doesn’t return an estimated precision matrix. The results are the averages over 20 runs.
A noticeable phenomenon in Table 1 is the moderate–low recalls in most cases. The recall of gmcoda is larger than that of other methods in all cases. One possible reason is that glasso, SEGL, and SEMB are more conservative than gmcoda. Glasso, SEGL, and SEMB produced fewer edges than gmcoda but did not get significant improvements in precision for recovering networks in some cases. It seems the identification efficiency of interactions depends on the strength of partial correlation coefficients in the precision matrix. We compare the identification efficiency of edges grouped by the strength of partial correlation coefficients in the band network (Supplementary Fig. S1). We find that the proportions of identified edges for gmcoda are larger than other methods in all four non-zeros partial correlation coefficients and there is an obvious trend that the stronger the strengths of the underlying true edges, the more likely detected by gmcoda. This moderate–low recall phenomenon reminds researchers that current methods may miss some interesting edges among microbes for interpreting microbial networks.
We also compare the area under the curve (AUC) values of various combinations of the network structure, the dimension of microbes, and the sample size (Supplementary Table S4). AUC values are calculated by varying the tuning parameter. AUC values are used as the secondary measure for comparison since AUC values do not provide useful information about the estimation that will be used in practice. Both sample sizes and lengths of composition vectors influence the performance of all methods. As the sample size increases, AUC values increase for all methods. The performance of gmcoda is much better than glasso and SPEIC-EASI for recovering the network structure of random, neighbor, and scale-free networks via AUC values. For hub and block networks, SPIEC-EASI performs better than gmcoda and glasso. For the band network, glasso is the best for a large sample size (n = 300) while SPIEC-EASI is the best for a small sample size (n = 100).
We also conducted a simulation study for comparing gmcoda and other methods in a situation where there are only inter-domain interactions while no cross-domain correlation exists. Compared to gmcoda’s results that about 15% of detected cross-domain edges, the other methods almost detect no false edges for bacteria–fungi interaction (Supplementary Table S5). However, gmcoda showed the advantage of the recall and the precision in this scenario.
3.2 Application in microbiome studies
We applied the proposed approach to infer the interaction network among bacteria and fungi in an intestinal bacterial microbiome dataset from fecal samples (Haak et al. 2021). This data contains 59 samples that are subjected to 16S rRNA genes with V3–V4 regions for bacteria and ITS rDNA gene sequences for fungi. The data are filtered by removing operational taxonomic units (OTUs) with more than 80% 0s across samples and removing samples with more than 80% 0s across OTUs separately for bacteria and fungi. There are 33 bacteria OTU and 15 fungi OTU remaining for 47 samples after the filtering process.
The counts of edges inferred by gmcoda, glasso, and SPIEC-EASI are presented in Table 2. The inferred network from gmcoda includes 266 edges, which are substantially more than those from glasso and SPIEC-EASI. The overlapped edges among these four methods are shown in Fig. 1. Most edges inferred by glasso and SPIEC-EASI are also detected by gmcoda. We shuffle the count data by permuting the counts in each OTU to give an estimate of the false discovery count. The permutation is repeated 20 times for deriving a reliable estimation of the false discovery count. The false discovery counts of gmcoda, glasso, SEMB, and SEGL are 10.30, 6.70, 7.60, and 7.25, respectively. The false discovery count of gmcoda is comparable to that of its competitors.

Venn diagrams of shared edges for networks inferred from gmcoda, glasso, and SPIEC-EASI for the intestinal bacterial microbiome data.
Edge numbers for networks inferred by gmcoda, glasso, and SPIEC-EASI for the intestinal bacterial microbiome data.
Method . | Bacteria–Bacteria . | Bacteria–Fungi . | Fungi–Fungi . | Total . |
---|---|---|---|---|
gmcoda | 148 | 81 | 37 | 266 |
glasso | 11 | 0 | 7 | 18 |
SEGL | 23 | 1 | 9 | 33 |
SEMB | 25 | 1 | 7 | 33 |
Method . | Bacteria–Bacteria . | Bacteria–Fungi . | Fungi–Fungi . | Total . |
---|---|---|---|---|
gmcoda | 148 | 81 | 37 | 266 |
glasso | 11 | 0 | 7 | 18 |
SEGL | 23 | 1 | 9 | 33 |
SEMB | 25 | 1 | 7 | 33 |
Edge numbers for networks inferred by gmcoda, glasso, and SPIEC-EASI for the intestinal bacterial microbiome data.
Method . | Bacteria–Bacteria . | Bacteria–Fungi . | Fungi–Fungi . | Total . |
---|---|---|---|---|
gmcoda | 148 | 81 | 37 | 266 |
glasso | 11 | 0 | 7 | 18 |
SEGL | 23 | 1 | 9 | 33 |
SEMB | 25 | 1 | 7 | 33 |
Method . | Bacteria–Bacteria . | Bacteria–Fungi . | Fungi–Fungi . | Total . |
---|---|---|---|---|
gmcoda | 148 | 81 | 37 | 266 |
glasso | 11 | 0 | 7 | 18 |
SEGL | 23 | 1 | 9 | 33 |
SEMB | 25 | 1 | 7 | 33 |
The network inferred by gmcoda includes 81 connections between bacteria and fungi. The strongest inter-domain edge connects the bacterial genus in Agathobacter and the fungal genus in Aspergillus, with an estimated partial correlation of −0.21; SPIEC-EASI also estimated a significant, albeit weaker, partial correlation of −0.028. Consistent with this finding, an independent gut microbiome study in COVID-19 patients has reported a negative correlation between Aspergillus niger and Agathobacter (Lv et al. 2021). A second strongest inter-domain edge, detected only by gmcoda and not by SPIEC-EASI, indicates a positive partial correlation between bacterial genus Streptococcus and fungal genus Candida. This finding is supported by an independent dental microbiome study, which has reported a statistically significant, and positive, correlation between Candida albicans and Streptococcus mutans (Sridhar et al. 2020).
4 Discussion
The interactions between bacteria and fungi are related to human health and disease (Krüger et al. 2019). We propose a graphical model approach, called gmcoda, for inferring microbial cross-domain interactions in microbiome studies. The proposed model is based on a latent logistic normal model for compositional data in that multiple compositional vectors are considered simultaneously. An efficient algorithm is developed to solve the optimization problem involved in gmcoda. Simulation studies show that gmcoda provides more accurate and robust results than SPIEC-EASI and glasso in most cases. Gmcoda provides larger AUCs than SPIEC-EASI for scale-free networks in simulation studies. Gmcoda is also applied in the network inference of real microbiome data and can find connections between fungi and bacteria, which would not be possible when using existing methods that treat each compositional vector separately.
Although gmcoda provides a novel method for inferring microbial cross-domain interactions, there are still some challenges in the field of network inference from microbiome data. The first challenge is that almost all methods, including all methods mentioned in this paper, based on log-ratio transformations assume the observed zeros are not true zeros and these observed zeros should be imputed with some small positive number before calculating compositions. This assumption is not always true for complex microbial ecosystems which some microbes don’t live in some conditions. More reasonable statistical models should be built for considering true zeros in compositional data. One strategy for this zero-value problem is using a mixture model of the single point distribution at zero and the additive logistical normal distribution instead of a single logistical normal distribution. The second challenge is the selection of the tuning parameter for balancing the model fitting and the sparsity assumption in gmcoda. Currently, gmcoda applied Bayesian information criteria for choosing the tuning parameter and simulation results showed some false edges are involved in the estimation. One strategy for choosing the tuning parameter is the adaptive method with two stages. The third challenge is the computational complexity of large-scale datasets in inferring networks. Although gmcoda is much faster than other methods, gmcoda is time-consuming when the number of microbes is moderate–large. A potential future direction for inferring microbial networks is to develop scalable algorithms for dealing with large datasets in real-world scenarios.
Acknowledgements
The author would like to thank Professor Hua Tang for proofreading the manuscript and two anonymous referees for their helpful comments that led to substantial improvements in the paper.
Supplementary data
Supplementary data are available at Bioinformatics online.
Conflict of interest
None declared.
Funding
This work was supported by the National Natural Science Foundation of China [12101425].
Data availability
The data underlying this article are available in the article and in its online supplementary material.