Bayesian functional analysis for untargeted metabolomics data with matching uncertainty and small sample sizes

Abstract

Untargeted metabolomics based on liquid chromatography-mass spectrometry technology is quickly gaining widespread application, given its ability to depict the global metabolic pattern in biological samples. However, the data are noisy and plagued by the lack of clear identity of data features measured from samples. Multiple potential matchings exist between data features and known metabolites, while the truth can only be one-to-one matches. Some existing methods attempt to reduce the matching uncertainty, but are far from being able to remove the uncertainty for most features. The existence of the uncertainty causes major difficulty in downstream functional analysis. To address these issues, we develop a novel approach for Bayesian Analysis of Untargeted Metabolomics data (BAUM) to integrate previously separate tasks into a single framework, including matching uncertainty inference, metabolite selection and functional analysis. By incorporating the knowledge graph between variables and using relatively simple assumptions, BAUM can analyze datasets with small sample sizes. By allowing different confidence levels of feature-metabolite matching, the method is applicable to datasets in which feature identities are partially known. Simulation studies demonstrate that, compared with other existing methods, BAUM achieves better accuracy in selecting important metabolites that tend to be functionally consistent and assigning confidence scores to feature-metabolite matches. We analyze a COVID-19 metabolomics dataset and a mouse brain metabolomics dataset using BAUM. Even with a very small sample size of 16 mice per group, BAUM is robust and stable. It finds pathways that conform to existing knowledge, as well as novel pathways that are biologically plausible.

Graphical Abstract

Open in new tab Download slide

Untargeted metabolomics based on liquid chromatographymass spectrometry technology depicts the global metabolic pattern in biological samples. However, multiple potential matchings exist between data features and known metabolites, while the truth can only be one-to-one matches, which causes major difficulty in downstream functional analysis. To address these issues, we develop a novel approach for Bayesian Analysis of Untargeted Metabolomics data (BAUM) to integrate previously separate tasks into a single framework, including matching uncertainty inference, metabolite selection, and functional analysis.

The figure illustrates the overall setup of BAUM. The observed data for model include the feature-level summary statistics |${r}_i$| computed from the observed metabolic feature and the clinical outcome (optional), the potential feature-metabolite matches and their confidence measures |$ {q}_{ij} $|⁠, and the known metabolic network structure. The output of the model include the false discovery rate (FDR) for each metabolite, and the strength of each feature-metabolite matching. We use a Bayesian latent factor model to characterize the observed feature summary statistics and link them to the unobserved metabolite behavior. We assign a Multinomial prior with prior probabilities |$ {q}_i $| to matching indicators |$ {\lambda}_i $|⁠, a normal prior to the null component score |$ {\eta}_0 $|⁠, a Dirichlet Process prior to the alternative component score |$ {\eta}_j $| and a weighted Potts prior to metabolite latent class indicators |$ z $|⁠. Generally, the observed summary statistic of a feature is a linear combination of the unobserved scores of its linked metabolites. The weights reflect the confidence level of the metabolite-feature annotation, and are to be estimated from the data. The metabolites are segregated into two latent classes: the clinically relevant class (alternative component), and the clinically irrelevant class (null component). The two classes have different distributions of metabolite scores. Metabolites that are connected on the metabolic network are more likely to belong to the same class

Bayesian latent factor model, matching uncertainty, metabolite network analysis

INTRODUCTION

Untargeted metabolomics by liquid chromatography-mass spectrometry (LC-MS) measures small molecules in a system in an unbiased manner. It has gained increasing prominence in biomedical research for the purposes of understanding nutrition, metabolic diseases, environmental health and cancer [1, 2].

In LC-MS metabolomics analysis, a key obstacle is the uncertainty in the matching between measured features and known metabolites [3]. Unlike gene expression data measured by deep sequencing, LC-MS features lack direct chemical identity information. Potential matches between features and metabolites are largely based on the features’ mass-to-charge ratio (m/z) and retention time, often resulting in multiple potentially matched metabolites for a single feature and vice versa [4, 5]. Although some methods have utilized information, such as tandem mass spectrometry (⁠|$\text{MS}^{2}$|⁠) or feature–feature relations, to improve features-to-metabolite annotations [4, 6], the uncertainty remains for most features [7], which posts challenges for downstream analyses and data interpretation.

The downstream analysis workflow typically involves two key tasks: the selection of metabolic features exhibiting differential abundance between sample groups and the assessment of pathway significance by considering the significance levels of features associated with each pathway [8–12]. An alternative approach is to analyze the data jointly with the entire metabolic network, which avoids artificially dividing the metabolic network into pathways and utilizes the detailed connection structure between metabolites [13, 14]. Both suffer from the uncertainty in feature-metabolite matching.

In essence, these downstream tasks can be treated as on-network feature selection problems. A number of previous studies rely on parametric regression models for feature selection on biological networks [15–20]. However, linear regression models may not capture complex associations between features and clinical outcomes effectively, potentially introducing undesirable bias [21]. On the other hand, complex non-linear parametric model are more expressive but can be computationally demanding and prone to overfitting in scenarios with a large number of features and a limited number of samples, which is common in biological network analysis. Furthermore, regression models assume the presence of an outcome variable, which may not always be the case or may not align with the research focus, such as when studying biomarker expression behavior, for example, gene periodicity [21, 22]. A different method is to generate a test statistic for each feature and perform feature selection using a Bayesian nonparametric approach that takes into account both the network dependencies and the test statistic [21, 23], which does not require parametric models nor outcome variables. Jin et al. [22] have extended this framework to accommodate asymmetrical null and alternative distributions and handle missing values systematically. However, these summary-statistics-based approaches, along with other regression-based methods, are designed for single layer networks (e.g. gene networks) and cannot simultaneously address both aspects in the down-stream analyses. The statistical down-stream analyses of the LC-MS metabolomics data require inferences on two-layer networks comprising observed metabolite features and the underlying metabolite network, where the matching uncertainty in between needs to be addressed. Some methods account for matching uncertainty in pathway testing by down-weighting features matched to multiple metabolites [24]. However, such methods down-weight multiple-matched features evenly across all impacted pathways without inferences on which potential matching is more likely to be true. Furthermore, these methods rely on predefined pathways, ignoring the detailed metabolic network structure, and can be sensitive to the feature-level |$p$|-value threshold.

In this study, we propose an innovative method, Bayesian Analysis for Untargeted Metabolomics data (BAUM), which jointly models feature-metabolite matching and metabolic network behavior under a Bayesian semiparametric framework. We assume a summary statistic for each feature is precalculated, either in a supervised manner when clinical outcomes are available or unsupervised when they are not. This approach offers computational efficiency, avoids making parametric model assumptions, provides flexibility for both linear and non-linear relationships, and allows for analysis even when outcome variables are absent. Additionally, it can effectively manage data with small sample sizes, as it only necessitates summary statistics from observed features. Although similar concepts have found success in gene networks [21–23], they have not been applied to the matching uncertainty problem inherent in two-layer feature-metabolite networks. The summary statistics can be obtained by transforming |$P$|-values obtained by statistical tests into normally distributed statistics, where the transformation is monotone so that it is guaranteed significant features have a larger summary statistics value. We establish the connection between feature-level summary statistics and on-network latent metabolite scores while accounting for matching uncertainty through a Bayesian factor analysis model with one-hot constraints (Section 2.2.1), where the feature summary statistic is a noisy copy of the latent score for the matched metabolite. We then model the latent metabolite scores by a mixture distribution of two components, one representing clinically relevant metabolites (alternative component) and the other representing clinically irrelevant metabolites (null component), by which we control local false discovery rate (FDR, Section 2.2.2). We assume the null component follows a centered Gaussian distribution while the alternative component follows the Dirichlet process mixture (DPM), which has been extensively discussed in Bayesian statistics [25–28] and used in local FDR control [21–23, 29]. The DPM is proved to achieve good performance on density estimation in light of its nonparametric nature, and efficient computational techniques are available, such as Gibbs sampling for stick-breaking priors [30]. To incorporate the metabolic network information into metabolite inferences, we employ the weighted Potts prior [22] that extends the Ising prior [31] to assign class labels to all metabolites based on their network dependencies (Section 2.2.3). The Ising prior tends to assign similar labels to closely connected nodes on the network, making it suitable for sub-network significance analysis. For posterior computation, we develop an efficient Gibbs sampler (Section 2.3), where we leverage an equivalent model representation of the DPM using the stick-breaking priors, resort to the Swendsen–Wang algorithm [32] for efficiently updating metabolite class labels, and exploit conjugacy for posterior sampling.

In several regards, BAUM is the first of its kind. Firstly, it can quantitatively evaluate which metabolite is more likely to be the true match of each feature. Secondly, BAUM is the first method to perform statistical inference directly at the metabolite level while accounting for the matching uncertainty. It can identify sub-networks within the entire metabolic network based on feature-level summary statistics, enhancing biological interpretation. Lastly, BAUM demonstrates robustness even when dealing with small sample sizes, as demonstrated in our real data analysis.

METHODS

Overview

We develop a Bayesian constrained latent factor model to characterize the observed feature-level test statistics and link them to the metabolite behavior and the clinical outcome variable (Figure 1). Generally, the test statistics of the features are the observed data, and we consider all the metabolite scores to be unobserved. The observed test statistic of a feature is considered to be a linear combination of the unobserved scores of its matched metabolites. The metabolite scores are to be inferred by the model. The weights reflect the confidence level of the metabolite-feature annotation, and are to be estimated from the data. The metabolites are segregated into two latent classes—the clinically relevant class and the clinically irrelevant class. The two classes have different distributions of metabolite scores. Metabolites that are connected on the KEGG metabolic network are more likely to belong to the same class.

$The overall setup of BAUM. The observed data for model include the feature-level summary statistics $\boldsymbol{r}=(r_{1},\ldots ,r_{p})^{\top }$ computed from the observed metabolic feature and the clinical outcome (optional), the potential feature-metabolite matches and their confidence measures $\mathbf{Q}=(\boldsymbol{q}_{i})$, and the known metabolic network structure C. The output of the model include the FDR for each metabolite, and the strength of each feature-metabolite matching. We use a Bayesian latent factor model to characterize the observed feature summary statistics and link them to the unobserved metabolite behavior. We assign a Multinomial prior with prior probabilities $\boldsymbol{q}_{i}$ to matching indicators $\boldsymbol \lambda _{i}$, a normal prior to the null component score $\eta _{0}$, a Dirichlet process prior to the alternative component score $\eta _{j}$ and a weighted Potts prior to metabolite latent class indicators $\boldsymbol z$. Section 2 provides details of model setup. Generally, the observed summary statistic of a feature is considered to be a linear combination of the unobserved scores of its linked metabolites. The weights reflect the confidence level of the metabolite-feature annotation, and are to be estimated from the data. The metabolites are segregated into two latent classes—the clinically relevant class (alternative component), and the clinically irrelevant class (null component). The two classes have different distributions of metabolite scores. Metabolites that are connected on the metabolic network are more likely to belong to the same class.$

Figure 1

The overall setup of BAUM. The observed data for model include the feature-level summary statistics |$\boldsymbol{r}=(r_{1},\ldots ,r_{p})^{\top }$| computed from the observed metabolic feature and the clinical outcome (optional), the potential feature-metabolite matches and their confidence measures |$\mathbf{Q}=(\boldsymbol{q}_{i})$|⁠, and the known metabolic network structure C. The output of the model include the FDR for each metabolite, and the strength of each feature-metabolite matching. We use a Bayesian latent factor model to characterize the observed feature summary statistics and link them to the unobserved metabolite behavior. We assign a Multinomial prior with prior probabilities |$\boldsymbol{q}_{i}$| to matching indicators |$\boldsymbol \lambda _{i}$|⁠, a normal prior to the null component score |$\eta _{0}$|⁠, a Dirichlet process prior to the alternative component score |$\eta _{j}$| and a weighted Potts prior to metabolite latent class indicators |$\boldsymbol z$|⁠. Section 2 provides details of model setup. Generally, the observed summary statistic of a feature is considered to be a linear combination of the unobserved scores of its linked metabolites. The weights reflect the confidence level of the metabolite-feature annotation, and are to be estimated from the data. The metabolites are segregated into two latent classes—the clinically relevant class (alternative component), and the clinically irrelevant class (null component). The two classes have different distributions of metabolite scores. Metabolites that are connected on the metabolic network are more likely to belong to the same class.

Open in new tab Download slide

Let |$p$| denote the number of observed metabolic features and |$k$| denote the number of unobserved metabolites. For |$i=1,\dots ,p$|⁠, |$r_{i}$| denotes the feature-level summary statistics generated by a statistical test that may or may not involve clinical outcomes. For any |$i$| and |$j$|⁠, denote |$q_{ij}\in [0, 1]$| the confidence measure of matching feature |$i$| to metabolite |$j$|⁠, which can be calculated based on the multiple matching status and other characteristics of each feature [33]. Let |$\boldsymbol q_{i} = (q_{i1}, \ldots , q_{ik})^{\top }$| and |$\sum _{j=1}^{k} q_{ij}=1$| for all |$i$|⁠. Denote |$\mathbf{C} = (c_{jl})$| the adjacency matrix for the metabolic network, where |$c_{jl}$| is 1 if there is an edge between metabolite |$j$| and metabolite |$l$|⁠, and 0 otherwise. The observed data include feature-level summary statistics |$r_{i}$|⁠, potential feature-metabolite matches and their prior biological confidence measures |$\boldsymbol{q}_{i}$|⁠, and the metabolic network structure |$\mathbf{C}$|⁠. We assume the feature-level summary statistics are obtained prior to using BAUM and the feature matrix (and possibly the clinical outcome) is not part of the observed data. The output of the model includes the FDR for each metabolite, and the strength of each feature-metabolite matching.

Model

One-hot constrained factor analysis model for matching uncertainty

We develop a factor analysis approach with one-hot constraints to model the matching uncertainty between observed metabolite features and unobserved metabolites,

$$ \begin{align} r_{i} = \sum_{j=1}^{k} \lambda_{ij}\eta_{j}^{*} + \epsilon_{i}, \quad \epsilon_{i}\overset{i.i.d.}{\sim}\text{N}(0, \sigma^{2})\end{align} $$

(1)

for |$i=1, \ldots , p$|⁠, where |$\eta _{j}^{*}$| is the latent score for metabolite |$j$| (see Section 2.2.2) and |$\lambda _{ij}$| is the binary matching indicator between feature |$i$| and metabolite |$j$|⁠, with |$\lambda _{ij} = 1$| denoting a match and |$\lambda _{ij} = 0$| otherwise. Consequently, for |$i=1,\ldots ,p$|⁠, the observed likelihood for summary statistics is |$(r_{i} \; | \; \eta _{j}^{*}, \sigma ^{2}) \sim \text{N}(\eta _{j}^{*}, \sigma ^{2})$| for |$j$| such that feature |$i$| and metabolite |$j$| are matched. Since only one metabolite can be the true match of a feature, for all |$i$|⁠, we require |$\boldsymbol \lambda _{i} = (\lambda _{i1}, \ldots , \lambda _{ik})^{\top }$| to be a one-hot binary vector, that is, |$\lambda _{ij}=1$| if and only if |$\lambda _{il}=0$| for |$l \neq j$|⁠. Then, with the matching confidence measure |$\boldsymbol q_{i}$|⁠, the prior of |$\boldsymbol{\lambda }_{i}$| is |$\text{Multinomial}(1, \boldsymbol{q_{i}})$|⁠, where |$\lambda _{ij} = 0$| if |$q_{ij} = 0$|⁠, while |$\lambda _{ij}$| may take either 0 or 1 if |$q_{ij}> 0$|⁠, for all |$i=1,\ldots ,p$|⁠.

Mixture model for latent metabolite scores

We model the latent metabolite scores |$\eta _{j}^{*}$| by a mixture distribution of a null component and a alternative component,

$$ \begin{align} \eta_{j}^{*}\sim\pi_{0} g_{0}(\eta_{j}^{*}) + \pi_{1} g_{1}(\eta_{j}^{*})\end{align} $$

(2)

where |$\pi _{0}$| and |$\pi _{1}$| are the proportions of metabolites that are clinically irrelevant and clinically relevant, respectively; functions |$g_{0}$| and |$g_{1}$| represent the densities for the two components. The null component density |$g_{0}$| contributes to the null distribution of the summary statistics which can be obtained by transforming |$p$|-values from statistical tests. Such summary statistics usually have a null of centered normal if properly transformed. Therefore, we model the null component by a centered Gaussian, i.e. |$g_{0}\sim \text{N}(0, \gamma _{0})$|⁠. To allow more flexibility of the alternative component, we utilize a DPM, i.e. |$g_{1}\sim \mathcal{DP}(P_{0}, \tau )$|⁠, where |$\gamma _{0}$| is the null component variance, |$P_{0}$| is the base measure of a Dirichlet process defined on |$\mathcal{R}\times [0, \infty )$| and |$\tau $| is the precision parameter. Equivalently, we can represent this mixture model by

$$ \begin{align} \eta_{j}^{*} = \eta_{0}^{1-z_{j}}\eta_{j}^{z_{j}}, \quad \eta_{0} \; | \; \gamma_{0} \; \sim \; \text{N}(0, \gamma_{0}), \quad \eta_{j} \; \sim \; \mathcal{DP}(P_{0}, \tau)\end{align} $$

(3)

where |$\eta _{j}$| is the latent score for clinically relevant metabolite |$j$|⁠, |$\eta _{0}$| is the latent score for all clinically irrelevant metabolites, and |$z_{j}$| is the binary latent class label for metabolite |$j$| (see Section 2.2.3). The latent metabolite score |$\eta _{j}^{*} = \eta _{j}$| if metabolite |$j$| is clinically relevant (⁠|$z_{j} = 1$|⁠, the alternative component) and |$\eta _{j}^{*} = \eta _{0}$| if metabolite |$j$| is clinically irrelevant (⁠|$z_{j} = 0$|⁠, the null component). The proportion of clinically relevant metabolites |$\pi _{0} = \Pr (z_{j} = 0)$| and the proportion of clinically relevant metabolites |$\pi _{1}=\Pr (z_{j} = 1)=1-\pi _{0}$|⁠.

Weighted Potts prior for latent metabolite class labels

To incorporate the topological structure of the metabolic network, we assign a weighted Potts prior [22] to latent metabolite class labels |$\boldsymbol{z}=(z_{1},\ldots ,z_{k})^{\top }$|⁠, with the following probability mass function up to a scaling factor,

$$ \begin{align*} \pi(\boldsymbol{z}\;|\;\boldsymbol\pi, \boldsymbol\rho, \boldsymbol w, \mathbf{C}) \propto \exp\Bigg\{\sum_{j=1}^{k}\Big[\tilde{w}_{j}\log\pi_{z_{j}}+\rho_{z_{j}}\sum_{l\ne j}^{k} w_{l} c_{lj} \text{I}_{\{z_{l}=z_{j}\}}\Big]\Bigg\} \end{align*} $$

where |$\text{I}_{A}$| is the indicator function of event |$A$|⁠, |$\boldsymbol \pi = (\pi _{0}, \pi _{1})^{\top }$|⁠, |$\boldsymbol \rho = (\rho _{0}, \rho _{1})^{\top }$| with |$\rho _{0}, \rho _{1} \ge 0$|⁠, |$\boldsymbol w = (w_{1}, \ldots , w_{k})^{\top }$| with |$w_{j} \ge 0$|⁠, and |$\mathbf{C} = (c_{jl})$| is the metabolic network adjacency matrix. The parameter |$\boldsymbol \pi $| is the prior knowledge on the proportions of clinically irrelevant and relevant metabolites. The parameter |$\boldsymbol \rho $| controls the global strength of the neighbourhood similarity of |$\boldsymbol z$|⁠, while weights |$\boldsymbol w$| controls the local similarity level. The neighborhood weight |$\tilde{w}_{j}=\sum _{l=1}^{k} c_{lj}w_{l} / \sum _{l=1}^{k} c_{lj}$| represents the average neighborhood weight for metabolite |$j$|⁠.

Posterior inferences

Equivalent model representation for latent metabolite scores

For efficient posterior computation, we construct an equivalent model representation of the alternative component in (3), i.e. |$\eta _{j} \; \sim \; \mathcal{DP}(P_{0}, \tau )$|⁠, by following [27] that DPM models can be obtained by taking the limit as the number of clusters goes to infinity. We employ the stick-breaking prior to approximate the Dirichlet process [30]. Let |$G$| denote the number of clusters, and |$\mathbf{K}=(K_{1},\ldots ,K_{k})^{\top }$| denote the cluster labels for metabolites. Let Categorical|$(\boldsymbol{p})$| denote the categorical distribution with probabilities |$\boldsymbol{p}=(p_{1},\ldots ,p_{G})^{\top }$| and |$\sum _{g}^{G} p_{g} = 1$|⁠, i.e. if |$K_{j}\sim \text{Categorical}(\boldsymbol{p})$| then |$\Pr (K_{j}=g)=p_{g}$|⁠, for |$g = 1,\dots ,G$|⁠. Denote |$\boldsymbol{m}=(m_{1},\ldots ,m_{G})^{\top }$| and |$\boldsymbol{\gamma }=(\gamma _{1},\ldots ,\gamma _{G})^{\top }$| as the cluster means and variances, respectively. Then, as |$G\rightarrow \infty $|⁠, (3) is equivalent to the following, for |$j=1,\dots ,k$|⁠,

$$ \begin{align*} \eta_{j} \; | \; \boldsymbol m, \boldsymbol\gamma, \mathbf{K}\sim\text{N}\left(\sum_{g=1}^{G} \text{I}_{\{K_{j} = g\}}m_{g}, \sum_{g=1}^{G} \text{I}_{\{K_{j} = g\}}\gamma_{g}\right), \ \ K_{j} \; | \; \boldsymbol p\sim\text{Categorical}(\boldsymbol p), \end{align*} $$

where |$\boldsymbol{p}\sim \pi (\boldsymbol{s}, \boldsymbol{t})$| is the stick-breaking prior parameterized by |$\boldsymbol{s}$| and |$\boldsymbol{t}$|⁠.

Hyperpriors and hyperparameters

Denote |$\text{Gamma}(a, b)$| as the Gamma distribution and |$\text{IG}(a, b)$| as the inverse Gamma distribution, both with shape |$a$| and rate |$b$|⁠. We assign conjugate inverse Gamma priors on the noise variance |$\sigma ^{2}$|⁠, the null component variance |$\gamma _{0}$| and the alternative component cluster variances |$\gamma _{g}$| for |$g=1,\ldots ,G$|⁠, i.e. |$\sigma ^{2}\sim \text{IG}(a_{1}, b_{1})$|⁠, |$\gamma _{0}\sim \text{IG}(a_{2}, b_{2})$| and (⁠|$\gamma _{g}\;|\;\beta _{g}) \sim \text{IG}(a_{3}, \beta _{g})$| for all |$g$|⁠, where |$a_{1}$|⁠, |$a_{2}$|⁠, |$a_{3}$|⁠, |$b_{1}$| and |$b_{2}$| are hyperparameters, and |$\beta _{g}$| has a conjugate Gamma prior with hyperparameters |$a_{4}$| and |$b_{4}$|⁠, i.e. |$\beta _{g} \sim \text{Gamma}(a_{4}, b_{4})$|⁠. For the alternative cluster means |$m_{g}$| for all |$g$|⁠, we assume a normal prior |$(m_{g} \; | \; \sigma ^{2}_{g}) \sim \text{N}(\mu _{g}, \sigma ^{2}_{g})$| with |$\sigma ^{2}_{g} \sim \text{IG}(a_{5}, b_{5})$| where |$\mu _{K_{j}}$|⁠, |$a_{5}$| and |$b_{5}$| are hyperparameters.

The number of clusters |$G$| in the alternative component, the cluster means |$\mu _{g}$| and the proportion of clinically relevant metabolites |$\pi _{1}$| need to be ideally prespecified based on the unknown distribution of metabolite scores. However, in practice, we can use the distribution of feature summary statistics as a close surrogate to the distribution of latent metabolite scores to determine these hyperparameters. This is because |$r_{i}$| is a noisy copy of |$\eta _{j}^{*}$| (i.e. |$r_{i}=\eta _{j}^{*}+\epsilon _{i}$|⁠) when feature |$i$| matches metabolite |$j$|⁠. We suggest |$\mu _{g}$| for |$g=1,\ldots ,G$| being evenly spaced and cover the range of summary statistics, and then |$G$| is determined according to the interval length between two neighbouring |$\mu _{g}$|s and the range of |$\mu _{g}$|s. We set |$\pi _{1}$| to be the proportion of significant features, with |$\pi _{0} = 1-\pi _{1}$|⁠. While these estimates may not perfectly reflect the unknown distribution of metabolite scores, they provide reasonable prior knowledge. Accurate estimates can be obtained through posterior inferences. For the weighted Potts prior, we set |$\rho _{0} = \rho _{1} = 0.1$| and |$w_{j}=1$| for all |$j$|⁠. In addition, we advise using a tight prior for the null component, i.e. |$a_{2} \gg b_{2}$|⁠, to separate it from the alternative component. Because a substantial constitution of BAUM is latent, we recommend assigning informative tight priors to variances, with specific settings described in each application.

We provide extensive sensitivity analysis demonstrating that BAUM remains very stable under changes in these tight priors in real-world applications. BAUM selects highly consistent pathways across different hyperparameter settings (see Supplementary Materials S3 for details).

Posterior sampling, parameters of interest and FDR control

We develop a blocked Gibbs sampler for posterior inferences. We rely on an equivalent model representation described in Section 2.3.1 for DPM and utilize a blocked Gibbs sampling algorithm for efficient posterior sampling of |$\eta _{j}$|’s. We obtain posterior samples of the latent metabolite class labels |$\boldsymbol{z}$| by the Swendsen–Wang graph partition algorithm [22, 32] with the weighted Potts prior. Full conditionals for other parameters can be derived in a standard manner. We provide the full conditionals of each parameter and a group updating scheme of |$\boldsymbol{z}$| based on the Swendsen–Wang algorithm in Supplementary Materials S1.

The main parameters of interest are |$\boldsymbol{z}$| and |$\boldsymbol \lambda _{i}$| for all |$i$|⁠. We estimate the posterior inclusion probability of metabolite |$j$| by the posterior mean of |$z_{j}$|⁠. We then control FDR at level |$\alpha $| [34]. Denote the posterior inclusion probabilities as |$u_{1},\ldots , u_{k}$|⁠. We first sort |$\{u_{j}\}^{k}_{j=1}$| in descending order to obtain |$\{u_{(l)}\}^{k}_{l=1}$|⁠. Then, let |$\phi _{\alpha } = u_{\xi }$| with |$\xi = \max \{l^{*}: (l^{*})^{-1}\sum _{l=1}^{l^{*}} (1-u_{(l)}) \le \alpha \}$|⁠. We determine metabolite |$j$| as significant if |$u_{j}> \phi _{\xi }$|⁠.

We estimate the posterior confidence measure of matching feature |$i$| and metabolite |$j$| by the posterior mean of |$\lambda _{ij}$|⁠, denoted by |$\hat{\lambda }_{ij}$|⁠, and determine feature |$i$| matches to metabolite |$j$| if |$j=\arg \max _{l}\{\hat{\lambda }_{il}\}$|⁠. For all our analyses, we employ a burn-in of 1000 steps followed by 4000 steps for inferences, and set FDR |$\alpha =0.2$|⁠. We check convergence by trace plots and auto-regressive correlation plots.

Post-processing—a heuristic approach for quick estimation of metabolite abundance

We estimate the subject-specific metabolic values are by a convex combination of all features, where the weights are based on the estimate of matching uncertainty (e.g. the posterior mean of |$\lambda _{ij}$|⁠). Specifically, denote the posterior mean of |$\lambda _{ij}$| by |$\lambda _{ij}^{*}$|⁠, and |$\boldsymbol \lambda _{j}^{*} = (\lambda _{1j}, \lambda _{2j},..., \lambda _{nj})^{\top }$|⁠. Then, we can estimate the metabolite |$j$| abundance of subject |$s$| by |$\hat{m}_{sj}=c\boldsymbol \lambda _{j}^{*T}\boldsymbol{x}_{s}$|⁠, where |$\boldsymbol{x}_{s}\in \mathbb{R}^{n}$| is the value of the |$n$| features for subject |$s$|⁠, and |$c$| is a scaling factor such that the weights of feature values sum up to 1.

SIMULATIONS

We perform extensive and realistic simulations to evaluate the performance of BAUM, varying a number of network specifications, including feature count |$p$|⁠, metabolite count |$k$|⁠, the alternative component, unmatched metabolite percentage, potential feature-metabolite matchings and the metabolite network structures. We consider four simulation scenarios, two based on generative networks (GN1 and GN2) and two based on real-world feature-metabolite networks (RN1 and RN2). Table 1 summarizes the key differences between these scenarios. In all scenarios, metabolite labels are determined based on their vertex degrees, with higher-degree metabolites more likely to belong to the alternative component. The alternative component’s metabolite scores may vary, but the null component always has a score of zero. In GN1 and GN2, we set |$p=k=1000$| and generate metabolite networks using the Barabasi–Albert model [35]. We determine the latent metabolite labels based on their vertex degrees. Metabolites with the top 2% degrees, along with their one-step neighbors are assigned to the alternative component (⁠|$z_{j} = 1$|⁠). The alternative component’s metabolite scores follow i.i.d. |${\mathrm{N}}(10, 1)$|⁠. Then, to mimic the problem of matching uncertainty, we set up to five metabolites to be potential matches to a feature, while only one metabolite is the true match. On the other hand, we allow for a metabolite connecting with a maximum of three features. GN1 ensures every metabolite has at least one potential matching to features, while GN2 introduces 50% unmatched metabolites to mimic real-world conditions. For RN1 and RN2, we utilize the network from the COVID-19 metabolomics data [36, 37] analyzed in Section 4.1 with |$p=1153$| features and |$k=1093$| metabolites. The metabolite network used in RN1 and RN2 contains 13% metabolites without potential matches. In RN1, we assign the metabolite labels and metabolite latent scores as in GN1 and GN2, while we consider a more complex design in RN2. In RN2, 10% metabolites are randomly chosen initially, and a metabolite with a larger degree has a higher chosen probability. Then, the |$d$|-step neighbors of the initially chosen metabolites are chosen with a decaying probability of |$0.8^{d}$|⁠. All of the chosen metabolites are assigned to the alternative component while the remaining to the null component. Metabolites in the alternative component have scores follow i.i.d. |$\chi ^{2}(10)$|⁠. Finally, feature-level summary statistics are generated according to (1).

Table 1

Open in new tab

Differences in the four simulation scenarios. In both GN1 and GN2, there are |$p=1000$| features and |$k=1000$| metabolites, and the alternative components (AC) are |${\mathrm{N}}(10, 1)$|⁠. The metabolite networks are generated by simulating scale-free networks. The percentage of unmatched metabolites (% UM) is 50% in GN2 while every metabolite has potential matchings to features in GN1. In both RN1 and RN2, potential feature-metabolite matchings and the metabolite networks are obtained from the ST001849 COVID-19 metabolomics data, where we use |$p=1153$| features and |$k=1093$| human metabolites. The metabolite network has 13% metabolites with no potential matchings to any features. In RN1, the alternative component is |${\mathrm{N}}(10, 1)$| while, in RN2, it is |$\chi ^{2}(10)$|

Settings	\|$\boldsymbol p$\|	\|$\boldsymbol k$\|	Network	AC	% UM
GN1	1000	1000	Scale-free	\|${\mathrm{N}}(10, 1)$\|	0%
GN2	1000	1000	Scale-free	\|${\mathrm{N}}(10, 1)$\|	50%
RN1	1153	1093	COIVD-19	\|${\mathrm{N}}(10, 1)$\|	13%
RN2	1153	1093	COIVD-19	\|$\chi ^{2}(10)$\|	13%

Settings	\|$\boldsymbol p$\|	\|$\boldsymbol k$\|	Network	AC	% UM
GN1	1000	1000	Scale-free	\|${\mathrm{N}}(10, 1)$\|	0%
GN2	1000	1000	Scale-free	\|${\mathrm{N}}(10, 1)$\|	50%
RN1	1153	1093	COIVD-19	\|${\mathrm{N}}(10, 1)$\|	13%
RN2	1153	1093	COIVD-19	\|$\chi ^{2}(10)$\|	13%

AC, alternative component; % UM, percentage of unmatched metabolites.

Table 1

Open in new tab

Settings	\|$\boldsymbol p$\|	\|$\boldsymbol k$\|	Network	AC	% UM
GN1	1000	1000	Scale-free	\|${\mathrm{N}}(10, 1)$\|	0%
GN2	1000	1000	Scale-free	\|${\mathrm{N}}(10, 1)$\|	50%
RN1	1153	1093	COIVD-19	\|${\mathrm{N}}(10, 1)$\|	13%
RN2	1153	1093	COIVD-19	\|$\chi ^{2}(10)$\|	13%

Settings	\|$\boldsymbol p$\|	\|$\boldsymbol k$\|	Network	AC	% UM
GN1	1000	1000	Scale-free	\|${\mathrm{N}}(10, 1)$\|	0%
GN2	1000	1000	Scale-free	\|${\mathrm{N}}(10, 1)$\|	50%
RN1	1153	1093	COIVD-19	\|${\mathrm{N}}(10, 1)$\|	13%
RN2	1153	1093	COIVD-19	\|$\chi ^{2}(10)$\|	13%

AC, alternative component; % UM, percentage of unmatched metabolites.

For GN1, GN2 and RN1, we utilize a single cluster (⁠|$G=1$|⁠) to simplify the alternative component as a Gaussian distribution. The mean of the alternative component (⁠|$m_{1}$|⁠) is degenerate at 10. In contrast, for RN2, we employ 21 clusters (⁠|$G=21$|⁠) for the alternative component. These clusters have prior means (⁠|$\mu _{g}$|⁠) taking integers from 5 to 25 based on feature summary statistics. For RN2, we specify |$a_{5}=1\text{e}4$| and |$b_{5}=1$|⁠, while |$a_{5}$| and |$b_{5}$| remain unspecified in the other scenarios since |$m_{1}$| is degenerate. Common parameters across all scenarios include |$a_{1}=2\text{e}4$|⁠, |$a_{2}=a_{3}=a_{4}=b_{1}=1\text{e}4$| and |$b_{2}=b_{4}=1$|⁠. Using a histogram of feature-level statistics, we determine |$\pi _{1} = 0.15$| for GN1 and GN2, |$\pi _{1}=0.2$| for RN1 and |$\pi _{1}=0.4$| for RN2. We repeat each scenarios for 100 times.

We evaluate the metabolite selection performance on matched metabolites since unmatched metabolites do not directly contribute to the data likelihood, making their latent class inferences highly challenging. For matching estimation, instead of checking if BAUM identifies true feature-metabolite matches, we assess its capability to match the features correctly to the null component or to the alternative component, because it is intrinsically difficult to distinguish between potential matches when a feature has potentially matches to multiple metabolites with similar latent scores.

Table 2(a) presents a comparison between BAUM and local FDR control (LocFDR) for metabolite selections. We also include the performance of LocFDR after incorporating BAUM’s matching uncertainty estimation (Post-LocFDR). There are methods of Bayesian local FDR control [22] in literature but such methods are not designed for the matching uncertainty estimation and can encounter numerical issues. For LocFDR, we assume all metabolites potentially matched to a feature have the same matching probability. We compute metabolite-specific statistics as weighted averages of feature statistics for features potentially matched to the metabolite. For Post-LocFDR, we compute metabolite-specific statistics using the posterior probability of matching as weights. Metabolite selections are based on the metabolite-specific statistics while controlling the local FDR at 0.2. Table 2(b) shows the feature-metabolite matching performance only for BAUM, as it is the first method to quantify the feature-metabolite matching uncertainty.

Table 2

Open in new tab

Simulation results for two generative network scenarios (GN1 and GN2) and two real-network scenarios (RN1 in (c) and RN2) on (a) metabolite inferences and (b) matching estimations

(a) Metabolite inferences in different simulation scenarios and for different methods. We compare BAUM with LocFDR and Post-LocFDR.
Settings	Methods	ACC	AUC	FPR	TPR
GN1	BAUM	95.1% (0.8%)	93.3% (1.4%)	4.3% (0.7%)	91.0% (2.7%)
	LocFDR	91.0% (1.0%)	92.9% (2.2%)	9.6% (1.2%)	95.3% (4.5%)
	Post-LocFDR	95.6% (0.8%)	91.4% (1.4%)	3.0% (0.8%)	85.9% (2.7%)
GN2	BAUM	99.1% (0.5%)	97.4% (1.4%)	0.3% (0.3%)	95.0% (2.8%)
	LocFDR	88.6% (2.2%)	93.5% (1.3%)	13.0% (2.6%)	99.9% (0.4%)
	Post-LocFDR	95.3% (2.5%)	95.8% (1.7%)	4.8% (2.9%)	96.5% (2.5%)
RN1	BAUM	96.7% (0.6%)	91.7% (1.3%)	0.8% (0.4%)	84.1% (2.6%)
	LocFDR	36.7% (2.0%)	62.0% (1.2%)	76.0% (2.4%)	100% (0%)
	Post-LocFDR	83.7% (3.0%)	83.7% (2.8%)	16.3% (3.6%)	83.8% (2.4%)
RN2	BAUM	94.4% (1.1%)	94.2% (1.2%)	4.9% (1.4%)	93.3% (2.0%)
	LocFDR	79.9% (1.9%)	83.4% (1.5%)	32.0% (3.0%)	98.9% (0.7%)
	Post-LocFDR	85.1% (1.9%)	87.7% (1.5%)	23.8% (3.2%)	99.2% (0.6%)
(b) Matching estimation results for BAUM in different simulation scenarios.
Settings	ACC	AUC	FPR	TPR
GN1	96.4% (0.9%)	91.2% (2.0%)	2.2% (0.8%)	85.7% (4.0%)
GN2	97.8% (0.6%)	97.2% (1.1%)	2.1% (0.7%)	96.5% (2.1%)
RN1	94.2% (0.9%)	87.0% (2.1%)	0.7% (0.2%)	74.8% (4.2%)
RN2	88.6% (1.4%)	87.3% (1.5%)	4.7% (1.2%)	79.3% (2.6%)

(a) Metabolite inferences in different simulation scenarios and for different methods. We compare BAUM with LocFDR and Post-LocFDR.
Settings	Methods	ACC	AUC	FPR	TPR
GN1	BAUM	95.1% (0.8%)	93.3% (1.4%)	4.3% (0.7%)	91.0% (2.7%)
	LocFDR	91.0% (1.0%)	92.9% (2.2%)	9.6% (1.2%)	95.3% (4.5%)
	Post-LocFDR	95.6% (0.8%)	91.4% (1.4%)	3.0% (0.8%)	85.9% (2.7%)
GN2	BAUM	99.1% (0.5%)	97.4% (1.4%)	0.3% (0.3%)	95.0% (2.8%)
	LocFDR	88.6% (2.2%)	93.5% (1.3%)	13.0% (2.6%)	99.9% (0.4%)
	Post-LocFDR	95.3% (2.5%)	95.8% (1.7%)	4.8% (2.9%)	96.5% (2.5%)
RN1	BAUM	96.7% (0.6%)	91.7% (1.3%)	0.8% (0.4%)	84.1% (2.6%)
	LocFDR	36.7% (2.0%)	62.0% (1.2%)	76.0% (2.4%)	100% (0%)
	Post-LocFDR	83.7% (3.0%)	83.7% (2.8%)	16.3% (3.6%)	83.8% (2.4%)
RN2	BAUM	94.4% (1.1%)	94.2% (1.2%)	4.9% (1.4%)	93.3% (2.0%)
	LocFDR	79.9% (1.9%)	83.4% (1.5%)	32.0% (3.0%)	98.9% (0.7%)
	Post-LocFDR	85.1% (1.9%)	87.7% (1.5%)	23.8% (3.2%)	99.2% (0.6%)
(b) Matching estimation results for BAUM in different simulation scenarios.
Settings	ACC	AUC	FPR	TPR
GN1	96.4% (0.9%)	91.2% (2.0%)	2.2% (0.8%)	85.7% (4.0%)
GN2	97.8% (0.6%)	97.2% (1.1%)	2.1% (0.7%)	96.5% (2.1%)
RN1	94.2% (0.9%)	87.0% (2.1%)	0.7% (0.2%)	74.8% (4.2%)
RN2	88.6% (1.4%)	87.3% (1.5%)	4.7% (1.2%)	79.3% (2.6%)

Summary statistics mean (SD) are based on 100 replicates.

Table 2

Open in new tab

Simulation results for two generative network scenarios (GN1 and GN2) and two real-network scenarios (RN1 in (c) and RN2) on (a) metabolite inferences and (b) matching estimations

(a) Metabolite inferences in different simulation scenarios and for different methods. We compare BAUM with LocFDR and Post-LocFDR.
Settings	Methods	ACC	AUC	FPR	TPR
GN1	BAUM	95.1% (0.8%)	93.3% (1.4%)	4.3% (0.7%)	91.0% (2.7%)
	LocFDR	91.0% (1.0%)	92.9% (2.2%)	9.6% (1.2%)	95.3% (4.5%)
	Post-LocFDR	95.6% (0.8%)	91.4% (1.4%)	3.0% (0.8%)	85.9% (2.7%)
GN2	BAUM	99.1% (0.5%)	97.4% (1.4%)	0.3% (0.3%)	95.0% (2.8%)
	LocFDR	88.6% (2.2%)	93.5% (1.3%)	13.0% (2.6%)	99.9% (0.4%)
	Post-LocFDR	95.3% (2.5%)	95.8% (1.7%)	4.8% (2.9%)	96.5% (2.5%)
RN1	BAUM	96.7% (0.6%)	91.7% (1.3%)	0.8% (0.4%)	84.1% (2.6%)
	LocFDR	36.7% (2.0%)	62.0% (1.2%)	76.0% (2.4%)	100% (0%)
	Post-LocFDR	83.7% (3.0%)	83.7% (2.8%)	16.3% (3.6%)	83.8% (2.4%)
RN2	BAUM	94.4% (1.1%)	94.2% (1.2%)	4.9% (1.4%)	93.3% (2.0%)
	LocFDR	79.9% (1.9%)	83.4% (1.5%)	32.0% (3.0%)	98.9% (0.7%)
	Post-LocFDR	85.1% (1.9%)	87.7% (1.5%)	23.8% (3.2%)	99.2% (0.6%)
(b) Matching estimation results for BAUM in different simulation scenarios.
Settings	ACC	AUC	FPR	TPR
GN1	96.4% (0.9%)	91.2% (2.0%)	2.2% (0.8%)	85.7% (4.0%)
GN2	97.8% (0.6%)	97.2% (1.1%)	2.1% (0.7%)	96.5% (2.1%)
RN1	94.2% (0.9%)	87.0% (2.1%)	0.7% (0.2%)	74.8% (4.2%)
RN2	88.6% (1.4%)	87.3% (1.5%)	4.7% (1.2%)	79.3% (2.6%)

(a) Metabolite inferences in different simulation scenarios and for different methods. We compare BAUM with LocFDR and Post-LocFDR.
Settings	Methods	ACC	AUC	FPR	TPR
GN1	BAUM	95.1% (0.8%)	93.3% (1.4%)	4.3% (0.7%)	91.0% (2.7%)
	LocFDR	91.0% (1.0%)	92.9% (2.2%)	9.6% (1.2%)	95.3% (4.5%)
	Post-LocFDR	95.6% (0.8%)	91.4% (1.4%)	3.0% (0.8%)	85.9% (2.7%)
GN2	BAUM	99.1% (0.5%)	97.4% (1.4%)	0.3% (0.3%)	95.0% (2.8%)
	LocFDR	88.6% (2.2%)	93.5% (1.3%)	13.0% (2.6%)	99.9% (0.4%)
	Post-LocFDR	95.3% (2.5%)	95.8% (1.7%)	4.8% (2.9%)	96.5% (2.5%)
RN1	BAUM	96.7% (0.6%)	91.7% (1.3%)	0.8% (0.4%)	84.1% (2.6%)
	LocFDR	36.7% (2.0%)	62.0% (1.2%)	76.0% (2.4%)	100% (0%)
	Post-LocFDR	83.7% (3.0%)	83.7% (2.8%)	16.3% (3.6%)	83.8% (2.4%)
RN2	BAUM	94.4% (1.1%)	94.2% (1.2%)	4.9% (1.4%)	93.3% (2.0%)
	LocFDR	79.9% (1.9%)	83.4% (1.5%)	32.0% (3.0%)	98.9% (0.7%)
	Post-LocFDR	85.1% (1.9%)	87.7% (1.5%)	23.8% (3.2%)	99.2% (0.6%)
(b) Matching estimation results for BAUM in different simulation scenarios.
Settings	ACC	AUC	FPR	TPR
GN1	96.4% (0.9%)	91.2% (2.0%)	2.2% (0.8%)	85.7% (4.0%)
GN2	97.8% (0.6%)	97.2% (1.1%)	2.1% (0.7%)	96.5% (2.1%)
RN1	94.2% (0.9%)	87.0% (2.1%)	0.7% (0.2%)	74.8% (4.2%)
RN2	88.6% (1.4%)	87.3% (1.5%)	4.7% (1.2%)	79.3% (2.6%)

Summary statistics mean (SD) are based on 100 replicates.

As shown in Table 2(a), across all simulations, BAUM consistently achieves superior metabolite selection results. In challenging scenarios with real-world networks, BAUM maintains a good and stable performance. In some scenarios, LocFDR exhibits higher true positive rate at the expense of much higher false positive rate (FPR). Both LocFDR and Post-LocFDR display lower accuracy (ACC), lower area under the curve (AUC) and higher FPR in metabolite selections than BAUM, which shows the importance of jointly modeling matching uncertainty and metabolite behaviour in enhancing metabolite selections. Notably, Post-LocFDR shows substantial improvement over LocFDR in accuracy, AUC and FPR, indicating the informativeness and utility of our matching estimation on metabolite-level inferences. Table 2(b) shows the feature-metabolite matching estimated by BAUM accurately distinguishes the null component and the alternative component for the metabolites, maintaining good accuracy and AUC, even in scenarios with complex real-world networks. In all scenarios, matching FPR is well controlled.

RESULTS

We analyzed a COVID-19 metabolomics dataset and a mouse brain development metabo-lomics dataset using BAUM. BAUM finds pathways that are consistent with existing knowledge, as well as novel pathways that are biologically plausible. We provide hyperparameters used in each application in Supplementary Materials S2. Furthermore, sensitivity analysis showed that BAUM was very robust and selected highly consistent pathways acorss different hyperparameter settings, even with a very small sample size of 16 mice per group in the analysis of the mouse brain data (please see Supplementary Materials S3 for details).

COVID-19 metabolomics data

We analyzed the COVID-19 metabolomics data [36, 37] using BAUM. The dataset (ST001849) was downloaded from the NIH Metabolomics Workbench, which was derived from an untargeted metabolomics study of subjects who were positive of SARS-CoV-2. The purpose was to find indicators of disease severity using baseline metabolomics at admission. After removing subjects with unknown ICU admission status, a total of 269 subjects who were SARS-CoV-2 positive were studied, among which 133 were admitted to ICU. Using day-0 metabolomics data, our analysis tried to find metabolic signatures that can separate those who were admitted to ICU from the less severe cases.

The dataset contained subject-level observations of 5471 metabolic features (3819 positive-ion features and 1652 negative-ion features). After matching the features to metabolites on the KEGG network [38], we kept 1153 features that had at least one match to the human metabolic network and 1093 human metabolites. The kept metabolites are either directly linked to features or part of path between metabolites that are linked to features. Among the 1153 features we studied, 582 (50.5%) were matched to only one metabolite, hence no matching uncertainty was involved. However, 151 (13.1%) features had at least five possible matches to metabolites. From the metabolite perspective, among the 1093 metabolites, 142 (13.0%) metabolites had no match to any features, 334 (30.6%) were matched to only one feature and 217 (20.0%) had at least five feature matches.

We performed marginal distance correlation t-tests [39] to detect non-linear associations between features and the binary ICU status, and used the resulting t-statistics as the feature-specific summary statistics.

We controlled FDR based on the posterior probability of whether the metabolite is clinically relevant to the outcome by adopting the procedure described in [34]. Among the 1093 metabolites, BAUM selected 189 clinically relevant metabolites to the outcome by controlling FDR at level 0.2.

Figure 2 shows the selected sub-networks from the human metabolic network. For each sub-network, the most significant pathway(s) were found using pathways in the metapone package [24] and the hypergeometric test for over-representation [40]. The majority of the sub-networks were part of the central metabolism of amino acids and nucleotides (Figure 2A). It has been found that COVID infection alters amino acid metabolism [41], and the level of changes are linked to disease severity [42]. Besides general amino acid metabolism changes, some specific amino acids were clearly linked to the physiology of COVID infection. We found tyrosine and tryptophan metabolism to be associated with ICU admission. Other studies have found tyrosine metabolic pathway was prominently affected by COVID infection in oral secretion samples, after correcting for stress response of the immune system [43]. The imbalance of the urea cycle can cause severe inflammatory damage. It has been reported that ornithine concentration is higher in critically ill COVID patients. At the same time, arginine concentration is significantly lower, and arginine-ornithine conversion dominates the urea cycle in COVID patients [44]. The degradation of arginine leads to the accumulation of its downstream metabolites and exacerbates the inflammatory response. The increase of aspartate and its downstream product asparagine provides a favorable environment for the translation of viral mRNA [45].

$Selected subnetworks and their corresponding pathways. Green nodes: selected metabolites at FDR $\leq 0.$2; blue nodes: features matched to the metabolites; gray nodes: metabolites that are not selected, but connect selected metabolites. (A) Subnetworks associated with amino acid and nucleotide metabolism. (B) Other subnetworks.$

Figure 2

Selected subnetworks and their corresponding pathways. Green nodes: selected metabolites at FDR |$\leq 0.$|2; blue nodes: features matched to the metabolites; gray nodes: metabolites that are not selected, but connect selected metabolites. (A) Subnetworks associated with amino acid and nucleotide metabolism. (B) Other subnetworks.

Open in new tab Download slide

Another interesting pathway is the caffeine metabolism pathway (Figure 2B). Though a commonplace diet component, caffeine has been shown to be an anti-inflammatory chemical, as well as an immuno-modulator, with a specific effect on airway smooth muscle. It is believed to achieve this function by acting as a phosphodiesterase inhibitor and adenosine receptor antagonist [46]. Among the selected metabolites was caffeine (C07481) itself. In addition, 1-Methylxanthine (C16358), AFMU (C16365) and paraxanthine (C13747) were also selected. Paraxanthine is known for its attenuation effect on the formation of cholestatic liver fibrosis [47].

The linoleate metabolism pathway (Figure 2B) was selected. It has been found that increased serum linoleic acid was associated with more severe symptoms of COVID [48]. Interestingly, a recent structural study suggested that linoleic acid can bind to the spike glycoprotein of the coronavirus, potentially exerting anti-viral effect [49]. In the dataset under study, linoleic acid level is lower on average in the group with more severe symptom. Among the other selected metabolites in the linoleate metabolism pathway was arachidonic acid (C00219), which is known to be an endogenous antiviral metabolite, the lack of which can make the the person less resistant to the coronavirus [50]. The average abundance of arachidonic acid was also lower in the more severe group in the current dataset.

The vitamin B3 (nicotinate and nicotinamide) metabolism pathway (Figure 2B) was also selected by our method. It has been documented that the vitamine B3 pathway, together with tryptophan metabolism pathwasy mentioned above, is altered in severe SARS-COV2 patients [51]. A nutritional intervention with nicotinamide was beneficial when combined with other therapy for coronavirus. A clinical trial has been conducted by NIH (NCT04751604) to study the impact of vitamin B3 on the disease course of COVID-19. Interestingly, the related compound nicotine was also considered potentially beneficial in SARS-COV2 resistance, potentially through the mechanism of nicotinic acetylcholine receptor.

Mouse brain data - development and healthy aging in different brain regions

We analyzed the mouse brain atlas data (ST001637), which was downloaded from the NIH Metabolomics Workbench [52, 53]. The dataset consisted of 480 mice split into 60 groups. The groups were characterized by two genders (male and female), three age points (3, 16 and 59 weeks) and ten brain regions. Each group had eight mice subjects.

The dataset had observations of 17 032 metabolic features (10085 positive- and 6947 negative-ion features). After matching the features to known mouse metabolites, we kept 819 features that had at least one match to the mouse metabolites. We screened out 950 mice metabolites that neither matched to any features nor was along the path between two metabolites which matched to features. Among the 819 kept features, 548 (67.0%) were matched to only one features, while 46 (5.6%) has at least five feature matches. Among the 2145 mice metabolites, 1207 (56.3%) had no match to any features, 598 (27.9%) were matched to one feature and 7 (0.3%) had at least five matched features.

Given the small sample size in each age/gender/brain region combination, we focused on linear relations between metabolites and development/aging. We employed the analysis of variance (ANOVA) to detect metabolic difference among mice of different age groups, in Hippocampus, Olfactory Bulb and Thalamus, respectively. We made two age group comparisons—week 3 versus week 16 (development), and week 16 versus week 59 (healthy aging). For each of the brain region, we performed an ANOVA for each feature controlling for gender. Then we transformed the |$F$|-statistic values by normal quantile transformation, and took the transformed statistics as the feature-level summary.

After selecting significant metabolites at |$\text{FDR} \leq 0.2$| in each comparison group, again the most significant pathway(s) were found using pathways in the metapone package [24] and the hypergeometric test for over-representation. Given we are making separate comparisons for different brain regions and different age group contrasts, and given the small sample size in each comparison (16 vs 16) yielding small number of significant metabolites, we conducted the pathway analysis using all significant metabolites for each comparison. We selected pathways with |$\geq 3$| significant metabolites, as well as with |$P$|-values |$\leq 0.05$|⁠. Figure 3 shows the selected pathways connected with the six comparisons.

Figure 4 shows details of some example pathways. Some pathways are significant in multiple brain regions or age group comparisons. For such pathways, we selected a single brain region and age group comparison that involves the largest number of significant metabolites as example. Figures for all selected pathways in all age group comparisons are in the Supplementary Materials S4.

Figure 3

Mouse brain data results: significant pathways that are associated with different stages and brain regions.

Open in new tab Download slide

$Example significant pathways from the mouse brain data. Green nodes: selected metabolites at FDR $\leq 0.2$; blue nodes: features matched to the metabolites; gray nodes: metabolites that are not selected, but connect selected metabolites. (A) Pathways connected to both development and healthy aging. (B) Pathways connected to development of multiple brain regions. (C) Example pathways connected to development of certain brain regions. (D) Example pathways connected to healthy aging of certain brain regions.$

Figure 4

Example significant pathways from the mouse brain data. Green nodes: selected metabolites at FDR |$\leq 0.2$|⁠; blue nodes: features matched to the metabolites; gray nodes: metabolites that are not selected, but connect selected metabolites. (A) Pathways connected to both development and healthy aging. (B) Pathways connected to development of multiple brain regions. (C) Example pathways connected to development of certain brain regions. (D) Example pathways connected to healthy aging of certain brain regions.

Open in new tab Download slide

The most notable pathway in Figures 3 and 4(A) is the retinol metabolism pathway, which is associated with both hippocampus and thalamus, in both development and aging stages. Retinoids are well known for their important role in nervous system development, as well as the development of many other bodily structures [54]. Retinyl palmitate (C02588) is a common form of retinol derivative in the brain. In the current data, its level is substantially higher in hippocampus and thalamus in adult mice than in baby mice. Supplementation of retinyl palmitate appears to have disruptive effects in developing and adult rat brain [55].

Another group of pathways that is well-known in brain function is the fatty acid metabo-lism pathways (Figures 3 and 4A). Fatty acid derivatives influence many brain functions [56]. The pathways show strong relations with thalamus at the developmental stage, and hippocampus in the healthy aging stage. However we also noticed that some of the metabolites are also linked to the brain regions in other stages. The significant metabolites include L-Palmitoylcarnitine (C02990), Hexadecanoic acid (C00249), Palmitoleic acid (C08362), Stearic acid (C01530), Lauric acid (C02679), Tetradecanoyl-CoA (C02593), Decanoyl-CoA (C05274), etc. Among the metabolites found in this study, steric acid (C01530) was identified as a potential marker for AD and aging in a human study [57]. Lauric acid showed beneficial effects in neuronal maturation and neuroprotection against oxidative stress in cellular and animal models [58].

Two bile acid pathways are widely connected with multiple brain regions in development and aging (Figures 3 and 4A). Bile acids are cholesterol-derived steroid acids that serve as signaling molecules mostly for nutrient availability [59]. The key enzyme in the pathway, CYP7A1, also plays an important role in the clearance of brain cholesterol. Bile acids CA, DCA and CDCA are able to influence neurotransmission [60], and play important roles in gut-liver-brain axis and normal brain functions. Disruption of gut microbiome can cause neurological disorder through bile acid signaling [61]. The selected metabolites in this pathway include 3|$\alpha $|⁠,7|$\alpha $|⁠,12|$\alpha $|⁠,26-Tetrahydroxy-5|$\beta $|-cholestane (C05446), 3|$\alpha $|⁠,7|$\alpha $|⁠,12|$\alpha $|-Trihydroxy-5|$\beta $|-cholestane (C05454), 3|$\alpha $|⁠,7|$\alpha $|⁠,12|$\alpha $|-Trihydroxy-5|$\beta $|-cholestan-26-al (C01301), 3|$\alpha $|⁠,7|$\alpha $|-Dihydroxy-5|$\beta $|-cholestane (C05452), Glycocholic acid (C01921), Palmitic acid (C00249), Lithocholic acid (C03990). Currently, mechanistic studies linking the bile acids and their derivatives to brain development and aging are still scarce. Studies have found that palmitic acid has the potential to trigger neuroinflammation in the brain [62], and the serum level of lithocholic acid, as well as two other bile acids deoxycholic acid and glycoursodeoxycholic acid, are higher in Alzheimer’s patients [63].

Figure 4(B) shows two pathways connected to the development stage. As an example, purines and their derivatives are centrally involved in energy homeostasis and DNA synthesis. In addition, purinergic signalling plays critical roles in the nervous system [64]. Disruptions to purine metabolism can cause a multitude of neurological disorders [65]. The selected metabolites include Guanosine (C00387), Deoxyguanosine (C00330), Inosine (C00294), P1,P4-Bis(5’-guanosyl) tetraphosphate (C01261), P1,P4-Bis(5’-adenosyl) tetraphosphate (Ap4A; C01260) and Guanosine 3’,5’-bis(diphosphate) (C01228). Among them, guanosine is a known neuromodulator [66]. Studies have shown that inosine and Ap4A have neuroprotective effects [67].

Figure 4(C) shows pathways connected to a single brain region either in development. Besides the self-explanatory pathway neuroactive ligan receptor interaction, in which selected metatolites include adenosine (C00212), sphinganine 1-phosphate (C01120), anandamide (C11695) and tryptamine (C00398), another interesting example is the selenoamino acid metabo-lism pathway. The impact of selenoproteins and selenoamino acids has been mostly studied from the perspective of neurodegenerative diseases and food supplements [68]. It has been found that selenomethionine promotes hippocampus neurogenesis in AD [69]. In the current study, the level of selenomethionine (C05335) decreases over the development of the hippocampus.

Figure 4(D) shows pathways connected to a single brain region in healthy aging. As an example, the riboflavin metabolism pathway is associated with healthy aging of olfactory bulb. Ribovlavin is a water-soluble B vitamin that protects against oxidative stress. Its metabolism is of critical importance in brain health, the dysfunction of which can contribute to neurodegenerative disease [70].

Another well-known pathway is C21 steroid hormone biosynthesis and metabolism. Derivatives of two of the three members of the estrogen family, estrone and estradiol were found to be significant. They include Estrone-3-sulfate (E1S, C02538) and Estradiol 3-glucuronide (E2-3G, C05503). E1S is an inactive form of E1, which can be taken up by cells to synthesize E2 [71]. It is a major form of estrone involved in brain-blood efflux transport of estrogens. E2 is known to be associated with aging and menopausal transition, as well as some neurodegenerative disorders, potentially through the cholinergic and dopaminergic systems [72]. Another selected metabolite, Cholesterol sulfate (C18043), is a critical component of the cell membrane, which serves to stabilize membrane structure. It has a neuroprotective effect by reducing oxidative stress [73]. Overall, our method was able to select informative metabolites and pathways, even under very small sample size of 16 mice per group.

DISCUSSION

In this study, we designed a framework BBAUM data, which can select important metabolites that tend to be functionally consistent, make inference on the matching between metabolites and observed features, and incorporate non-linear associations between features and the clinical outcome. BAUM utilizes the existing knowledge graph of the relations between metabolites in the form of a metabolic network. A drawback is that BAUM ignores data features that do not match to known metabolites in the network. On the other hand, as most studies focus on core metabolic pathways, BAUM is powerful in that it can make statistical inference on the metabolites’ association with the outcome and feature-metabolite matching simultaneously, partially resolving the issue of multiple matching. The Bayesian framework makes BAUM very robust, which can make inferences based on small sample sizes, solving a challenge in many metabolomics studies. We used BAUM to analyze two real datasets and obtained biologically meaningful results.

BAUM and similar methods concentrate on features that are matched to KEGG metabolites, as results from such features are more interpretable. This is the route we take in the current analysis. However, it is possible to incorporate unmatched features into the analysis, by matching such features to a bigger database such as HMDB, and even creating “pseudo metabolites” for unmatched features. Such metabolites cannot borrow information from connected metabolites, but their matched features’ association with the outcome can serve as evidence for their selection. Further, it is possible to estimate statistical association networks from metabolites and add to the connections of the known metabolic network, to create a hybrid network for the analysis. This is subject to our future study.

Key Points

We develop a innovative approach for Bayesian Analysis of Untargeted Metabolomics Data (BAUM) to integrate previously separate tasks into a single framework, including matching uncertainty inference, metabolite selection, and functional analysis.
BAUM can identify subnetworks within the entire metabolic network based on feature-level summary statistics, enhancing biological interpretation.
Under the Bayesian framework, BAUM is robust and stable, and can make inferences based on small sample sizes.
Simulations show BAUM can make accurate inferences on the feature-metabolite matchings and metabolite significance.
BAUM finds pathways that conform to existing knowledge as well as novel pathways that are biologically plausible on two real-world datasets.

FUNDING

This work was partially supported by National Institutes of Health (grants R01DA048993, R01MH105561, R01GM124061), National Science Foundation (grant IIS2123777), Guangdong Talent Program (2021CX02Y145), Guangdong Provincial Key Laboratory of Big Data Computing and Shenzhen Key Laboratory of Cross-Modal Cognitive Computing (ZDSYS20230626091302006)

CODE AVAILABILITY

We provide an R package ‘BAUM for analyzing untargeted metabolomics data by our method. The R package is available at https://github.com/guoxuan-ma/BAUM.

DATA AVAILABILITY

The COVID-19 metabolomics dataset (ST001849) was downloaded from the NIH Metabolomics Workbench at https://www.metabolomicsworkbench.org/data/DRCCMetadata.php?Mode=Study&StudyID=ST001849. The mouse brain atlas data (ST001637) were downloaded from the NIH Metabolomics Workbench at https://www.metabolomicsworkbench.org/data/DRCCMetadata.php?Mode=Study&StudyID=ST001637. The preprocessed data of these two datasets are available at https://github.com/guoxuan-ma/BAUM.

Guoxuan Ma is currently a Ph.D. candidate in the Department of Biostatistics of the University of Michigan at Ann Arbor. He received his M.S. inBiostatistics from the University of Michigan at Ann Arbor in 2023 and his B.S. in Statistics from the Chinese University of Hong Kong at Shenzhen in2021. His research interests include Bayesian modeling with applications on biomedical networks and neuroimaging, and asynchronousbrain-computer interfaces.

Jian Kang received his BS in Statistics from Beijing Normal University in 2005 and his MS in Mathematics from Tsinghua University in 2007. He obtained hisPhD in Biostatistics from the University of Michigan, Ann Arbor, in 2011. From 2011 to 2015, Dr. Kang was a faculty member at Emory University. Currently, he isa Professor of Biostatistics at the University of Michigan, Ann Arbor. His main research interests include developing statistical methods and theory for the analysisof large-scale complex biomedical data, with a focus on Bayesian methods, machine learning, imaging statistics, and metabolomics.

Tianwei Yu received his Ph.D. degree in Statistics from UCLA in 2005. Before that, he received Bachelor's and Master's degree from Tsinghua University in 1997and 2000, respectively. He is currently employed as a Professor at the School of Data Science, The Chinese University of Hong Kong, Shenzhen, China. He alsoworks with the Shenzhen Research Institute of Big Data in Shenzhen. His research focuses on bioinformatics, applied statistics, and applied machine learning.

References

Liang

Moutinho

Golan

, et al.

Use of high-resolution metabolomics for the identification of metabolic signals associated with traffic-related air pollution

Environ Int

2018

;

120

145

–

Month:	Total Views:
April 2024	713
May 2024	166
June 2024	117
July 2024	114
August 2024	109
September 2024	117
October 2024	141
November 2024	147
December 2024	97
January 2025	95
February 2025	103
March 2025	160
April 2025	118

Article Contents

Bayesian functional analysis for untargeted metabolomics data with matching uncertainty and small sample sizes

Abstract

INTRODUCTION

METHODS

Overview

Model

One-hot constrained factor analysis model for matching uncertainty

Mixture model for latent metabolite scores

Weighted Potts prior for latent metabolite class labels

Posterior inferences

Equivalent model representation for latent metabolite scores

Hyperpriors and hyperparameters

Posterior sampling, parameters of interest and FDR control

Post-processing—a heuristic approach for quick estimation of metabolite abundance

SIMULATIONS

RESULTS

COVID-19 metabolomics data

Mouse brain data - development and healthy aging in different brain regions

DISCUSSION

FUNDING

CODE AVAILABILITY

DATA AVAILABILITY

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only