UNMF: a unified nonnegative matrix factorization for multi-dimensional omics data

Abstract

Factor analysis, ranging from principal component analysis to nonnegative matrix factorization, represents a foremost approach in analyzing multi-dimensional data to extract valuable patterns, and is increasingly being applied in the context of multi-dimensional omics datasets represented in tensor form. However, traditional analytical methods are heavily dependent on the format and structure of the data itself, and if these change even slightly, the analyst must change their data analysis strategy and techniques and spend a considerable amount of time on data preprocessing. Additionally, many traditional methods cannot be applied as-is in the presence of missing values in the data. We present a new statistical framework, unified nonnegative matrix factorization (UNMF), for finding informative patterns in messy biological data sets. UNMF is designed for tidy data format and structure, making data analysis easier and simplifying the development of data analysis tools. UNMF can handle a wide range of data structures and formats, and works seamlessly with tensor data including missing observations and repeated measurements. The usefulness of UNMF is demonstrated through its application to several multi-dimensional omics data, offering user-friendly and unified features for analysis and integration. Its application holds great potential for the life science community. UNMF is implemented with R and is available from GitHub (https://github.com/abikoushi/moltenNMF).

factor analysis, nonnegative matrix factorization, metagenome, gene expression analysis, Bayesian statistics, multi-dimensional data

INTRODUCTION

The utilization of multi-dimensional omics technologies is becoming increasingly ubiquitous in all aspects of life science. As an illustration, data such as genome, epigenome, transcriptome, proteome and metabolome can be quantified under varying conditions and at different times, in order to unveil associations between genotype and environmental factors [1]. These diverse forms of biological information can be represented through the use of multi-dimensional arrays, formally referred to as tensors.

Factor analysis is a premier approach for analyzing multi-dimensional data to extract useful patterns and is increasingly being applied in the context of multiomics datasets. This model class encompasses a broad range of analytical methods including principal component analysis to nonnegative matrix factorization. Recently, it has been expanded to model various structures of datasets such as multiple data modalities, sample group structures and temporal or spatial structures [2–4]. Nevertheless, conventional methods are strongly dependent on the format and structure of the data and require modification of analytical strategies and techniques, as well as a significant amount of time spent on data cleaning and preprocessing, when any of these are altered even slightly. Furthermore, many traditional methods cannot be directly applied when the data contain missing values.

In this study, our objective is to address these issues and introduce a novel statistical methodology, named unified nonnegative matrix factorization (UNMF), to discover meaningful patterns in messy biological data sets. Our approach can be applied to multi-dimensional omics data, regardless of their format or structure, and is designed to comply with the principles of tidy data format and structure as recommended by [5]. The aim of our methodology is to simplify data analysis and streamline the development of data analysis tools that can be used cohesively.

The key contributions of our research are as follows:

(1) Our framework can handle a wide array of data structures and formats in a unified manner, facilitating data analysis and allowing for the focus on identifying specific patterns of interest rather than handling uninteresting logistics of the data.
(2) Our model seamlessly works with tensor data that encompass missing observations and repeated measurements, even sparse ones, which would pose challenges for traditional approaches.
(3) We have developed a straightforward and efficient learning procedure of UNMF using variational Bayes and derived a lower bound on marginal likelihood (evidence lower bound; ELBO).

To highlight the usefulness of UNMF, we provide examples of its application to several multi-dimensional omics data. UNMF provides user-friendly and unified features to analyze and integrate multi-dimensional omics data and its application holds great potential for the life science community.

The structure of this paper will proceed as follows. In the ‘Conceptual View’ section, we will discuss in detail why UNMF is not restricted on the formats of the data and how it can be used for predictions and explanations. The ‘Model and variational inference’ section describes the generative model and its learning procedure. The ‘Results’ section contains two sub-section. Firstly, we conduct simulation study to show the behavior of the our estimator. Next, we analyze the more realistic dataset of metagenomics and RNA sequencing.

CONCEPTUAL VIEW

Traditional factor analysis techniques such as principal component analysis [6], nonnegative matrix factorization [7] and latent Dirichlet allocation [8] seek to find matrices |$Z$| and |$W$| such that |$Y \approx ZW$|⁠, given a data matrix |$Y$|⁠. Here, each row of the matrix |$Z$| represents latent variables that describe the structure and characteristics of data, corresponding to |$k$| representative patterns inherent in the data. Additionally, each row of the matrix |$W$| describes the relationship between latent factors and observed variables through a linear mapping. Factor analysis is a powerful approach for extracting valuable patterns in multi-dimensional data, but conventional traditional factor analysis is heavily dependent on the structure and format of the data matrix, rendering them inapplicable when the matrix contains missing values or lacks a clear form.

Unlike existing factor analysis techniques for multi data, UNMF takes in tidy data [5, 9] as input, which is a data format that is coherent in structure and meaning and satisfies the following conditions:

(1) Each variable forms one column.
(2) Each observation forms one row.
(3) The type of observational unit for each observation forms one table.
(4) Each value constitutes one cell.

Figure 1 shows an example of the traditional matrix format and tidy data format using toy data. When adding additional observations to the data set, tidy data format simply requires adding columns, while the normal matrix format requires expanding the multi-dimensional array. Similarly, in the case of missing values, tidy data format only requires deleting rows, while the normal matrix format requires some form of imputation to fill in the missing values for factor analysis.

Figure 1

Visual example of the tidy format. The ‘matrix’ and the ‘tidy’ format have the same information. The gray parts of the ‘tidy’ format can be regarded as explanatory variables.

Open in new tab Download slide

UNMF performs factorization of input data in a manner akin to traditional factor analysis, decomposing them into latent factors, and embeds the samples in a lower dimensional latent space. Simultaneously, UNMF employs linear mapping to establish the relationship between the latent factors and observed variables. Given data matrices represented in tidy data format and a vector of observed data values denoted by |$y=(y_1,\ldots , y_N)^{\prime}$| and a matrix |$X=(x_{nd})$| (⁠|$x_{nd} \in \{0,1\}$|⁠) representing structure and meaning, UNMF finds a nonnegative value matrix |$V$| of |$D \times L$| such that

$$ \begin{align} y_{n} \approx \sum_{l=1}^L \prod_{d=1}^D v_{dl}^{x_{nd}}. \end{align} $$

(1)

Here |$D$| is determined by dimension (number of columns) of |$X$| and |$L$| is a user-specified parameter. A larger |$L$| makes the model more expressive but can also lead to overfitting. The matrix |$V$| describes the co-occurrence relationships of |$y_n$|⁠, and if the vectors |$v_n = (\prod _{d} v_{d1}^{x_{nd}}, \ldots , \prod _{d} v_{dL}^{x_{nd}})$| characterized by non-zero elements of the explanatory variable |$X$| are similar in direction, |$y$| will have higher values.

An example of matrix factorization in tidy data format using UNMF is shown in Figure 2. Suppose an experiment using two subjects yields observations of three features and the second subject has a missing value for the third feature which are represented by a matrix |$Y$|⁠. The tidy data format of the matrix |$Y$| can be represented using a vector |$y$| and a matrix |$X$|⁠. Here, the indicator |$i$| represents the sample, the indicator |$j$| represents the feature and the indicator |$h$| represents the number of rows in the tidy data. In the tidy data format, the index of |$y_{nk}$| is represented by the explanatory variable shown in the following equation. |$X$| represents a one-hot vector that indicates the rows and columns of matrix |$y_{nk}$|⁠. For instance, the subscripts |$(n,k)=(1,1)$| and |$(n,k)=(2,1)$| of matrix |$Y$| correspond to the first and second row vectors |$x_1=(1,0,1,0,0)$| and |$x_2=(0,1,1,0,0)$| of matrix |$X$|⁠. It should be noted here that the information content of the two data formats does not change.

Figure 2

Visual example to understand the equation 1. The gray circles indicate observed variables, and white indicates latent variable.

Open in new tab Download slide

UNMF performs factorization of input data represented in tidy data format in a manner similar to traditional factor analysis and approximates it as in the following equation:

$$ \begin{align*} y_h (\, =y_{nk}) \approx \sum_{l=1}^L \prod_{d=1}^D v_{dl}^{x_{nd}} = V_{(1)}V_{(2)}^{\prime}, \end{align*} $$

where |$V_{(1)} = (\prod _{d=1}^2 v_{dl}^{x_{nd}})$| and |$(V_{(2)} = \prod _{d=3}^4 v_{dl}^{x_{nd}})$|⁠.

UNMF also deals with cases where subjects who are observed in the first instance are not observed in the second instance, meaning that the data contain missing values. When such data are held as a tensor, they require special processing, such as imputation, in order to perform factor analysis due to the existence of missing values at the second point in time. UNMF is capable of handling such missing data in a natural manner. If the subject does not exist at the second point in time, the equivalent row does not exist in the organized data and there is no hindrance in building the model. Conversely, even if there are duplicates in the data, the same result is outputted.

MODEL AND VARIATIONAL INFERENCE

UNMF is formalized within a probabilistic framework as follows:

$$ \begin{align} y_{n} \mid v_{dl}, x_{n} & \sim \mathrm{Poisson}\left(\sum_{l=1}^{L} \left\{\prod_{d=1}^{D}v_{dl}^{x_{nd}} \right\} \right) \\ v_{dl} & \sim \mathrm{Gamma}(a,b).\nonumber \end{align} $$

(2)

The gamma prior is easy to compute, and it constrains the value of |$v_{dl}$| to be nonnegative, making it easier to interpret. Simply, higher values of |$v_{dl}$|s have highly co-occurrences.

Equation 2 is equivalent to the following:

$$ \begin{align} y_{n} \mid u_{nl} = \sum_{d=1}^{D} \sum_{l=1}^{L}u_{nl}, \quad u_{nl} \sim \mathrm{Poisson}\left(\prod_{d=1}^{D}v_{dl}^{x_{nd}} \right) \end{align} $$

(3)

Using the mean field assumption [10], we introduce variational Bayesian inference updates for the proposition model with the objective of finding the variational posterior distribution of the latent variable |$q(v_{dl})$|⁠.

For any latent variable, say |$w$|⁠, let |$\tilde w$| be the mean of |$w$| evaluated by the variational posterior. i.e.

$$ \begin{align*} \tilde{w} = \int w q(w) \, dw \end{align*} $$

Note that |$\widetilde{\log w} = \int \log w q(w) \, dw$| is not equal to |$\log (\tilde{w})$|⁠, in general. The variational posterior of |$u_{nl}$| is expressed as a multinomial distribution, and the updates for |$u_{nl}$| are defined by |$\widetilde{u_{nl}}= y_{n} r_{nl}$|⁠, where |$r_{nl}$| is denoted as

$$ \begin{align} r_{nl} =\frac{\exp(\widetilde{\log v_{dl}})}{\sum_{l=1}^{L} \exp( x_{nd} \widetilde{\log v_{dl}} )} \end{align} $$

(4)

The variational posterior |$q(v_{dl})$| is characterized as a gamma distribution with shape parameter |$\hat a_{dl}$| and rate parameter |$\hat b_{dl}$|⁠, as outlined in

$$ \begin{align} \hat a_{dl} &= \sum_{n=1}^{N} \sum_{n=1}^{N}x_{nd} \widetilde{u_{nl}} + a, \nonumber \\ \hat b_{dl} &= \sum_{n=1}^{N} x_{nd} \left( \prod_{d^{\prime} \neq d} \widetilde{v_{d^{\prime}l}}^{x_{nd^{\prime}}} \right) + b. \end{align} $$

(5)

To summarize, the following steps outline the proposed procedure:

Initialize |$v_{dl}$| through a gamma distribution.
Repeat the following two steps at a specified iteration number:
- – Update |$\widetilde{u_{nl}}$| using Equation 4.
- Update |$\widetilde{v_{dl}}$| and |$\widetilde{\log v_{dl}}$| using Equation 5.

The convergence of this algorithm is substantiated by the lower bound of the marginal likelihood (sometimes referred as evidence lower bound; ELBO), say |$\mathcal{L}(q)$|⁠, to monitor the convergence of the algorithm,

$$ \begin{align} \mathcal{L}(q) &= \int q(\theta) \log \frac{p(\boldsymbol{y},\theta|X, a, b)}{q(\theta)} d\theta \nonumber \\ &=\int q(\theta) \log p(\boldsymbol{y},\theta|X, a, b) d\theta - \int q(\theta) \log q(\theta) d\theta , \end{align} $$

(6)

where |$\theta $| is all of the variational parameters, and |$\boldsymbol{y}=(y_1, \ldots , y_N)^{\prime}$|⁠.

After some manipulations, the following expression is obtained:

$$ \begin{align*} \mathcal{L}(q) & = \sum_{n=1}^N \left( - R_n + y_n \log(R_n) - \log \Gamma(y_n+1) \right) +\\ &+ \sum_{d,l} \left((a-1) \log v_{dl} - b v_{dl} + a\log(\hat b_{dl}) - \log \Gamma(a)\right) +\\ &- \sum_{d,l} \left((\hat a_{dl}-1) \log v_{dl} - \hat b_{dl} v_{dl} + \hat a_{dl} \log(\hat b_{dl}) - \log \Gamma(\hat a_{dl}) \right), \end{align*} $$

where |$R_n= \sum _{l=1}^{L} \exp ( x_{nd}\widetilde{\log v_{dl}})$|

Our procedure of the variational inference can be regarded as finding the |$q(\theta )$| such that maximizing Eq.6.

Discrete probability distributions beyond the Poisson model

In prior sections, we delved into the Poisson distribution model; however, the multinomial distribution is also frequently employed as a fundamental discrete probability distribution. The multinomial distribution is utilized when there is an upper limit on the data’s values. Moreover, in cases of over-dispersion of the data, the negative binomial distribution is often used. Despite the fact that Poisson distribution serves as the foundation for these distributions, our model can be applied to a range of count data.

For instance, we consider the data generation process of Latent Dirichlet Allocation (LDA) [11]:

$$ \begin{align*} Y_n | M_n &\sim \mathrm{Multinomial}(M_n, W_n H)\\ M_n &\sim \mathrm{Poisson}(\lambda_n) \\ W_n &\sim \mathrm{Dirichlet}(\alpha) \\ H_l &\sim \mathrm{Dirichlet}(\beta) \end{align*} $$

The marginal distribution of |$Y_n$| is

$$ \begin{align*} &\int \mathrm{Multinomial}(Y_n |m, W_n H) \mathrm{Poisson}(m|\lambda_n) \, dm \\ &= \prod_{k=1}^K \mathrm{Poisson}(Y_n | \lambda_n W_n H_k). \end{align*} $$

Additionally, with a simple alteration to the model, the negative binomial distribution can also be employed:

$$ \begin{align*} y_n &\sim \mathrm{Poisson}\left(z_n \sum_{l=1}^L \prod_{d=1}^D v_{dl}^{x_{nd}}\right) \\ z_n &\sim \mathrm{Gamma}(\tau,\tau)\\ v_{dl} &\sim \mathrm{Gamma}(a,b). \end{align*} $$

The variational posterior of |$v_{dl}$| is

$$ \begin{align*} q(v_{dl}) = \mathrm{Gamma}(\hat \alpha_{dl}, \hat \beta_{dl} ), \end{align*} $$

where the shape and rate parameters are, respectively,

$$ \begin{align*} \hat \alpha_{dl} &= \sum_{n=1}^{N} x_{nd} \widetilde{u_{nl}} + a,\\ \hat \beta_{dl} &= \sum_{n=1}^{N} \tilde{z_n} x_{nd} \left( \prod_{d^{\prime}\neq d} (\widetilde{v_{d^{\prime}l}})^{x_{nd^{\prime}}} \right) + b. \end{align*} $$

The variational posterior of |$z_{n}$| is

$$ \begin{align*} q(z_{n}) = \mathrm{Gamma}\left(y_n + \tau, \sum_{l=1}^L \prod_{d=1}^D v_{dl}^{x_{nd}}+ \tau\right). \end{align*} $$

The hyperparameter |$\tau $| can be optimized through the application of Newton’s method [10].

RESULUTS

Simulation Study

We performed three simulations to evaluate the precision of estimation in the UNMF framework.

(1) We generated synthetic data in accordance with the model’s assumptions and contrasted the result with the expectation a posteriori (EAP). The consistency of the estimator was demonstrated through this evaluation.
(2) Synthetic data were generated and the optimal number of components, |$L$|⁠, was selected through holdout validation. The behavior of validation-loss was then analyzed.
(3) Synthetic data were produced using LDA’s model and compared with the EAP, demonstrating the robustness of the estimator.

The mean and standard deviation of estimators were assessed in the first scenario. Synthetic data were generated using |$N=100 \times 100 \times 3$| samples and observations were performed. The parameters |$v_{dl}$| were randomly drawn from a gamma distribution with shape parameter |$a=1$| and rate parameter |$b=0.1$|⁠, while the parameters |$W$| were generated from a gamma distribution with parameters |$a=1$| and rate parameter |$b=0.01$|⁠. The comparison between the true parameters and the mean and standard deviation of estimates obtained from 100 simulations is depicted in Figure 3. We observed that the points were arranged diagonally, indicating that the estimator was small biases.

Figure 3

The comparison true parameters and the mean of estimated parameters. The error bars indicates standard deviation.

Open in new tab Download slide

Subsequently, in the second scenario, we evaluate the model selection log-likelihood as calculated on the test data. The parameters and data are generated in the same manner as previously described in each simulation. From 1000 trials, it was observed that |$L = 5$| was selected 644 times (Figure 4).

Figure 4

Log-likelihood evaluated at test data. One line corresponds to one synthetic dataset.

Open in new tab Download slide

In the third scenario, a nonnegative integer matrix |$Y$| (⁠|$100 \times 100$|⁠) is generated from a multinomial distribution, while |$W_n$| and |$H_l$| are randomly drawn from a Dirichlet distribution with all parameters equal to 2. This demonstrates that the proposed method yields credible estimations (Figure 5). These simulation studies indicate that UNMF has consistency. This property helps interpretation of estimated |$V$|⁠. If the optimal points of the parameters are not substantially unique, the obtained estimates have potentially multiple interpretations. If the analyst presents only one of them, the results can be misleading. With this validation, it is feasible to apply the method to real data to determine its compatibility with existing studies.

$Estimated $v_{dl}$ versus $H_{lk}$ and $W_{nl}$ (true value). The points indicate mean, and the error bars indicate standard deviation.$

Figure 5

Estimated |$v_{dl}$| versus |$H_{lk}$| and |$W_{nl}$| (true value). The points indicate mean, and the error bars indicate standard deviation.

Open in new tab Download slide

Table 1

Open in new tab

Selected number of component |$L$| by holdout validation (true is 5)

\|$L$\|	5	6	7	8	9
	0.64	0.21	0.06	0.04	0.03

Table 1

Open in new tab

Selected number of component |$L$| by holdout validation (true is 5)

\|$L$\|	5	6	7	8	9
	0.64	0.21	0.06	0.04	0.03

Real data analysis

Metagenomic sample analysis

In this section, we conduct two metagenomic sample analysis, examining the data from David et al. [12] and Kostic et al. [13]. The analysis is performed using the abundance of genus-level taxa.

Prior to showing the analysis results, we establish certain quantity to help interpretation.

$$ \begin{align*} \mathrm{FLEX}_{jk} =\frac{1}{1/(\sum{j=1}^{D} v_{jk}) + 1/(\sum{k=1}^{K} v_{jk})} \end{align*} $$

FLEX, which stands for Frequency-Expectation [14], is defined as the harmonic mean of two probabilities, with the first term representing the probability of microbe |$j$| given component |$k$| and the second representing the exclusive presence of microbe |$j$| in component |$k$|⁠.

Now, we undertake an analysis of David’s data. David and his colleagues conducted a longitudinal study of fecal metagenomes from two donors, A and B. However, for the purposes of this analysis, we shall limit our focus solely to donor B. This decision was made as both donors are not representative samples. The data reveal that donor B had an episode of enteric infection between Days 151 and 159. The information is accessible via the R package ‘themetagenomics’ (https://github.com/EESI/themetagenomics). For donor B, 530 species of bacteria (unknown species were aggregated as ‘unknown”) were recorded at 176 time points.

In order to determine the optimal number of components |$L$|⁠, we employed a holdout validation approach and evaluated the log-likelihood at the test data. A randomly selected 10% of the data was designated as test data, with the remainder serving as training data. The values of log-likelihood are illustrated in Figure 6. Figure 7 displays the estimates of |$v_{dl}$| for each day, with a marked increase in component 4 around Days 150–160. In Figure 8, we depict the representative microbes for each component based on the top 10 FLEX values. Component 4 is characterized by the genera Eshtishia/Shigella, Salmonella and Clostridium, which are known to cause food poisoning.

Figure 6

Log-likelihood evaluated at test data. Each line corresponds one data point.

Open in new tab Download slide

$x-axis: time point, y-axis: estimates of $v_{dl}$.$

Figure 7

x-axis: time point, y-axis: estimates of |$v_{dl}$|⁠.

Open in new tab Download slide

$Relative frequency. x-axis: estimates of $v_{dl}$, y-axis: species corresponds $d$.$

Figure 8

Relative frequency. x-axis: estimates of |$v_{dl}$|⁠, y-axis: species corresponds |$d$|⁠.

Open in new tab Download slide

Subsequently, we analyze Kostic’s data. Kostic and colleagues conducted research on the correlation between type 1 diabetes (T1D) and metagenome by cohort. This information can be accessed at https://pubs.broadinstitute.org/diabimmune/t1d-cohort/resources/16s-sequence-data. The data included a total of 19 unique subjects, eight, seven and four in control, seroconverted and T1D, respectively. This study is a cohort study spanning Days 42 to 1233 of life and includes repeated measures for subjects. Since subjects are not sampled at the same age and the number of repeats is also different, these data require specific preprocessing to be handled by ordinary matrix factorization. However, UNMF does not require any specials. Our explanatory variables, denoted as |$X$|⁠, consist of age, subject id, bacterial species and T1D status (control, T1D and seroconversion). We label the model containing T1D status information as model 1, and the model without it as model 0.

As illustrated in Figure 2, the log-likelihood value assessed by holdout-validation indicates that the optimal number of components is |$L=2$|⁠. Nonetheless, null models (excluding T1D status) reveal greater likelihoods in the test data, as shown in Table 2. Consequently, the association between T1D and the enteric microbiome cannot be deemed credible. As a side note, the log-likelihood of the test data for model 1 at |$L=5$| is very small due to overfitting (Table 2).

Table 2

Open in new tab

The mean of log-likelihood evaluated at test data.

\|$L$\|	1	2	3	4	5
model 0	−821.95	−749.25	−894.72	−1221.73	−1627.17
model 1	−822.14	−785.38	−867.18	−1076.71	−385328.61

\|$L$\|	1	2	3	4	5
model 0	−821.95	−749.25	−894.72	−1221.73	−1627.17
model 1	−822.14	−785.38	−867.18	−1076.71	−385328.61

Table 2

Open in new tab

The mean of log-likelihood evaluated at test data.

\|$L$\|	1	2	3	4	5
model 0	−821.95	−749.25	−894.72	−1221.73	−1627.17
model 1	−822.14	−785.38	−867.18	−1076.71	−385328.61

\|$L$\|	1	2	3	4	5
model 0	−821.95	−749.25	−894.72	−1221.73	−1627.17
model 1	−822.14	−785.38	−867.18	−1076.71	−385328.61

The analysis should proceed using model 1, which assumes a relationship between T1D and the gut microbiome. Kostic et al. proposed that individuals with T1D exhibit decreased alpha-diversity in their gut microbiome. This finding is consistent with the entropy distribution (refer to Table 3).

Table 3

Open in new tab

Entropy when the microbe are multinomial distributed

control	seroconverted	T1D
2.49	2.24	2.35

Table 3

Open in new tab

Entropy when the microbe are multinomial distributed

control	seroconverted	T1D
2.49	2.24	2.35

Lastly, Figure 10 illustrates the estimates of |$v_{dl}$| for each age group. It indicates that component 2 experiences growth after about 1000 days, and a negative correlation exists between components 1 and 2. When utilizing model 1, the findings obtained align with the report put forth by [13], as previously described. It is important to note, however, that these results were achieved under the presumption of a relationship between T1D and the gut microbiome. As T1D information does not contribute to enhancing the prediction accuracy, we cannot make a compelling argument for the existence of said association. In general, when preprocessing involves multiple steps such as normalization, smoothing and missing data imputation, the resulting analysis possesses a significant number of degrees of freedom, rendering validation a more complex undertaking. As demonstrated in this section’s analysis, hypothesis verification is straightforward owing to the ability of UNMF to generate predictions and interpretations within a single model.

Figure 9

Log-likelihood evaluated at test data. Each line corresponds one data point.

Open in new tab Download slide

Figure 10

x-axis: age (daily), y-axis: estimates. The lines indicate mean of each 250.

Open in new tab Download slide

$x-axis: T1D status, y-axis: corresponds estimates of $v_{dl}$.$

Figure 11

x-axis: T1D status, y-axis: corresponds estimates of |$v_{dl}$|⁠.

Open in new tab Download slide

$Relative frequency. x-axis: estimates of $v_{dl}$, y-axis: species corresponds $d$.$

Figure 12

Relative frequency. x-axis: estimates of |$v_{dl}$|⁠, y-axis: species corresponds |$d$|⁠.

Open in new tab Download slide

Gene expression analysis

Revealing the gene expression responses to drugs leads to the knowledge about drug discovery. However, due to the large number of combinations of conditions such as cells, genes and drug doses, there are many missing values in the gene expression data for drug response. Therefore, missing values imputation in drug response gene expression data is needed.

An illustration of missing data completion is demonstrated through analyzing gene expression profiles (GSE92742, [15, 16]) obtained from the Gene Expression Omnibus database (GEO; [17]). The analysis encompasses ‘level 2’ data, consisting of 1246 landmark genes, with expression levels recorded at 6, 24, 96, 120, 144 and 168 hours post-treatment. Three cell types (MCF7, PC3 and VCAP) were treated with dosages of 0, 25ul, 10 and 30 um. Gene expression profiling often results in missing values, and the aim of this analysis is to predict these missing parts.

To assess the performance of missing data completion, artificial missing values were introduced to the original data. The conditions under which records were deleted were randomly selected, resulting in training data lacking observations with the same four conditions (dose, time-point, marker gene and cell) as the test data.

The prediction of missing values was compared with XGboost [18] as the baseline method. XGBoost and its analogue, ensemble models based on decision trees, have been reported to have excellent performance in predicting tabular data [19]. Both UNMF and XGBoost were applied on tidy formatted data to predict gene expressions with the same explanatory variables (above-mentioned four conditions). The comparison, presented in Figure 13, showcases the prediction against the observed (true) values, with root mean squared errors (RMSEs) of 820 (left panel) and 1300 (right panel), respectively. Note that the mean and standard deviation for the entire data are 2228.463 and 1617.54, respectively (the range is |$[1, 25465]$|⁠), so this RMSE is not too large. The GSE data were resampled to 10 million rows, with 1 million conditions designated as test data and the remaining as training data.

Figure 13

Comparison of the prediction. x-axis: predicted, y-axis: observed (true value).

Open in new tab Download slide

Table 4 illustrates that the results across varying hyperparameter choices display a consistent trend.

Table 4

Open in new tab

Comparison of the prediction. We resampled the GSE data to |$10^6$| rows, with |$10^5$| conditions as test data and the rest as training data.

hyperparameter				(XGBoost)
eta	max_depth	lambda	alpha	RMSE
0.30	6.00	1.00	0.00	1314.17
0.30	5.00	0.50	0.00	1340.58
0.50	15.00	0.50	0.10	1032.77
1.50	30.00	1.00	0.00	800.95
				(UNMF)
			L	RMSE
			2	816.67
			3	780.11
			4	762.93
			5	763.50

hyperparameter				(XGBoost)
eta	max_depth	lambda	alpha	RMSE
0.30	6.00	1.00	0.00	1314.17
0.30	5.00	0.50	0.00	1340.58
0.50	15.00	0.50	0.10	1032.77
1.50	30.00	1.00	0.00	800.95
				(UNMF)
			L	RMSE
			2	816.67
			3	780.11
			4	762.93
			5	763.50

Table 4

Open in new tab

Comparison of the prediction. We resampled the GSE data to |$10^6$| rows, with |$10^5$| conditions as test data and the rest as training data.

hyperparameter				(XGBoost)
eta	max_depth	lambda	alpha	RMSE
0.30	6.00	1.00	0.00	1314.17
0.30	5.00	0.50	0.00	1340.58
0.50	15.00	0.50	0.10	1032.77
1.50	30.00	1.00	0.00	800.95
				(UNMF)
			L	RMSE
			2	816.67
			3	780.11
			4	762.93
			5	763.50

hyperparameter				(XGBoost)
eta	max_depth	lambda	alpha	RMSE
0.30	6.00	1.00	0.00	1314.17
0.30	5.00	0.50	0.00	1340.58
0.50	15.00	0.50	0.10	1032.77
1.50	30.00	1.00	0.00	800.95
				(UNMF)
			L	RMSE
			2	816.67
			3	780.11
			4	762.93
			5	763.50

Our method, UNMF, utilizes features constructed as products of multiple factors, allowing for the expression of their interactions in a natural manner. In a Bayesian perspective, the ensemble of multiple predictors corresponds to the posterior predictive, integrated through the posterior distribution, and the regularization term corresponds to the prior distribution. In Eq.6, the first line has a larger value at better fitting, and the next two lines are penalty for departing from the prior distribution. These factors make UNMF flexible while avoiding overfitting.

CONCLUSION

In this study, we introduced a novel statistical framework, known as UNMF, to extract meaningful patterns from complex biological data sets, without being restricted by the data’s format and structure. UNMF can handle a wide range of data structures and formats, and works seamlessly with tensor data including missing observations and repeated measurements.

In the metagenomic sample analysis, we demonstrated the effectiveness of the UNMF in knowledge discovery. In case-control studies, a crucial aspect is identifying differences between cases and controls. Our results, obtained from a metagenomic example, indicated that the UNMF can effectively perform this task, even when the data have a complex structure (subject |$\times $| time |$\times $| microbe |$\times $| case-control). Furthermore, the gene expression example demonstrated that the UNMF can perform reasonable predictions while maintaining interpretability. This property comes from the inference procedure and the non-negativity, as we have discussed in the ‘Model and variational inference’ and ‘Simulation study’ sections. These features make UNMF a potentially valuable tool for a wide range of analysis.

This research holds potential for future expansion and may make significant contributions to the field. As omics data encompass more than just count data, it may be worth considering the use of alternative distributions, such as normal or log normal, within the UNMF framework.

We believe that UNMF offers a user-friendly, unified functionality for analyzing and integrating multi-dimensional and multi-dimensional omics data, and its application holds great potential for the life sciences field.

Default prior distribution

We set the gamma prior distribution weakly informative [20] as not to rule out the information contained in data. This means, even if the observation has many 0s or outlier, numerical errors are not occurred, and otherwise provide similar result to the maximum likelihood estimator.

To check the sensitivity of proposed model, we analyze the four dataset (Titanic, HairEyeColor, UCBAdmissions and Lander, 2017’s National Election Study in 1964) which have different range of values. The data of ‘Titanic’, ‘HairEyeColor’ and ‘UCBAdmissions’ are included in the statistical software R as default.

We set |$a=0.1,0,2, \ldots , 0.9$| and |$b=10^0, 10^{-1}, \ldots , 10^{-4}$| as hyperparameters, respectively. Figure 14 shows the difference between maximum likelihood estimator and EAP. In all cases, the difference is close to 0 when |$a=0.05$| and |$b=0.01$|⁠.

Figure 14

Absolute value of the difference between maximum likelihood estimator and EAP

Open in new tab Download slide

Next, Figure 15 shows regression coefficients for National Election Study data. Our proposed estimator not occurred numerical divergent for variable `Raceblack' and other variables no drastic change from MLE.

Figure 15

Comparison between maximum likelihood estimator and EAP. MLE about ‘Raceblack’ are divergent numerically.

Open in new tab Download slide

This example illustrates our procedure gives reasonable estimates.

Key Points

We develop a novel statistical framework, UNMF, to extract meaningful feature from biological data sets without being restricted by the data’s format and structure.
UNMF works seamlessly with tensor data including missing observations and repeated measurements.
UNMF achieves both predictive accuracy and model interpretability through nonnegative constraints and Bayesian procedures.

AUTHOR CONTRIBUTIONS STATEMENT

K.A. and T.S. developed the method and analyzed the results. K.A. conducted the simulation study. K.A. and T.S. wrote and reviewed the manuscript.

ACKNOWLEDGMENTS

This work was supported by grants from KAKENHI Grants-in-Aid for Young Scientists (A.K., grant number 20K19921), KAKENHI Grants-in Aid for Scientific Research (B) (T.S., grant number 20H04281), KAKENHI Grants-in-Aid for Scientific Research on Innovative Areas on Information Physics of Living Matters (T.S., grant number 22H04839), KAKENHI Grants-in-Aid for Challenging Exploratory Research (T.S., grant number 20K21832), KAKENHI Grant-in-Aid for Transformative Research Areas (platforms for Advanced Technologies and Research Resources) (T.S., grant number 22H04925), and KAKENHI Grant-in-Aid for Transformative Research Areas (A) (T.S., grant number 23H04938) from the Japan Society for the Promotion of Science (JSPS). It was also supported by RADDAR-J (T.S., grant number 22ek0109488), Project for P-PROMOTE (T.S., grant number 22ama221215 and 22ama221501), Brain/MINDS Health and Diseases (T.S., grant number 22wm0425007) from the Japan Agency for Medical Research and Development (AMED), and Moonshot R&D (T.S., grant number JPMJMS2025) from the Japan Science and Technology Agency (JST). The super-computing resource was provided by Human Genome Center (The University of Tokyo) and AI Bridging Cloud Infrastructure (ABCI) (National Institute of Advanced Industrial Science and Technology).

REFERENCES

Eckhardt

Baccarelli

Molecular mechanisms of environmental exposures and human disease

Nat Rev Genet

2023

. https://doi.org/10.1038/s41576-022-00569-3.

Epub ahead of print. PMID: 36717624

Google Scholar

OpenURL Placeholder Text

WorldCat

Argelaguet

Velten

Arnol

, et al.

Multi-omics factor analysis-a framework for unsupervised integration of multi-omics data sets

Mol Syst Biol

2018

;

(

e8124

. https://doi.org/10.15252/msb.20178124.PMID:29925568;PMCID:PMC6010767.

Argelaguet

Arnol

Bredikhin

, et al.

MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data

Genome Biol

2020

;

(

111

. https://doi.org/10.1186/s13059-020-02015-1.PMID:32393329;PMCID:PMC7212577.

Velten

Braunger

Argelaguet

, et al.

Identifying temporal and spatial patterns of variation from multimodal data using MEFISTO

Nat Methods

2022

;

(

179

–

. https://doi.org/10.1038/s41592-021-01343-9.

Epub 2022 Jan 13. PMID: 35027765; PMCID: PMC8828471

Wickham

Tidy data

J Stat Softw

2014

;

(

–

Tipping

Bishop

Probabilistic principal component analysis

J R Stat Soc Series B Stat Methodol

1999

;

(

611

–

Google Scholar

Crossref

WorldCat

Cemgil

Bayesian inference for nonnegative matrix factorisation models

Comput Intell Neurosci

2009

;

2009

. https://doi.org/10.1155/2009/785152.

Google Scholar

OpenURL Placeholder Text

WorldCat

Wang

Blei

Variational inference in nonconjugate models

J Mach Learn Res

2013

Google Scholar

OpenURL Placeholder Text

WorldCat

Wickham

Advanced R

, second edn.

Chapman and Hall/CRC

2019

10.

Jordan

Ghahramani

Jaakkola

Saul

An introduction to variational methods for graphical models

Mach Learn

1999

;

(

183

–

233

Google Scholar

Crossref

WorldCat

11.

Blei

JordanMI.

Latent dirichlet allocation

J Mach Learn Res

2003

;

(

Jan

993

–

1022

Google Scholar

OpenURL Placeholder Text

WorldCat

12.

David

Materna

Friedman

, et al.

Host lifestyle affects humanmicrobiota on daily timescales

Genome Biol

2014

;

R89

. https://doi.org/10.1186/gb2014-157-r89.

13.

Kostic

Gevers

Siljander

, et al.

The dynamics of the human infant gut microbiome in development and in progression toward type 1 diabetes

Cell Host Microbe

2015

;

(

260

–

14.

Roberts

Stewart

Airoldi

A model of text for experimentation in the social sciences

J Am Stat Assoc

2016

;

111

(

515

988

–

1003

. https://doi.org/10.1080/01621459.2016.1141684.

Google Scholar

Crossref

WorldCat

15.

Subramanian

Narayan

Corsello

, et al.

A next generation connectivity map: L1000 platform and the first 1,000,000 profiles

Cell

2017

;

171

(

1437

–

1452.e17

PMID: 29195078

16.

Seçilmiş

Hillerton

Morgan

, et al.

Uncovering cancer gene regulation by accurate regulatory network inference from uninformative data

NPJ Syst Biol Appl

2020

;

(

PMID: 33168813

17.

Barrett

Wilhite

Ledoux

, et al.

NCBI GEO: archive for functional genomics data sets–update

Nucleic Acids Res

2013

;

(

Database issue

D991

–

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

18.

Chen

Benesty

, et al.

Xgboost: extreme gradient boosting

R package version 0.4-2,

2015

;

(

–

19.

Grinsztajn

Oyallon

, &

Varoquaux

. (

2022

Why do tree-based models still outperform deep learning on typical tabular data?

Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track

20.

Gelman

Carlin

Stern

Rubin

Bayesian data analysis

Chapman and Hall/CRC

1995

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

Download all slides

Month:	Total Views:
July 2023	287
August 2023	152
September 2023	146
October 2023	137
November 2023	104
December 2023	110
January 2024	102
February 2024	58
March 2024	73
April 2024	85
May 2024	87
June 2024	71
July 2024	89
August 2024	58
September 2024	74
October 2024	64
November 2024	97
December 2024	53
January 2025	87
February 2025	91
March 2025	78
April 2025	53
May 2025	12

Article Contents

UNMF: a unified nonnegative matrix factorization for multi-dimensional omics data

Abstract

INTRODUCTION

CONCEPTUAL VIEW

MODEL AND VARIATIONAL INFERENCE

Discrete probability distributions beyond the Poisson model

RESULUTS

Simulation Study

Real data analysis

Metagenomic sample analysis

Gene expression analysis

CONCLUSION

Default prior distribution

AUTHOR CONTRIBUTIONS STATEMENT

ACKNOWLEDGMENTS

REFERENCES

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

UNMF: a unified nonnegative matrix factorization for multi-dimensional omics data

Abstract

INTRODUCTION

CONCEPTUAL VIEW

MODEL AND VARIATIONAL INFERENCE

Discrete probability distributions beyond the Poisson model

RESULUTS

Simulation Study

Real data analysis

Metagenomic sample analysis

Gene expression analysis

CONCLUSION

Default prior distribution

AUTHOR CONTRIBUTIONS STATEMENT

ACKNOWLEDGMENTS

REFERENCES

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only