Comparison and evaluation of integrative methods for the analysis of multilevel omics data: a study based on simulated and experimental cancer data

Parameters for data simulation

Abberation level			DE features %
Effect	δ	SD	p_path	p_rand	p_DE
Weak	0.2	0.1	10.0	1.0	3.5
Medium	0.4		20.0		13.5
Strong	0.8		50.0		26.0

Abberation level			DE features %
Effect	δ	SD	p_path	p_rand	p_DE
Weak	0.2	0.1	10.0	1.0	3.5
Medium	0.4		20.0		13.5
Strong	0.8		50.0		26.0

Note: Aberration level of DE features in terms of relative effect size δ and relative SD; proportion of features p_path simulated as DE in each of 5 of 20 pathways; random p_rand and total p_DE proportion of DE features in a data set. Combinations of parameters δ and p_DE result in a total number of nine parameter sets and in 18 simulated data sets (9 GE and 9 MET).

Table 1

Parameters for data simulation

Abberation level			DE features %
Effect	δ	SD	p_path	p_rand	p_DE
Weak	0.2	0.1	10.0	1.0	3.5
Medium	0.4		20.0		13.5
Strong	0.8		50.0		26.0

Abberation level			DE features %
Effect	δ	SD	p_path	p_rand	p_DE
Weak	0.2	0.1	10.0	1.0	3.5
Medium	0.4		20.0		13.5
Strong	0.8		50.0		26.0

Note: Aberration level of DE features in terms of relative effect size δ and relative SD; proportion of features p_path simulated as DE in each of 5 of 20 pathways; random p_rand and total p_DE proportion of DE features in a data set. Combinations of parameters δ and p_DE result in a total number of nine parameter sets and in 18 simulated data sets (9 GE and 9 MET).

Comparison

Variable sets

To achieve comparable sizes of feature sets for the different methods, we predefine the number of variables to be selected by sCCA and NMF in the GE and the MET data set, respectively, and accumulate the features selected in multiple (maximum 100) runs of MALA. For synthetic data, we set the number of variables to be equal to $\sum_{i = 1}^{I} m_{i_{D E}}$ ⁠. For experimental data, we calculate log₂-fold-changes of GE and difference in means of MET M-values between groups for each feature and derive an empirical density function for each data set. We set a density-dependent threshold of ±3.5 in GE and ±2.2 in MET and estimate the number of interesting features as those located in lower and upper tails of the empirical density function.

Discrimination power

We assess the power of each method for detecting DE features in synthetic data sets by calculating the Matthews correlation coefficient (MCC, [47]), a performance metric for imbalanced data, taking values in the range of −1 to 1. In the experimental data, the truly DE features are unknown, and thus, the accuracy of sample classification is used as performance measure. We split the experimental data sets 10 times and apply each method to 80% of tumour and normal samples (training set). For each split, the resulting feature set is evaluated as follows: we construct two matrices containing the resulting feature set, one matrix comprising the samples in the training set and one matrix comprising the remaining 20% of samples (test set). On these matrices, we perform hierarchical clustering on samples using R functions dist and hclust with Euclidean distance measure and ward.D2 agglomeration method (data not shown). The resulting dendrogram is cut with the R function cutree so as to obtain two groups. The degree of consensus between the two main clusters and the biological conditions (classification accuracy) is derived. This represents the discriminating power of the feature set and is used as a performance measure.

Comparison on the feature and the annotation level

We examine the congruence of feature sets resulting from the different methods. To yet unravel biological redundancies of results, we compare them on different annotation levels. Features selected in the synthetic data are mapped to simulated pathways and pathway over-representation is assessed with one-sided Fisher’s exact test (R function fisher.test). The capability of methods to identify perturbed pathways in simulated data is measured in terms of the MCC. Features resulting from experimental data are mapped to genes and additionally to over-represented GO terms in an unconditional test, and to perturbed Reactome pathways (RP). For this, the functions hyperGTest and runSPIA from Bioconductor [48] packages GOstats [49] with annotation database packages org.Hs.eg.db (v. 3.4.1) [50] and GO.db (v. 3.4.1) [51], and graphite (v. 1.22.0) [52], which serves as an interface to the R package SPIA [53] and to Reactome, were used. For each gene, we call the function runSPIA with the log₂-fold-change of the corresponding feature in the GE data set and the largest difference in means of M-values of the corresponding features in the MET data set. The results of the comparison are visualized as Venn diagrams created with R package VennDiagram [54]. The false discovery rate of overrepre- sentation analysis is controlled at a level of 5% in synthetic data and at a level of 20% for experimental data with Benjamini–Hochberg’s [55] method.

Comparison with cancer signatures

We assess the biological relevance of results on experimental data investigating the overlap of feature sets with known cancer signatures. Signatures under consideration include a set of genes and gene products generally involved in different types of cancer (Kyoto Encyclopedia of Genes and Genomes, KEGG, Pathways in cancer; [56]), targets of cancer therapy (Tumour Alterations Relevant for GEnomics-driven Therapy (TARGET); http://archive.broadinstitute.org/cancer/cga/target, v3 14 February 2015), three gene sets for breast cancer subtype stratification: PAM (Prediction Analysis of Microarray; [57]), Sorlie [58] and Hu [59] and a collection of genes of the immune system identified in different transcriptomic studies to bear prognostic and predictive roles in breast cancer (Immune; [60] and related publications). Statistical significance of overlaps is assessed, after removal of genes not present in our data set, with one-sided Fisher’s exact test (R function fisher.test).

Computational environment

Computation of sCCA and NMF was conducted on Windows 7 64-bit Intel Core i7-4770 CPU and 16 GB RAM. Processing of the experimental datasets with TCGA Assembler and computation of MALA was accomplished on a cluster running CentOS Linux 7 with four Intel Xeon E7-4830 CPUs and 500 GB RAM. Further computations were conducted in R [38] version 3.4.1.

Results

Performance evaluation

We applied sCCA, NMF and MALA to synthetic data generated with nine sets of simulation parameters combining different aberration levels and proportions of DE features. Larger proportions of DE features and stronger DE signals result in a better agreement of simulated and predicted DE features for all methods (Figure 1). When the DE signal is weak (δ = 0.2) and the proportion of DE features is small (3.5%), all methods fail to identify DE features. When the DE signal is moderate, the correlation of simulated and predicted DE features for sCCA and NMF largely depends on the proportion of DE features. At larger proportions of DE features, sCCA and NMF are able to predict the majority of features simulated as DE correctly, and the performance of MALA starts to increase. If the DE signal is strong, sCCA and NMF are almost perfectly capable to identify DE features, and the proportion of DE features correctly identified by MALA further increases. Across all simulated data sets, sCCA performs best, NMF is almost as good as sCCA and the ability of MALA to identify DE features is modest (Figure 1).

Figure 1

MCC for methods applied to 100 synthetic data sets. Combinations of parameters listed in Table 1 result in a total number of nine parameter sets, and MCC is visualized for different DE signals (weak, moderate, strong) and proportions of DE features (3.5, 13.5 and 26%).

On experimental data, the performance of methods is represented by the classification accuracy (ACC), true positive rate (TPR) and specificity (SP) achieved with the resulting feature sets. The average ACC on test data is 97.1, 98.6 and 99.5% for sCCA, NMF and MALA, respectively, and is higher in the test set than in the training set for sCCA and NMF. Average TPR ranges from 94.3% (sCCA) to 99% (MALA), and average SP ranges from 99.5% (NMF) to 100% (sCCA, MALA) on the test data (Figure 2). All methods displayed similar performance across the three additional data sets (Supplementary Figure S1), with one exception: we note that the TPR dropped below 80% in the COAD data set.

Figure 2

Performance evaluation of methods on experimental data in terms of classification ACC, TPR and SP of methods in n = 10 random splits. Methods are applied on 80% of samples (training set), and performance of selected features is assessed on the training set and on the remaining 20% of samples (test set).

Downstream analysis of variable sets

In synthetic data, the ability of methods to identify perturbed pathways depends largely on the proportion and the aberration level of DE features. sCCA and NMF are able to correctly identify all perturbed pathways at proportions of DE features of 13.5 and 26%, if the DE signal is moderate or strong, and at a proportion of 3.5%, if the DE signal is strong. At weak DE signals, sCCA and NMF identify all perturbed pathways, as the proportion of DE features is large (26%). MALA does not reveal (reliably) the perturbed pathways in simulated data sets, except when the DE signal is strong and the proportion of DE features in perturbed pathways is large. Perturbed pathways may or may not be identified by sCCA and NMF at a weak DE signal and a medium proportion of DE features. At a moderate DE signal, sCCA is capable to identify a subset of perturbed pathways at small proportions of DE features. MALA identifies a subset of perturbed pathways at moderate DE signals and a large proportion of DE features (Supplementary Figure S2).

We quantitatively compare method results for the experimental data set and visualize overlaps on different annotation levels as Venn diagrams. Each method results in an approximate number of 500 features or 1.3% of the total number of features under study (39 553), and 78 or approximately 16% of these are identified by all three methods (Figure 3A). If features are mapped to genes, the percentage increases to almost 3% (sCCA: 2.63%, NMF: 2.68%, MALA: 2.78%) of total genes (18 216), and the overlap of all methods is with 77 or approximately 16% of genes the same as on the features level (Figure 3B). The pair-wise overlap of feature/gene sets is considerably higher. The proportion of features/genes detected by at least one of the other methods is 310/305, 305/299 and 149/149 or approximately 60% for sCCA and NMF and 30% for MALA, respectively.

Figure 3

Overlap of method results is shown on (A) feature (genes, CpG sites) and (B) on gene level and in terms of over-represented GO (C) BP, (D) MF, (E) CC and (F) perturbed RP.

Comparison of biological information in terms of over-represented GO terms for BP, molecular functions (MF) and cellular components (CC) (Figure 3C–E) reflects well the situation on feature/gene level in that terms over-represented in the results of sCCA and NMF produce in general more overlap than with those over-represented in the results of MALA. Regarding GO BP, the set of over-represented terms in the results of NMF and MALA is relatively small as compared with the set of terms over-represented in the result of sCCA, while the number of over-represented terms is more balanced in the remaining annotation categories MF and CC. Over-representation analysis of RP results in few pathways associated to each feature set (Figure 3F), and the overlap of RP produced by MALA with sCCA and NMF is particularly small.

We applied sCCA, NMF and MALA to three additional biological data sets (KIRC, LUAD and COAD) and assessed the overlap of features, genes and perturbed pathways (Figure 4 and Supplementary Figures S3–S5). Similar patterns emerge across all analysed data sets. On the feature and gene level, we observe that MALA selects the most unique features/genes followed by sCCA and NMF, which select a similar proportion of unique features/genes. When comparing the overlap between two methods, sCCA and NMF show the highest overlap followed by sCCA and MALA, and NMF and MALA that select similar proportions of overlapping features/genes in most cases. The overlap between all three methods is low in all four data sets. Comparable patterns can be observed in terms of perturbed pathways. MALA identifies most unique perturbed pathways, except RP, which were not overrepresented in the data sets with low sample size (KIRC, LUAD and COAD). The proportion of identified unique pathways is less similar among sCCA and NMF compared with the feature and gene level. We observe that the overlap between sCCA and NMF selected pathways is higher than sCCA and MALA, and NMF and MALA. The overlap between all three methods is low, with instances in which no overlap was observed. The overlap of RP is particularly low compared with the overlap on the GO level. An overview of the number of unique features/genes identified with each method in the GE, the MET and in both data sets as well as the number of GO terms and RP resulting from the over-representation analysis of these sets is given in Supplementary Tables S2–S4, S22–S24, S42–S44 and S62–S64, respectively. A list of features and mapped genes resulting from sCCA, NMF and MALA in each dataset is given in Supplementary Tables S5–S10, S25–S30, S45–S49 and S65–S69; over-represented GO terms are enlisted in Supplementary Tables S11–S19, S31–S38, S50–S58 and S70–S78; and a list of over-represented pathways is given in Supplementary Tables S20–S21, S39–S41, S59–S61 and S79–S81, respectively.

Figure 4

Overlap on the feature, gene and perturbed pathway levels between sCCA, NMF and MALA, across four data sets (KIRC, 24 complete samples; LUAD, 20 complete samples; COAD, 15 complete samples; and BRCA with 104 complete samples). Unique features were defined as the proportion of features uniquely identified by one method, relative the total number of features identified by that method. The overlap between two methods was defined as the proportion of features identified by both methods, relative the total number of features identified by the two methods together. The overlap between all three methods was defined relative the total number of features identified by all three methods.

To assess their biological relevance, we compare the results of sCCA, NMF and MALA to curated cancer signatures (Table 2). One-sided Fisher’s exact test shows significant (*0.04 < P-value < 0.05; $* *$ P-value = $5 \cdot 10^{- 5}$ ⁠) overlap of method results with gene sets PAM (sCCA, NMF, MALA), KEGG and Hu (MALA). The overlap of PAM, KEGG and Hu gene signatures with method results on the BRCA data set is visualized in Supplementary Figure S6. The overlap of results on the LUAD, the KIRC and the COAD data sets with the KEGG and TARGET gene signatures is visualized in Supplementary Figures S7 and S8, respectively.

Table 2

Overlap of method results on experimental data with curated cancer signatures

	KEGG (307)	PAM (50)	Sorlie (433)	Hu (219)	Immune (110)	TARGET (135)
sCCA	10	4*	11	5	5	4
NMF	12	8**	9	7	5	3
MALA	14*	5*	11	11*	4	5

	KEGG (307)	PAM (50)	Sorlie (433)	Hu (219)	Immune (110)	TARGET (135)
sCCA	10	4*	11	5	5	4
NMF	12	8**	9	7	5	3
MALA	14*	5*	11	11*	4	5

Note: The number in parenthesis indicates the number of genes in the signature included in the experimental data set. Significant overlap following a one-sided Fisher’s exact test is indicated by *0.04 < P-value < 0.05) and **P-value = $5 \cdot 10^{- 5}$ ⁠).

Table 2

Overlap of method results on experimental data with curated cancer signatures

	KEGG (307)	PAM (50)	Sorlie (433)	Hu (219)	Immune (110)	TARGET (135)
sCCA	10	4*	11	5	5	4
NMF	12	8**	9	7	5	3
MALA	14*	5*	11	11*	4	5

	KEGG (307)	PAM (50)	Sorlie (433)	Hu (219)	Immune (110)	TARGET (135)
sCCA	10	4*	11	5	5	4
NMF	12	8**	9	7	5	3
MALA	14*	5*	11	11*	4	5

Note: The number in parenthesis indicates the number of genes in the signature included in the experimental data set. Significant overlap following a one-sided Fisher’s exact test is indicated by *0.04 < P-value < 0.05) and **P-value = $5 \cdot 10^{- 5}$ ⁠).

Computational runtime

The computing time is mainly a consequence of the parameter settings and the dimensionality of data matrices and is shortest for sCCA. The analysis of a single data set with sCCA is completed in the order of seconds. Effectively, completion time of sCCA is extended, as multiple sCCA runs are required to find constants c₁ and c₂ resulting in a desired level of sparseness. The NMF run time accumulates to several seconds on simulated and to several hours for the experimental data. A single MALA run takes 1–2 min on the simulated data and approximately 40 min on the experimental data. We perform up to 100 MALA runs on each data set and accumulate the results (Table 3).

Table 3

Computational runtime in minutes

	Synthetic data		Experimental data (BRCA)
	1 run	Total	1 run	Total
sCCA	<1	900	<1	8250
NMF	8	2000	230	6860
MALA	2	5700^a	40	14 970^a

	Synthetic data		Experimental data (BRCA)
	1 run	Total	1 run	Total
sCCA	<1	900	<1	8250
NMF	8	2000	230	6860
MALA	2	5700^a	40	14 970^a

Note: Computing time of a single analysis run performed on a single synthetic or experimental data set; and accumulated computing time of analysis runs performed on 100 synthetic data sets and on experimental data and on 10 training data sets, respectively.

^aMultiple runs have been performed in parallel.

Table 3

Computational runtime in minutes

	Synthetic data		Experimental data (BRCA)
	1 run	Total	1 run	Total
sCCA	<1	900	<1	8250
NMF	8	2000	230	6860
MALA	2	5700^a	40	14 970^a

	Synthetic data		Experimental data (BRCA)
	1 run	Total	1 run	Total
sCCA	<1	900	<1	8250
NMF	8	2000	230	6860
MALA	2	5700^a	40	14 970^a

Note: Computing time of a single analysis run performed on a single synthetic or experimental data set; and accumulated computing time of analysis runs performed on 100 synthetic data sets and on experimental data and on 10 training data sets, respectively.

^aMultiple runs have been performed in parallel.

Discussion

The analysis of omics data aims to identify subsets of variables capable of explaining different phenotypes and to uncover functional relationships between these variables. Recently, integrative analysis approaches have emerged, combining multiple information levels provided by multiple omics technologies. The growing use of integrative methods imposes a number of fundamental questions on the validity and comparability of analysis results. However, comparisons are rare and in great demand.

In the presented work, we investigated the concordance of results from three integrative analysis methods published recently and coming with ready-to-use software implementations. Although the potential biological value of the results is evidenced by the substantial power provided to accurately classify samples, the set of features jointly identified by all methods is relatively small.

Conceptual differences between methods

As representatives of decomposition-based approaches, sCCA finds the projections of data matrices with maximum pair-wise correlation using a limited number of variables, and NMF derives most important additive components by minimizing the joint reconstruction error of data matrices. The goal of the third method MALA, which represents machine learning-based approaches, is to select a minimal set of variables capable of separating different phenotype samples.

Feature selection

In this study, we predefine the number of variables to be selected: the variables are represented by non-zero canonical weights in sCCA, by the members of md-modules in NMF, and by the feature-literals in the logic formulas resulting from MALA. In sCCA, we chose the constants c₁ and c₂ as to roughly achieve a desired level of sparseness in the canonical weight vectors; however, the R package PMA provides a function, MultiCCA.permute, to select these constants, from a supplied set of values, based on the correlation in the data. In NMF, we set the weight threshold for the incorporation of variables into an md-module, separately for the GE and the MET data set, so as to include a desired number of variables in the md-modules. However, this threshold can be set to an arbitrary constant. In MALA, the (maximum) number of features to be included in the logic formulas in one run is determined by the parameter β, and the number of variables selected in GE and the MET cannot be defined separately. Remarkably, the number of variables selected in one MALA run is relatively small. The results of MALA runs differ because of a heuristic component in FS and the fact that the classification model is derived from a random subset of samples in each run. Thus, the set of selected variables can be expanded by accumulation of results from multiple runs. In the experimental data, MALA tends to select preferentially features from the GE data set, while in the synthetic data more features from the MET data set are selected. This might indicate that the features in the GE or in the MET data set are more powerful in discriminating the samples, respectively. The unequal number of features MALA selects in the GE and the MET data set as compared with sCCA and NMF might be an important reason for the small overlap of method results at the feature/gene level as well as on the biological annotation level. Regardless of the biological question, the number of truly informative features in each data set is unknown. A major limitation of sCCA, NMF and MALA is thus that the number of variables to be selected has, in fact, to be predefined. An essential characteristic of MALA is that the number of phenotypes present and the phenotype labels of samples have to be delivered with the data.

Performance evaluation

Performance evaluation on synthetic data illustrates that all methods benefit from large proportions of DE features and stronger DE signals, and that MALA is least capable to identify DE features. In turn, the performance on experimental data of sCCA and NMF in terms of sample classification accuracy is slightly inferior to that of MALA. sCCA and NMF result in a predefined (large) number of features, which may impose conflicts on the classification of samples. In contrast to that, MALA is proposed to derive a simple classification model, comprising a small number of features, applicable for the discrimination of phenotypes. We set the number of features to build up the classification model in each MALA run to the maximum (β = 50). The large β value results in a decrease of the performance of the classification model derived in a single MALA run on the test set (overfitting). We accumulate the features selected in multiple runs to obtain a feature set specifically relevant for the analysed data set. The authors of MALA recently presented a novel method for the extraction of a rule-based classification model from RNA-seq data sets: Classifier with Alternative and MUltiple Rule-based models (CAMUR) [61]. The stated aim of CAMUR is to extract multiple equivalent classification models and therewith to identify a larger number of features related to a biological condition than is typically identified by other rule-based classifiers. The algorithm to derive the classification model used in CAMUR differs from the one used in MALA; however, CAMUR only extracted 2% of the DE genes in TCGA Breast cancer data, which is similar to the results of MALA on synthetic data in this study. This suggests that rule-based classification algorithms in general, and MALA in particular, are not designed to detect all DE features but the minimal number required for an accurate classification.

Biological findings

The feature/gene sets identified by the methods in the experimental data produce a relatively small overlap (approximately 16%). We surmised that on the GO terms and pathway level the percentage of overlap would be larger, as we expected the methods to reveal perturbed biological structures. In synthetic data, the pathway over-representation analysis demonstrates that, in some simulation scenarios, the methods are able to reveal perturbed pathways, although the MCC of identified and predicted DE features was low (Supplementary Figure S2). However, regarding experimental data, considering the differing set sizes, the overlap on the GO and pathway annotation level is in most cases even smaller as compared with the feature/gene level. This might be to some extent a consequence of the used functions HyperGTest and runSPIA delivered with the R packages GOstats and graphite, respectively, which do not take into account the feature level but are based on the mapped gene identifiers. To produce the largest possible overlap, we perform an unconditional over-representation test of GO terms where each term is assessed separately, regardless of its relation to other significant terms in the GO graph. Additionally, the purpose of over-representation analysis in this study is to reveal congruences of method results, rather than biological interpretation.

As the experimental data comprise tumour and normal samples, we expect integrative analysis to have the capability of identifying genes known to play a role in cancer. However, the results show a relatively poor overlap with generally accepted (breast) cancer signatures (Table 2 and Supplementary Figure S7). Effectively, in our experimental data set, the normal (control) condition is represented by samples from tissue adjacent to the tumour, and this may be a reason why we do not see a high overlap with known signatures.

Another reason for the poor overlap might be that gene sets PAM, Sorlie and Hu are proposed for breast cancer subtype stratification; KEGG is not specific for breast cancer but comprises general cancer pathways; Immune and TARGET comprise candidates addressed in cancer therapy and are thus not intended for the classification of samples. Notably, the overlap of NMF with PAM shows the smallest P-value. This might indicate that the md-modules, identified by NMF as important parts in the additive reconstruction of the original data sets, account for inherent structures such as tumour subtypes.

Other comparison studies

CIA, GSVD and IBC have been compared in terms of GO BP associated with experimental conditions [19]; however, no list of features to differentiate experimental conditions was provided. The proportion of commonly identified GO terms largely depends on the experimental condition, and the authors propose to take the intersection of the three results obtained for the biological interpretation. In [15], the results of CCA-EN, CIA and sPLS in terms of statistical measures and biological findings are compared. As the authors aim to discriminate more than two experimental conditions, they use the first three components of the decomposition in each method. As statistical measures, they use the amount of explained variance by each component in the decomposition of individual data sets and the correlation between the components across data set in each dimension. Additionally, they assess the overlap of genes and associated pathways corresponding to components separating the same cell lines and found the results of CCA-EN and sPLS to be congruent and rather complementary to CIA results. Both studies [15, 19] have in common the aim of discriminating subtypes of an experimental condition (life cycle stages of P. falciparum and human tumour cell lines, respectively) in contrast to the experimental data set used here where samples from breast cancer primary tumour and adjacent normal tissue, regardless of tumour subtype, stage or grade and subject age, therapeutic or prognostic status, are used. Further methods were suggested for tumour subtype discovery in multi-omics data, including ‘Group Lasso and Tight iCluster’ (GSTiC) [62] and ‘Integrative Sparse K-Means With Overlapping Group Lasso’ [63]. GSTiC is similar to sCCA and NMF, as it regularizes sparse weight vectors (like sCCA) and discovers modules corresponding to subtypes (like NMF).

Computational resources

The computing time and the required main memory resources vary enormously between methods. The quickest algorithm is sCCA. Depending on the system architecture, the computational load of sCCA is spread across multiple threads. The second fastest method is NMF. The computing time required depends largely on the dimensionality of data matrices and on the selected number of random initialisations and iterations of factor matrices. Notably, random initializations might be parallelized. The computing time of MALA is dominated by two factors: the FS step and the transposition of data matrices. MALA performs substantial numbers of read/write operations and can be significantly accelerated by operating on a solid state drive. The time dedicated to the completion of FS can be defined in the parameters of MALA. While RAM usage of sCCA and NMF is negligible, MALA requires large amounts of RAM. Multiple runs of MALA can easily be parallelized; however, the parallelization of MALA runs is limited by the amount of available RAM.

Future work

The wealth of available integrative methods calls for a systematic and comprehensive method comparison study including as many approaches as possible, representing different algorithm types. Moreover, future comparisons should aim to use well-understood experimental data sets with homogeneous biological samples under supervised clinical treatment. Additionally, data simulations with large numbers of features are needed. The potential impact of comparison studies is to provide researchers with a resource to look up the results of specific methods applied to the same simulated and experimental data sets. Furthermore, the data sets should be made available, so that new methods may be applied to them, and results can be directly compared with those of the already existing methods. Finally, extensive comparison studies will guide the researcher’s decision of which method to use for a particular study.

Conclusion

The integrative analysis of two different types of omics data with sCCA, NMF and MALA enables a direct comparison of findings and points out the main differences of these methods. sCCA and NMF are able to detect DE features in the simulated, data and the results of all three methods applied to the experimental breast cancer data from TCGA are suitable for classifying samples. The lack of extensive congruences between results suggests that the methods tend to discover different types of patterns. Ideally, the set of features characterizing a data set and accounting for the differences in experimental conditions should be small; however, the evaluation on simulated data demonstrates that the methods are not able to identify DE features when the predefined number of variables to select is small. The fact that methods might be used to extract a small set of features from large data sets in practice, while they perform best when the number of extracted variables is large, tends to be largely overlooked, but it might well explain the small overlap between the method results on experimental data. If one of the methods under comparison has to be chosen, we would recommend sCCA, as on the synthetic data, it performs best in identifying DE features, and it is most likely to reveal perturbed pathways. Additionally, sCCA is the least computational intensive method with an execution time on the order of seconds, even on large data sets.

Key Points

Comparisons of integrative analysis approaches, combining multiple information levels provided by multiple omics technologies, are rare and in great demand.
We evaluate three methods for the integrative analysis of omics data sets, sCCA, NMF and MALA each of which are based on different mathematical concepts.
In a comprehensive data simulation study, we assess the ability of sCCA, NMF and MALA to identify DE features.
The feature sets resulting from the application of methods to experimental breast cancer data are suitable for classifying samples, but the overlapping proportion of features jointly identified by all methods is only 16%.
We point out differences in the methods concepts and discuss possible reasons for the moderate overlap of identified feature sets with each other and with curated cancer signatures.

Bettina Pucher is a research associate in Gerhard Thallinger’s group at the Institute of Computational Biotechnology at Graz University of Technology and has been funded by the Omics Center Graz, which is part of the BioTechMed-Graz, Austria initiative.

Oana Zeleznik is a postdoctoral research fellow in the Channing Division of Network Medicine at Brigham and Women’s Hospital and Harvard Medical School. She is a former member of Gerhard Thallinger’s group at Graz University of Technology, Austria.

Gerhard Thallinger is a principle investigator at the Institute of Computational Biotechnology at Graz University of Technology, Austria and Deputy Speaker of the Omics Center Graz, which is part of the BioTechMed-Graz, Austria initiative.

Acknowledgements

The authors wish to thank Aedín Culhane for her valuable comments on data simulation and Emanuel Weitschek and Giovanni Felici for their kind support with MALA. The results shown here are in part based on data generated by the TCGA Research Network: http://cancergenome.nih.gov/.

Funding

This work was supported by the Austrian Ministry of Science, Research and Economy (HSRSM grant Omics Center Graz, to G.G.T.), the BioTechMed-Graz Initiative (Flagship project EPIAge to G.G.T), and by Amazon Web Services Cloud Credits for Research (grant to O.A.Z.).

References

1

Bersanelli

M

,

Mosca

E

,

Remondini

D

, .

Methods for the integration of multi-omics data: mathematical aspects

.

BMC Bioinformatics

2016

;

17

(

S2

):

15

.

2

Meng

C

,

Zeleznik

OA

,

Thallinger

GG

, et al.

Dimension reduction techniques for the integrative analysis of multi-omics data

.

Brief Bioinform

2016

;

17

(

4

):

628

–

41

.

3

Härdle

W

,

Simar

L.

Applied Multivariate Statistical Analysis

.

Berlin

:

Springer Science & Business Media

,

2007

.

4

Shen

R

,

Olshen

AB

,

Ladanyi

M.

Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis

.

Bioinformatics

2009

;

25

(

22

):

2906

–

12

.

5

Shen

R

,

Wang

S

,

Mo

Q.

Sparse integrative clustering of multiple omics data sets

.

Ann Appl Stat

2013

;

7

(

1

):

269

–

94

.

6

Gusenleitner

D

,

Howe

EA

,

Bentink

S

, et al.

iBBiG: iterative binary bi-clustering of gene sets

.

Bioinformatics

2012

;

28

(

19

):

2484

–

92

.

7

Cao

H

,

Duan

J

,

Lin

D

,

Wang

YP.

Sparse representation based clustering for integrated analysis of gene copy number variation and gene expression data

.

Int J Comput Appl

2012

;

19

(

2

):

131

–

44

.

8

Breiman

L.

Random forests

.

Mach Learn

2001

;

45

(

1

):

5

–

32

.

9

Reif

DM

,

Motsinger

A

,

McKinney

B

, et al. Feature selection using a random forests classifier for the integrated analysis of multiple data types. In: Proceedings of the IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology (CIBCB’06), Toronto, ON, Canada. IEEE, New York, NY, 2006,

1

–

8

.

10

Lê Cao

KA

,

Rossouw

D

,

Robert-Granié

C

,

Besse

P.

A sparse PLS for variable selection when integrating omics data

.

Stat Appl Genet Mol Biol

2008

;

7

(

1

):Article 35.

11

Waaijenborg

S

,

de Witt Hamer

PC

,

Zwinderman

AH.

Quantifying the association between gene expressions and DNA-markers by penalized canonical correlation analysis

.

Stat Appl Genet Mol Biol

2008

;

7

(

1

):

Article 3

.

12

Dolédec

S

,

Chessel

D.

Co-inertia analysis: an alternative method for studying species–environment relationships

.

Freshw Biol

1994

;

31

(

3

):

277

–

94

.

13

Culhane

AC

,

Perrière

G

,

Higgins

DG.

Cross-platform comparison and visualisation of gene expression data using co-inertia analysis

.

BMC Bioinformatics

2003

;

4

:

59.

14

Fagan

A

,

Culhane

AC

,

Higgins

DG.

A multivariate analysis approach to the integration of proteomic and gene expression data

.

Proteomics

2007

;

7

(

13

):

2162

–

71

.

15

Lê Cao

KA

,

Martin

PGP

,

Robert-Granié

C

,

Besse

P.

Sparse canonical methods for biological data integration: application to a cross-platform study

.

BMC Bioinformatics

2009

;

10

(

1

):

34.

16

Van Loan

CF.

Generalizing the singular value decomposition

.

SIAM J Numer Anal

1976

;

13

(

1

):

76

–

83

.

17

Alter

O

,

Brown

PO

,

Botstein

D.

Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms

.

Proc Natl Acad Sci USA

2003

;

100

(

6

):

3351

–

6

.

18

Tanay

A

,

Sharan

R

,

Kupiec

M

,

Shamir

R.

Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data

.

Proc Natl Acad Sci USA

2004

;

101

(

9

):

2981

–

6

.

19

Tomescu

OA

,

Mattanovich

D

,

Thallinger

GG.

Integrative omics analysis. A study based on Plasmodium falciparum mRNA and protein data

.

BMC Syst Biol

2014

;

8(Suppl 2)

:

S4

.

20

Witten

DM

,

Tibshirani

RJ.

Extensions of sparse canonical correlation analysis with applications to genomic data

.

Stat Appl Genet Mol Biol

2009

;

8

(

1

):Article 28.

21

Zhang

S

,

Liu

CC

,

Li

W

, et al.

Discovery of multi-dimensional modules by integrative analysis of cancer genomic data

.

Nucleic Acids Res

2012

;

40

(

19

):

9379

–

91

.

22

Weitschek

E

,

Felici

G

,

Bertolazzi

P.

MALA: a microarray clustering and classification software. In: Proceedings of the 23rd International Workshop on Database and Expert Systems Applications (DEXA) 2012, Vienna, Austria. IEEE, New York, NY, 2012, 201–205.

23

Ashburner

M

,

Ball

CA

,

Blake

JA

, et al.

Gene Ontology: tool for the unification of biology

.

Nat Genet

2000

;

25

(

1

):

25

–

9

.

24

Milacic

M

,

Haw

R

,

Rothfels

K

, et al.

Annotating cancer variants and anti-cancer therapeutics in Reactome

.

Cancers

2012

;

4

(

4

):

1180

–

211

.

25

Croft

D

,

Mundo

AF

,

Haw

R

, et al.

The Reactome pathway knowledgebase

.

Nucleic Acids Res

2014

;

42

:

D472

–

7

.

26

Weber

M

,

Davies

JJ

,

Wittig

D

, et al.

Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells

.

Nat Genet

2005

;

37

(

8

):

853

–

62

.

27

Hotelling

H.

Relations between two sets of variates

.

Biometrika

1936

;

28

(

3–4

):

321

–

77

.

28

Parkhomenko

E

,

Tritchler

D

,

Beyene

J.

Sparse canonical correlation analysis with application to genomic data integration

.

Stat Appl Genet Mol Biol

2009

;

8

:

Article 1.

29

Witten

DM

,

Tibshirani

R

,

Hastie

T.

A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis

.

Biostatistics

2009

;

10

(

3

):

515

–

34

.

30

Lee

DD

,

Seung

HS.

Learning the parts of objects by non-negative matrix factorization

.

Nature

1999

;

401

(

6755

):

788

–

91

.

31

Bertolazzi

P

,

Felici

G

,

Festa

P

,

Lancia

G.

Logic classification and feature selection for biomedical data

.

Comput Math Appl

2008

;

55

(

5

):

889

–

99

.

32

Arisi

I

,

D’Onofrio

M

,

Brandi

R

, et al.

Gene expression biomarkers in the brain of a mouse model for Alzheimer’s disease: mining of microarray data by logic classification and feature selection

.

J Alzheimers Dis

2011

;

24

(

4

):

721

–

38

.

33

Weitschek

E

,

Fiscon

G

,

Felici

G

,

Bertolazzi

P.

GELA: a software tool for the analysis of gene expression data. In: Proceedings of the 26th International Workshop on Database and Expert Systems Applications (DEXA) 2015, Valencia, Spain. IEEE, New York, NY, 2015,

31

–

35

.

34

Kurgan

LA

,

Cios

KJ.

CAIM discretization algorithm

.

IEEE Trans Knowl Data Eng

2004

;

16

(

2

):

145

–

53

.

35

Resende

MG

, Greedy randomized adaptive search procedures. In:

Floudas

CA

,

Pardalos

PM

(ed).

Encyclopedia of Optimization

.

New York

:

Springer Science & Business Media

,

2009

,

1460

–

1469

.

36

Felici

G

,

Truemper

K.

The Lsquare system for mining logic data. In:

Wang

J

(ed).

Encyclopedia of Data Warehousing and Mining

, Vol. 2.

Hershey

:

IGI Global

,

2006

,

693

–

697

.

37

Truemper

K

,

Design of Logic-Based Intelligent Systems

.

Hoboken

:

John Wiley & Sons

,

2004

.

38

R Core Team

.

R: A Language and Environment for Statistical Computing

.

R Foundation for Statistical Computing

,

Vienna, Austria

,

2017

. https://www.R-project.org.

. http://cran.r-project.org/package=PMA.

39

Witten

DM

,

Tibshirani

RJ

,

Gross

S

,

Narasimhan

B.

PMA: Penalized Multivariate Analysis. R package version 1.0.9,

2013

40

Du

P

,

Zhang

X

,

Huang

CC

, et al.

Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis

.

BMC Bioinformatics

2010

;

11

(

1

):

587

.

41

Zhu

Y

,

Qiu

P

,

Ji

Y.

TCGA-Assembler: open-source software for retrieving and processing TCGA data

.

Nat Methods

2014

;

11

(

6

):

599

–

600

.

42

Wei

L

,

Jin

Z

,

Yang

S

, et al.

TCGA-Assembler 2: software pipeline for retrieval and processing of TCGA/CPTAC data

.

Bioinformatics

2017

, in press. doi: https://doi.org/10.1093/bioinformatics/btx812.

43

Cancer Genome Atlas Network

.

Comprehensive molecular portraits of human breast tumors

.

Nature

2012

;

490

(

7418

):

61

–

70

.

PubMed

44

Wright

GW

,

Simon

RM.

A random variance model for detection of differential gene expression in small microarray experiments

.

Bioinformatics

2003

;

19

(

18

):

2448

–

55

.

45

Li

D

,

Xie

Z

,

Le Pape

M

,

Dye

T.

An evaluation of statistical methods for DNA methylation microarray data analysis

.

BMC Bioinformatics

2015

;

16

:

217.

46

Meng

C

,

Kuster

B

,

Peters

B

, et al.

moGSA: integrative single sample gene-set analysis of multiple omics data

.

bioRxiv

2016

. doi: https://doi.org/10.1101/046904

. http://bioconductor.org/packages/org.Hs.eg.db/

47

Matthews

BW.

Comparison of the predicted and observed secondary structure of T4 phage lysozyme

.

Biochim Biophys Acta

1975

;

405

(

2

):

442

–

51

.

48

Huber

W

,

Carey

VJ

,

Gentleman

R

, et al.

Orchestrating high-throughput genomic analysis with Bioconductor

.

Nat Methods

2015

;

12

(

2

):

115

–

21

.

49

Falcon

S

,

Gentleman

R.

Using GOstats to test gene lists for GO term association

.

Bioinformatics

2007

;

23

(

2

):

257

–

8

.

50

Carlson

M.

org.Hs.eg.db: Genome wide annotation for Human. R package version 3.4.1,

2017

51

Carlson

M.

GO.db: A set of annotation maps describing the entire Gene Ontology. R package version 3.4.1,

2017

. http://bioconductor.org/packages/GO.db/

52

Sales

G

,

Calura

E

,

Romualdi

C.

graphite: GRAPH Interaction from pathway Topological Environment. R package version 1.22.0,

2017

. http://bioconductor.org/packages/graphite/

53

Tarca

AL

,

Kathri

P

,

Draghici

S.

SPIA: Signaling Pathway Impact Analysis (SPIA) using combined evidence of pathway over-representation and unusual signaling perturbations. R package version 2.28.0,

2013

. http://bioconductor.org/packages/SPIA/

54

Chen

H

,

Boutros

PC.

VennDiagram: a package for the generation of highly-customizable Venn and Euler diagrams in R

.

BMC Bioinformatics

2011

;

12

:

35.

55

Benjamini

Y

,

Hochberg

Y.

Controlling the false discovery rate: a practical and powerful approach to multiple testing

.

J R Statist Soc B

1995

;

57

(

1

):

289

–

300

.