-
PDF
- Split View
-
Views
-
Cite
Cite
Zeynab Maghsoudi, Ha Nguyen, Alireza Tavakkoli, Tin Nguyen, A comprehensive survey of the approaches for pathway analysis using multi-omics data integration, Briefings in Bioinformatics, Volume 23, Issue 6, November 2022, bbac435, https://doi.org/10.1093/bib/bbac435
- Share Icon Share
Abstract
Pathway analysis has been widely used to detect pathways and functions associated with complex disease phenotypes. The proliferation of this approach is due to better interpretability of its results and its higher statistical power compared with the gene-level statistics. A plethora of pathway analysis methods that utilize multi-omics setup, rather than just transcriptomics or proteomics, have recently been developed to discover novel pathways and biomarkers. Since multi-omics gives multiple views into the same problem, different approaches are employed in aggregating these views into a comprehensive biological context. As a result, a variety of novel hypotheses regarding disease ideation and treatment targets can be formulated. In this article, we review 32 such pathway analysis methods developed for multi-omics and multi-cohort data. We discuss their availability and implementation, assumptions, supported omics types and databases, pathway analysis techniques and integration strategies. A comprehensive assessment of each method’s practicality, and a thorough discussion of the strengths and drawbacks of each technique will be provided. The main objective of this survey is to provide a thorough examination of existing methods to assist potential users and researchers in selecting suitable tools for their data and analysis purposes, while highlighting outstanding challenges in the field that remain to be addressed for future development.
Introduction
Due to the prevalence of microarray and RNA-seq data in early stages, many pathway analysis methods have been developed for the analysis of gene expression data. The first generation are the over-representation analysis (ORA) methods [1–8]. These methods take a list of differentially expressed (DE) genes as input and identify the pathways in which the DE genes are over- or under-represented. The second generation are the Functional Class Scoring (FCS) methods [9–14]. FCS methods eliminate dependency on gene selection criteria by taking all genes into consideration. The third generation are the Topology-based (TB) methods [15–27]. These methods aim to exploit pathway topology and gene interactions that are meant to capture and describe the biological phenomenon. Yet, they still focus on gene expression analysis and are not able to analyze other types of data or integrate multiple experiments.
Integrating information from different types of data or experiments has become increasingly more essential to obtain a comprehensive overview of biological systems [28, 29]. The ability to analyze together independent studies allows researchers to increase sample size and reproducibility, while integrating different types of assays will increase the ability to uncover the complete cause of diseases and their functional consequences, which no single type of assay would be able to provide. This has led to the introduction of many techniques that utilize multi-omics and multi-cohort data to identify the pathways that underly the conditions or phenotypes of interest.
In this article, we review 32 approaches that have been designed for the purpose of integrative pathway analysis in the multi-omics and/or multi-cohort setup. Figure 1 shows the timeline of integrative pathway analysis techniques developed between 2005 and 2022. The availability of many methods and their continuous development indicate the high demand for pathway analysis using multi-omics data in the research community over the past 17 years. We note that there exist many other useful tools for pathway analysis. For example, LncSEA [30], LnCompare [31] and DIANA-miRPath [32] can perform pathway analysis using noncoding RNA data. However, these methods are not within the scope of this article because they are not designed for data integration (multi-omics and/or multi-cohort integration).

Timeline of the approaches developed for pathway analysis and multi-omics integration.
This will be the first article that provides such an in-depth review and discussion about integrative pathway analysis approaches using multi-omics and multi-cohort data. Many review articles [33–36] provide comprehensive assessments of pathway and subnetwork identification methods, but they are limited to the analysis of a single type of data, i.e. mRNA or scRNA-seq data. Other surveys [37–41] review techniques developed for integrating multi-omics data in general, without discussing how these can be applied in the context of pathway analysis. In contrast, here, we provide a comprehensive review of prominent methods capable of integrating multi-cohort and multi-omics data for the purpose of pathway analysis. Our survey is expected to benefit life scientists who wish to choose a method that is most suitable for their data. At the same time, this survey also benefits computational scientists who wish to know the limitations of existing methods in order to develop new methods and infrastructure addressing the current shortcomings.
In Section Availability and Implementation we will describe the methods’ availability, their implementation and important features. In Section Pre-analysis we will present the steps each method follows prior to pathway analysis. In Section Integrative Pathway Analysis we will describe and categorize these methods according to the techniques they employ for pathway analysis and data integration while presenting their advantages and disadvantages. In Section Method Assessment we will assess the quality of the methods based on six different metrics. Finally, in Section Summary and Discussion, we will highlight outstanding challenges in the field that remain to be addressed for further enhancements. We also provide detailed description, analysis pipeline and practicality of each method in supplementary materials (SupplementaryNote.pdf, Supplementary-Table-S1.xlsx, Supplementary-Table-S2.xlsx).
Availability and implementation
Tables 1 and 2 provide a summary of the availability and important features of each of the 32 surveyed methods. Table 1 shows available hyperlinks, platforms, availability, year, number of citations (Google Scholar) and references to their publications. Among the 32 methods, R packages and web-based tools are the most common platforms used for integrative pathway analysis. Moreover, the web interface is usually intuitive and provides options for selecting the input, analyzing the data and visualizing the output. Three of the surveyed tools, i.e. CancerMA [42], DANUBE [43] and rPAC [44], are not available. Regardless, we include these approaches in the survey because we believe that understanding these methods will be beneficial to readers.
Availability of 32 integrative pathway analysis approaches. The ✓ and ✗ in the Availability column indicate whether the methods are available at the time of writing this manuscript. The term ‘Web’ in the Platform column indicates whether the tool has a web-based graphical user interface. R, MATLAB, and Java terms indicate the programming languages used to implement the package. The three last columns, Year, Reference and Citation, indicate the year of publication, reference to the article and the number of citations
Availability of 32 integrative pathway analysis approaches. The ✓ and ✗ in the Availability column indicate whether the methods are available at the time of writing this manuscript. The term ‘Web’ in the Platform column indicates whether the tool has a web-based graphical user interface. R, MATLAB, and Java terms indicate the programming languages used to implement the package. The three last columns, Year, Reference and Citation, indicate the year of publication, reference to the article and the number of citations
Comparison of the 32 surveyed methods in terms of the supported omics types, meta-analysis, embedded pathways databases and mapping unit. The majority of these methods can integrate transcriptomic data with genomic, epigenomic, proteomic and metabolomic data. Among these, 19 methods are capable of performing meta-analysis (multi-cohort analysis). Further, 22 have embedded databases and thus allow users to analyze their data without the need of providing pathway/gene set information. Most methods also automatically map the gene/compound IDs of the expression data to the gene/compound IDs presented in the pathways. The column ‘Inter. Visual.’ indicates whether the tool has an interactive interface. Six methods provide interactive pathway diagram that allow users to interactively explore the results and inspect the statistics obtained in each omics analysis
Method . | Type of omic data . | Meta-analysis . | Built-in Database . | ID Mapping . | Inter. Visual. . | |||||
---|---|---|---|---|---|---|---|---|---|---|
. | Genome . | Transcriptome . | Epigenome . | Proteome . | Metabolome . | . | . | Manual . | Built-in . | . |
Gene-level, P-value-based integrative approaches | ||||||||||
KaPPA-View | ✓ | ✓ | KEGG, BioCyc | ✓ | ✓ | |||||
MAPE | ✓ | ✓ | ✓ | |||||||
CancerMA | ✓ | ✓ | GO | ✓ | ||||||
INMEX | ✓ | ✓ | ✓ | KEGG, GO | ✓ | |||||
3Omics | ✓ | ✓ | ✓ | KEGG, BioCyc | ✓ | ✓ | ||||
InCroMAP | ✓ | ✓ | ✓ | ✓ | ✓ | KEGG, Reactome, BioCarta | ✓ | ✓ | ||
ActivePathways | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
mitch | ✓ | ✓ | ✓ | ✓ | ✓ | |||||
iODA | ✓ | ✓ | ✓ | ✓ | KEGG | ✓ | ||||
Pathway-level, P-value-based integrative approaches | ||||||||||
IMPaLA | ✓ | ✓ | ✓ | KEGG, Reactome, WikiPathways, NCI-PID, Biocarta, PharmaGKBB, INOH, EHMN, NetPath, SMPDB, BioCyc | ✓ | |||||
iPEAP | ✓ | ✓ | ✓ | ✓ | KEGG | ✓ | ||||
MarVis-Pathway | ✓ | ✓ | ✓ | KEGG, BioCyc | ✓ | |||||
BLMA | ✓ | ✓ | KEGG | ✓ | ||||||
GeneTrail2 | ✓ | ✓ | ✓ | ✓ | ✓ | KEGG, Reactome, WikiPathways, BioCarta, NCI-PID, PharmGKB, SignaLink, GO, SMPDB | ✓ | |||
DANUBE | ✓ | ✓ | ✓ | |||||||
Mergeomics | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
Pathview | ✓ | ✓ | ✓ | KEGG | ✓ | ✓ | ✓ | |||
PaintOmics 3 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | KEGG | ✓ | ✓ | ✓ |
ReactomeGSA | ✓ | ✓ | ✓ | Reactome | ✓ | |||||
multiGSEA | ✓ | ✓ | ✓ | KEGG, Reactome, PharmGKB, BioCyc, SMPDB, Panther, Biocarta, NCI-PID | ✓ | ✓ | ||||
pathwayMultiomics | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
CPA | ✓ | ✓ | KEGG, GO | ✓ | ✓ | |||||
clusterProfiler 4.0 | ✓ | ✓ | ✓ | KEGG, GO, WikiPathways | ✓ | ✓ | ||||
Graph-transformation-based approaches | ||||||||||
PARADIGM | ✓ | ✓ | ✓ | NCI-PID | ✓ | |||||
Subpathway-GM | ✓ | ✓ | KEGG | ✓ | ||||||
microGraphite | ✓ | ✓ | KEGG, Reactome, NCI-PID, BioCarta | ✓ | ||||||
mirIntegrator | ✓ | ✓ | KEGG, miRTarBase | ✓ | ||||||
MOSClip | ✓ | ✓ | ✓ | ✓ | ||||||
IMPRes-Pro | ✓ | ✓ | KEGG, STRING, TRRUST | ✓ | ✓ | |||||
rPAC | ✓ | ✓ | ✓ | |||||||
Machine-learning-based approaches | ||||||||||
GSOA | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
PathwayPCA | ✓ | ✓ | ✓ | ✓ |
Method . | Type of omic data . | Meta-analysis . | Built-in Database . | ID Mapping . | Inter. Visual. . | |||||
---|---|---|---|---|---|---|---|---|---|---|
. | Genome . | Transcriptome . | Epigenome . | Proteome . | Metabolome . | . | . | Manual . | Built-in . | . |
Gene-level, P-value-based integrative approaches | ||||||||||
KaPPA-View | ✓ | ✓ | KEGG, BioCyc | ✓ | ✓ | |||||
MAPE | ✓ | ✓ | ✓ | |||||||
CancerMA | ✓ | ✓ | GO | ✓ | ||||||
INMEX | ✓ | ✓ | ✓ | KEGG, GO | ✓ | |||||
3Omics | ✓ | ✓ | ✓ | KEGG, BioCyc | ✓ | ✓ | ||||
InCroMAP | ✓ | ✓ | ✓ | ✓ | ✓ | KEGG, Reactome, BioCarta | ✓ | ✓ | ||
ActivePathways | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
mitch | ✓ | ✓ | ✓ | ✓ | ✓ | |||||
iODA | ✓ | ✓ | ✓ | ✓ | KEGG | ✓ | ||||
Pathway-level, P-value-based integrative approaches | ||||||||||
IMPaLA | ✓ | ✓ | ✓ | KEGG, Reactome, WikiPathways, NCI-PID, Biocarta, PharmaGKBB, INOH, EHMN, NetPath, SMPDB, BioCyc | ✓ | |||||
iPEAP | ✓ | ✓ | ✓ | ✓ | KEGG | ✓ | ||||
MarVis-Pathway | ✓ | ✓ | ✓ | KEGG, BioCyc | ✓ | |||||
BLMA | ✓ | ✓ | KEGG | ✓ | ||||||
GeneTrail2 | ✓ | ✓ | ✓ | ✓ | ✓ | KEGG, Reactome, WikiPathways, BioCarta, NCI-PID, PharmGKB, SignaLink, GO, SMPDB | ✓ | |||
DANUBE | ✓ | ✓ | ✓ | |||||||
Mergeomics | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
Pathview | ✓ | ✓ | ✓ | KEGG | ✓ | ✓ | ✓ | |||
PaintOmics 3 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | KEGG | ✓ | ✓ | ✓ |
ReactomeGSA | ✓ | ✓ | ✓ | Reactome | ✓ | |||||
multiGSEA | ✓ | ✓ | ✓ | KEGG, Reactome, PharmGKB, BioCyc, SMPDB, Panther, Biocarta, NCI-PID | ✓ | ✓ | ||||
pathwayMultiomics | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
CPA | ✓ | ✓ | KEGG, GO | ✓ | ✓ | |||||
clusterProfiler 4.0 | ✓ | ✓ | ✓ | KEGG, GO, WikiPathways | ✓ | ✓ | ||||
Graph-transformation-based approaches | ||||||||||
PARADIGM | ✓ | ✓ | ✓ | NCI-PID | ✓ | |||||
Subpathway-GM | ✓ | ✓ | KEGG | ✓ | ||||||
microGraphite | ✓ | ✓ | KEGG, Reactome, NCI-PID, BioCarta | ✓ | ||||||
mirIntegrator | ✓ | ✓ | KEGG, miRTarBase | ✓ | ||||||
MOSClip | ✓ | ✓ | ✓ | ✓ | ||||||
IMPRes-Pro | ✓ | ✓ | KEGG, STRING, TRRUST | ✓ | ✓ | |||||
rPAC | ✓ | ✓ | ✓ | |||||||
Machine-learning-based approaches | ||||||||||
GSOA | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
PathwayPCA | ✓ | ✓ | ✓ | ✓ |
Comparison of the 32 surveyed methods in terms of the supported omics types, meta-analysis, embedded pathways databases and mapping unit. The majority of these methods can integrate transcriptomic data with genomic, epigenomic, proteomic and metabolomic data. Among these, 19 methods are capable of performing meta-analysis (multi-cohort analysis). Further, 22 have embedded databases and thus allow users to analyze their data without the need of providing pathway/gene set information. Most methods also automatically map the gene/compound IDs of the expression data to the gene/compound IDs presented in the pathways. The column ‘Inter. Visual.’ indicates whether the tool has an interactive interface. Six methods provide interactive pathway diagram that allow users to interactively explore the results and inspect the statistics obtained in each omics analysis
Method . | Type of omic data . | Meta-analysis . | Built-in Database . | ID Mapping . | Inter. Visual. . | |||||
---|---|---|---|---|---|---|---|---|---|---|
. | Genome . | Transcriptome . | Epigenome . | Proteome . | Metabolome . | . | . | Manual . | Built-in . | . |
Gene-level, P-value-based integrative approaches | ||||||||||
KaPPA-View | ✓ | ✓ | KEGG, BioCyc | ✓ | ✓ | |||||
MAPE | ✓ | ✓ | ✓ | |||||||
CancerMA | ✓ | ✓ | GO | ✓ | ||||||
INMEX | ✓ | ✓ | ✓ | KEGG, GO | ✓ | |||||
3Omics | ✓ | ✓ | ✓ | KEGG, BioCyc | ✓ | ✓ | ||||
InCroMAP | ✓ | ✓ | ✓ | ✓ | ✓ | KEGG, Reactome, BioCarta | ✓ | ✓ | ||
ActivePathways | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
mitch | ✓ | ✓ | ✓ | ✓ | ✓ | |||||
iODA | ✓ | ✓ | ✓ | ✓ | KEGG | ✓ | ||||
Pathway-level, P-value-based integrative approaches | ||||||||||
IMPaLA | ✓ | ✓ | ✓ | KEGG, Reactome, WikiPathways, NCI-PID, Biocarta, PharmaGKBB, INOH, EHMN, NetPath, SMPDB, BioCyc | ✓ | |||||
iPEAP | ✓ | ✓ | ✓ | ✓ | KEGG | ✓ | ||||
MarVis-Pathway | ✓ | ✓ | ✓ | KEGG, BioCyc | ✓ | |||||
BLMA | ✓ | ✓ | KEGG | ✓ | ||||||
GeneTrail2 | ✓ | ✓ | ✓ | ✓ | ✓ | KEGG, Reactome, WikiPathways, BioCarta, NCI-PID, PharmGKB, SignaLink, GO, SMPDB | ✓ | |||
DANUBE | ✓ | ✓ | ✓ | |||||||
Mergeomics | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
Pathview | ✓ | ✓ | ✓ | KEGG | ✓ | ✓ | ✓ | |||
PaintOmics 3 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | KEGG | ✓ | ✓ | ✓ |
ReactomeGSA | ✓ | ✓ | ✓ | Reactome | ✓ | |||||
multiGSEA | ✓ | ✓ | ✓ | KEGG, Reactome, PharmGKB, BioCyc, SMPDB, Panther, Biocarta, NCI-PID | ✓ | ✓ | ||||
pathwayMultiomics | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
CPA | ✓ | ✓ | KEGG, GO | ✓ | ✓ | |||||
clusterProfiler 4.0 | ✓ | ✓ | ✓ | KEGG, GO, WikiPathways | ✓ | ✓ | ||||
Graph-transformation-based approaches | ||||||||||
PARADIGM | ✓ | ✓ | ✓ | NCI-PID | ✓ | |||||
Subpathway-GM | ✓ | ✓ | KEGG | ✓ | ||||||
microGraphite | ✓ | ✓ | KEGG, Reactome, NCI-PID, BioCarta | ✓ | ||||||
mirIntegrator | ✓ | ✓ | KEGG, miRTarBase | ✓ | ||||||
MOSClip | ✓ | ✓ | ✓ | ✓ | ||||||
IMPRes-Pro | ✓ | ✓ | KEGG, STRING, TRRUST | ✓ | ✓ | |||||
rPAC | ✓ | ✓ | ✓ | |||||||
Machine-learning-based approaches | ||||||||||
GSOA | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
PathwayPCA | ✓ | ✓ | ✓ | ✓ |
Method . | Type of omic data . | Meta-analysis . | Built-in Database . | ID Mapping . | Inter. Visual. . | |||||
---|---|---|---|---|---|---|---|---|---|---|
. | Genome . | Transcriptome . | Epigenome . | Proteome . | Metabolome . | . | . | Manual . | Built-in . | . |
Gene-level, P-value-based integrative approaches | ||||||||||
KaPPA-View | ✓ | ✓ | KEGG, BioCyc | ✓ | ✓ | |||||
MAPE | ✓ | ✓ | ✓ | |||||||
CancerMA | ✓ | ✓ | GO | ✓ | ||||||
INMEX | ✓ | ✓ | ✓ | KEGG, GO | ✓ | |||||
3Omics | ✓ | ✓ | ✓ | KEGG, BioCyc | ✓ | ✓ | ||||
InCroMAP | ✓ | ✓ | ✓ | ✓ | ✓ | KEGG, Reactome, BioCarta | ✓ | ✓ | ||
ActivePathways | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
mitch | ✓ | ✓ | ✓ | ✓ | ✓ | |||||
iODA | ✓ | ✓ | ✓ | ✓ | KEGG | ✓ | ||||
Pathway-level, P-value-based integrative approaches | ||||||||||
IMPaLA | ✓ | ✓ | ✓ | KEGG, Reactome, WikiPathways, NCI-PID, Biocarta, PharmaGKBB, INOH, EHMN, NetPath, SMPDB, BioCyc | ✓ | |||||
iPEAP | ✓ | ✓ | ✓ | ✓ | KEGG | ✓ | ||||
MarVis-Pathway | ✓ | ✓ | ✓ | KEGG, BioCyc | ✓ | |||||
BLMA | ✓ | ✓ | KEGG | ✓ | ||||||
GeneTrail2 | ✓ | ✓ | ✓ | ✓ | ✓ | KEGG, Reactome, WikiPathways, BioCarta, NCI-PID, PharmGKB, SignaLink, GO, SMPDB | ✓ | |||
DANUBE | ✓ | ✓ | ✓ | |||||||
Mergeomics | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
Pathview | ✓ | ✓ | ✓ | KEGG | ✓ | ✓ | ✓ | |||
PaintOmics 3 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | KEGG | ✓ | ✓ | ✓ |
ReactomeGSA | ✓ | ✓ | ✓ | Reactome | ✓ | |||||
multiGSEA | ✓ | ✓ | ✓ | KEGG, Reactome, PharmGKB, BioCyc, SMPDB, Panther, Biocarta, NCI-PID | ✓ | ✓ | ||||
pathwayMultiomics | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
CPA | ✓ | ✓ | KEGG, GO | ✓ | ✓ | |||||
clusterProfiler 4.0 | ✓ | ✓ | ✓ | KEGG, GO, WikiPathways | ✓ | ✓ | ||||
Graph-transformation-based approaches | ||||||||||
PARADIGM | ✓ | ✓ | ✓ | NCI-PID | ✓ | |||||
Subpathway-GM | ✓ | ✓ | KEGG | ✓ | ||||||
microGraphite | ✓ | ✓ | KEGG, Reactome, NCI-PID, BioCarta | ✓ | ||||||
mirIntegrator | ✓ | ✓ | KEGG, miRTarBase | ✓ | ||||||
MOSClip | ✓ | ✓ | ✓ | ✓ | ||||||
IMPRes-Pro | ✓ | ✓ | KEGG, STRING, TRRUST | ✓ | ✓ | |||||
rPAC | ✓ | ✓ | ✓ | |||||||
Machine-learning-based approaches | ||||||||||
GSOA | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
PathwayPCA | ✓ | ✓ | ✓ | ✓ |
Table 2 shows the features that are currently supported by the tools: supported data types, pathway databases embedded in the tools, ID mapping (built-in/user-provided) and interactive visualization. We group the methods into four categories based on their analysis techniques and purposes: (i) gene-level, P-value-based integration, (ii) pathway-level, P-value-based integration, (iii) graph transformation-based and (iv) machine-learning-based analysis. The categories are discussed in detail in Section Integrative Pathway Analysis. All of the reviewed tools are capable of analyzing transcriptomics as gene expression data are the most prevalent type of available omics data. Many of the methods are capable of integrating transcriptomics with other types of data, including genomic (e.g. SNP, copy number variation), epigenomic, proteomic and metabolomic data. In addition to multi-omics integration, there are 19 methods that allow users to integrate multiple datasets/experiments (meta-analysis). Methods with interactive visualization capability provide the diagrammatic representation of omics and interactions, as well as the quantitative analysis results on their graphical user interface. Users can interactively explore the analysis results, gene-level statistics and pathway networks, and choose the set of impacted pathways based on their domain knowledge and expert opinion.
Regarding pathway knowledge, 22 methods embed pathways from known pathway databases in their software. Embedded databases include KEGG [74], GO [75], Reactome [76], STRING [77], MirTarBase [78], NCI-PID [79], Biocarta [80], PharmGKB [81], Wikipathways [82], TRRUST [83], BioCyc [84], NetPath [85], INOH [86], EHMN [87], SMPDB [88] and SignaLink [89]. For these tools, users can conveniently use the built-in pathway information for their analysis without the need of providing pathways as part of the input. As each database uses different gene/compound IDs, it is important that the molecular data are mapped to the same IDs as the gene/compound IDs in the pathways. Such mapping can be either a user-imported mapping file or based on a built-in mapping database. Many methods also allow users to import their own pathway information (mostly in the format of gene sets). Clearly, the built-in mapping database is preferable since it relieves users from technical details and mapping knowledge.
High-level workflow
Figure 2 shows the high-level workflow and overall pipeline of integrative pathway analysis methods. These methods typically consist of two analysis phases: (i) data processing and standardization (pre-analysis), and (ii) data integration and pathway analysis (integrative pathway analysis).

The high-level workflow of the surveyed approaches for integrative pathway analysis. The modules commonly present in every analysis pipeline are depicted as solid-line boxes. The dashed boxes are optional modules. The analysis process starts with processing input data (multiple multi-omics datasets). The methods first perform an integrity check to ensure the consistency among the input matrices and then map the gene/compound IDs to supported IDs of pathway databases. After pre-analysis, the methods perform differential analysis at the gene-level and then identify significant pathways using statistical hypothesis testing. The output often includes pathway P-values, enrichment scores and visualization plots.
During the pre-analysis phase, the input data are processed to become suitable for integrative pathway analysis. Pre-analysis includes an ID mapping module that maps the input omics features to the gene/compound IDs presented in the pathways. Next, a data filtering module is designed to perform data quality control and noise reduction. Two optional modules can be also provided: data matching and pathway augmentation. In the data matching module, all expression data are matched so that the final dataset contains either common genes or samples across all original datasets. In the pathway augmentation module, validated multi-omics interactions (e.g. microRNA–gene interactions) are added to the pathway. This module is particularly important for methods that take into consideration interactions among genes and multi-omics layers.
The integrative pathway analysis phase is typically for identifying pathways that are significantly enriched. Most methods perform differential analysis for each data type or dataset and then combine them at the gene- or pathway-level to obtain an overall enrichment score for each pathway. These scores are often compared against the null distributions to obtain the P-values that represent statistical significance of the pathways.
Pre-analysis
Input data processing and filtering
The input of the surveyed methods includes: (i) omics expression matrices in .CSV format, in which rows represent genes/markers and columns represent samples/patients, and (ii) pathways matrix in the GMT format if the pathways are not embedded in the software. The analysis begins with quality control and data filtering.
The quality control step typically checks for the consistency of class labels across all datasets, the validity of gene/compound IDs and the correctness of data format. Some methods also detect outliers or perform additional checks depending on the type of omics data they support. The data filtering step aims at reducing the number of irrelevant features. To reduce the number of omics markers, most methods remove data points with low quality. The majority of methods also remove pathways that have too few genes measured. The three methods, CancerMA [42], GSOA [72] and MOSClip [70], also perform additional data filtering to suit their purpose. CancerMA only keeps cancer-related genes because it is designed for system-level analysis of 80 cancer datasets. GSOA and MOSClip filter out genes that do not belong to the pathways/conditions of interest.
Identifier (ID) mapping
At the abstract level, a pathway consists of multiple genes that work together to achieve a certain biological function. The challenge is that the same gene has different IDs in assaying platforms, pathway databases and omics layers. Crucially, ID mapping aims to integrate multi-omics data with pathway information available from different databases. Most methods perform ID mapping by either requiring users to import a mapping file, automatically mapping the IDs using a built-in mapping database, or providing both options. A complete description of supported databases and mapping IDs of all methods is provided in Supplementary Table S1.
Data matching
After importing input datasets, an optional data matching process can be employed to match or integrate the input expression matrices. Five methods implement this module in their pre-analysis phase: CancerMA [42], mitch [51], MarVis-Pathway [55], MOSClip [70], GSOA [72]. CancerMA and mitch match the gene IDs across all expression matrices, whereas MarVis-Pathway, MOSClip, and GSOA match the sample/patient ID across all input matrices.
Pathway augmentation
The ID mapping step often leads to a one-to-many scenario where multiple IDs can represent the same gene in a pathway. For example, one single gene can code for several distinct proteins due to the alternative RNA splicing process during gene expression. Similarly, an miRNA might target several genes, or multiple miRNAs target a gene. Integrative pathway analysis methods address this challenge by either: (i) treating the duplicated gene IDs as individual data points or (ii) choosing the optimal data point among all candidates. Regardless, both solutions result in substantial information loss. To alleviate this issue, mirIntegrator [69] and microGraphite [68] augment the pathways to include all entities instead of merging them to a single gene node. Specifically, they utilize validated/predicted microRNA/target interactions and add each interaction to a pathway if the microRNA targets a gene of the pathway. The augmentation process leads to a more comprehensive representation of biological pathways and interaction networks.
Multi-cohort analysis
Over the years, a research group might have collected many datasets from different sets of samples (or patient cohorts) for a specific condition/disease. As technologies change, the assaying platforms and data types of each cohort may differ. One example scenario is as follows: (i) first dataset/cohort has gene expression, (ii) second dataset has gene expression and methylation and (iii) third cohort has gene expression and copy number variation. Even for the same omic type, the data can be generated from different platforms, including microarray and sequencing technologies. Depending on the analysis purpose, users can choose an appropriate method based on the information provided in Table 2.
The gene-level integrative approaches often combine the P-values of genes obtained from multiple datasets before performing pathway analysis. These methods typically remove genes that do not appear in all experiments. In other words, these methods are suitable only when the genes/features are consistently measured in all datasets. Note that P-value is a standardized and scaled metric, i.e. P-values are uniformly distributed between 0 and 1 under the null hypothesis regardless of data scaling and assaying platforms. Therefore, these methods can integrate data of different platforms as long as the P-values are computed correctly in each dataset and that each dataset measures a similar set of markers/genes. Table 2 also shows the types of data each method was designed and tested on. In particular, MAPE and CancerMA were designed for gene expression, while INMEX, InCroMAP, ActivePathways, mitch and iODA can analyze other types of data. Note that the two methods, KaPPA-View and 3Omics, in this category are not capable of multi-cohort analysis.
When different cohorts measure different sets of markers, users can choose among the approaches that integrate the data at the pathway-level: 10 meta-analysis methods in the pathway-level integration category (MarVis-Pathway, BLMA, GeneTrail2, DANUBE, Mergeomics, PaintOmics 3, ReactomeGSA, pathwayMultiomics, CPA and clusterProfiler 4.0), plus rPAC and PathwayPCA. These methods first analyze each dataset independently to calculate the P-values of each pathway and then combine the statistics (e.g. P-values) of the pathways across multiple cohorts. As the set of pathways is the same in each analysis, the integration at the pathway-level can almost always be performed regardless of the data type and assaying platforms being used. Compared with gene-level integration, pathway-level integrative approaches are more flexible as they do not require each dataset to measure similar markers. However, as they focus on pathway-level integration, these approaches are not capable of identifying the genes/markers that are consistently differentially expressed across the cohorts. The pros and cons of each category will be further discussed in the next sections.
Integrative pathway analysis
The pre-analysis phase prepares the data for the integrative pathway analysis phase. The 32 methods utilize different integrative techniques and hypothesis testing methods to identify pathways that are significantly different between the compared phenotypes. In the following text, we will provide a comprehensive review of these methods as follows: (i) providing a high-level description of each method, (ii) categorizing them according to their analysis techniques and purposes and (iii) providing a high-level assessment of the strategic advantages and disadvantages of each category.
At the high-level, integrative pathway analysis consists of two main steps: (i) computing the summary statistics for the genes/markers, and (ii) performing hypothesis testing at the pathway-level to identify pathways that are significantly different between two phenotypes (e.g. disease versus control, or treated versus untreated). In the first step, these methods perform differential analysis to compute summary statistics for omics features in the experiments [90]. In the second step, integrative pathway analysis methods compute a summary statistic from omics scores of genes belonging to a pathway. Next, a statistical significance test is performed to assess how likely the computed statistic is observed by chance (P-value). Therefore, an enrichment score and a P-value for each pathway are obtained at this step, showing whether a pathway is associated with a disease phenotype. Depending on the assumption and strategy of each method, data integration (of multiple cohorts and multiple omics types) is performed in the first or second step.
Overall, the surveyed methods are divided into four categories based on their strategy used for analysis and data integration: (i) gene-level, P-value-based integrative approaches, (ii) pathway-level, P-value-based integrative methods, (iii) graph-transformation-based techniques and (iv) machine-learning-based approaches. Characteristics of each category will be described in the following sections.
Gene-level, P-value-based integrative approaches
Figure 3 shows the overall pipeline of these methods. Methods in this category include KaPPA-View [45], MAPE [46], CancerMA [42], INMEX [47], 3Omics [48], InCroMAP [49], ActivePathways [50], mitch [51] and iODA [52]. Consider a readout as the measurement/expression of a single omics data in a dataset (e.g. gene expression data of an experiment/dataset). The input of integrative pathway analysis approaches typically includes multiple readouts (from multiple datasets of different omics types). In each readout, rows typically represent genes/features, whereas columns represent samples. The samples are divided into two groups: disease versus control (phenotypes). Approaches in this category start by analyzing each readout independently to compute the P-value and effect size of each gene. For each gene, these methods combine the P-values and statistics across all readouts to compute a summary P-value and statistics that represent the overall difference of the gene between the two phenotypes. These summary statistics of the genes then serve as input for enrichment methods to calculate the P-values and enrichment scores of pathways.

Overall pipeline of gene-level, P-value-based integrative approaches. The input includes multi-omics and/or multi-cohort data that compare two phenotypes. Methods in this category first analyze each readout independently to obtain the gene-level P-values and effect sizes (e.g. log fold-change). For each gene, the methods combine the P-values and effect sizes across multiple readouts to obtain the summary P-value and effect size of the gene. Finally, these approaches perform functional analysis using the summary P-values and statistics to identify pathways that are significantly different between the two phenotypes. The output of these methods typically include the P-values and enrichment scores of the pathways.
There are two main techniques that can be used to combine the P-values: quantile-based combination and ranked-based combination. Given multiple P-values for a gene
KaPPA-View allows users to visualize metabolomics and transcriptomics to analyze plant metabolic pathways. The software displays quantitative data changes for individual transcripts and metabolites between different experimental conditions on the same metabolic pathway maps. Furthermore, gene-to-gene and/or metabolite-to-metabolite relationships such as co-expression correlations of genes can be displayed as edges on a metabolic pathway graph.
MAPE performs meta-analysis by combining the P-values obtained from each input mRNA datasets. The combination is performed at both the gene-level (MAPE-G) and the pathway-level (MAPE-P). MAPE-G first computes the t-statistic for each gene in each study and then calculates the P-value using permutation test. For each gene, MAPE-G then combines the individual P-values across multiple studies using the MaxP algorithm [93] to obtain one single P-value for the gene. Finally, given the combined P-values and statistics obtained from each gene, MAPE-G then performs a Kolmogorov–Smirnov (KS) test to calculate the P-value for each pathway. In contrast to MAPE-G, MAPE-P combines the P-values at the pathway-level. In each study, MAPE-P first computes the t-statistic for each gene and then calculates the P-value for the gene using the permutation test. It then calculates the P-values for the pathways using the KS-test. MAPE-P then combines the P-values of a pathway across all studies using MaxP. Finally, MAPE-I combines the obtained P-values from MAPE-G and MAPE-P modules using MinP [92].
CancerMA was designed to analyze 80 cancer-specific gene expression datasets. This method first performs differential analysis in each dataset and then combines the P-values using Stouffer’s method [96]. CancerMA also integrates the logFC-s using weighted linear combination [97]. By determining the differentially expressed (DE) genes based on these integrated statistics, CancerMA then performs an ORA [1] to calculate the P-values of GO terms.
INMEX implements differential analysis and then applies one of the five strategies to combine the computed gene P-values across readouts: combining P-values (Fisher’s and Stouffer’s methods), combining standardized differences, combining rank orders, vote-counting and direct data merge. After obtaining the combined P-values for the genes, INMEX calculates the P-values of the pathways using ORA or GSEA [98].
3Omics supports the analysis of proteomics, transcriptomics and metabolomics using five major modules: correlation analysis, co-expression analysis, pathway enrichment analysis and phenotypic analysis. Correlation and co-expression modules visualize connectivity and heatmaps of the two omics types. The pathway enrichment analysis component can be applied in two modes (normal and enriched). The normal mode displays user-provided metabolites via simple metabolite mapping to a pathway from the pathway databases. In the enriched mode, the tool compares two conditions to obtain the set of DE genes and enriched pathways using ORA.
InCroMAP first applies a threshold to identify the DE genes in each readout (using fold change or P-values). Then the intersection or union of identified DE genes are considered for pathway enrichment analysis using ORA. This tool provides two comprehensive visualizations, metabolic and pathway views. The metabolic view generates an interactive global map of cellular metabolism. The pathway view shows the integrated pathway-based data visualization from multiple omics platforms.
ActivePathways accepts a matrix of P-values with rows representing common genes across multiple datasets/cohort, and columns related to each omics dataset. Next, considering the dependency between omics, the P-values for each gene are combined across the datasets using Brown’s method [99], which is an extension of Fisher’s method (combining independent P-values). As a result, a list of combined P-values is generated, which is further used to identify DE genes. From the list of DE genes, this method performs a ranked ORA to compute the P-values for the pathways. The method also adjusts the P-values using the Bonferroni method. Pathways with adjusted P-values smaller than
The mitch method is capable of integrating multi-cohort and multi-omics data. This method allows users to import DE genes and their statistics. Otherwise, this tool computes a directional significance score (D) for each marker defined as:
iODA supports the analysis of transcriptome profiles (mRNA or miRNA expression data) and protein–DNA interactions (ChIP-Seq data). For each dataset, iODA applies six statistical methods to compute the P-values at the gene-level. These methods include Least Sum of Ordered Subset Squared (LSOSS) [103], Cancer Outlier Profile Analysis (COPA) [104], Maximum Ordered Subset T-statistics (MOST) [105], Outlier Robust T-statistics (ORT) [106], Outlier Sum (OS) [107] and the t-test. A gene is considered DE if four or more statistical methods report the gene as significant. The MACS [108] model is used by iODA to find significant peaks of the protein binding site in ChIP-seq data. The binding sites are then assigned to the target genes using the PeakAnalyzer [109] tool. iODA then uses two alternative strategies for pathway analysis: (i) intersecting the DE genes from all data types and then calculating the P-values of the pathways using ORA, or (ii) performing pathway analysis for each data type, then intersecting the significant pathways to obtain the final list.
Overall, this category provides the flexibility in adding multiple omics layers into the pathway analysis. Given that all omics markers, which match the same gene, affect the disease phenotype uniformly, the methods use well-formulated algorithms to combine the P-values and effect sizes from multiple readouts. Moreover, they are usually fast because pathway analysis is performed only once regardless of the number of inputs. However, they require the same set of genes across different omics datasets. Therefore, this approach may lose some important information, such as some genes may have crucial effects on pathway regulation, but they are excluded in one of the omics layers. Also, the gene statistics calculation in this approach fails to consider the topology among omics markers in the same omics layer other than transcriptomics.
Pathway-level, P-value-based integrative approaches
Figure 3 shows the overall workflow of pathway-level, P-value-based integrative approaches. Methods in this category include IMPaLA [53], iPEAP [54], MarVis-Pathway [55], BLMA [56], DANUBE [43], GeneTrail2 [57], Mergeomics [58], Pathview [59], PaintOmics 3 [60], ReactomeGSA [61], multiGSEA [62], pathwayMultiomics [63], CPA [64] and clusterProfiler 4.0 [65].
In contrast to gene-level integration, the methods in this category, for each readout, first perform differential analysis to obtain the gene-level statistics (P-values and effect sizes) and then perform pathway analysis using the gene-level statistics to calculate the P-values and enrichment scores for the pathways. Finally, for each pathway, these methods combine the P-values and enrichment scores from all readouts to compute a single P-value and enrichment score.

Overall pipeline of pathway-level, P-value-based integrative approaches. These methods first analyze each readout independently to calculate the P-values and statistics of each pathway in each readout. Next, the obtained P-values and statistics across all input datasets are combined to obtain the summary P-value and effect size for each pathway.
IMPaLA supports analyzing metabolomics and transcriptomics. This tool allows users to analyze each data type independently using ORA or Wilcoxon Enrichment Analysis (WEA) [110]. In the end, IMPaLA combines the two P-values into one single P-value by multiplying them.
iPEAP applies an existing pathway analysis method [1, 98, 111–113] on KEGG pathways for each dataset. This step returns multiple ranked lists of pathways. iPEAP then combines these ranked lists into one single rank list using the following techniques: RankAggreg, RobustRankAggre, min, median and mean.
MarVis-Pathway executes three types of enrichment analysis: entry-based, marker/feature-based and sample-based. The entry-based analysis is analogous to ORA. The marker/feature-based analysis ranks the features and then computes the pathway P-value using an iterative hypergeometric test [114], a rank-sum [115] or a KS test from the ranked list. Finally, sample-based analysis operates similarly to GSEA. Accordingly, the corresponding P-values for each pathway are returned. These P-values are then combined using Fisher’s [116] or Stouffer’s [96] method to produce a meta-P-value for each pathway.
BLMA was developed for the meta-analysis of multiple transcriptome datasets. The software allows users to perform pathway analysis on each dataset by multiple methods, including ORA [1], GSA [117], PADOG [118] and Impact Analysis [15]. For a chosen pathway analysis method, each pathway will therefore have multiple P-values—one per dataset. For each pathway, users can combine the P-values one of many techniques, including addCLT [56], Fisher’s [116], Stouffer’s [96], minP [92] or maxP [93] methods. The output of the software includes results of the data integration, as well as the results of each dataset.
DANUBE is based on the fact that pathway analysis often provides biased results toward well-studied conditions, e.g. cancer or Alzheimer’s disease. This meta-analysis approach attempts to correct for method bias and integration of multiple mRNA datasets. To perform pathway analysis on each dataset, DANUBE uses one of the four techniques: GSEA, GSA, IA and PADOG. It then uses the empirical distributions to correct the P-values obtained for each pathway. After bias correction, each pathway has multiple P-values—one per dataset. Finally, DANUBE combines the resulted P-values using the addCLT method for each pathway.
GeneTrail2 is a web service that allows users to perform pathway analysis on each input separately, using one of these options: weighted/unweighted KS-test, Wilcoxon test, ORA, sum, mean, median, maxmean statistics and t-test. The significant pathways and their enrichment scores can be viewed online. Furthermore, for data integration, two view modes are available. The union mode displays pathways that are significant in at least one data type. The intersection mode only displays pathways that are significant in all data types.
Mergeomics consists of two independent modules: Marker Set Enrichment Analysis (MSEA) and weighted Key Driver Analysis (wKDA). MSEA focuses on identifying enriched pathways in each dataset. This approach is similar to ORA: if there are more DE genes in a pathway than what can be expected by chance, then the pathway is likely to be enriched. Chi-square-like statistics is calculated to test this assumption, and a permutation test is performed to retrieve the P-value for each pathway. MSEA allows users to analyze a single dataset, as well as to perform meta-analysis of multiple datasets (named ‘meta-MSEA’). The meta-MSEA first calculates the P-values obtained from each dataset for each pathway and then combines these P-values using Stouffer’s method [96] to obtain a single P-value for the pathway. wKDA focuses on identifying genes that are the potential key drivers of the enriched pathways. Mergeomics outputs enriched pathways and their potential key drivers.
Pathview uses GAGE [119] (for expression data) and ORA (for compound IDs) to calculate the P-values of the pathways. This tool visualizes the pathway graphs with user data mapped in two views, the native KEGG pathway view and Graphviz [120] view. Graphviz view provides better control over node/edge attributes and a better view of graph topology.
PaintOmics 3 analyzes a range of different omics types: gene expression, metabolomics, region-based omics like ChIP-seq, DNase-seq, ATAC-seq, Methyl-seq, etc. and regulatory-based omics like miRNAs, transcription factors or other factors. This tool calculates the P-value for each pathway in each dataset using ORA and then combines the P-values for each pathway using Stouffer’s or weighted Fisher’s method. This tool provides three modules for visualizing the analysis results. In the first module, a pie chart and hierarchical structure show that KEGG pathways are organized in seven main classifications (Cellular Processes, Drug Development, Environmental Information Processing, Genetic Information Processing, Human Diseases, Metabolism and Organismal Systems). The second module is a pathways interaction network, in which the nodes represent pathways and edges indicate shared features among them. The last module allows users to explore each pathway by producing an interactive pathway diagram and a global heatmap with complementary information.
ReactomeGSA supports pathway analysis using microarray measurements, raw RNA-Seq and normalized read counts, proteomics spectral counts and intensity-based quantitative data. ReactomeGSA applies Camera [121], PADOG [118] or ssGSEA [122, 123] for pathway analysis on each data type. This tool also supports the analysis of single-cell RNA-sequencing (scRNA-seq) data. For this type of data, this tool uses either Seurat [124] or scater [125] to compute the mean expression of genes within a cluster. Then, through one of the supported analyzing methods, one pathway-level expression value per cell cluster is calculated. The obtained results are then converted to Reactome’s internal data format to visualize the results. The results from different analyses can be seen and interactively explored side by side down to a single gene or protein level.
The multiGSEA software independently executes GSEA on each input data type, which can be transcriptomics, proteomics and/or metabolomics. This package allows users to utilize the pathways definition from eight different databases, including PharmGKB [81], NCI/Nature Pathway Interaction Database [126], HumanCyc [127], SMPDB [128], Panther [129], Biocarta [80], KEGG [74, 130, 131] and Reactome [76]. For each pathway, up to three adjusted P-values and enrichment scores are returned. The method then combines the pathway’s P-values using Z-method, Stouffer’s, Fisher’s or Edgington’s method [132].
The pathwayMultiomics method as input, accepts a table of pathways’ P-values across multi-omics types. This tool, considering all combinations of pairs of P-values for each pathway, then picks the maximum P-value in each pair. Next, among the selected P-values, the minimum P-value is chosen for that specific pathway. To assess the statistical significance of chosen MinMax statistic, pathwayMultiomics assumes that all input P-values are independent, and the MinMax statistic for each pathway follows a Beta distribution for the rth order statistic.
CPA is a web-based platform that supports multiple inputs that can be analyzed with multiple methods in a single analysis [9, 14, 15, 110, 118, 133–135]. With input is a list of DE genes, ORA/WebGestalt [136] will be used. With input is a pre-ranked gene list (e.g. gene list with fold change), FGSEA [134], KS-test [133] and Wilcox-test [137] are the available methods. With input as an expression matrix, eight methods, including GSEA [9], GSA [14], FGSEA, PADOG [118], Impact Analysis (IA) [15, 16], ORA/WebGestalt, KS-test, and Wilcox-test, can be used. This tool provides an interactive visualization of the analysis results in both pathway- and gene-level graphs. At the pathway-level, a pathway is a node sliced into multiple parts corresponding to the analysis results. At the gene-level, a node is a gene with multiple colored parts representing the regulation direction of the gene in each data set.
The last method, clusterProfiler 4.0, allows users to perform both multi-omics integration (transcriptome and epigenome) and multi-cohort analysis. The input of the method includes multiple lists of genes and their statistics obtained from one or multiple transcriptome/epigenome datasets. The software provides embedded pathways information from KEGG, GO and WikiPathways, but it also allows users to input their customized pathways. For each dataset, the method performs pathway analysis using either ORA or GSEA. For each analysis, the method adjusts the P-values for multiple comparisons using one of the following methods: Holm’s, Bofferoni’s, Hochberg’s, Hommel’s, Bonferroni-Holm or Benjamini-Hochberg’s false discovery rate (FDR). In addition, the method allows users to compare and contrast the results obtained from multiple analyses using the compareCluster function. This function returns a table of pathways with multiple adjusted P-values—one for each input list. Finally, it further visualizes the results in a side-by-side graph for comparison analysis.
In general, these methods confer flexibility in combining multiple omics significant signals. As a result, the statistical power of these methods is expected to increase. However, most of the mentioned statistical tests assume independence between omics types, which contradicts reality and may negatively affect the analysis accuracy. A possible solution is considering the methods that can combine dependent P-values (e.g. Brown’s [99] and Liptak’s [138]). Another critical pitfall of these methods is that they neglect the actual expression changes, i.e. effect sizes. This might result in information loss. Although P-value is influenced by effect size, it is also greatly affected by sample size [139]. For datasets with large sample sizes, a test for differential expression will almost always result in a significant P-value, unless the effect size is zero, which is very unlikely. Simply combining the P-values would likely produce varying degrees of false discoveries.
Graph-transformation-based approaches
Methods in graph-transformation-based category include PARADIGM [66], Subpathway-GM [67], microGraphite [68], mirIntegrator [69], MOSClip [70], IMPRes-Pro [71] and rPAC [44]. Figure 5 shows the overall pipeline of these methods. Approaches in this category emphasize the point that studying affected genes and pathways separately may hinder the understanding of the whole genome-wide picture [71]. These methods utilize different strategies to construct gene/compound network/graph from existing knowledge about pathway topology. Consequently, each pathway is represented as a network of genes and products. They then perform network-based analysis to identify significantly different pathways between the two phenotypes.

Overall pipeline of graph-transformation-based integrative approaches. These methods first construct pathway networks and then analyze each readout independently to obtain the summary statistics for genes and other omics entities. Finally, they perform graph-based analysis to calculate the P-value and network score for each pathway.
PARADIGM models each gene as a factor graph of four different biological entities: copy number, gene expression state, protein level and protein activity. From the input expression data, PARADIGM computes the scores of the nodes (observed states) and estimates the probability that a node is positive/active or negative/inactive. For each pathway, PARADIGM returns a matrix of states in which columns represent samples and rows represent nodes in the pathway factor graph. All pathways are ranked based on the average number of samples in which significant activity was detected per node.
Subpathway-GM, which is now part of the iSubpathwayMiner package [140], is capable of integrating gene expression data with metabolic data to identify metabolic pathways and subgraphs that are significantly impacted. The method first maps genes and metabolites to enzyme and metabolite nodes of KEGG pathways. The genes and metabolites are then referred to as ‘signature nodes’ of the pathway graph. The method next searches for shortest paths between signature nodes and then removes nodes that do not belong to these shortest pathways. Subgraphs with a number of nodes larger than a pre-defined threshold are considered important. The method calculates the enrichment score for each subgraph using the hypergeometric test. The P-value of the underlying pathway is the smallest P-value of its subgraphs. The method repeats the above process for all metabolic pathways to obtain the P-values for all pathways. Finally, Subpathway-GM outputs the pathways, subgraphs and their P-values.
Both mirIntegrator and microGraphite extend the pathways to include microRNA–gene interactions. mirIntegrator performs pathway analysis on the extended pathways using ORA and Impact Analysis and then combines the P-values of the two types of evidence using Fisher’s method. In contrast, microGraphite decomposes the network into fully connected cliques and calculates their P-values. The method then builds a junction tree, having nodes as cliques and edges as connectivity, and computes the scores of every path in the graph. All the top-scored paths are combined to generate a meta-pathway. In the last step, the meta-pathway’s paths are analyzed and ranked according to their involvement in the phenotype. Finally, microGraphite performs a sample permutation test to estimate the significance (P-value) of the pathways, meta-pathway, paths and cliques.
MOSClip focuses on identifying pathways or modules associated with patient survival. From the pathway graph, MOSClip generates all possible cliques and performs survival analysis. For each pathway or clique, MOSClip goes through the following process: (i) data filtering to keep only the genomic features that belong to the pathway/module, (ii) dimension reduction, (iii) data concatenation across data types by patient matching and (iv) survival analysis using Cox regression [141]. MOSClip outputs a Cox P-value for each pathway that represents how likely the pathway is significantly associated with patient survival.
IMPRes-Pro implements a step-wise active pathway detection algorithm on the background network. Starting from seed nodes, the method explores all potential paths that include as many target nodes as possible. Next, IMPRes-Pro uses the shortest path algorithm [142] with a customized penalty function to achieve the optimal pathway network. The algorithm stops when each node achieves a minimum penalty score. The final active pathway network is detected by truncating and backtracking.
The rPAC software is designed to find routes of a pathway highly associated with the disease. Routes are portions of a pathway that typically involve a transcription factor. The computed scores for each route are: (i) activity scores based on the down or up-regulation of that route, (ii) the rate that a route is altered within a cohort and (iii) the average of yielded activity scores of a route within all samples in a cohort. The P-values for routes are computed by testing a two-tailed hypothesis based on the scores.
The graph-transformation-based approaches have significant advantages over enrichment methods. They account for pathway topology and multi-omics interactions by modeling pathways as networks of genes and their products. As a great result, these methods present the option to identify active subnetworks, which would have more explanatory power and decrease the domain of research to the subset of biological components. Adding more layers of omics might, however, result in a complex network that necessitates a significant amount of effort in terms of implementation and analysis.
Machine-learning-based approaches
Figure 6 shows the high-level description of machine-learning-based approaches. Methods in this category include GSOA [72] and PathwayPCA [73]. These methods apply machine learning techniques together with multi-omics data integration to identify the pathways strongly associated with the phenotype. The main idea is to classify samples using the genes that belong to each pathway. Next, the performance of the classification is assessed, and the importance of a pathway is assessed based on how well we can separate disease and healthy samples using the genes in the pathway.

Overall pipeline of machine-learning-based approaches. For a given pathway, these methods filter the multi-omics data to keep only genes belonging to the pathway. Next, these methods classify each sample using the expression data and assess the accuracy using the area under the receiver operating characteristic curve (AUC). The P-value of the pathway is calculated by comparing the obtained AUC to its empirical distribution constructed under the null.
GSOA starts with merging multi-omics data into a single data frame and then performs, by default, radial basis function support vector machine kernel (RBF SVM Kernel) [143] classification using the genes belonging to each pathway. GSOA then assesses the prediction accuracy by calculating the area under the receiver operating characteristic curve (AUC). Then, for each pathway, a P-value is calculated using the empirical null distribution of the AUC values.
PathwayPCA [73] extends two existing methods, AES-PCA [144] and SuperPCA [145], to perform pathway analysis and data integration. AES-PCA first performs dimension reduction and computes the latent variables for each data type and each pathway. For a given data type, the latent variables are evaluated against the phenotype either by using a regression model
One key advantage of machine-learning-based approaches is that users can incorporate many machine learning techniques into pathway analysis. Many of these techniques are publicly available and thus, require minimal implementation. However, the machine learning algorithms have some disadvantages in the scope of pathway analysis, including (i) the lack of means to take into consideration interactions among omics layers, (ii) the dependency on the chosen machine learning algorithm and parameters and (iii) computational burden due to repeated classifications, i.e. classification is performed for each pathway.
Method assessment
Figure 7 shows the quality assessment of the 32 surveyed methods using the following criteria: (i) validation, (ii) stability, (iii) installation, (iv) user-friendliness, (v) documentation and (vi) tutorial. For each criterion, a method is scored from one (worst) to five (best). Overall, there are 12 methods that have an average score above 4.0. These include ReactomeGSA, ActivePathways, PaintOmics 3, mitch, BLMA, PathwayPCA, Mergeomics, iODA, Subpathway-GM, multiGSEA, CPA and MAPE. Among these, five are GUI tools (ReactomeGSA, PaintOmics 3, Mergeomics, iODA and CPA), while the remaining are stand-alone packages. The details of the assigned scores for each method are provided in Supplementary Table S2.

Assessment of 32 surveyed methods in terms of the validation, stability, installation, user friendliness, documentation and tutorial. The score of each metric ranges from one () to five (
). Each metric has a different color. The methods are sorted according to their average score in an ascending order. The horizontal axis shows the average score for each method. There are 12 methods that have an average score above 4.0: ReactomeGSA, ActivePathways, PaintOmics 3, mitch, BLMA, PathwayPCA, Mergeomics, iODA, Subpathway-GM, multiGSEA, CPA and MAPE.
First, the validation metric refers to the quality of the validation reported in the original paper of each method. A method is scored five if the corresponding paper presents an in-depth analysis using at least two case studies of high-quality, real datasets. We deduct points if the reported validation has low quality or only has one case study. Further, if a method is not validated at all, the validation score is one. Note that most methods provide at least one real case study except InCroMAP, clusterProfiler4.0 and INMEX.
Second, the stability metric indicates how stable the methods are. In order words, this metric assesses how smoothly a method executes an analysis without crashing or having bugs/errors. For each method, we create datasets that are compatible with the input required by each tool. For example, an input object of multiGSEA can have up to three types of data as this software supports the integration of transcriptome, proteome and metabolome. For each method, we create 10 such datasets of different sizes (genes and samples) and execute them to quantify their stability and investigate whether they have bugs, errors or crashes. If any unexpected problem happens during running or any failure arises in different modules of a tool, we lower the score. As shown in the plot, most methods can complete their analysis without problems. For tools that are not available at the time of writing this review (e.g. PARADIGM, rPAC, DANUBE CancerMA), we consider a score of one.
Third, the installation metric refers to how straightforward it is to install a package or software. Most of the available methods have a high score in this metric. We deduct points for some methods (e.g. GSOA, microGraphite, mirIntegrator, MOSClip) because they require users to manually install many packages that can potentially cause difficulties. We give a score of one to methods that cannot be installed, or require components that are not available anymore. We also give a score of one for tools that are not available or not accessible at the time of writing this manuscript.
Fourth, the user-friendliness metric shows how the tool is well designed from users’ perspective. For web-based tools and software with a graphical user interface (GUI), this metric considers how well the architecture is planned so users can navigate through different tabs/modules to perform their analysis. Well-designed GUI tools receive the highest score (five) because they make it easy for users to analyze, visualize and interactively explore the results. For tools that do not have GUI, the highest score they can receive is four. For these methods, the score also takes into account the capability to plot the results or the efforts required from users to convert/pass data from one function to another. ReactomeGSA, PaintOmics 3, iODA, 3Omics, CPA and InCroMAP have the highest score in this metric.
Fifth, the documentation metric refers to the quality of software documentation. It takes into account how well each function and its parameters are documented for stand-alone software and how well different tabs/modules are explained for web-based tools and GUI software. ActivePathways, iODA, ReactomeGSA, PaintOmics 3, Mergeomics, PathwayPCA, clusterProfiler4.0, GeneTrail2, INMEX, mitch, InCroMAP and CPA are among the methods that have high-quality and thorough documentation. At the same time, microGraphite, pathwayMultiomics, PARADIGM, GSOA, IMPaLA and 3Omics are methods that do not present any documentation.
The last metric, tutorial, indicates whether the authors provided an elaborated, step-by-step tutorial that can be easily followed by users. ReactomeGSA, PaintOmics 3, ActivePathways, BLMA, iODA, Mergeomics, PathwayPCA, CPA, multiGSEA, clusterProfiler4.0, MOSClip, GeneTrail2, mirIntegrator and INMEX receive the highest score because they have a high-quality tutorial. In contrast, GSOA, 3Omics, pathwayMultiomics, MarVis-Pathway, IMPRes-Pro and PARADIGM receive low scores because they only provide a brief description of how to perform an analysis. Methods that are not available/accessible receive a score of one.
Summary and discussion
To provide readers with a comprehensive and compact view, we summarized the important details of the 32 surveyed methods in Figure 8. The summary figure emphasizes techniques used in three core modules for integrative pathway analysis: (i) network construction, (ii) pathway statistics computation and (iii) statistics combination. The network construction module refers to the way each method transforms existing pathways. The next module, pathway statistics computation, refers to techniques used to compute pathway statistics, including their enrichment score and P-value. The third module, statistics combination, provides the techniques used to combine the statistics across multiple readouts. The combination can be performed either at the gene-level or pathway-level. The last column of the figure represents the overall performance score for each method (in scale of one to five as reported in Section Method Assessment). Additionally, the pros and cons for each category are also provided in the figure.

Three core modules of the 32 surveyed methods, their pros and cons, and overall performance score (from one to five). The network construction module represents all activities performed by each method for expanding/transforming the pathway annotation graph. The statistics computation module includes techniques employed by each method for computing statistics at the pathway-level. The score combination module includes each method’s strategy in combining the computed statistics at the gene or pathway-level. Most methods combine the statistics at the pathway-level, except those in the gene-level integration category. Methods with an asterisk (*) also support pathway-level combination.
It is generally accepted that the emergence of a specific phenotype is not a straightforward and predetermined process going from DNA to RNA, to proteins, to downstream biological function, but rather involves complex interactions of factors playing at different levels (e.g. gene expression, variants, microRNA, long noncoding RNA and methylation). For instance, integrating miRNA and mRNA expression profiles results in a better understanding of disease phenomena, both in biomarker discovery [146–148] and pathway analysis [32, 149]. DNA methylation has also been recognized as playing a crucial role in complex diseases [150–153]. There are many other compelling pieces of evidence in mouse, cell lines and human studies demonstrating that even the integration of only two types of data (genotype and mRNA) allows one to successfully connect complex phenotypic traits to inherited and non-inherited factors [154–159]. In many cases, a multi-omics strategy is a must for a systematic understanding of affected pathways [160–162].
Compared with single-omics analysis methods, integrative pathway analysis approaches offer a number of important advantages. First, multi-omics integration allows researchers to observe the full regulatory landscape of pathways from the regulations that occur in different omics layers. Essentially, biological pathways involve a range of different biomolecules, and the flow of information as a response to a specific disease is not confined to a single-omics layer. Therefore, changes in biological pathways can only be accurately identified and comprehensively observed by analyzing multi-omics data. For example, many signaling pathways include phosphorylation reactions that are relevant for regulating the activity of proteins and are not reflected at the transcriptomics level [163]. Hence, analyzing transcriptome data alone might be insufficient. Second, multi-omics integration has the potential to reveal the key insights into underlying pathway mechanisms and uncover the cross-omics relationships, which cannot be made apparent through single-omics studies. Many integrative approaches augment the pathways to obtain a more comprehensive representation of biological pathways and multi-omics interactions. They account for pathway topology and multi-omics interactions by modeling pathways as networks of genes and their products. In this case, multi-omics integration helps researchers to better connect genotype to phenotype and provides novel scientific evidence on disease development, or treatment targets, that can be then tested in further molecular studies [164, 165]. Finally, data integration can potentially increase the statistical power and confidence level of the results [163, 165–167]. All of the integrative pathway analysis methods surveyed in this article provide case studies demonstrating the advantages of multi-omics integration over the analysis of a single omics type or experiment.
However, we would like to note that it is not always beneficial to add more omics layers to an analysis. If an omics layer is not relevant to the underlying condition, adding it to the analysis can increase noise and thus make the analysis less accurate. In an attempt to show the differences between single-omics and multi-omics analysis, Canzler et al. [163] provide an extensive survey. They used two case studies, mitochondrial response and murine hepatocyte datasets, to investigate the benefit of multi-omics integration (transcriptome, proteome and metabolome). In the mitochondrial study, the multi-omics integration produced more significant pathways that are relevant to the condition than analyzing transcriptome alone (single-omics). However, in the murine hepatocyte dataset, the analysis of transcriptome data produced more significant pathways that are related to the underlying condition. This shows that multi-omics integration is not always the best option in all conditions. Therefore, choosing the most relevant omics types to include is important before conducting any kind of integration.
Furthermore, there are some outstanding challenges in the context of integrative pathway analysis that necessarily need to be taken into account. First, it should be noted that the accuracy of all analysis methods is highly dependent on data processing and quality control. Multi-omics data are stochastic and heterogeneous in nature that can be attributed to a variety of factors, including technical variability (assays and data type), biological heterogeneity and study bias (sample collection and preparation, experimental design). These factors can greatly affect the results. Therefore, having consistency among input omics data in terms of applied assay, processing protocol and experiment design is very important. In order to ensure the accuracy of the analysis, each pathway analysis technique should evaluate the consistency of the provided data, either through an automatic process or by requiring users’ acknowledgment.
Second, analysis results also rely on the degree of effectiveness of each type of omics. It is trivial that biological markers may be affected by a disease differently, and as a result, the amount of carried data by each type of omics could vary. However, all reviewed approaches consider the same weight for all omics types. This argument can be extended further when the input sizes between omics are considerably different, i.e. the larger the dataset, the more information can be extracted. To overcome this challenge, one can examine how each omics type affects the studied condition and then, based on the analysis results, consider a degree of effectiveness or significance for each type of omics. Some statistical tests have been designed specifically for this purpose. One such test is weighted Kolmogorov–Smirnov, which allows for applying a weighting strategy to both the effect of omics types and different input sample sizes.
Third, the inter-dependency among genes and multi-omics layers is another critical factor. These dependencies explain the natural interactions between biological compounds and their omics layers. Therefore, it is important to consider the omics dependencies in designing the pathway analyzer, as it can substantially improve the quality of omics data processing. Among the discussed analysis approaches, pathway graph transformation and visualization categories cover the omics inter-dependencies and utilize these relations in their analysis. However, despite the more powerful analysis, the model becomes more complicated. Therefore, this can be considered a trade-off between the power of the analysis and its computational complexity.
Fourth, it is undeniable that all of the reviewed approaches allow for only two experimental conditions. Depending on the underlying biological process or disease, it may be required in some experiments to have multiple conditions and study them simultaneously. One solution is to analyze every combination of pairs of experimental conditions separately and then merge the obtained results. Another solution is to search for statistical approaches that allow for the analysis of multiple conditions and then extend them to be suitable for analysis.
Fifth, combining different pathway databases has still remained challenging. The existing knowledge regarding pathway annotations is scattered among pathway databases, yet none of them alone is complete. Therefore, exploiting the complementary knowledge available in multiple databases can improve the accuracy and statistical power. Some of the proposed pathway analysis methods attempt to merge multiple databases to cover more information and give users the power to benefit from a more comprehensive database in their analysis. However, the major drawback is that each database uses different identifiers for pathway annotations. Even in some databases, the same entity (gene, metabolite or protein) is annotated with different identifiers, causing redundancy in data.
PathwayCommons [168] is exemplary in this aspect, and it is one of the few comprehensive human pathway repositories. At the time of writing this review, this database has aggregated 5,772 pathway annotations and 2,424,055 interactions from 22 distinct pathway databases, including widely known databases such as KEGG, Reactome, BioCyc, miRTarBase and MSigDB. In comparison with the number of pathways in popular pathway databases such as KEGG (less than 400 pathways) and Reactome (less than 3,000), the number of pathways in PathwayCommons demonstrates how extensive this database is. Also, PathwayCommons annotates all pathways using HGNC or UniProt identifiers; thus, it unifies all the pathways’ definitions from different repositories. Several pathway analysis and visualization tools [136, 169, 170] have been developed based on pathway information from the PathwayCommons database. However, none of them are yet capable of integrating multi-omics and/or multi-cohort. Exploring this comprehensive database can benefit researchers in their future development of integrative pathway analysis approaches.
Lastly, the ability to interpret the analysis results should be taken into consideration. Methods with interactive visualization capability may provide interactive interfaces for representing the analysis result (as noted in the last column of Table 2). However, it is still difficult to follow the chains of reactions or signaling cascades in large networks of hundreds or thousands of nodes. One solution to alleviate this problem is to generate sub-networks instead of creating a vast network. Using this strategy, users can easily follow the interactions and inspect the corresponding omics layers. Another solution is to provide a three-dimensional depiction of the entire network. This three-dimensional view will enable a more immersive interaction with the network. For example, users could rotate the perspective to better view the network components or inspect each part of the network.
Conclusion
In this survey, we systematically review 32 integrative pathway analysis methods. The overall pipeline of these methods typically includes data preprocessing, ID mapping, pathway augmentation, differential analysis, pathway analysis and visualization. We categorize these multi-omics pathway analysis approaches into four different categories based on their principal concepts and techniques: (i) gene-level, P-value-based integration, (ii) pathway-level, P-value-based integration, (iii) graph-transformation-based and (iv) machine-learning-based tools. We discuss the pros and cons of each category, as well as the overall advantages multi-omics integration over single-omics analysis. We also assess the practicality of each method using six different metrics. Our main objective is to help potential users, especially life scientists, to choose a method that is most suitable for their available data and analysis purpose. Finally, we identify the shortcomings of existing approaches with the goal of helping computational scientists to develop new methods that address the current limitations.
Pathway analysis is important because it provides insights into the biology underlying phenotypes beyond the detection of differentially expressed genes or proteins.
Multi-cohort analysis (meta-analysis) increases statistical power, while multi-omics integration integrates different types of omics data to understand complex biological processes that involve multiple omics levels.
This article reviews and discusses in-depth 32 pathway analysis methods using multi-cohort and multi-omics data: their availability, supported databases and omics types, hypothesis testing techniques and integration strategies.
This article points out pros and cons of integrative pathway analysis methods, as well as assesses each method’s practicality, and discusses outstanding challenges.
This article will assist life scientists in selecting suitable methods for their analysis purposes, as well as computational scientists in identifying shortcomings of current methods.
Acknowledgments
This work was partially supported by NSF under grant numbers 2203236 and 2141660. Any opinions, findings and conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of any of the funding agencies.
Data Availability
The data used in Method Assessment can be found at Gene Expression Omnibus database (Accession IDs: GSE104749, GSE55945 and GSE6956). Other data are downloaded from software package provided by the authors of the papers in this review.
Author Biographies
Zeynab Maghsoudi is a PhD student in the Department of Computer Science and Engineering at the University of Nevada, Reno, Nevada, USA. Her research foci include pathway analysis and multi-omics data integration.
Ha Nguyen is a PhD student in the Department of Computer Science and Engineering at the University of Nevada, Reno, Nevada, USA. His research foci include single-cell data deconvolution and systems-level analysis.
Alireza Tavakkoli is an Associate Professor in the Department of Computer Science and Engineering at the University of Nevada, Reno (UNR), Nevada, USA. His research focuses on developing deep machine learning and statistical methods for a variety of domains, including medicine, biology, behavioral science and computational neuroscience.
Tin Nguyen is an Assistant Professor in the Department of Computer Science and Engineering at the University of Nevada, Reno (UNR), Nevada, USA. His research focuses on developing machine learning and statistical methods that can be applied to bioinformatics and computational biology.
Reference
Odom G, Ban J, Liu L, Wang L, and Chen S.