-
PDF
- Split View
-
Views
-
Cite
Cite
Jin Zhang, Yan Yang, Muheng Shang, Lei Guo, Daoqiang Zhang, Lei Du, for the Alzheimer’s Disease Neuroimaging Initiative, Mutual-assistance learning for trustworthy biomarker discovery and disease prediction, Briefings in Bioinformatics, Volume 26, Issue 2, March 2025, bbaf178, https://doi.org/10.1093/bib/bbaf178
- Share Icon Share
Abstract
Integrating and analyzing multiple omics datasets, such as genomics, environmental influences, and imaging endophenotypes, has yielded an abundance of candidate biomarkers. However, translating such findings into beneficial clinical knowledge for disease prediction remains challenging. This becomes even more challenging when studying interpretable high-order feature interactions such as gene-environment interaction (G
Introduction
Human diseases, such as Alzheimer’s disease (AD) and cancer, are inherently complex and governed by a multitude of factors encompassing genetics, endophenotypes, and environmental influences [1, 2]. Genomic data implicate the innate predisposition of an individual, while endophenotypes like neuroimaging data and proteomic expressions provide insights into his/her current disease status [3, 4]. In addition, the environmental influences capture distinct disease information from the environmental and behavioral aspects. Therefore, integrating these diverse modalities could accurately discern causal effects in disease pathogenesis, thereby devising effective strategies to slow down or halt the disease progression [5].
Imaging genetics is a rapidly evolving field for exploring the genetic architecture of brain endophenotypes [6–10]. During the past decade, univariate, multivariate, and bi-multivariate techniques have been harnessed to study genetic and phenotypic correlations. But they predominantly focus on the main effects of SNPs, and thus may overlook important heritability factors [11, 12]. To address this limitation, non-linear learning methods like kernel techniques and deep neural networks have been introduced to model feature interactions [2, 13–15]. These approaches often lack interpretability and may not consistently produce optimal feature effects and their interactions [16–18]. Hence, there is a pressing need for an efficient method to identify interpretable gene–environment interactions, enabling deeper exploration of the intricate relationships among genotype, environment, and imaging phenotypes in the context of brain disorders [19–24].
Intuitively, integrating and analyzing genomics, exposomes, and endophenotypes offers the potential to obtain better diagnostics and therapeutics. However, directly fusing these datasets and exploring gene–environment interaction on endophenotypes present several challenges [25–32]. They are spurious correlations for the biological explanation, the stability for diagnostic prediction, and the high dimensionality of genetic variations, which could bias subsequent healthcare decision-making [33, 34]. Variable selection is a prominent area in machine learning and biomarker discovery. There were many attractive solutions for variable selection including Lasso, elastic net, smoothly clipped absolute deviation, and Minimax Concave Penalty (MCP) may perform insufficient when dealing with highly correlated variables and potentially lead to misleading biological knowledge [35–40]. In addition, causal inference serves as a potent statistical tool for selecting explanatory variables, i.e. propensity score matching and confounder balancing, etc. [41–43]. Nevertheless, variable selection can manifest substantial variability when confronted with high-dimensional features and limited sample sizes. Although it is promising to marry causal feature selection and disease prediction, two challenges still exist. First, removing spurious correlations among genetics due to linkage disequilibrium (LD) for causal biomarker discovery in high-dimensional settings is difficult [43, 44]. Many unexpected feature correlations, especially in high-dimensional settings, could hinder the feature selection performance. Second, feature decorrelation and disease prediction are distinct tasks with differing objectives, and thus leveraging causal feature selection for disease prediction is also challenging [44, 45]. Thus, developing a task-oriented feature decorrelation framework is highly essential. However, designing a scalable feature decorrelation method for prediction tasks remains complex due to the inherent difficulty of directly aligning feature decorrelation with prediction objectives [46, 47].
To this end, we draw on the idea of mutual-assistance (MA) learning and propose a simple yet versatile method, referred to as mutual-assistance causal biomarker discovery and stable disease prediction (MA-CBxDP), to address the aforementioned challenges (refer to Fig. 1). In particular, MA-CBxDP incorporates an interpretable bi-directional mapping structure that embeds causally driven feature interactions into the disease prediction task. Interestingly, biomarker discovery plays a pivotal role in ensuring that disease prediction models achieve robust and representative performance. Conversely, accurate disease prediction facilitates the precise identification of disease-associated biomarkers. Notably, the synergistic interplay between biomarker discovery and disease prediction has the potential to mutually enhance their effectiveness, ultimately offering novel and reliable insights into the pathogenic mechanisms underlying chronic disorders.

A schematic illustration of imaging endophenotypes-assisted gene-environment interaction analysis. First, the bi-directional mapping framework incorporates genetic variations, exposomes, and endophenotypes simultaneously. The feature-interaction module identifies biologically meaningful genetics, environment as well as G
Furthermore, considering the computational complexity of high-dimensional gene–environment interactions, we develop a fast optimization algorithm in the spirit of divide-and-conquer. Two publicly available datasets, i.e. Alzheimer’s disease neuroimaging initiative (ADNI) and multimodal breast cancer (BC) datasets, were employed to demonstrate the effectiveness and superiority of MA-CBxDP.
Materials and methods
As mentioned above, we draw on the idea of MA learning and accordingly integrate disease progression with biomarker identification tasks into a whole, which could benefit both subtasks mutually. For ease of presentation, genetic variations and environmental are denoted as
In our model, the first term is a bi-directional mapping framework, integrated with a feature interaction module, to extract the co-expression patterns across different modalities. As we all konw that the SNPs in a LD commonly exhibit correlation due to shared ancestry [44]. Consequently, the genotype–phenotype correlation may be the indirect result of a correlated functional variant, potentially leading to the misidentification of the actual causal variants. Hence, we design a causal decorrelation module, i.e.,
Moreover,
To sum up, our model first uses the bi-directional mapping module which makes it easier and more reasonable to incorporate genetic variations, environmental influences, and endophenotypes simultaneously. We also present a feature interaction operation, where G
Extension to chromosome-wide analysis
The direct application of MA-CBxDP to chromosome-wide or genome-wide analysis presents challenges due to the computational intensity of genomics and G
The matrix concatenation operator

Convergence analysis
Algorithm 1 will monotonously decrease in each iteration.
The optimization process reveals that the sub-objectives for
Results
Experimental setup
State-of-the-art methods: (i) We conducted a comparative analysis of MA-CBxDP with most related state-of-the-art, including SMCCA, adaptive SMCCA, and RelPMDCCA. These techniques were emblematic of computational imaging genetics methods and can be reduced to specific genotype–phenotype analytical methods [16–18].
Evaluation criteria and parameter setting: The experiment performance relied on three critical metrics, including identified feature subsets, canonical correlation coefficient (CCC), and RMSE. We utilized the nested five-fold cross-validation to optimize parameters within the candidate set
Application to Alzheimer’s disease
Dataset: Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public–private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging, positron emission tomography, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early AD.
The dataset included imaging, genetics, proteomic markers as well as environmental factors. In details, gray matter volumes from regions of interest were extracted using voxel-based morphometry (VBM), while cortical thickness values were obtained via FreeSurfer, with both datasets designated as VBM and FreeSurfer, respectively. Environmental factors, including human behavior variables (e.g. age, body mass index, visual ability, alcohol abuse, drug sensitivities, blood pressure, smoking and stroke), were recorded. Proteomic analyses were conducted using the rules-based medicine proteomic panel, yielding 146 markers after quality control. Cognitive scores from ADAS, MMSE, and RAVLT assessments were also collected. Additionally, 10 000 SNPs were selected from the ADNI database using additive coding based on minor allele counts. To ensure data consistency and reliability, SNPs, proteomic markers, and neuroimaging data were normalized to z-scores, eliminating scale-related biases.
Improved multimodal correlation
Figure 2 presents the mean and standard deviation of correlation (CCCs) and regression error (RMSEs) across all methods. Four genotype–phenotype correlation tasks: SNP-VBM, SNP-FreeSurfer, SNP-Plasma, and SNP-CSF, where higher CCC and lower RMSE denoted better performance. As expected, MA-CBxDP exhibited the highest CCCs and lowest RMSE. Moreover, our method surpassed the baselines in terms of standard deviations, indicating that stable prediction can be attributed to the causality bidirectional mapping framework. Unsurprisingly, ignoring GE interactions performed sub-optimal, emphasizing the importance of feature interactions rather than focusing solely on the main effects.

Comparison of testing (a) CCCs and (b) RMSEs for genotype–phenotype relationship. Without GE means that the GE interaction component was deleted.
Improved predicting clinical scores and disease status classification
We further investigated the predicting and diagnosis performance using the top ten selected features, including SNPs and endophenotypes. In Fig. 3a, our method demonstrated superior performance compared to baseline approaches in predicting clinical scores (ADAS, MMSE, and RAVLT), achieving the highest CCC and the lowest standard error using linear regression. This highlights the effectiveness of the causality and cooperative prediction module in delivering stable predictions. Furthermore, the LIBSVM software package was employed to classify HCs, MCIs, and ADs. Figure 3b displays the mean and std of accuracy (ACC), where MA-CBxDP achieved the highest testing ACC and smallest std. This highlighted the superior predictive ability of the top-ranked features in MA-CBxDP. All these findings underscored our MA frameworks in enhancing disease classification and cognitive prediction. In addition, considering GE exhibited the best predictive performance, and thus it is necessary for feature interaction and can bring a positive effect.

Comparison of testing (a) CCCs and (b) ACCs on ADNI dataset for predicting clinical scores and disease status classification. Without GE means that the GE interaction component was deleted.
Biomarkers identification and explanation
Main and interactions effects: The heat map in Fig. 4 showed the top selected SNPs. Based on the color bar, we can determine the relative importance of the features. MA-CBxDP identified multiple AD-related risk loci, including well-known rs429358 (APOE). In addition, MA-CBxDP also identified rs56131196, rs4420638, and rs12721051 (all in APOC1 [48–50]), which was supported by

Weights (mean value) of SNPs from five-fold cross-validation. Each row is a method: (1) SMCCA; (2) Adaptive SMCCA; (3) RelPMDCCA; (4) MA-CBxDP.
In addition to the main effects, our method also uncovered significant GE interactions, with the top five interactions identified as follows: (rs75654248, smoking), (rs75654248, alcohol abuse), (rs440446, visual impairment), (rs390082, drug sensitivities) and (rs1064725, alcohol abuse). Firstly, we observed that rs75654248 - smoking was associated with ADs, rs75654248 had been previously reported AD risk, and smoking was also an important factor that could contribute to AD. These interactions can provide promising evidence for precise diagnosis, because the co-occurrence of these abnormalities may help clinicians be confident to diagnose at-risk individuals. We also conducted an ablation study by removing the feature interaction module from MA-CBxDP (Table 1). Unsurprisingly, optimal performance was achieved when the main effects and interaction effects of genotype and environment were considered, underscoring the importance of their joint consideration. Beyond diagnosis some of these snp environment interactions may help in treatment with either slowing the disease or counseling family to reduce risk of others who may also carry the variant.
Ablation studies on main modules of different design choices on ADNI database
id . | PCCC | MCCC | ||||
---|---|---|---|---|---|---|
(a) | 0.66 | 0.29 | ||||
(b) | 0.67 | 0.31 | ||||
(c) | 0.68 | 0.32 | ||||
(d) | 0.70 | 0.34 | ||||
(e) | 0.72 | 0.36 |
id . | PCCC | MCCC | ||||
---|---|---|---|---|---|---|
(a) | 0.66 | 0.29 | ||||
(b) | 0.67 | 0.31 | ||||
(c) | 0.68 | 0.32 | ||||
(d) | 0.70 | 0.34 | ||||
(e) | 0.72 | 0.36 |
“
Ablation studies on main modules of different design choices on ADNI database
id . | PCCC | MCCC | ||||
---|---|---|---|---|---|---|
(a) | 0.66 | 0.29 | ||||
(b) | 0.67 | 0.31 | ||||
(c) | 0.68 | 0.32 | ||||
(d) | 0.70 | 0.34 | ||||
(e) | 0.72 | 0.36 |
id . | PCCC | MCCC | ||||
---|---|---|---|---|---|---|
(a) | 0.66 | 0.29 | ||||
(b) | 0.67 | 0.31 | ||||
(c) | 0.68 | 0.32 | ||||
(d) | 0.70 | 0.34 | ||||
(e) | 0.72 | 0.36 |
“
Phenotype feature explanation
Identification and interpretation of neuroimaging-derived phenotype markers (VBM): Identifying intermediate phenotypes attacked by AD was also crucial for computer-aided diagnosis. In our study, Fig. 5 and Fig. 6 presented the selected neuroimaging phenotypes and follow-up analyses, while Fig. 7 illustrated the feature selection of other phenotypes. In details, the top imaging phenotypes identified by MA-CBxDP (Figs 7a and 5a) were selected neuroimaging-derived phenotype markers and mapped onto the brain for visualization. Notably, MA-CBxDP identified key brain regions such as the left, and right hippocampus, and the right lingual area. Research has indicated the importance of the hippocampus in memory and its significance in diagnosing AD. Further experiments on various phenotypes reinforced the efficacy of our method. In contrast, baselines cannot identify reliable biomarkers, hindering their performance in AD diagnosis as illustrated in Fig. 7a. These demonstrated the superior capability of MA-CBxDP in identifying AD-affected phenotypes.

(a) Visualization of identified brain imaging QTs. (b) Heatmap of pairwise correlation between top ten SNPs and endophenotypes, where symbol “

Investigation of the mediating effects of endophenotypes characteristics on diagnostic phenotypes caused by genetic variants (a) rs429358 (b) rs6857. (*,

Biomarker discovery for important endophenotypes on ADNI. (a) Neuroimaging-derived phenotypes (VBM). (b) Neuroimaging-derived phenotypes (FreeSurfer). (c) Plasma-derived proteomic markers. (d) CSF-derived proteomic markers. Each row is a method: (1) SMCCA; (2) adaptive SMCCA; (3) RelPMDCCA; (4) proposed.
Identification and interpretation of neuroimaging-derived phenotype markers (FreeSurfer): The results of feature selection, as depicted in Fig. 7b, showcased the neuroimaging-derived phenotypes (FreeSurfer) discerned through MA-CBxDP. Notably, MA-CBxDP assigned notably elevated weights to a distinct subset of FreeSurfer markers, RPallvol and LEntCtx, which have been previously confirmed to be associated with elevated AD risk. This recurrent recognition of AD-related neuroimaging-derived biomarkers underscored the effectiveness of MA-CBxDP in precisely and reliably identifying disease-relevant markers.
Identification and interpretation of plasma-derived proteomic markers: Fig. 7c shed light on the significance of plasma-derived proteomic markers. Through the causal variable decorrelation module, it was noteworthy that our approach effectively pinpointed various proteomic markers linked to AD, such as ApoE, APoB, and CRP. Conversely, while the comparison methods also detected some AD-associated proteomic markers, they yielded a plethora of extraneous signals, consequently impeding a coherent interpretation.
Identification and interpretation of CSF-derived proteomic markers: The heatmap displayed in Fig. 7d delineated the canonical weights assigned to cerebrospinal fluid (CSF)-derived proteomic markers. Notably, MA-CBxDP effectively recognized CSF-derived proteomic markers associated with AD, including FGF-4, ApoD, and ApoE, among others. In contrast, benchmark methods generated a multitude of markers, complicating the interpretation process. Through the amalgamation of data from both plasma and CSF datasets, we can confidently assert that MA-CBxDP surpassed benchmark methods, highlighting its superior capability in identifying reliable markers of proteomic expression.
Genotype–phenotype correlated detection
To further illustrate the biological effect, a genotype–phenotype correlation analysis of neuroimaging-derived phenotype markers (VBM) was conducted. In Fig. 5b, most genotype–phenotype correlations displayed significance. Of note, the association between rs429358 and the left hippocampus reached the significance level, reinforcing the genetic impact on hippocampal abnormalities. In addition, the extended experiments on other genotype–phenotype correlations also demonstrated the superiority of our framework, which could gain comprehensive insights into the potential pathological mechanisms of ADs. To sum up, all this analysis highlighted the capability of MA-CBxDP in discerning causal genotype–phenotype relationships for AD pathogenesis.
We have independently shown the relevance of identified SNPs and imaging QTs. In addition to exploring whether biomarkers that are associated with disease, among the detected genotype–phenotype pairs of AD. Further, in Fig. 6, we intended to uncover the causal relation between imaging QTs, and SNPs on diagnostic phenotypes. We looked into the first two genotype–phenotype pairs, i.e. (rs429358, hippocampus) and (rs6857, hippocampus). We performed mediation analyses to build the causal relationships between genetic, endophenotypes, and diagnostic phenotypes. Mediation analysis was a statistical method used to explore and quantify the mechanisms through which a causal effect operates. This framework was grounded in causal inference and was particularly useful when trying to understand how or why an exposure affects an outcome. We checked whether the genetic effects on diagnosis could be explained by endophenotypes. Specifically, genetic factors are treated as the input variable, while each diagnostic phenotype serves as the outcome variable. The mean endophenotypes (hippocampus) act as the mediator.
By selecting the endophenotypes (hippocampus) as mediator variables, we find a significant correlation between rs429358 and diagnostic phenotypes (
Follow-up analyses: gene-set analyses
To validate the identified genetic loci, we conducted a gene-based analysis using MAGMA, which employs a multiple linear principal component regression model. MAGMA projects the multivariate LD matrix of SNPs within a gene to extract principal components that capture genetic variation. These components are then used as predictors in a linear regression framework to assess their association with the phenotype. Fisher’s test is applied to compute
Notably, genes APOE (
Follow-up analyses: functional mapping
To validate and interpret the identified biomarkers, we employed the Functional Mapping and Annotation (FUMA) platform. This tool facilitated functional mapping, prioritization, annotation, and interpretation of results from genome-wide association studies (GWAS). Significant SNPs were identified using meta-analysis summary statistics with a significance threshold of


(a) A Manhattan plot depicting the snp-based test. (b) A separate Manhattan plot was created for the gene-based test, the x-axis represents the SNP positions on the genome, while the y-axis represents the negative base-10 logarithm of the
Follow-up analyses: gene expression analyses
To explore the biological implications of the identified loci at the gene expression level, we utilized GENE2FUNC for gene expression analysis. We investigated gene expression patterns linked to the top 50 SNPs by annotating these SNPs to their respective genes for analysis. Utilizing data from the GTEx database (version 8) covering 54 tissue types and BrainSpan RNA sequencing data across 29 developmental stages, we generated heat maps to visualize gene expressions, with each map illustrating the average normalized expression value for its associated labels. We presented mRNA expression profiles of prioritized genes associated with the top 50 SNPs in 54 tissue types of developing adult human brains. The expression levels of genes such as APOE, APOC1 were shown in Fig. 10b. These genes exhibited distinct expression patterns throughout different life stages as reflected in BrainSpan data in Fig. 10a. Notably, APOE consistently exhibited high expression levels across all lifespan stages, while APOC1 showed elevated expression during late prenatal and postnatal stages, potentially contributing to AD. These results highlight the efficacy of MA-CBxDP in identifying reliable genetic variations associated with various human brain tissues across different lifespan stages.

Heat maps illustrating the normalized gene expression values, obtained through zero mean normalization of log2-transformed expression, are presented for the prioritized genes associated with the 50 SNPs. The lower panel represents data from GTEx v8 RNAseq, while the upper panel showcases BrainSpan data.
Follow-up analyses for ADNI: phenome-wide association studies and enrichment analysis
To validate the phenotypic implications linked with SNPs pinpointed via MA-CBxDP, a phenotype-wide association analysis (pheWAS) was conducted utilizing data sourced from the publicly accessible GWAS Atlas32 (https://atlas.ctglab.nl). This investigation encompassed an extensive dataset comprising 4756 GWAS, potentially without an explicit focus on the SNP or gene under scrutiny.
PheWAS: First, a comprehensive phenotype-wide association study (pheWAS) is conducted on the lead SNP, rs429358, to explore its potential associations across a broad spectrum of phenotypes encompassing 28 domains. Fig. 11a reveals a significant linkage between the rs429358 locus and neurological phenotypes. Particularly, significant relationships were established between rs429358 and neurological characteristics such as AD or dementia with paternal history, in addition, we also found the medication for cholesterol, blood pressure, and diabetes had significant associations with rs429358. Understanding the modifiable risk factors is paramount for designing personalized therapeutic interventions. Further, Fig. 11b offers a comprehensive insight into phenome-wide association studies concerning the prominent lead SNPs rs56131196. Interestingly, illnesses of the father: AD/dementia and paternal history of AD, low-density lipoprotein cholesterol, and total cholesterol had a significant correlation with rs56131196, providing valuable insights into their multi-functionality.

(a) The PheWAS analysis yielded results for SNP rs429358 (
In summary, the combined evidence from the pheWAS highlights the successful identification of the genetic underpinnings of neurological phenotypes through our MA-CBxDP approach.
Enrichment analysis: The top genes identified by our algorithm were examined for their biological significance. Additionally, Figs 12a and 12b display the metascape and GO enrichment results for the top genes, respectively. Notably, most pathways involved in these top genes are also associated with AD. We have revisited the enrichment analysis data and the listed top “GO terms.” For each “GO term,” we have included the exact name and the corresponding “GO ID” as defined in the database. The top “GO term” is “neuron recognition,” “heart development,” “neuron projection development,” “cell junction organization,” and “regulation of neuron projection development”.

(a) Summary of enrichment analysis in GO biological processes. (b) Summary of enrichment analysis in DisGeNET.
In addition, enrichment analysis was conducted on DisGeNET ontology categories using Metascape. DisGeNET aggregates data from curated repositories, GWAS catalogs, animal models, and scientific literature to elucidate the genetic basis of human diseases. The analysis identified the top 20 significant biological processes, including “Alzheimer’s disease, focal onset,” “acute confusional senile dementia,” “behavioral variant of frontotemporal dementia,” and “memory performance.” These results offer insights into the biological mechanisms underlying AD and underscore the therapeutic potential of the identified causal gene biomarkers.
Application to breast cancer for generalization
Datasets
To evaluate the generalizability of our method, we applied MA-CBxDP to a multi-omics BC dataset [52] to predict treatment response and gain molecular insights. This study utilized clinical environment features (age, ER.status, HER2.status, Size.at.diagnosis, LN.at.diagnosis, etc), DNA data, tumor microenvironment, and pathway functional profiles of early and locally advanced BC patients, obtained from the TransNEO cohort at Cambridge University Hospitals NHS Foundation. To ensure data consistency and reliability, clinical environment features, DNA mutation, tumor microenvironment, and molecular pathway data were normalized to z-scores, eliminating scale-related biases. The goal was to predict RCB scores, which quantify residual disease after neoadjuvant (pre-surgical) therapy, using TiME and pathway activity as endophenotypes.
Multimodal correlation and predicting clinical scores: Fig. 13 depicts the mean and std of CCC for genotype–phenotype correlation and predicting clinical scores (CCC and RMSE). As expected, MA-CBxDP exhibited the highest CCCs for genotype–phenotype correlation. Furthermore, mutual assistance frameworks outperformed baselines in terms of CCCs and RMSEs when predicting clinical scores. This is attributed to the substantial enhancement achieved by integrating causality into prediction.

Comparison of testing CCCs and RMSEs (mean
Biological marker explanation
Main and interactions effects: An investigation of the biomarkers confirmed that most features were indeed relevant to BC i.e. including TP53, PIK3CA, COL12A1, and RYR1. Notably, TP53 stands out as a critical variable for BC due to its pivotal role in cell cycle regulation and apoptosis. Beyond main effects, MA-CBxDP revealed noteworthy GE interactions including: (RYR1, age), (RYR1, LN at diagnosis), (PIK3CA, HER2 status), (RYR1, ER status), and (MT-ND5, Histology). Interestingly, RYR1 and age were important factors that contributed to BC. Meanwhile, their interactions also made sense. Further investigations of the interactions could reveal novel clues to BC. As shown in Fig. 13, the best performance was achieved in genotype–phenotype correlation and clinical prediction when the main and interaction effects identification modules were included, highlighting the necessity of incorporating both aspects. To further demonstrate the efficacy of MA-CBxDP, post-analyses showcased biological implications.
Follow-up analyses for breast cancer: phenome-wide association studies
An investigation into the biomarker identified by MA-CBxDP indicated that most features were indeed relevant to BC, i.e. including TP53, PIK3CA, COL12A1, and RYR1. Notably, TP53 standed out as a critical variable for BC due to its pivotal role in cell cycle regulation and apoptosis. Beyond main effects, MA-CBxDP revealed noteworthy GE interactions including (RYR1, age), (RYR1, LN at diagnosis), (PIK3CA, HER2 status), (RYR1, ER status), and (MT-ND5, Histology). Interestingly, RYR1 was detected in main and interaction effects. Together, these findings provided promising biomarker tools for BC diagnosis and prediction.
This PheWAS study used 17 361 dichotomous phenotypes and 1419 quantitative phenotypes from the AstraZeneca pheWAS portal database to perform pheWAS at the gene level. PheWAS results can be interpreted as the association of genetically determined protein expression with specific diseases or traits. In Figs 14 and 15, a notable correlation between TP53, RYR1, and specific phenotypes was illustrated. Specifically, associations between TP53 and neoplasms traits such as malignant neoplasms stated or presumed to be primary of lymphoid hematopoietic and related tissue. We also found that there were significant associations between RYR1 and traits related to cells such as breast, and malignant neoplasms all linked to BC disease. This emphasized the effectiveness of MA-CBxDP in pinpointing causal genetic variations associated with BC.

Traits significantly associated with TP53 using PheWAS portal.

Traits significantly associated with RYR1 using PheWAS portal.
Ablation study
We ran ablation to investigate the impact of each component on prediction clinical (PCCC) and correlation (MCCC).
Effect of sparse variable regularizer module: We first considered
Effect of feature interaction module: Then, we investigated the effect of
Effect of causal decorrelation module: In Table 1, our approach outperformed other baselines, showing superior results in terms of average PCCC and MCCC with the smallest standard error. This highlighted the advantages of integrating causal decorrelation module into predictive modeling. This can not only enhance correlation and prediction accuracy and stability but also underscore the significance of eliminating spurious correlations and identifying causal biomarkers.
Effect of cooperative prediction module: In Table 1, we also investigated the impact of
Discussion
We propose a MA learning framework, an efficient approach for jointly predicting disease and identifying disease-related causal biomarkers. To our knowledge, this is the first attempt to simultaneously address disease prediction and pathogenesis identification. Experimental results demonstrate that MA-CBxDP significantly outperforms state-of-the-art methods. Moreover, jointly modeling disease progression and identifying pathogenesis factors is more effective than tackling these tasks independently. MA-CBxDP notably improves the prediction and identification of pathogenesis biomarkers, offering valuable insights into chronic disorders.
To validate the pathophysiological significance of the identified multimodal biomarkers, we conducted follow-up analyses, including ANOVA, gene-set analysis, functional mapping, gene expression analysis, phenome-wide association studies (PheWAS), causal mediation analysis, and enrichment analysis. These analyses confirmed the relevance of the identified risk factors. For instance, ANOVA revealed significant associations between several biomarkers and diagnostic status, while MAGMA analysis highlighted their biological significance at the gene level. Both analyses validated the effectiveness of our approach. Additionally, PheWAS explored associations between potential therapeutic targets and clinical characteristics, providing insights into their multifunctionality and mechanistic roles, which could inform future research and treatment strategies. Gene enrichment analysis further revealed the functional characteristics and biological relevance of these targets, enhancing our understanding of their roles in AD pathogenesis and treatment. In contrast, comparison methods produced numerous irrelevant signals, raising concerns for subsequent analyses. Overall, the results demonstrate that MA-CBxDP accurately and comprehensively identifies disease risk factors.
Our study had several limitations in achieving a more comprehensive understanding of AD pathophysiology, incorporating additional imaging modalities could offer a broader perspective on disease identification. In addition, prospective randomized clinical trials are necessary to evaluate the impact of integrating pathologies into clinical workflows.
Conclusion
This study proposed an interpretable framework for causal GE biomarker discovery and stable disease prediction in the spirit of MA learning, which can be applied to the diagnosis and prognosis of AD and other chronic diseases. We also extended the current task to a chromosome-wide setting and produced strong baseline results. Experimental results demonstrated that MA-CBxDP established new state-of-the-art results for AD and cancer, while maintaining exceptional interpretability, verifying its flexibility and versatility in practical applications. To translate these findings into potential clinical applications for routine diagnostics and validate the clinical significance of the identified biomarkers, we performed follow-up analyses and further examined the predictive and diagnostic performance using the top selected features, which could provide novel and reliable insights into the pathogenic mechanisms of chronic disorders.
We propose a simple but efficient disease prediction incorporating GE interactions, i.e. MA-CBxDP, based on MA learning, which simultaneously benefits disease prediction and biomarker identification, representing the first attempt in brain imaging genetic field.
We integrate the sparsity variable selection and causal decorrelation module into the disease prediction task. The results demonstrate that MA-CBxDP attains superior performance in terms of causal biomarker identification and disease prediction, establishing a new state-of-the-art.
Acknowledgments
Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd; Janssen Alzheimer Immunotherapy Research & Development, LLC; Johnson & Johnson Pharmaceutical Research & Development LLC; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.
Author contributions
Jin Zhang (Conceptualization, Methodology, Software, Validation, Writing—original draft), Yan Yang (Software, Visualization, Investigation), Muheng Shang (Validation, Writing—review & editing), Lei Guo (Conceptualization, Data curation, Investigation), Daoqiang Zhang (Conceptualization, Supervision, Writing—review & editing), and Lei Du (Supervision, Conceptualization, Resources, Writing—review and editing).
Conflict of interest: None declared.
Funding
This work was supported in part by the MOST 2030 Brain Project Grant (No. 2022ZD0208500); National Natural Science Foundation of China (Nos 62136004, 62373306); Innovation Foundation for Doctor Dissertation (No. CX2023062); and Fundamental Research Funds for the Central Universities at Northwestern Polytechnical University.
Data availability
Data used in the preparation of this article were obtained from ADNI database (adni.loni.usc.edu). Code can obtain from https://github.com/ZJ-Techie/MA-CBxDP.
References
Author notes
Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.