Mutual-assistance learning for trustworthy biomarker discovery and disease prediction

Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.

Author Notes

Abstract

Integrating and analyzing multiple omics datasets, such as genomics, environmental influences, and imaging endophenotypes, has yielded an abundance of candidate biomarkers. However, translating such findings into beneficial clinical knowledge for disease prediction remains challenging. This becomes even more challenging when studying interpretable high-order feature interactions such as gene-environment interaction (G $\times$ E) to understand the etiology. To fill this gap, we draw on the idea of mutual-assistance (MA) learning and accordingly propose a fresh and powerful scheme, referred to as mutual-assistance causal biomarker discovery and stable disease prediction approach (MA-CBxDP). Specifically, we design an interpretable bi-directional mapping framework, integrated with a causal feature interaction module, to extract co-expression patterns across different modalities and identify trustworthy biomarkers including G $\times$ E. A cooperative prediction module is further incorporated to ensure accurate diagnosis and identification of causal effects for pathogenesis. Importantly, biomarker discovery and disease prediction can mutually reinforce each other, helping to provide novel insights into chronic diseases. Furthermore, in light of the large computational burden incurred by the high-dimensional interactions, we devise a rapid strategy and extend it to a more practical but challenging chromosome-wide setting. We conduct extensive experiments on two databases under three tasks, i.e. multimodal correlation, disease diagnosis, and trait prediction. MA-CBxDP establishes new state-of-the-art results in predicting clinical scores and disease status classification, while maintaining exceptional interpretability, verifying its flexibility and versatility in practical applications.

multi-omics brain imaging genetics, G × E interactions, biomarker discovery

Issue Section:

Problem solving protocol

Introduction

Human diseases, such as Alzheimer’s disease (AD) and cancer, are inherently complex and governed by a multitude of factors encompassing genetics, endophenotypes, and environmental influences [1, 2]. Genomic data implicate the innate predisposition of an individual, while endophenotypes like neuroimaging data and proteomic expressions provide insights into his/her current disease status [3, 4]. In addition, the environmental influences capture distinct disease information from the environmental and behavioral aspects. Therefore, integrating these diverse modalities could accurately discern causal effects in disease pathogenesis, thereby devising effective strategies to slow down or halt the disease progression [5].

Imaging genetics is a rapidly evolving field for exploring the genetic architecture of brain endophenotypes [6–10]. During the past decade, univariate, multivariate, and bi-multivariate techniques have been harnessed to study genetic and phenotypic correlations. But they predominantly focus on the main effects of SNPs, and thus may overlook important heritability factors [11, 12]. To address this limitation, non-linear learning methods like kernel techniques and deep neural networks have been introduced to model feature interactions [2, 13–15]. These approaches often lack interpretability and may not consistently produce optimal feature effects and their interactions [16–18]. Hence, there is a pressing need for an efficient method to identify interpretable gene–environment interactions, enabling deeper exploration of the intricate relationships among genotype, environment, and imaging phenotypes in the context of brain disorders [19–24].

Intuitively, integrating and analyzing genomics, exposomes, and endophenotypes offers the potential to obtain better diagnostics and therapeutics. However, directly fusing these datasets and exploring gene–environment interaction on endophenotypes present several challenges [25–32]. They are spurious correlations for the biological explanation, the stability for diagnostic prediction, and the high dimensionality of genetic variations, which could bias subsequent healthcare decision-making [33, 34]. Variable selection is a prominent area in machine learning and biomarker discovery. There were many attractive solutions for variable selection including Lasso, elastic net, smoothly clipped absolute deviation, and Minimax Concave Penalty (MCP) may perform insufficient when dealing with highly correlated variables and potentially lead to misleading biological knowledge [35–40]. In addition, causal inference serves as a potent statistical tool for selecting explanatory variables, i.e. propensity score matching and confounder balancing, etc. [41–43]. Nevertheless, variable selection can manifest substantial variability when confronted with high-dimensional features and limited sample sizes. Although it is promising to marry causal feature selection and disease prediction, two challenges still exist. First, removing spurious correlations among genetics due to linkage disequilibrium (LD) for causal biomarker discovery in high-dimensional settings is difficult [43, 44]. Many unexpected feature correlations, especially in high-dimensional settings, could hinder the feature selection performance. Second, feature decorrelation and disease prediction are distinct tasks with differing objectives, and thus leveraging causal feature selection for disease prediction is also challenging [44, 45]. Thus, developing a task-oriented feature decorrelation framework is highly essential. However, designing a scalable feature decorrelation method for prediction tasks remains complex due to the inherent difficulty of directly aligning feature decorrelation with prediction objectives [46, 47].

To this end, we draw on the idea of mutual-assistance (MA) learning and propose a simple yet versatile method, referred to as mutual-assistance causal biomarker discovery and stable disease prediction (MA-CBxDP), to address the aforementioned challenges (refer to Fig. 1). In particular, MA-CBxDP incorporates an interpretable bi-directional mapping structure that embeds causally driven feature interactions into the disease prediction task. Interestingly, biomarker discovery plays a pivotal role in ensuring that disease prediction models achieve robust and representative performance. Conversely, accurate disease prediction facilitates the precise identification of disease-associated biomarkers. Notably, the synergistic interplay between biomarker discovery and disease prediction has the potential to mutually enhance their effectiveness, ultimately offering novel and reliable insights into the pathogenic mechanisms underlying chronic disorders.

$A schematic illustration of imaging endophenotypes-assisted gene-environment interaction analysis. First, the bi-directional mapping framework incorporates genetic variations, exposomes, and endophenotypes simultaneously. The feature-interaction module identifies biologically meaningful genetics, environment as well as G$\times $E. Moreover, we meticulously design cooperative prediction modules to ensure accurate diagnosis and identification of causal effects for pathogens.$

Figure 1

A schematic illustration of imaging endophenotypes-assisted gene-environment interaction analysis. First, the bi-directional mapping framework incorporates genetic variations, exposomes, and endophenotypes simultaneously. The feature-interaction module identifies biologically meaningful genetics, environment as well as G $\times$ E. Moreover, we meticulously design cooperative prediction modules to ensure accurate diagnosis and identification of causal effects for pathogens.

Open in new tab Download slide

Furthermore, considering the computational complexity of high-dimensional gene–environment interactions, we develop a fast optimization algorithm in the spirit of divide-and-conquer. Two publicly available datasets, i.e. Alzheimer’s disease neuroimaging initiative (ADNI) and multimodal breast cancer (BC) datasets, were employed to demonstrate the effectiveness and superiority of MA-CBxDP.

Materials and methods

As mentioned above, we draw on the idea of MA learning and accordingly integrate disease progression with biomarker identification tasks into a whole, which could benefit both subtasks mutually. For ease of presentation, genetic variations and environmental are denoted as $X \in R^{n \times p}$ ⁠, $E \in R^{n \times r}$ ⁠, the $f$ -th brain endophenotypes are denoted as $Y_{f} \in R^{n \times q_{f}}$ ⁠, $z$ is clinical scores or diagnostic status, where $n$ is number of subjects, $p$ ⁠, $q,$ and $r$ represent the number of SNPs, intermediate phenotypes, and environmental exposures, respectively. Then, MA-CBxDP is defined as follows:

\begin{array}{r} \begin{array}{c} min_{u_{f}, Q, v_{f}} \sum_{f = 1}^{F} \sum_{i = 1}^{n} {DC}_{i} {‖ x_{i}^{T} u_{f} + x_{i}^{T} Q E_{i} - y_{i f}^{T} v_{f} ‖}_{2}^{2} \\ + Λ ψ (x_{i}^{T} u_{f} + x_{i}^{T} Q E_{i} + y_{i f}^{T} v_{f}; z_{i}) \\ st. {DC}_{i} = \frac{\tilde{P} (D)}{P (D)}, Ω (u_{f}) \leq C_{1}, Ω (v_{f}) \leq C_{2}, Ω (Q) \leq C_{3} . \end{array} \end{array}

(1)

$U = [u_{1}, \dots, u_{F}] \in R^{p \times F}$ carries the main effects of SNPs, $V = [v_{1}, \dots, v_{F}] \in R^{q \times F}$ carries the main effects of endophenotypes, $Q \in R^{p \times r}$ is the GE interaction effects between SNPs and environment markers. $Ω (U)$ ⁠, $Ω (Q)$ and $Ω (V)$ are penalty terms for biomarkers detection.

In our model, the first term is a bi-directional mapping framework, integrated with a feature interaction module, to extract the co-expression patterns across different modalities. As we all konw that the SNPs in a LD commonly exhibit correlation due to shared ancestry [44]. Consequently, the genotype–phenotype correlation may be the indirect result of a correlated functional variant, potentially leading to the misidentification of the actual causal variants. Hence, we design a causal decorrelation module, i.e., ${DC}_{i} = \frac{\tilde{P} (D)}{P (D)}$ ⁠, to prompt variables independence by learning sample weights. This can help focus on the true connection between biomedical variables and disease outcomes.

Moreover, $ψ (X, E, Y_{f}; z)$ is the cooperative prediction module based on a metric function (linear regression loss, i.e, $ψ_{l i n} (D_{f}, z) = \frac{1}{2} \sum_{i = 1}^{n} {(z_{i} - ⟨ D_{f_{i}}, w_{f} ⟩)}^{2}$ logistic regression loss, i.e, $ψ_{\log} (D_{f}, z) = \frac{1}{2} \sum_{i = 1}^{n} (\log (1 + \exp ⟨ D_{f_{i}}, w_{f} ⟩) - z_{i} ⟨ D_{f_{i}}, w_{f} ⟩)$ ⁠), to ensure accurate diagnosis and identification of causal effects for pathogenesis. $Λ$ is a tuning parameter to balance the prediction contributions. However, with finite samples, it is difficult to achieve the desirable weights that ensure perfect independence to get rid of the unstable variables. Thus, we further design a sparse variable regularizer to reduce the feature dimensionality, facilitating decorrelation and adapting to large-scale causal effect exploration settings. We use $Ω (Q) = ‖ Q ‖_{1, 1} = \sum_{i} \sum_{j} | Q_{i, j} |$ to identify interpretable G $\times$ E effects. $Ω (U)$ and $Ω (V)$ control the sparsity of the main effects of genetics and endophenotypes. We employ ${FGL}_{2, 1}$ -norm (⁠ ${‖ U ‖}_{{FGL}_{2, 1}} = \sum_{i = 1}^{p - 1} \sqrt{{‖ u^{i} ‖}_{2}^{2} + {‖ u^{i + 1} ‖}_{2}^{2}} .$ ⁠), $ℓ_{2, 1}$ -norm, and $ℓ_{1}$ -norm to identify relevant SNPs on group and individual levels. Thus, $Ω (U) = λ_{u_{1}} ‖ U ‖_{{FGL}_{2, 1}} + λ_{u_{2}} ‖ U ‖_{2, 1} + λ_{u_{3}} ‖ U ‖_{1, 1}$ ⁠. In addition, $ℓ_{1}$ -norm is introduced to identify interpretable endophenotypes, i.e. $Ω (V) = λ_{v} ‖ V ‖_{1, 1}$ ⁠. Furthermore, we organically incorporate sparsity variable selection and independence-based sample reweighting within our predictive modeling framework. Therefore, these operator holds diverse and interpretable, predictive and causal biomarker identification with limited data.

To sum up, our model first uses the bi-directional mapping module which makes it easier and more reasonable to incorporate genetic variations, environmental influences, and endophenotypes simultaneously. We also present a feature interaction operation, where G $\times$ E is reintegrated to provide and subsequently unveils the inherent unity across both modalities. Second, to ensure stability and interpretability, we organically combine decorrelation regularized modules and sparsity regularized modules in an iterative way to select important features and eliminate biased biomarkers. Third, we meticulously design a disease diagnosis module to ensure precise diagnostic capabilities and the identification of causal effects for pathogenesis. Fortunately, MA-CBxDP is multi-convex, enabling optimization through an alternating convex search strategy. We first fix $V$ and $Q$ and solve for $U$ using the gradient descent, and then iteratively update each variable while treating the others as constants. Mathematical analysis confirms that MA-CBxDP has a lower bound of zero, guaranteeing the convergence to a local optimum.

Extension to chromosome-wide analysis

The direct application of MA-CBxDP to chromosome-wide or genome-wide analysis presents challenges due to the computational intensity of genomics and G $\times$ E. To address this issue, we partition genome data into $K$ non-intersecting subsets, denoted as $U = \oplus_{k = 1}^{K} U^{k}$ and $Q = \oplus_{k = 1}^{K} Q^{k}$ ⁠, respectively. The choice of $K$ can be user-defined or data-driven. We employ a strategic approach to circumvent direct computations of main and interaction terms by calculating these effects within each subset and then aggregating the outcomes across all genotypes. This methodology significantly reduces computational complexity. Following a divide-and-conquer principle, we redefine MA-CBxDP as:

\begin{array}{r} \begin{array}{c} min_{u_{f}, Q, v_{f}} \sum_{f = 1}^{F} \sum_{i = 1}^{n} {DC}_{i} {‖ x_{i}^{T} (u_{f_{1}} \oplus \dots u_{f_{K}}) + x_{i}^{T} (Q_{1} \oplus \dots Q_{K}) E_{i} - y_{i f}^{T} v_{f} ‖}_{2}^{2} \\ + Λ ψ (x_{i}^{T} (u_{f_{1}} \oplus \dots u_{f_{K}}) + x_{i}^{T} (Q_{1} \oplus \dots Q_{K}) E_{i} + y_{i f}^{T} v_{f}; z_{i}) \\ s.t. {DC}_{i} = \frac{\tilde{P} (D)}{P (D)}, Ω (u_{f}) \leq C_{1}, Ω (v_{f}) \leq C_{2}, Ω (Q) \leq C_{3} . \end{array} \end{array}

(2)

The matrix concatenation operator $\oplus$ merges SNPs and interaction terms and enables parallel processing, allowing independent treatment of SNPs and interaction terms. This strategy minimizes memory requirements, as fast MA-CBxDP only needs to retain small SNP matrices during iterations, which demonstrates practicality for chromosome-wide or genome-wide analyses, verifying its flexibility and versatility in practical applications.

Convergence analysis

Theorem 1.1.

Algorithm 1 will monotonously decrease in each iteration.

The optimization process reveals that the sub-objectives for $U$ ⁠, $Q,$ and $V$ constitute three convex sub-problems. This indicates that convergence can be assured by iteratively solving $U$ ⁠, $Q,$ and $V$ ⁠. MA-CBxDP is bounded from below by zero. Consequently, Algorithm 1 is guaranteed to converge to a local optimum.

Results

Experimental setup

State-of-the-art methods: (i) We conducted a comparative analysis of MA-CBxDP with most related state-of-the-art, including SMCCA, adaptive SMCCA, and RelPMDCCA. These techniques were emblematic of computational imaging genetics methods and can be reduced to specific genotype–phenotype analytical methods [16–18].

Evaluation criteria and parameter setting: The experiment performance relied on three critical metrics, including identified feature subsets, canonical correlation coefficient (CCC), and RMSE. We utilized the nested five-fold cross-validation to optimize parameters within the candidate set $10^{i}$ (⁠ $i = - 4, - 3, - 2, \dots, 0, \dots, 2, 3, 4$ ⁠), selecting parameters that produced the highest mean testing CCCs.

Application to Alzheimer’s disease

Dataset: Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public–private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging, positron emission tomography, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early AD.

The dataset included imaging, genetics, proteomic markers as well as environmental factors. In details, gray matter volumes from regions of interest were extracted using voxel-based morphometry (VBM), while cortical thickness values were obtained via FreeSurfer, with both datasets designated as VBM and FreeSurfer, respectively. Environmental factors, including human behavior variables (e.g. age, body mass index, visual ability, alcohol abuse, drug sensitivities, blood pressure, smoking and stroke), were recorded. Proteomic analyses were conducted using the rules-based medicine proteomic panel, yielding 146 markers after quality control. Cognitive scores from ADAS, MMSE, and RAVLT assessments were also collected. Additionally, 10 000 SNPs were selected from the ADNI database using additive coding based on minor allele counts. To ensure data consistency and reliability, SNPs, proteomic markers, and neuroimaging data were normalized to z-scores, eliminating scale-related biases.

Improved multimodal correlation

Figure 2 presents the mean and standard deviation of correlation (CCCs) and regression error (RMSEs) across all methods. Four genotype–phenotype correlation tasks: SNP-VBM, SNP-FreeSurfer, SNP-Plasma, and SNP-CSF, where higher CCC and lower RMSE denoted better performance. As expected, MA-CBxDP exhibited the highest CCCs and lowest RMSE. Moreover, our method surpassed the baselines in terms of standard deviations, indicating that stable prediction can be attributed to the causality bidirectional mapping framework. Unsurprisingly, ignoring GE interactions performed sub-optimal, emphasizing the importance of feature interactions rather than focusing solely on the main effects.

Figure 2

Comparison of testing (a) CCCs and (b) RMSEs for genotype–phenotype relationship. Without GE means that the GE interaction component was deleted.

Open in new tab Download slide

Improved predicting clinical scores and disease status classification

We further investigated the predicting and diagnosis performance using the top ten selected features, including SNPs and endophenotypes. In Fig. 3a, our method demonstrated superior performance compared to baseline approaches in predicting clinical scores (ADAS, MMSE, and RAVLT), achieving the highest CCC and the lowest standard error using linear regression. This highlights the effectiveness of the causality and cooperative prediction module in delivering stable predictions. Furthermore, the LIBSVM software package was employed to classify HCs, MCIs, and ADs. Figure 3b displays the mean and std of accuracy (ACC), where MA-CBxDP achieved the highest testing ACC and smallest std. This highlighted the superior predictive ability of the top-ranked features in MA-CBxDP. All these findings underscored our MA frameworks in enhancing disease classification and cognitive prediction. In addition, considering GE exhibited the best predictive performance, and thus it is necessary for feature interaction and can bring a positive effect.

Figure 3

Comparison of testing (a) CCCs and (b) ACCs on ADNI dataset for predicting clinical scores and disease status classification. Without GE means that the GE interaction component was deleted.

Open in new tab Download slide

Biomarkers identification and explanation

Main and interactions effects: The heat map in Fig. 4 showed the top selected SNPs. Based on the color bar, we can determine the relative importance of the features. MA-CBxDP identified multiple AD-related risk loci, including well-known rs429358 (APOE). In addition, MA-CBxDP also identified rs56131196, rs4420638, and rs12721051 (all in APOC1 [48–50]), which was supported by ${FGL}_{2, 1}$ ⁠. However, the competitors reported numerous irrelevant signals and thus could mislead subsequent analyses. All these findings show the generality and superiority of exceptional interpretability for biomarker discovery.

Figure 4

Weights (mean value) of SNPs from five-fold cross-validation. Each row is a method: (1) SMCCA; (2) Adaptive SMCCA; (3) RelPMDCCA; (4) MA-CBxDP.

Open in new tab Download slide

In addition to the main effects, our method also uncovered significant GE interactions, with the top five interactions identified as follows: (rs75654248, smoking), (rs75654248, alcohol abuse), (rs440446, visual impairment), (rs390082, drug sensitivities) and (rs1064725, alcohol abuse). Firstly, we observed that rs75654248 - smoking was associated with ADs, rs75654248 had been previously reported AD risk, and smoking was also an important factor that could contribute to AD. These interactions can provide promising evidence for precise diagnosis, because the co-occurrence of these abnormalities may help clinicians be confident to diagnose at-risk individuals. We also conducted an ablation study by removing the feature interaction module from MA-CBxDP (Table 1). Unsurprisingly, optimal performance was achieved when the main effects and interaction effects of genotype and environment were considered, underscoring the importance of their joint consideration. Beyond diagnosis some of these snp environment interactions may help in treatment with either slowing the disease or counseling family to reduce risk of others who may also carry the variant.

Table 1

Open in new tab

Ablation studies on main modules of different design choices on ADNI database

id	$L_{SVR}$	$L_{FIL}$	$L_{DC}$	$L_{CP}$	PCCC $↑$	MCCC $↑$
(a)	$\times$	$\times$	$\times$	$\times$	0.66 $\pm$ 0.06	0.29 $\pm$ 0.08
(b)	$✓$	$\times$	$\times$	$\times$	0.67 $\pm$ 0.06	0.31 $\pm$ 0.07
(c)	$✓$	$✓$	$\times$	$\times$	0.68 $\pm$ 0.05	0.32 $\pm$ 0.07
(d)	$✓$	$✓$	$✓$	$\times$	0.70 $\pm$ 0.03	0.34 $\pm$ 0.04
(e)	$✓$	$✓$	$✓$	$✓$	0.72 $\pm$ 0.03	0.36 $\pm$ 0.04

id	$L_{SVR}$	$L_{FIL}$	$L_{DC}$	$L_{CP}$	PCCC $↑$	MCCC $↑$
(a)	$\times$	$\times$	$\times$	$\times$	0.66 $\pm$ 0.06	0.29 $\pm$ 0.08
(b)	$✓$	$\times$	$\times$	$\times$	0.67 $\pm$ 0.06	0.31 $\pm$ 0.07
(c)	$✓$	$✓$	$\times$	$\times$	0.68 $\pm$ 0.05	0.32 $\pm$ 0.07
(d)	$✓$	$✓$	$✓$	$\times$	0.70 $\pm$ 0.03	0.34 $\pm$ 0.04
(e)	$✓$	$✓$	$✓$	$✓$	0.72 $\pm$ 0.03	0.36 $\pm$ 0.04

“ $L_{SVR}$ ” denotes sparse variable regularizer. “ $L_{FIL}$ ” is feature interaction module. “ $L_{DC}$ ” represents causal decorrelation module, “ $L_{CP}$ ” means cooperative prediction module. Average prediction clinical scores (PCCC) and multimodal genotype–phenotype correlation (MCCC). Bold indicates the best result, underlined is second best. $↑$ means higher is better.

Table 1

Open in new tab

Ablation studies on main modules of different design choices on ADNI database

id	$L_{SVR}$	$L_{FIL}$	$L_{DC}$	$L_{CP}$	PCCC $↑$	MCCC $↑$
(a)	$\times$	$\times$	$\times$	$\times$	0.66 $\pm$ 0.06	0.29 $\pm$ 0.08
(b)	$✓$	$\times$	$\times$	$\times$	0.67 $\pm$ 0.06	0.31 $\pm$ 0.07
(c)	$✓$	$✓$	$\times$	$\times$	0.68 $\pm$ 0.05	0.32 $\pm$ 0.07
(d)	$✓$	$✓$	$✓$	$\times$	0.70 $\pm$ 0.03	0.34 $\pm$ 0.04
(e)	$✓$	$✓$	$✓$	$✓$	0.72 $\pm$ 0.03	0.36 $\pm$ 0.04

id	$L_{SVR}$	$L_{FIL}$	$L_{DC}$	$L_{CP}$	PCCC $↑$	MCCC $↑$
(a)	$\times$	$\times$	$\times$	$\times$	0.66 $\pm$ 0.06	0.29 $\pm$ 0.08
(b)	$✓$	$\times$	$\times$	$\times$	0.67 $\pm$ 0.06	0.31 $\pm$ 0.07
(c)	$✓$	$✓$	$\times$	$\times$	0.68 $\pm$ 0.05	0.32 $\pm$ 0.07
(d)	$✓$	$✓$	$✓$	$\times$	0.70 $\pm$ 0.03	0.34 $\pm$ 0.04
(e)	$✓$	$✓$	$✓$	$✓$	0.72 $\pm$ 0.03	0.36 $\pm$ 0.04

Phenotype feature explanation

Identification and interpretation of neuroimaging-derived phenotype markers (VBM): Identifying intermediate phenotypes attacked by AD was also crucial for computer-aided diagnosis. In our study, Fig. 5 and Fig. 6 presented the selected neuroimaging phenotypes and follow-up analyses, while Fig. 7 illustrated the feature selection of other phenotypes. In details, the top imaging phenotypes identified by MA-CBxDP (Figs 7a and 5a) were selected neuroimaging-derived phenotype markers and mapped onto the brain for visualization. Notably, MA-CBxDP identified key brain regions such as the left, and right hippocampus, and the right lingual area. Research has indicated the importance of the hippocampus in memory and its significance in diagnosing AD. Further experiments on various phenotypes reinforced the efficacy of our method. In contrast, baselines cannot identify reliable biomarkers, hindering their performance in AD diagnosis as illustrated in Fig. 7a. These demonstrated the superior capability of MA-CBxDP in identifying AD-affected phenotypes.

$(a) Visualization of identified brain imaging QTs. (b) Heatmap of pairwise correlation between top ten SNPs and endophenotypes, where symbol “$\times $” indicates the pairwise association reached the significance level ($P<.05$).$

Figure 5

(a) Visualization of identified brain imaging QTs. (b) Heatmap of pairwise correlation between top ten SNPs and endophenotypes, where symbol “ $\times$ ” indicates the pairwise association reached the significance level (⁠ $P < .05$ ⁠).

Open in new tab Download slide

$Investigation of the mediating effects of endophenotypes characteristics on diagnostic phenotypes caused by genetic variants (a) rs429358 (b) rs6857. (*, $P\ $<0̇5), (**, $P\ $<0̇1), (***, $P\ $< .001).$

Figure 6

Investigation of the mediating effects of endophenotypes characteristics on diagnostic phenotypes caused by genetic variants (a) rs429358 (b) rs6857. (^*, $P$ <0̇5), (^**, $P$ <0̇1), (^***, $P$ < .001).

Open in new tab Download slide

Figure 7

Biomarker discovery for important endophenotypes on ADNI. (a) Neuroimaging-derived phenotypes (VBM). (b) Neuroimaging-derived phenotypes (FreeSurfer). (c) Plasma-derived proteomic markers. (d) CSF-derived proteomic markers. Each row is a method: (1) SMCCA; (2) adaptive SMCCA; (3) RelPMDCCA; (4) proposed.

Open in new tab Download slide

Identification and interpretation of neuroimaging-derived phenotype markers (FreeSurfer): The results of feature selection, as depicted in Fig. 7b, showcased the neuroimaging-derived phenotypes (FreeSurfer) discerned through MA-CBxDP. Notably, MA-CBxDP assigned notably elevated weights to a distinct subset of FreeSurfer markers, RPallvol and LEntCtx, which have been previously confirmed to be associated with elevated AD risk. This recurrent recognition of AD-related neuroimaging-derived biomarkers underscored the effectiveness of MA-CBxDP in precisely and reliably identifying disease-relevant markers.

Identification and interpretation of plasma-derived proteomic markers: Fig. 7c shed light on the significance of plasma-derived proteomic markers. Through the causal variable decorrelation module, it was noteworthy that our approach effectively pinpointed various proteomic markers linked to AD, such as ApoE, APoB, and CRP. Conversely, while the comparison methods also detected some AD-associated proteomic markers, they yielded a plethora of extraneous signals, consequently impeding a coherent interpretation.

Identification and interpretation of CSF-derived proteomic markers: The heatmap displayed in Fig. 7d delineated the canonical weights assigned to cerebrospinal fluid (CSF)-derived proteomic markers. Notably, MA-CBxDP effectively recognized CSF-derived proteomic markers associated with AD, including FGF-4, ApoD, and ApoE, among others. In contrast, benchmark methods generated a multitude of markers, complicating the interpretation process. Through the amalgamation of data from both plasma and CSF datasets, we can confidently assert that MA-CBxDP surpassed benchmark methods, highlighting its superior capability in identifying reliable markers of proteomic expression.

Genotype–phenotype correlated detection

To further illustrate the biological effect, a genotype–phenotype correlation analysis of neuroimaging-derived phenotype markers (VBM) was conducted. In Fig. 5b, most genotype–phenotype correlations displayed significance. Of note, the association between rs429358 and the left hippocampus reached the significance level, reinforcing the genetic impact on hippocampal abnormalities. In addition, the extended experiments on other genotype–phenotype correlations also demonstrated the superiority of our framework, which could gain comprehensive insights into the potential pathological mechanisms of ADs. To sum up, all this analysis highlighted the capability of MA-CBxDP in discerning causal genotype–phenotype relationships for AD pathogenesis.

We have independently shown the relevance of identified SNPs and imaging QTs. In addition to exploring whether biomarkers that are associated with disease, among the detected genotype–phenotype pairs of AD. Further, in Fig. 6, we intended to uncover the causal relation between imaging QTs, and SNPs on diagnostic phenotypes. We looked into the first two genotype–phenotype pairs, i.e. (rs429358, hippocampus) and (rs6857, hippocampus). We performed mediation analyses to build the causal relationships between genetic, endophenotypes, and diagnostic phenotypes. Mediation analysis was a statistical method used to explore and quantify the mechanisms through which a causal effect operates. This framework was grounded in causal inference and was particularly useful when trying to understand how or why an exposure affects an outcome. We checked whether the genetic effects on diagnosis could be explained by endophenotypes. Specifically, genetic factors are treated as the input variable, while each diagnostic phenotype serves as the outcome variable. The mean endophenotypes (hippocampus) act as the mediator.

By selecting the endophenotypes (hippocampus) as mediator variables, we find a significant correlation between rs429358 and diagnostic phenotypes (⁠ $β$ = – 0.37, $P$ <0.001). This association is mediated by the hippocampus (bootstrapped average causal mediation effect: (⁠ $β$ = –0.07 [–0.11, –0.03]), indicating a partial mediation effect). Similarly, we find that the hippocampus partially mediated the effects of rs6857 on participants’ diagnostic phenotypes (bootstrapped average causal mediation effect: (⁠ $β$ = – 0.06 [-0.11, –0.03]). To sum up, the hippocampus mediates the effects of risk genetics biomarkers on diagnostic phenotype, which confirms the physiological significance of the identified hippocampus biomarkers. In summary, the statistical analysis results above confirmed the value of the causal genotype–phenotype relationships for AD pathogenesis. This emphasized the effectiveness of MA-CBxDP in pinpointing causal genetic variation and could deepen our understanding of the pathogenesis of AD.

Follow-up analyses: gene-set analyses

To validate the identified genetic loci, we conducted a gene-based analysis using MAGMA, which employs a multiple linear principal component regression model. MAGMA projects the multivariate LD matrix of SNPs within a gene to extract principal components that capture genetic variation. These components are then used as predictors in a linear regression framework to assess their association with the phenotype. Fisher’s test is applied to compute $P$ -values, determining the significance of the gene–phenotype relationship [51].

Notably, genes APOE (⁠ $P = 9.90 \times 10^{- 26}$ ⁠) and APOC1 (⁠ $P = 5.72 \times 10^{- 18}$ ⁠) showed the most significant association with the diagnostic phenotype. Of note, APOE, encoding apolipoprotein E involved in amyloid-specific pathways involving amyloid trafficking and plaque clearance, is consistently associated with AD diagnosis. This underscored MA-CBxDP’s efficacy in identifying causal genetic variations.

Follow-up analyses: functional mapping

To validate and interpret the identified biomarkers, we employed the Functional Mapping and Annotation (FUMA) platform. This tool facilitated functional mapping, prioritization, annotation, and interpretation of results from genome-wide association studies (GWAS). Significant SNPs were identified using meta-analysis summary statistics with a significance threshold of $P$ $<$ $5 \times 10^{- 8}$ and 1-Mb window independence, referencing the 1000 Genomes phase 3 dataset. Lead SNPs were then pinpointed from these significant SNPs, affirming their disease associations. Notably, genetic loci of APOE and APOC1 showed substantial statistical significance, as illustrated in Fig. 8 and Fig. 9b. These results strongly aligned with our MA-CBxDP findings. Specifically, rs429358, situated in the APOE on chromosome 19, emerged as the most significant SNPs, as well as the causal genetic variation (highlighted in Figs 8 and 9a), reinforcing its pivotal role in AD. This consistency underscored the reliability and effectiveness of MA-CBxDP in identifying causal biomarkers.

Figure 8

Regional plot of the top lead SNPs rs429358.

Open in new tab Download slide

Figure 9

(a) A Manhattan plot depicting the snp-based test. (b) A separate Manhattan plot was created for the gene-based test, the x-axis represents the SNP positions on the genome, while the y-axis represents the negative base-10 logarithm of the $P$ -values. Higher values on the y-axis indicate stronger signals, suggesting significant associations.

Open in new tab Download slide

Follow-up analyses: gene expression analyses

To explore the biological implications of the identified loci at the gene expression level, we utilized GENE2FUNC for gene expression analysis. We investigated gene expression patterns linked to the top 50 SNPs by annotating these SNPs to their respective genes for analysis. Utilizing data from the GTEx database (version 8) covering 54 tissue types and BrainSpan RNA sequencing data across 29 developmental stages, we generated heat maps to visualize gene expressions, with each map illustrating the average normalized expression value for its associated labels. We presented mRNA expression profiles of prioritized genes associated with the top 50 SNPs in 54 tissue types of developing adult human brains. The expression levels of genes such as APOE, APOC1 were shown in Fig. 10b. These genes exhibited distinct expression patterns throughout different life stages as reflected in BrainSpan data in Fig. 10a. Notably, APOE consistently exhibited high expression levels across all lifespan stages, while APOC1 showed elevated expression during late prenatal and postnatal stages, potentially contributing to AD. These results highlight the efficacy of MA-CBxDP in identifying reliable genetic variations associated with various human brain tissues across different lifespan stages.

Figure 10

Heat maps illustrating the normalized gene expression values, obtained through zero mean normalization of log2-transformed expression, are presented for the prioritized genes associated with the 50 SNPs. The lower panel represents data from GTEx v8 RNAseq, while the upper panel showcases BrainSpan data.

Open in new tab Download slide

Follow-up analyses for ADNI: phenome-wide association studies and enrichment analysis

To validate the phenotypic implications linked with SNPs pinpointed via MA-CBxDP, a phenotype-wide association analysis (pheWAS) was conducted utilizing data sourced from the publicly accessible GWAS Atlas32 (https://atlas.ctglab.nl). This investigation encompassed an extensive dataset comprising 4756 GWAS, potentially without an explicit focus on the SNP or gene under scrutiny.

PheWAS: First, a comprehensive phenotype-wide association study (pheWAS) is conducted on the lead SNP, rs429358, to explore its potential associations across a broad spectrum of phenotypes encompassing 28 domains. Fig. 11a reveals a significant linkage between the rs429358 locus and neurological phenotypes. Particularly, significant relationships were established between rs429358 and neurological characteristics such as AD or dementia with paternal history, in addition, we also found the medication for cholesterol, blood pressure, and diabetes had significant associations with rs429358. Understanding the modifiable risk factors is paramount for designing personalized therapeutic interventions. Further, Fig. 11b offers a comprehensive insight into phenome-wide association studies concerning the prominent lead SNPs rs56131196. Interestingly, illnesses of the father: AD/dementia and paternal history of AD, low-density lipoprotein cholesterol, and total cholesterol had a significant correlation with rs56131196, providing valuable insights into their multi-functionality.

Figure 11

(a) The PheWAS analysis yielded results for SNP rs429358 (⁠ $A P O E$ ⁠). (b) The PheWAS analysis yielded results for SNP rs56131196 (⁠ $A P O C 1$ ⁠).

Open in new tab Download slide

In summary, the combined evidence from the pheWAS highlights the successful identification of the genetic underpinnings of neurological phenotypes through our MA-CBxDP approach.

Enrichment analysis: The top genes identified by our algorithm were examined for their biological significance. Additionally, Figs 12a and 12b display the metascape and GO enrichment results for the top genes, respectively. Notably, most pathways involved in these top genes are also associated with AD. We have revisited the enrichment analysis data and the listed top “GO terms.” For each “GO term,” we have included the exact name and the corresponding “GO ID” as defined in the database. The top “GO term” is “neuron recognition,” “heart development,” “neuron projection development,” “cell junction organization,” and “regulation of neuron projection development”.

Figure 12

(a) Summary of enrichment analysis in GO biological processes. (b) Summary of enrichment analysis in DisGeNET.

Open in new tab Download slide

In addition, enrichment analysis was conducted on DisGeNET ontology categories using Metascape. DisGeNET aggregates data from curated repositories, GWAS catalogs, animal models, and scientific literature to elucidate the genetic basis of human diseases. The analysis identified the top 20 significant biological processes, including “Alzheimer’s disease, focal onset,” “acute confusional senile dementia,” “behavioral variant of frontotemporal dementia,” and “memory performance.” These results offer insights into the biological mechanisms underlying AD and underscore the therapeutic potential of the identified causal gene biomarkers.

Application to breast cancer for generalization

Datasets

To evaluate the generalizability of our method, we applied MA-CBxDP to a multi-omics BC dataset [52] to predict treatment response and gain molecular insights. This study utilized clinical environment features (age, ER.status, HER2.status, Size.at.diagnosis, LN.at.diagnosis, etc), DNA data, tumor microenvironment, and pathway functional profiles of early and locally advanced BC patients, obtained from the TransNEO cohort at Cambridge University Hospitals NHS Foundation. To ensure data consistency and reliability, clinical environment features, DNA mutation, tumor microenvironment, and molecular pathway data were normalized to z-scores, eliminating scale-related biases. The goal was to predict RCB scores, which quantify residual disease after neoadjuvant (pre-surgical) therapy, using TiME and pathway activity as endophenotypes.

Multimodal correlation and predicting clinical scores: Fig. 13 depicts the mean and std of CCC for genotype–phenotype correlation and predicting clinical scores (CCC and RMSE). As expected, MA-CBxDP exhibited the highest CCCs for genotype–phenotype correlation. Furthermore, mutual assistance frameworks outperformed baselines in terms of CCCs and RMSEs when predicting clinical scores. This is attributed to the substantial enhancement achieved by integrating causality into prediction.

$Comparison of testing CCCs and RMSEs (mean $\pm $ std.) on BC dataset. (a) Genotype–phenotype correlation (CCC) and (b) predicting clinical scores (CCC and RMSE).$

Figure 13

Comparison of testing CCCs and RMSEs (mean $\pm$ std.) on BC dataset. (a) Genotype–phenotype correlation (CCC) and (b) predicting clinical scores (CCC and RMSE).

Open in new tab Download slide

Biological marker explanation

Main and interactions effects: An investigation of the biomarkers confirmed that most features were indeed relevant to BC i.e. including TP53, PIK3CA, COL12A1, and RYR1. Notably, TP53 stands out as a critical variable for BC due to its pivotal role in cell cycle regulation and apoptosis. Beyond main effects, MA-CBxDP revealed noteworthy GE interactions including: (RYR1, age), (RYR1, LN at diagnosis), (PIK3CA, HER2 status), (RYR1, ER status), and (MT-ND5, Histology). Interestingly, RYR1 and age were important factors that contributed to BC. Meanwhile, their interactions also made sense. Further investigations of the interactions could reveal novel clues to BC. As shown in Fig. 13, the best performance was achieved in genotype–phenotype correlation and clinical prediction when the main and interaction effects identification modules were included, highlighting the necessity of incorporating both aspects. To further demonstrate the efficacy of MA-CBxDP, post-analyses showcased biological implications.

Follow-up analyses for breast cancer: phenome-wide association studies

An investigation into the biomarker identified by MA-CBxDP indicated that most features were indeed relevant to BC, i.e. including TP53, PIK3CA, COL12A1, and RYR1. Notably, TP53 standed out as a critical variable for BC due to its pivotal role in cell cycle regulation and apoptosis. Beyond main effects, MA-CBxDP revealed noteworthy GE interactions including (RYR1, age), (RYR1, LN at diagnosis), (PIK3CA, HER2 status), (RYR1, ER status), and (MT-ND5, Histology). Interestingly, RYR1 was detected in main and interaction effects. Together, these findings provided promising biomarker tools for BC diagnosis and prediction.

This PheWAS study used 17 361 dichotomous phenotypes and 1419 quantitative phenotypes from the AstraZeneca pheWAS portal database to perform pheWAS at the gene level. PheWAS results can be interpreted as the association of genetically determined protein expression with specific diseases or traits. In Figs 14 and 15, a notable correlation between TP53, RYR1, and specific phenotypes was illustrated. Specifically, associations between TP53 and neoplasms traits such as malignant neoplasms stated or presumed to be primary of lymphoid hematopoietic and related tissue. We also found that there were significant associations between RYR1 and traits related to cells such as breast, and malignant neoplasms all linked to BC disease. This emphasized the effectiveness of MA-CBxDP in pinpointing causal genetic variations associated with BC.

Figure 14

Traits significantly associated with TP53 using PheWAS portal.

Open in new tab Download slide

Figure 15

Traits significantly associated with RYR1 using PheWAS portal.

Open in new tab Download slide

Ablation study

We ran ablation to investigate the impact of each component on prediction clinical (PCCC) and correlation (MCCC).

Effect of sparse variable regularizer module: We first considered $L_{SVR}$ ⁠. Table 1 displayed the testing PCCC and MCCC for all methods. Remarkably, the absence of $L_{SVR}$ module leads to suboptimum performance, emphasizing the necessity of integrating $L_{SVR}$ into predictive workflows.

Effect of feature interaction module: Then, we investigated the effect of $L_{FIL}$ ⁠. Table 1 showcased the superior performance of MA-CBxDP compared to baselines, which can help elucidate missing heritability. Furthermore, an additional experiment demonstrated a significant time reduction achieved through our divide-and-conquer strategy, decreasing time consumption by a hundredfold (from 6750 to 61 s) while maintaining comparable performance. This substantial improvement in time efficiency enhanced the practical accessibility and utility of our strategy.

Effect of causal decorrelation module: In Table 1, our approach outperformed other baselines, showing superior results in terms of average PCCC and MCCC with the smallest standard error. This highlighted the advantages of integrating causal decorrelation module into predictive modeling. This can not only enhance correlation and prediction accuracy and stability but also underscore the significance of eliminating spurious correlations and identifying causal biomarkers.

Effect of cooperative prediction module: In Table 1, we also investigated the impact of $L_{CP}$ ⁠. It was evident that the absence of $L_{CP}$ resulted in inferior performance, highlighting the importance of incorporating $L_{CP}$ into predictive workflows to bolster MA-CBxDP’s predictive capabilities.

Discussion

We propose a MA learning framework, an efficient approach for jointly predicting disease and identifying disease-related causal biomarkers. To our knowledge, this is the first attempt to simultaneously address disease prediction and pathogenesis identification. Experimental results demonstrate that MA-CBxDP significantly outperforms state-of-the-art methods. Moreover, jointly modeling disease progression and identifying pathogenesis factors is more effective than tackling these tasks independently. MA-CBxDP notably improves the prediction and identification of pathogenesis biomarkers, offering valuable insights into chronic disorders.

To validate the pathophysiological significance of the identified multimodal biomarkers, we conducted follow-up analyses, including ANOVA, gene-set analysis, functional mapping, gene expression analysis, phenome-wide association studies (PheWAS), causal mediation analysis, and enrichment analysis. These analyses confirmed the relevance of the identified risk factors. For instance, ANOVA revealed significant associations between several biomarkers and diagnostic status, while MAGMA analysis highlighted their biological significance at the gene level. Both analyses validated the effectiveness of our approach. Additionally, PheWAS explored associations between potential therapeutic targets and clinical characteristics, providing insights into their multifunctionality and mechanistic roles, which could inform future research and treatment strategies. Gene enrichment analysis further revealed the functional characteristics and biological relevance of these targets, enhancing our understanding of their roles in AD pathogenesis and treatment. In contrast, comparison methods produced numerous irrelevant signals, raising concerns for subsequent analyses. Overall, the results demonstrate that MA-CBxDP accurately and comprehensively identifies disease risk factors.

Our study had several limitations in achieving a more comprehensive understanding of AD pathophysiology, incorporating additional imaging modalities could offer a broader perspective on disease identification. In addition, prospective randomized clinical trials are necessary to evaluate the impact of integrating pathologies into clinical workflows.

Conclusion

This study proposed an interpretable framework for causal GE biomarker discovery and stable disease prediction in the spirit of MA learning, which can be applied to the diagnosis and prognosis of AD and other chronic diseases. We also extended the current task to a chromosome-wide setting and produced strong baseline results. Experimental results demonstrated that MA-CBxDP established new state-of-the-art results for AD and cancer, while maintaining exceptional interpretability, verifying its flexibility and versatility in practical applications. To translate these findings into potential clinical applications for routine diagnostics and validate the clinical significance of the identified biomarkers, we performed follow-up analyses and further examined the predictive and diagnostic performance using the top selected features, which could provide novel and reliable insights into the pathogenic mechanisms of chronic disorders.

Key Points

We propose a simple but efficient disease prediction incorporating GE interactions, i.e. MA-CBxDP, based on MA learning, which simultaneously benefits disease prediction and biomarker identification, representing the first attempt in brain imaging genetic field.
We integrate the sparsity variable selection and causal decorrelation module into the disease prediction task. The results demonstrate that MA-CBxDP attains superior performance in terms of causal biomarker identification and disease prediction, establishing a new state-of-the-art.

Acknowledgments

Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd; Janssen Alzheimer Immunotherapy Research & Development, LLC; Johnson & Johnson Pharmaceutical Research & Development LLC; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

Author contributions

Jin Zhang (Conceptualization, Methodology, Software, Validation, Writing—original draft), Yan Yang (Software, Visualization, Investigation), Muheng Shang (Validation, Writing—review & editing), Lei Guo (Conceptualization, Data curation, Investigation), Daoqiang Zhang (Conceptualization, Supervision, Writing—review & editing), and Lei Du (Supervision, Conceptualization, Resources, Writing—review and editing).

Conflict of interest: None declared.

Funding

This work was supported in part by the MOST 2030 Brain Project Grant (No. 2022ZD0208500); National Natural Science Foundation of China (Nos 62136004, 62373306); Innovation Foundation for Doctor Dissertation (No. CX2023062); and Fundamental Research Funds for the Central Universities at Northwestern Polytechnical University.

Data availability

Data used in the preparation of this article were obtained from ADNI database (adni.loni.usc.edu). Code can obtain from https://github.com/ZJ-Techie/MA-CBxDP.

References

Sims

Hill

Williams

The multiplex model of the genetics of Alzheimer’s disease

Nat Neurosci

2020

;

:311–22.

Google Scholar

OpenURL Placeholder Text

WorldCat

Shen

Thompson

Brain imaging genomics: Integrated analysis and machine learning

Proc IEEE

2019

;

108

125

–

Google Scholar

Crossref

WorldCat

Liu

Jiayuan

Guo

. et al.

Environmental neuroscience linking exposome to brain structure and function underlying cognition and behavior

Mol Psychiatry

2023

;

–

10.1038/s41380-022-01669-6

Haotian

Eckhardt

Baccarelli

Molecular mechanisms of environmental exposures and human disease

Nat Rev Genet

2023

;

332

–

10.1038/s41576-022-00569-3

Westerman

Sofer

Many roads to a gene-environment interaction

Am J Hum Genet

2024

;

111

626

–

10.1016/j.ajhg.2024.03.002

Vogel

Corriveau-Lecavalier

Franzmeier

. et al.

Connectome-based modelling of neurodegenerative diseases: towards precision medicine and mechanistic insight

Nat Rev Neurosci

2023

;

620

–

10.1038/s41583-023-00731-8

Graham

Quoc Dang

Jahanifar

. et al.

One model is all you need: multi-task learning enables simultaneous histology image segmentation and classification

Med Image Anal

2023

;

102685

10.1016/j.media.2022.102685

Lambert

J-C

Ramirez

Grenier-Boley

. et al.

Step by step: towards a better understanding of the genetic architecture of alzheimer’s disease

Mol Psychiatry

2023

;

2716

–

10.1038/s41380-023-02076-1

Chung

Das

Sun

. et al.

Genome-wide association and multi-omics studies identify mgmt as a novel risk gene for alzheimer’s disease among women

Alzheimers Dement

2023

;

896

–

908

10.

Zhang

Wang

Zhao

. et al.

Lei Du, and Alzheimer’s Disease Neuroimaging Initiative. Identification of multimodal brain imaging association via a parameter decomposition based sparse multi-view canonical correlation analysis method

BMC Bioinform

2022

;

128

10.1186/s12859-022-04669-z

Google Scholar

Crossref

WorldCat

11.

Wei

W-H

Hemani

Haley

Detecting epistasis in human complex traits

Nat Rev Genet

2014

;

722

–

12.

Lin

Calhoun

Wang

Y-P

Correspondence between fmri and snp data by group sparse canonical correlation analysis

Med Image Anal

2014

;

891

–

902

10.1016/j.media.2013.10.010

13.

Duvenaud

Nickisch

Rasmussen

Additive gaussian processes

Advances in Neural Information Processing Systems

, 2011;

:1–9.

OpenURL Placeholder Text

WorldCat

14.

Lanckriet

GRG

Cristianini

Bartlett

. et al.

Learning the kernel matrix with semidefinite programming

J Mach Learn Res

2004

;

–

Google Scholar

OpenURL Placeholder Text

WorldCat

15.

Donghuan

Popuri

Ding

. et al.

Multiscale deep neural network based analysis of fdg-pet images for the early diagnosis of alzheimer’s disease

Med Image Anal

2018

;

–

10.1016/j.media.2018.02.002

16.

Rodosthenous

Shahrezaei

Evangelou

Integrating multi-omics data through sparse canonical correlation analysis for the prediction of complex traits: a comparison study

Bioinformatics

2020

;

4616

–

10.1093/bioinformatics/btaa530

17.

Witten

Tibshirani

Extensions of sparse canonical correlation analysis with applications to genomic data

Stat Appl Genet Mol Biol

2009

;

–

10.2202/1544-6115.1470

18.

Wenxing

Lin

Cao

. et al.

Adaptive sparse multiple canonical correlation analysis with application to imaging (epi) genomics study of schizophrenia

IEEE Trans Biomed Eng

2017

;

390

–

Google Scholar

OpenURL Placeholder Text

WorldCat

19.

Eichler

Flint

Gibson

. et al.

Missing heritability and strategies for finding the underlying causes of complex disease

Nat Rev Genet

2010

;

446

–

20.

Manolio

Collins

Cox

. et al.

Finding the missing heritability of complex diseases

Nature

2009

;

461

747

–

21.

Lei

Zhang

Liu

. et al.

Identifying associations among genomic, proteomic and imaging biomarkers via adaptive sparse multi-view canonical correlation analysis

Med Image Anal

2021

;

102003

10.1016/j.media.2021.102003

22.

Lei

Zhang

Zhao

. et al.

Inmtscca: an integrated multi-task sparse canonical correlation analysis for multi-omic brain imaging genetics

Genom Proteom Bioinform

396

–

413

OpenURL Placeholder Text

WorldCat

23.

Wang

Zhang

. et al.

Preventing prefrontal dysfunction by tdcs modulates stress-induced creativity impairment in women: an fnirs study

Cereb Cortex

2023

;

10528

–

10.1093/cercor/bhad301

24.

Lei

Liu

Yao

. et al.

Pattern discovery in brain imaging genetics via scca modeling with a generic non-convex penalty

Scientific Reports

, 2017;

:14052.

OpenURL Placeholder Text

WorldCat

25.

de Los

Campos

Funkhouser

. et al.

Fine mapping and accurate prediction of complex traits using bayesian variable selection models applied to biobank-size data

Eur J Hum Genet

2023

;

313

–

10.1038/s41431-022-01135-5

26.

Zhou

Ren

. et al.

Sparse group variable selection for gene–environment interactions in the longitudinal study

Genet Epidemiol

2022

;

317

–

27.

Ren

Zhang

. et al.

Gene–environment interaction identification via penalized robust divergence

Biom J

2022

;

461

–

10.1002/bimj.202000157

28.

Wang

Liang

Zhang

. et al.

Replicability in cancer omics data analysis: measures and empirical explorations

Brief Bioinform

2022

;

bbac304

29.

Sheng

Jun Li

Xiao

. et al.

Discriminative multi-view subspace feature learning for action recognition

IEEE Trans Circuits Syst Video Technol

2019

;

4591

–

600

10.1109/TCSVT.2019.2918591

Google Scholar

Crossref

WorldCat

30.

Zhao

Cui

Huang

. et al.

Supervised brain network learning based on deep recurrent neural networks

IEEE Access

2020

;

69967

–

10.1109/ACCESS.2020.2984948

Google Scholar

Crossref

WorldCat

31.

Zhang

Shang

Yang

. et al.

Lei Du, and Azheimers disease neuroimaging initiative. Disease progression prediction incorporating genotype-environment interactions: a longitudinal neurodegenerative disorder study

MICCAI

2024

;

152–162

:1–9.

Google Scholar

OpenURL Placeholder Text

WorldCat

32.

Zhang

Yang

. et al.

Lei Du, and Alzheimer’s Disease Neuroimaging Initiative. Modeling genotype–protein interaction and correlation for alzheimer’s disease: a multi-omics imaging genetics study

Brief Bioinform

2024

;

bbae038

33.

X-a

Liu

Xie

. et al.

Morbigenous brain region and gene detection with a genetically evolved random neural network cluster approach in late mild cognitive impairment

Bioinformatics

2020

;

2561

–

10.1093/bioinformatics/btz967

34.

Cui

Athey

Stable learning establishes some common ground between causal inference and machine learning

Nat Mach Intell

2022

;

110

–

10.1038/s42256-022-00445-z

Google Scholar

Crossref

WorldCat

35.

Tong Tong

Chen

Hastie

. et al.

Genome-wide association analysis by lasso penalized logistic regression

Bioinformatics

2009

;

714

–

10.1093/bioinformatics/btp041

36.

Jacob

Obozinski

Vert

J-P

Group lasso with overlap and graph lasso

. In: Danyluk A, Bottou L, Littman M. (eds),

Proceedings of the 26th Annual International Conference on Machine Learning

. Montreal, Quebec, Canada. New York, NY, USA: Association for Computing Machinery (ACM). pp.

433

–

2009

37.

Friedman

Hastie

Tibshirani

Sparse inverse covariance estimation with the graphical lasso

Biostatistics

2008

;

432

–

10.1093/biostatistics/kxm045

38.

Ivanoff

Picard

Rivoirard

Adaptive lasso and group-lasso for functional poisson regression

The Journal of Machine Learning Research

2016

;

1903

–

Google Scholar

OpenURL Placeholder Text

WorldCat

39.

Fan

Variable selection via nonconcave penalized likelihood and its oracle properties

J Am Stat Assoc

2001

;

1348

–

10.1198/016214501753382273

Google Scholar

Crossref

WorldCat

40.

Jiang

Zheng

Dong

Sparse and robust estimation with ridge minimax concave penalty

Inform Sci

2021

;

571

154

–

10.1016/j.ins.2021.04.047

Google Scholar

Crossref

WorldCat

41.

Rosenbaum

Rubin

The central role of the propensity score in observational studies for causal effects

Biometrika

1983

;

–

10.1093/biomet/70.1.41

Google Scholar

Crossref

WorldCat

42.

Fong

Hazlett

Imai

Covariate balancing propensity score for a continuous treatment: application to the efficacy of political advertisements

Ann Appl Stat

2018

;

156

–

Google Scholar

Crossref

WorldCat

43.

Yun

Matching on balanced nonlinear representations for treatment effects estimation

Adv Neural Inform Process Syst

2017

;

:10638.

10.1007/978-3-319-70139-4

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

44.

Casale

Rakitsch

Lippert

. et al.

Efficient set tests for the genetic analysis of correlated traits

Nat Methods

2015

;

755

–

45.

Kuang

Cui

. et al.

Estimating treatment effect in the wild via differentiated confounder balancing

. In: Pei J, Wang H, Wang W. (eds),

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

. New York, NY, USA: Association for Computing Machinery (ACM), pp.

265

–

2017

46.

Solovieff

Cotsapas

Lee

. et al.

Pleiotropy in complex traits: challenges and strategies

Nat Rev Genet

2013

;

483

–

47.

Liang

Zhang

Hierarchical false discovery rate control for high-dimensional survival analysis with interactions

Comput Stat Data Anal

2024

;

192

107906

10.1016/j.csda.2023.107906

48.

Gao

Cui

Shen

. et al.

Shared genetic etiology between type 2 diabetes and alzheimer’s disease identified by bioinformatics analysis

J Alzheimers Dis

2016

;

–

49.

Raber

Huang

Wesson

. et al.

Apoe genotype accounts for the vast majority of ad risk and ad pathology

Neurobiol Aging

2004

;

641

–

10.1016/j.neurobiolaging.2003.12.023

50.

Liu

C-C

Kanekiyo

Huaxi

. et al.

Apolipoprotein e and alzheimer disease: risk, mechanisms and therapy

Nat Rev Neurol

2013

;

106

–

10.1038/nrneurol.2012.263

51.

de Leeuw

Mooij

Heskes

. et al.

Magma: generalized gene-set analysis of gwas data

PLoS Comput Biol

2015

;

e1004219

10.1371/journal.pcbi.1004219

52.

Sammut

S-J

Crispin-Ortuzar

Chin

S-F

. et al.

Multi-omic machine learning predictor of breast cancer therapy response

Nature

2022

;

601

623

–

10.1038/s41586-021-04278-5

Author notes

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

Download all slides

Month:	Total Views:
April 2025	180
May 2025	12

Article Contents

Mutual-assistance learning for trustworthy biomarker discovery and disease prediction

Abstract

Introduction

Materials and methods

Extension to chromosome-wide analysis

Convergence analysis

Results

Experimental setup

Application to Alzheimer’s disease

Improved multimodal correlation

Improved predicting clinical scores and disease status classification

Biomarkers identification and explanation

Phenotype feature explanation

Genotype–phenotype correlated detection

Follow-up analyses: gene-set analyses

Follow-up analyses: functional mapping

Follow-up analyses: gene expression analyses

Follow-up analyses for ADNI: phenome-wide association studies and enrichment analysis

Application to breast cancer for generalization

Datasets

Biological marker explanation

Follow-up analyses for breast cancer: phenome-wide association studies

Ablation study

Discussion

Conclusion

Acknowledgments

Author contributions

Funding

Data availability

References

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Mutual-assistance learning for trustworthy biomarker discovery and disease prediction

Abstract

Introduction

Materials and methods

Extension to chromosome-wide analysis

Convergence analysis

Results

Experimental setup

Application to Alzheimer’s disease

Improved multimodal correlation

Improved predicting clinical scores and disease status classification

Biomarkers identification and explanation

Phenotype feature explanation

Genotype–phenotype correlated detection

Follow-up analyses: gene-set analyses

Follow-up analyses: functional mapping

Follow-up analyses: gene expression analyses

Follow-up analyses for ADNI: phenome-wide association studies and enrichment analysis

Application to breast cancer for generalization

Datasets

Biological marker explanation

Follow-up analyses for breast cancer: phenome-wide association studies

Ablation study

Discussion

Conclusion

Acknowledgments

Author contributions

Funding

Data availability

References

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only