Abstract

Integrating and analyzing multiple omics datasets, such as genomics, environmental influences, and imaging endophenotypes, has yielded an abundance of candidate biomarkers. However, translating such findings into beneficial clinical knowledge for disease prediction remains challenging. This becomes even more challenging when studying interpretable high-order feature interactions such as gene-environment interaction (G×E) to understand the etiology. To fill this gap, we draw on the idea of mutual-assistance (MA) learning and accordingly propose a fresh and powerful scheme, referred to as mutual-assistance causal biomarker discovery and stable disease prediction approach (MA-CBxDP). Specifically, we design an interpretable bi-directional mapping framework, integrated with a causal feature interaction module, to extract co-expression patterns across different modalities and identify trustworthy biomarkers including G×E. A cooperative prediction module is further incorporated to ensure accurate diagnosis and identification of causal effects for pathogenesis. Importantly, biomarker discovery and disease prediction can mutually reinforce each other, helping to provide novel insights into chronic diseases. Furthermore, in light of the large computational burden incurred by the high-dimensional interactions, we devise a rapid strategy and extend it to a more practical but challenging chromosome-wide setting. We conduct extensive experiments on two databases under three tasks, i.e. multimodal correlation, disease diagnosis, and trait prediction. MA-CBxDP establishes new state-of-the-art results in predicting clinical scores and disease status classification, while maintaining exceptional interpretability, verifying its flexibility and versatility in practical applications.

Introduction

Human diseases, such as Alzheimer’s disease (AD) and cancer, are inherently complex and governed by a multitude of factors encompassing genetics, endophenotypes, and environmental influences [1, 2]. Genomic data implicate the innate predisposition of an individual, while endophenotypes like neuroimaging data and proteomic expressions provide insights into his/her current disease status [3, 4]. In addition, the environmental influences capture distinct disease information from the environmental and behavioral aspects. Therefore, integrating these diverse modalities could accurately discern causal effects in disease pathogenesis, thereby devising effective strategies to slow down or halt the disease progression [5].

Imaging genetics is a rapidly evolving field for exploring the genetic architecture of brain endophenotypes [6–10]. During the past decade, univariate, multivariate, and bi-multivariate techniques have been harnessed to study genetic and phenotypic correlations. But they predominantly focus on the main effects of SNPs, and thus may overlook important heritability factors [11, 12]. To address this limitation, non-linear learning methods like kernel techniques and deep neural networks have been introduced to model feature interactions [2, 13–15]. These approaches often lack interpretability and may not consistently produce optimal feature effects and their interactions [16–18]. Hence, there is a pressing need for an efficient method to identify interpretable gene–environment interactions, enabling deeper exploration of the intricate relationships among genotype, environment, and imaging phenotypes in the context of brain disorders [19–24].

Intuitively, integrating and analyzing genomics, exposomes, and endophenotypes offers the potential to obtain better diagnostics and therapeutics. However, directly fusing these datasets and exploring gene–environment interaction on endophenotypes present several challenges [25–32]. They are spurious correlations for the biological explanation, the stability for diagnostic prediction, and the high dimensionality of genetic variations, which could bias subsequent healthcare decision-making [33, 34]. Variable selection is a prominent area in machine learning and biomarker discovery. There were many attractive solutions for variable selection including Lasso, elastic net, smoothly clipped absolute deviation, and Minimax Concave Penalty (MCP) may perform insufficient when dealing with highly correlated variables and potentially lead to misleading biological knowledge [35–40]. In addition, causal inference serves as a potent statistical tool for selecting explanatory variables, i.e. propensity score matching and confounder balancing, etc. [41–43]. Nevertheless, variable selection can manifest substantial variability when confronted with high-dimensional features and limited sample sizes. Although it is promising to marry causal feature selection and disease prediction, two challenges still exist. First, removing spurious correlations among genetics due to linkage disequilibrium (LD) for causal biomarker discovery in high-dimensional settings is difficult [43, 44]. Many unexpected feature correlations, especially in high-dimensional settings, could hinder the feature selection performance. Second, feature decorrelation and disease prediction are distinct tasks with differing objectives, and thus leveraging causal feature selection for disease prediction is also challenging [44, 45]. Thus, developing a task-oriented feature decorrelation framework is highly essential. However, designing a scalable feature decorrelation method for prediction tasks remains complex due to the inherent difficulty of directly aligning feature decorrelation with prediction objectives [46, 47].

To this end, we draw on the idea of mutual-assistance (MA) learning and propose a simple yet versatile method, referred to as mutual-assistance causal biomarker discovery and stable disease prediction (MA-CBxDP), to address the aforementioned challenges (refer to Fig. 1). In particular, MA-CBxDP incorporates an interpretable bi-directional mapping structure that embeds causally driven feature interactions into the disease prediction task. Interestingly, biomarker discovery plays a pivotal role in ensuring that disease prediction models achieve robust and representative performance. Conversely, accurate disease prediction facilitates the precise identification of disease-associated biomarkers. Notably, the synergistic interplay between biomarker discovery and disease prediction has the potential to mutually enhance their effectiveness, ultimately offering novel and reliable insights into the pathogenic mechanisms underlying chronic disorders.

A schematic illustration of imaging endophenotypes-assisted gene-environment interaction analysis. First, the bi-directional mapping framework incorporates genetic variations, exposomes, and endophenotypes simultaneously. The feature-interaction module identifies biologically meaningful genetics, environment as well as G$\times $E. Moreover, we meticulously design cooperative prediction modules to ensure accurate diagnosis and identification of causal effects for pathogens.
Figure 1

A schematic illustration of imaging endophenotypes-assisted gene-environment interaction analysis. First, the bi-directional mapping framework incorporates genetic variations, exposomes, and endophenotypes simultaneously. The feature-interaction module identifies biologically meaningful genetics, environment as well as G×E. Moreover, we meticulously design cooperative prediction modules to ensure accurate diagnosis and identification of causal effects for pathogens.

Furthermore, considering the computational complexity of high-dimensional gene–environment interactions, we develop a fast optimization algorithm in the spirit of divide-and-conquer. Two publicly available datasets, i.e. Alzheimer’s disease neuroimaging initiative (ADNI) and multimodal breast cancer (BC) datasets, were employed to demonstrate the effectiveness and superiority of MA-CBxDP.

Materials and methods

As mentioned above, we draw on the idea of MA learning and accordingly integrate disease progression with biomarker identification tasks into a whole, which could benefit both subtasks mutually. For ease of presentation, genetic variations and environmental are denoted as XRn×p, ERn×r, the f-th brain endophenotypes are denoted as YfRn×qf, z is clinical scores or diagnostic status, where n is number of subjects, p, q, and r represent the number of SNPs, intermediate phenotypes, and environmental exposures, respectively. Then, MA-CBxDP is defined as follows:

(1)

U=[u1,,uF]Rp×F carries the main effects of SNPs, V=[v1,,vF]Rq×F carries the main effects of endophenotypes, QRp×r is the GE interaction effects between SNPs and environment markers. Ω(U), Ω(Q) and Ω(V) are penalty terms for biomarkers detection.

In our model, the first term is a bi-directional mapping framework, integrated with a feature interaction module, to extract the co-expression patterns across different modalities. As we all konw that the SNPs in a LD commonly exhibit correlation due to shared ancestry [44]. Consequently, the genotype–phenotype correlation may be the indirect result of a correlated functional variant, potentially leading to the misidentification of the actual causal variants. Hence, we design a causal decorrelation module, i.e.,  DCi=P~(D)P(D), to prompt variables independence by learning sample weights. This can help focus on the true connection between biomedical variables and disease outcomes.

Moreover, ψ(X,E,Yf;z) is the cooperative prediction module based on a metric function (linear regression loss, i.e, ψlin(Df,z)=12i=1n(ziDfi,wf)2 logistic regression loss, i.e, ψlog(Df,z)=12i=1n(log(1+expDfi,wf)ziDfi,wf)), to ensure accurate diagnosis and identification of causal effects for pathogenesis. Λ is a tuning parameter to balance the prediction contributions. However, with finite samples, it is difficult to achieve the desirable weights that ensure perfect independence to get rid of the unstable variables. Thus, we further design a sparse variable regularizer to reduce the feature dimensionality, facilitating decorrelation and adapting to large-scale causal effect exploration settings. We use Ω(Q)=Q1,1=ij|Qi,j| to identify interpretable G×E effects. Ω(U) and Ω(V) control the sparsity of the main effects of genetics and endophenotypes. We employ FGL2,1-norm (UFGL2,1=i=1p1ui22+ui+122.), 2,1-norm, and 1-norm to identify relevant SNPs on group and individual levels. Thus, Ω(U)=λu1UFGL2,1+λu2U2,1+λu3U1,1. In addition, 1-norm is introduced to identify interpretable endophenotypes, i.e. Ω(V)=λvV1,1. Furthermore, we organically incorporate sparsity variable selection and independence-based sample reweighting within our predictive modeling framework. Therefore, these operator holds diverse and interpretable, predictive and causal biomarker identification with limited data.

To sum up, our model first uses the bi-directional mapping module which makes it easier and more reasonable to incorporate genetic variations, environmental influences, and endophenotypes simultaneously. We also present a feature interaction operation, where G×E is reintegrated to provide and subsequently unveils the inherent unity across both modalities. Second, to ensure stability and interpretability, we organically combine decorrelation regularized modules and sparsity regularized modules in an iterative way to select important features and eliminate biased biomarkers. Third, we meticulously design a disease diagnosis module to ensure precise diagnostic capabilities and the identification of causal effects for pathogenesis. Fortunately, MA-CBxDP is multi-convex, enabling optimization through an alternating convex search strategy. We first fix V and Q and solve for U using the gradient descent, and then iteratively update each variable while treating the others as constants. Mathematical analysis confirms that MA-CBxDP has a lower bound of zero, guaranteeing the convergence to a local optimum.

Extension to chromosome-wide analysis

The direct application of MA-CBxDP to chromosome-wide or genome-wide analysis presents challenges due to the computational intensity of genomics and G×E. To address this issue, we partition genome data into K non-intersecting subsets, denoted as U=k=1KUk and Q=k=1KQk, respectively. The choice of K can be user-defined or data-driven. We employ a strategic approach to circumvent direct computations of main and interaction terms by calculating these effects within each subset and then aggregating the outcomes across all genotypes. This methodology significantly reduces computational complexity. Following a divide-and-conquer principle, we redefine MA-CBxDP as:

(2)

The matrix concatenation operator merges SNPs and interaction terms and enables parallel processing, allowing independent treatment of SNPs and interaction terms. This strategy minimizes memory requirements, as fast MA-CBxDP only needs to retain small SNP matrices during iterations, which demonstrates practicality for chromosome-wide or genome-wide analyses, verifying its flexibility and versatility in practical applications.

Convergence analysis

 

Theorem 1.1.

Algorithm 1 will monotonously decrease in each iteration.

The optimization process reveals that the sub-objectives for U, Q, and V constitute three convex sub-problems. This indicates that convergence can be assured by iteratively solving U, Q, and V. MA-CBxDP is bounded from below by zero. Consequently, Algorithm 1 is guaranteed to converge to a local optimum.

Results

Experimental setup

State-of-the-art methods: (i) We conducted a comparative analysis of MA-CBxDP with most related state-of-the-art, including SMCCA, adaptive SMCCA, and RelPMDCCA. These techniques were emblematic of computational imaging genetics methods and can be reduced to specific genotype–phenotype analytical methods [16–18].

Evaluation criteria and parameter setting: The experiment performance relied on three critical metrics, including identified feature subsets, canonical correlation coefficient (CCC), and RMSE. We utilized the nested five-fold cross-validation to optimize parameters within the candidate set 10i (i=4,3,2,,0,,2,3,4), selecting parameters that produced the highest mean testing CCCs.

Application to Alzheimer’s disease

Dataset: Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public–private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging, positron emission tomography, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early AD.

The dataset included imaging, genetics, proteomic markers as well as environmental factors. In details, gray matter volumes from regions of interest were extracted using voxel-based morphometry (VBM), while cortical thickness values were obtained via FreeSurfer, with both datasets designated as VBM and FreeSurfer, respectively. Environmental factors, including human behavior variables (e.g. age, body mass index, visual ability, alcohol abuse, drug sensitivities, blood pressure, smoking and stroke), were recorded. Proteomic analyses were conducted using the rules-based medicine proteomic panel, yielding 146 markers after quality control. Cognitive scores from ADAS, MMSE, and RAVLT assessments were also collected. Additionally, 10 000 SNPs were selected from the ADNI database using additive coding based on minor allele counts. To ensure data consistency and reliability, SNPs, proteomic markers, and neuroimaging data were normalized to z-scores, eliminating scale-related biases.

Improved multimodal correlation

Figure 2 presents the mean and standard deviation of correlation (CCCs) and regression error (RMSEs) across all methods. Four genotype–phenotype correlation tasks: SNP-VBM, SNP-FreeSurfer, SNP-Plasma, and SNP-CSF, where higher CCC and lower RMSE denoted better performance. As expected, MA-CBxDP exhibited the highest CCCs and lowest RMSE. Moreover, our method surpassed the baselines in terms of standard deviations, indicating that stable prediction can be attributed to the causality bidirectional mapping framework. Unsurprisingly, ignoring GE interactions performed sub-optimal, emphasizing the importance of feature interactions rather than focusing solely on the main effects.

Comparison of testing (a) CCCs and (b) RMSEs for genotype–phenotype relationship. Without GE means that the GE interaction component was deleted.
Figure 2

Comparison of testing (a) CCCs and (b) RMSEs for genotype–phenotype relationship. Without GE means that the GE interaction component was deleted.

Improved predicting clinical scores and disease status classification

We further investigated the predicting and diagnosis performance using the top ten selected features, including SNPs and endophenotypes. In Fig. 3a, our method demonstrated superior performance compared to baseline approaches in predicting clinical scores (ADAS, MMSE, and RAVLT), achieving the highest CCC and the lowest standard error using linear regression. This highlights the effectiveness of the causality and cooperative prediction module in delivering stable predictions. Furthermore, the LIBSVM software package was employed to classify HCs, MCIs, and ADs. Figure 3b displays the mean and std of accuracy (ACC), where MA-CBxDP achieved the highest testing ACC and smallest std. This highlighted the superior predictive ability of the top-ranked features in MA-CBxDP. All these findings underscored our MA frameworks in enhancing disease classification and cognitive prediction. In addition, considering GE exhibited the best predictive performance, and thus it is necessary for feature interaction and can bring a positive effect.

Comparison of testing (a) CCCs and (b) ACCs on ADNI dataset for predicting clinical scores and disease status classification. Without GE means that the GE interaction component was deleted.
Figure 3

Comparison of testing (a) CCCs and (b) ACCs on ADNI dataset for predicting clinical scores and disease status classification. Without GE means that the GE interaction component was deleted.

Biomarkers identification and explanation

Main and interactions effects: The heat map in Fig. 4 showed the top selected SNPs. Based on the color bar, we can determine the relative importance of the features. MA-CBxDP identified multiple AD-related risk loci, including well-known rs429358 (APOE). In addition, MA-CBxDP also identified rs56131196, rs4420638, and rs12721051 (all in APOC1 [48–50]), which was supported by FGL2,1. However, the competitors reported numerous irrelevant signals and thus could mislead subsequent analyses. All these findings show the generality and superiority of exceptional interpretability for biomarker discovery.

Weights (mean value) of SNPs from five-fold cross-validation. Each row is a method: (1) SMCCA; (2) Adaptive SMCCA; (3) RelPMDCCA; (4) MA-CBxDP.
Figure 4

Weights (mean value) of SNPs from five-fold cross-validation. Each row is a method: (1) SMCCA; (2) Adaptive SMCCA; (3) RelPMDCCA; (4) MA-CBxDP.

In addition to the main effects, our method also uncovered significant GE interactions, with the top five interactions identified as follows: (rs75654248, smoking), (rs75654248, alcohol abuse), (rs440446, visual impairment), (rs390082, drug sensitivities) and (rs1064725, alcohol abuse). Firstly, we observed that rs75654248 - smoking was associated with ADs, rs75654248 had been previously reported AD risk, and smoking was also an important factor that could contribute to AD. These interactions can provide promising evidence for precise diagnosis, because the co-occurrence of these abnormalities may help clinicians be confident to diagnose at-risk individuals. We also conducted an ablation study by removing the feature interaction module from MA-CBxDP (Table 1). Unsurprisingly, optimal performance was achieved when the main effects and interaction effects of genotype and environment were considered, underscoring the importance of their joint consideration. Beyond diagnosis some of these snp environment interactions may help in treatment with either slowing the disease or counseling family to reduce risk of others who may also carry the variant.

Table 1

Ablation studies on main modules of different design choices on ADNI database

idLSVRLFILLDCLCPPCCCMCCC
(a)××××0.66 ± 0.060.29 ± 0.08
(b)×××0.67 ± 0.060.31 ± 0.07
(c)××0.68 ± 0.050.32 ± 0.07
(d)×0.70 ± 0.030.34 ± 0.04
(e)0.72 ± 0.030.36 ± 0.04
idLSVRLFILLDCLCPPCCCMCCC
(a)××××0.66 ± 0.060.29 ± 0.08
(b)×××0.67 ± 0.060.31 ± 0.07
(c)××0.68 ± 0.050.32 ± 0.07
(d)×0.70 ± 0.030.34 ± 0.04
(e)0.72 ± 0.030.36 ± 0.04

LSVR” denotes sparse variable regularizer. “LFIL” is feature interaction module. “LDC” represents causal decorrelation module, “LCP” means cooperative prediction module. Average prediction clinical scores (PCCC) and multimodal genotype–phenotype correlation (MCCC). Bold indicates the best result, underlined is second best. means higher is better.

Table 1

Ablation studies on main modules of different design choices on ADNI database

idLSVRLFILLDCLCPPCCCMCCC
(a)××××0.66 ± 0.060.29 ± 0.08
(b)×××0.67 ± 0.060.31 ± 0.07
(c)××0.68 ± 0.050.32 ± 0.07
(d)×0.70 ± 0.030.34 ± 0.04
(e)0.72 ± 0.030.36 ± 0.04
idLSVRLFILLDCLCPPCCCMCCC
(a)××××0.66 ± 0.060.29 ± 0.08
(b)×××0.67 ± 0.060.31 ± 0.07
(c)××0.68 ± 0.050.32 ± 0.07
(d)×0.70 ± 0.030.34 ± 0.04
(e)0.72 ± 0.030.36 ± 0.04

LSVR” denotes sparse variable regularizer. “LFIL” is feature interaction module. “LDC” represents causal decorrelation module, “LCP” means cooperative prediction module. Average prediction clinical scores (PCCC) and multimodal genotype–phenotype correlation (MCCC). Bold indicates the best result, underlined is second best. means higher is better.

Phenotype feature explanation

Identification and interpretation of neuroimaging-derived phenotype markers (VBM): Identifying intermediate phenotypes attacked by AD was also crucial for computer-aided diagnosis. In our study, Fig. 5 and Fig. 6 presented the selected neuroimaging phenotypes and follow-up analyses, while Fig. 7 illustrated the feature selection of other phenotypes. In details, the top imaging phenotypes identified by MA-CBxDP (Figs 7a and 5a) were selected neuroimaging-derived phenotype markers and mapped onto the brain for visualization. Notably, MA-CBxDP identified key brain regions such as the left, and right hippocampus, and the right lingual area. Research has indicated the importance of the hippocampus in memory and its significance in diagnosing AD. Further experiments on various phenotypes reinforced the efficacy of our method. In contrast, baselines cannot identify reliable biomarkers, hindering their performance in AD diagnosis as illustrated in Fig. 7a. These demonstrated the superior capability of MA-CBxDP in identifying AD-affected phenotypes.

(a) Visualization of identified brain imaging QTs. (b) Heatmap of pairwise correlation between top ten SNPs and endophenotypes, where symbol “$\times $” indicates the pairwise association reached the significance level ($P<.05$).
Figure 5

(a) Visualization of identified brain imaging QTs. (b) Heatmap of pairwise correlation between top ten SNPs and endophenotypes, where symbol “×” indicates the pairwise association reached the significance level (P<.05).

Investigation of the mediating effects of endophenotypes characteristics on diagnostic phenotypes caused by genetic variants (a) rs429358 (b) rs6857. (*, $P\ $<0̇5), (**, $P\ $<0̇1), (***, $P\ $< .001).
Figure 6

Investigation of the mediating effects of endophenotypes characteristics on diagnostic phenotypes caused by genetic variants (a) rs429358 (b) rs6857. (*, P <0̇5), (**, P <0̇1), (***, P < .001).

Biomarker discovery for important endophenotypes on ADNI. (a) Neuroimaging-derived phenotypes (VBM). (b) Neuroimaging-derived phenotypes (FreeSurfer). (c) Plasma-derived proteomic markers. (d) CSF-derived proteomic markers. Each row is a method: (1) SMCCA; (2) adaptive SMCCA; (3) RelPMDCCA; (4) proposed.
Figure 7

Biomarker discovery for important endophenotypes on ADNI. (a) Neuroimaging-derived phenotypes (VBM). (b) Neuroimaging-derived phenotypes (FreeSurfer). (c) Plasma-derived proteomic markers. (d) CSF-derived proteomic markers. Each row is a method: (1) SMCCA; (2) adaptive SMCCA; (3) RelPMDCCA; (4) proposed.

Identification and interpretation of neuroimaging-derived phenotype markers (FreeSurfer): The results of feature selection, as depicted in Fig. 7b, showcased the neuroimaging-derived phenotypes (FreeSurfer) discerned through MA-CBxDP. Notably, MA-CBxDP assigned notably elevated weights to a distinct subset of FreeSurfer markers, RPallvol and LEntCtx, which have been previously confirmed to be associated with elevated AD risk. This recurrent recognition of AD-related neuroimaging-derived biomarkers underscored the effectiveness of MA-CBxDP in precisely and reliably identifying disease-relevant markers.

Identification and interpretation of plasma-derived proteomic markers:  Fig. 7c shed light on the significance of plasma-derived proteomic markers. Through the causal variable decorrelation module, it was noteworthy that our approach effectively pinpointed various proteomic markers linked to AD, such as ApoE, APoB, and CRP. Conversely, while the comparison methods also detected some AD-associated proteomic markers, they yielded a plethora of extraneous signals, consequently impeding a coherent interpretation.

Identification and interpretation of CSF-derived proteomic markers: The heatmap displayed in Fig. 7d delineated the canonical weights assigned to cerebrospinal fluid (CSF)-derived proteomic markers. Notably, MA-CBxDP effectively recognized CSF-derived proteomic markers associated with AD, including FGF-4, ApoD, and ApoE, among others. In contrast, benchmark methods generated a multitude of markers, complicating the interpretation process. Through the amalgamation of data from both plasma and CSF datasets, we can confidently assert that MA-CBxDP surpassed benchmark methods, highlighting its superior capability in identifying reliable markers of proteomic expression.

Genotype–phenotype correlated detection

To further illustrate the biological effect, a genotype–phenotype correlation analysis of neuroimaging-derived phenotype markers (VBM) was conducted. In Fig. 5b, most genotype–phenotype correlations displayed significance. Of note, the association between rs429358 and the left hippocampus reached the significance level, reinforcing the genetic impact on hippocampal abnormalities. In addition, the extended experiments on other genotype–phenotype correlations also demonstrated the superiority of our framework, which could gain comprehensive insights into the potential pathological mechanisms of ADs. To sum up, all this analysis highlighted the capability of MA-CBxDP in discerning causal genotype–phenotype relationships for AD pathogenesis.

We have independently shown the relevance of identified SNPs and imaging QTs. In addition to exploring whether biomarkers that are associated with disease, among the detected genotype–phenotype pairs of AD. Further, in Fig. 6, we intended to uncover the causal relation between imaging QTs, and SNPs on diagnostic phenotypes. We looked into the first two genotype–phenotype pairs, i.e. (rs429358, hippocampus) and (rs6857, hippocampus). We performed mediation analyses to build the causal relationships between genetic, endophenotypes, and diagnostic phenotypes. Mediation analysis was a statistical method used to explore and quantify the mechanisms through which a causal effect operates. This framework was grounded in causal inference and was particularly useful when trying to understand how or why an exposure affects an outcome. We checked whether the genetic effects on diagnosis could be explained by endophenotypes. Specifically, genetic factors are treated as the input variable, while each diagnostic phenotype serves as the outcome variable. The mean endophenotypes (hippocampus) act as the mediator.

By selecting the endophenotypes (hippocampus) as mediator variables, we find a significant correlation between rs429358 and diagnostic phenotypes (β = – 0.37, P <0.001). This association is mediated by the hippocampus (bootstrapped average causal mediation effect: (β = –0.07 [–0.11, –0.03]), indicating a partial mediation effect). Similarly, we find that the hippocampus partially mediated the effects of rs6857 on participants’ diagnostic phenotypes (bootstrapped average causal mediation effect: (β = – 0.06 [-0.11, –0.03]). To sum up, the hippocampus mediates the effects of risk genetics biomarkers on diagnostic phenotype, which confirms the physiological significance of the identified hippocampus biomarkers. In summary, the statistical analysis results above confirmed the value of the causal genotype–phenotype relationships for AD pathogenesis. This emphasized the effectiveness of MA-CBxDP in pinpointing causal genetic variation and could deepen our understanding of the pathogenesis of AD.

Follow-up analyses: gene-set analyses

To validate the identified genetic loci, we conducted a gene-based analysis using MAGMA, which employs a multiple linear principal component regression model. MAGMA projects the multivariate LD matrix of SNPs within a gene to extract principal components that capture genetic variation. These components are then used as predictors in a linear regression framework to assess their association with the phenotype. Fisher’s test is applied to compute P-values, determining the significance of the gene–phenotype relationship [51].

Notably, genes APOE (P=9.90×1026) and APOC1 (P=5.72×1018) showed the most significant association with the diagnostic phenotype. Of note, APOE, encoding apolipoprotein E involved in amyloid-specific pathways involving amyloid trafficking and plaque clearance, is consistently associated with AD diagnosis. This underscored MA-CBxDP’s efficacy in identifying causal genetic variations.

Follow-up analyses: functional mapping

To validate and interpret the identified biomarkers, we employed the Functional Mapping and Annotation (FUMA) platform. This tool facilitated functional mapping, prioritization, annotation, and interpretation of results from genome-wide association studies (GWAS). Significant SNPs were identified using meta-analysis summary statistics with a significance threshold of P  <  5×108 and 1-Mb window independence, referencing the 1000 Genomes phase 3 dataset. Lead SNPs were then pinpointed from these significant SNPs, affirming their disease associations. Notably, genetic loci of APOE and APOC1 showed substantial statistical significance, as illustrated in Fig. 8 and Fig. 9b. These results strongly aligned with our MA-CBxDP findings. Specifically, rs429358, situated in the APOE on chromosome 19, emerged as the most significant SNPs, as well as the causal genetic variation (highlighted in Figs 8 and 9a), reinforcing its pivotal role in AD. This consistency underscored the reliability and effectiveness of MA-CBxDP in identifying causal biomarkers.

Regional plot of the top lead SNPs rs429358.
Figure 8

Regional plot of the top lead SNPs rs429358.

(a) A Manhattan plot depicting the snp-based test. (b) A separate Manhattan plot was created for the gene-based test, the x-axis represents the SNP positions on the genome, while the y-axis represents the negative base-10 logarithm of the $P$-values. Higher values on the y-axis indicate stronger signals, suggesting significant associations.
Figure 9

(a) A Manhattan plot depicting the snp-based test. (b) A separate Manhattan plot was created for the gene-based test, the x-axis represents the SNP positions on the genome, while the y-axis represents the negative base-10 logarithm of the P-values. Higher values on the y-axis indicate stronger signals, suggesting significant associations.

Follow-up analyses: gene expression analyses

To explore the biological implications of the identified loci at the gene expression level, we utilized GENE2FUNC for gene expression analysis. We investigated gene expression patterns linked to the top 50 SNPs by annotating these SNPs to their respective genes for analysis. Utilizing data from the GTEx database (version 8) covering 54 tissue types and BrainSpan RNA sequencing data across 29 developmental stages, we generated heat maps to visualize gene expressions, with each map illustrating the average normalized expression value for its associated labels. We presented mRNA expression profiles of prioritized genes associated with the top 50 SNPs in 54 tissue types of developing adult human brains. The expression levels of genes such as APOE, APOC1 were shown in Fig. 10b. These genes exhibited distinct expression patterns throughout different life stages as reflected in BrainSpan data in Fig. 10a. Notably, APOE consistently exhibited high expression levels across all lifespan stages, while APOC1 showed elevated expression during late prenatal and postnatal stages, potentially contributing to AD. These results highlight the efficacy of MA-CBxDP in identifying reliable genetic variations associated with various human brain tissues across different lifespan stages.

Heat maps illustrating the normalized gene expression values, obtained through zero mean normalization of log2-transformed expression, are presented for the prioritized genes associated with the 50 SNPs. The lower panel represents data from GTEx v8 RNAseq, while the upper panel showcases BrainSpan data.
Figure 10

Heat maps illustrating the normalized gene expression values, obtained through zero mean normalization of log2-transformed expression, are presented for the prioritized genes associated with the 50 SNPs. The lower panel represents data from GTEx v8 RNAseq, while the upper panel showcases BrainSpan data.

Follow-up analyses for ADNI: phenome-wide association studies and enrichment analysis

To validate the phenotypic implications linked with SNPs pinpointed via MA-CBxDP, a phenotype-wide association analysis (pheWAS) was conducted utilizing data sourced from the publicly accessible GWAS Atlas32 (https://atlas.ctglab.nl). This investigation encompassed an extensive dataset comprising 4756 GWAS, potentially without an explicit focus on the SNP or gene under scrutiny.

PheWAS: First, a comprehensive phenotype-wide association study (pheWAS) is conducted on the lead SNP, rs429358, to explore its potential associations across a broad spectrum of phenotypes encompassing 28 domains. Fig. 11a reveals a significant linkage between the rs429358 locus and neurological phenotypes. Particularly, significant relationships were established between rs429358 and neurological characteristics such as AD or dementia with paternal history, in addition, we also found the medication for cholesterol, blood pressure, and diabetes had significant associations with rs429358. Understanding the modifiable risk factors is paramount for designing personalized therapeutic interventions. Further, Fig. 11b offers a comprehensive insight into phenome-wide association studies concerning the prominent lead SNPs rs56131196. Interestingly, illnesses of the father: AD/dementia and paternal history of AD, low-density lipoprotein cholesterol, and total cholesterol had a significant correlation with rs56131196, providing valuable insights into their multi-functionality.

(a) The PheWAS analysis yielded results for SNP rs429358 ($APOE$). (b) The PheWAS analysis yielded results for SNP rs56131196 ($APOC1$).
Figure 11

(a) The PheWAS analysis yielded results for SNP rs429358 (APOE). (b) The PheWAS analysis yielded results for SNP rs56131196 (APOC1).

In summary, the combined evidence from the pheWAS highlights the successful identification of the genetic underpinnings of neurological phenotypes through our MA-CBxDP approach.

Enrichment analysis: The top genes identified by our algorithm were examined for their biological significance. Additionally, Figs 12a and 12b display the metascape and GO enrichment results for the top genes, respectively. Notably, most pathways involved in these top genes are also associated with AD. We have revisited the enrichment analysis data and the listed top “GO terms.” For each “GO term,” we have included the exact name and the corresponding “GO ID” as defined in the database. The top “GO term” is “neuron recognition,” “heart development,” “neuron projection development,” “cell junction organization,” and “regulation of neuron projection development”.

(a) Summary of enrichment analysis in GO biological processes. (b) Summary of enrichment analysis in DisGeNET.
Figure 12

(a) Summary of enrichment analysis in GO biological processes. (b) Summary of enrichment analysis in DisGeNET.

In addition, enrichment analysis was conducted on DisGeNET ontology categories using Metascape. DisGeNET aggregates data from curated repositories, GWAS catalogs, animal models, and scientific literature to elucidate the genetic basis of human diseases. The analysis identified the top 20 significant biological processes, including “Alzheimer’s disease, focal onset,” “acute confusional senile dementia,” “behavioral variant of frontotemporal dementia,” and “memory performance.” These results offer insights into the biological mechanisms underlying AD and underscore the therapeutic potential of the identified causal gene biomarkers.

Application to breast cancer for generalization

Datasets

To evaluate the generalizability of our method, we applied MA-CBxDP to a multi-omics BC dataset [52] to predict treatment response and gain molecular insights. This study utilized clinical environment features (age, ER.status, HER2.status, Size.at.diagnosis, LN.at.diagnosis, etc), DNA data, tumor microenvironment, and pathway functional profiles of early and locally advanced BC patients, obtained from the TransNEO cohort at Cambridge University Hospitals NHS Foundation. To ensure data consistency and reliability, clinical environment features, DNA mutation, tumor microenvironment, and molecular pathway data were normalized to z-scores, eliminating scale-related biases. The goal was to predict RCB scores, which quantify residual disease after neoadjuvant (pre-surgical) therapy, using TiME and pathway activity as endophenotypes.

Multimodal correlation and predicting clinical scores:  Fig. 13 depicts the mean and std of CCC for genotype–phenotype correlation and predicting clinical scores (CCC and RMSE). As expected, MA-CBxDP exhibited the highest CCCs for genotype–phenotype correlation. Furthermore, mutual assistance frameworks outperformed baselines in terms of CCCs and RMSEs when predicting clinical scores. This is attributed to the substantial enhancement achieved by integrating causality into prediction.

Comparison of testing CCCs and RMSEs (mean $\pm $ std.) on BC dataset. (a) Genotype–phenotype correlation (CCC) and (b) predicting clinical scores (CCC and RMSE).
Figure 13

Comparison of testing CCCs and RMSEs (mean ± std.) on BC dataset. (a) Genotype–phenotype correlation (CCC) and (b) predicting clinical scores (CCC and RMSE).

Biological marker explanation

Main and interactions effects: An investigation of the biomarkers confirmed that most features were indeed relevant to BC i.e. including TP53, PIK3CA, COL12A1, and RYR1. Notably, TP53 stands out as a critical variable for BC due to its pivotal role in cell cycle regulation and apoptosis. Beyond main effects, MA-CBxDP revealed noteworthy GE interactions including: (RYR1, age), (RYR1, LN at diagnosis), (PIK3CA, HER2 status), (RYR1, ER status), and (MT-ND5, Histology). Interestingly, RYR1 and age were important factors that contributed to BC. Meanwhile, their interactions also made sense. Further investigations of the interactions could reveal novel clues to BC. As shown in Fig. 13, the best performance was achieved in genotype–phenotype correlation and clinical prediction when the main and interaction effects identification modules were included, highlighting the necessity of incorporating both aspects. To further demonstrate the efficacy of MA-CBxDP, post-analyses showcased biological implications.

Follow-up analyses for breast cancer: phenome-wide association studies

An investigation into the biomarker identified by MA-CBxDP indicated that most features were indeed relevant to BC, i.e. including TP53, PIK3CA, COL12A1, and RYR1. Notably, TP53 standed out as a critical variable for BC due to its pivotal role in cell cycle regulation and apoptosis. Beyond main effects, MA-CBxDP revealed noteworthy GE interactions including (RYR1, age), (RYR1, LN at diagnosis), (PIK3CA, HER2 status), (RYR1, ER status), and (MT-ND5, Histology). Interestingly, RYR1 was detected in main and interaction effects. Together, these findings provided promising biomarker tools for BC diagnosis and prediction.

This PheWAS study used 17 361 dichotomous phenotypes and 1419 quantitative phenotypes from the AstraZeneca pheWAS portal database to perform pheWAS at the gene level. PheWAS results can be interpreted as the association of genetically determined protein expression with specific diseases or traits. In Figs 14 and 15, a notable correlation between TP53, RYR1, and specific phenotypes was illustrated. Specifically, associations between TP53 and neoplasms traits such as malignant neoplasms stated or presumed to be primary of lymphoid hematopoietic and related tissue. We also found that there were significant associations between RYR1 and traits related to cells such as breast, and malignant neoplasms all linked to BC disease. This emphasized the effectiveness of MA-CBxDP in pinpointing causal genetic variations associated with BC.

Traits significantly associated with TP53 using PheWAS portal.
Figure 14

Traits significantly associated with TP53 using PheWAS portal.

Traits significantly associated with RYR1 using PheWAS portal.
Figure 15

Traits significantly associated with RYR1 using PheWAS portal.

Ablation study

We ran ablation to investigate the impact of each component on prediction clinical (PCCC) and correlation (MCCC).

Effect of sparse variable regularizer module: We first considered LSVR. Table 1 displayed the testing PCCC and MCCC for all methods. Remarkably, the absence of LSVR module leads to suboptimum performance, emphasizing the necessity of integrating LSVR into predictive workflows.

Effect of feature interaction module: Then, we investigated the effect of LFIL. Table 1 showcased the superior performance of MA-CBxDP compared to baselines, which can help elucidate missing heritability. Furthermore, an additional experiment demonstrated a significant time reduction achieved through our divide-and-conquer strategy, decreasing time consumption by a hundredfold (from 6750 to 61 s) while maintaining comparable performance. This substantial improvement in time efficiency enhanced the practical accessibility and utility of our strategy.

Effect of causal decorrelation module: In Table 1, our approach outperformed other baselines, showing superior results in terms of average PCCC and MCCC with the smallest standard error. This highlighted the advantages of integrating causal decorrelation module into predictive modeling. This can not only enhance correlation and prediction accuracy and stability but also underscore the significance of eliminating spurious correlations and identifying causal biomarkers.

Effect of cooperative prediction module: In Table 1, we also investigated the impact of LCP. It was evident that the absence of LCP resulted in inferior performance, highlighting the importance of incorporating LCP into predictive workflows to bolster MA-CBxDP’s predictive capabilities.

Discussion

We propose a MA learning framework, an efficient approach for jointly predicting disease and identifying disease-related causal biomarkers. To our knowledge, this is the first attempt to simultaneously address disease prediction and pathogenesis identification. Experimental results demonstrate that MA-CBxDP significantly outperforms state-of-the-art methods. Moreover, jointly modeling disease progression and identifying pathogenesis factors is more effective than tackling these tasks independently. MA-CBxDP notably improves the prediction and identification of pathogenesis biomarkers, offering valuable insights into chronic disorders.

To validate the pathophysiological significance of the identified multimodal biomarkers, we conducted follow-up analyses, including ANOVA, gene-set analysis, functional mapping, gene expression analysis, phenome-wide association studies (PheWAS), causal mediation analysis, and enrichment analysis. These analyses confirmed the relevance of the identified risk factors. For instance, ANOVA revealed significant associations between several biomarkers and diagnostic status, while MAGMA analysis highlighted their biological significance at the gene level. Both analyses validated the effectiveness of our approach. Additionally, PheWAS explored associations between potential therapeutic targets and clinical characteristics, providing insights into their multifunctionality and mechanistic roles, which could inform future research and treatment strategies. Gene enrichment analysis further revealed the functional characteristics and biological relevance of these targets, enhancing our understanding of their roles in AD pathogenesis and treatment. In contrast, comparison methods produced numerous irrelevant signals, raising concerns for subsequent analyses. Overall, the results demonstrate that MA-CBxDP accurately and comprehensively identifies disease risk factors.

Our study had several limitations in achieving a more comprehensive understanding of AD pathophysiology, incorporating additional imaging modalities could offer a broader perspective on disease identification. In addition, prospective randomized clinical trials are necessary to evaluate the impact of integrating pathologies into clinical workflows.

Conclusion

This study proposed an interpretable framework for causal GE biomarker discovery and stable disease prediction in the spirit of MA learning, which can be applied to the diagnosis and prognosis of AD and other chronic diseases. We also extended the current task to a chromosome-wide setting and produced strong baseline results. Experimental results demonstrated that MA-CBxDP established new state-of-the-art results for AD and cancer, while maintaining exceptional interpretability, verifying its flexibility and versatility in practical applications. To translate these findings into potential clinical applications for routine diagnostics and validate the clinical significance of the identified biomarkers, we performed follow-up analyses and further examined the predictive and diagnostic performance using the top selected features, which could provide novel and reliable insights into the pathogenic mechanisms of chronic disorders.

Key Points
  • We propose a simple but efficient disease prediction incorporating GE interactions, i.e. MA-CBxDP, based on MA learning, which simultaneously benefits disease prediction and biomarker identification, representing the first attempt in brain imaging genetic field.

  • We integrate the sparsity variable selection and causal decorrelation module into the disease prediction task. The results demonstrate that MA-CBxDP attains superior performance in terms of causal biomarker identification and disease prediction, establishing a new state-of-the-art.

Acknowledgments

Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd; Janssen Alzheimer Immunotherapy Research & Development, LLC; Johnson & Johnson Pharmaceutical Research & Development LLC; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

Author contributions

Jin Zhang (Conceptualization, Methodology, Software, Validation, Writing—original draft), Yan Yang (Software, Visualization, Investigation), Muheng Shang (Validation, Writing—review & editing), Lei Guo (Conceptualization, Data curation, Investigation), Daoqiang Zhang (Conceptualization, Supervision, Writing—review & editing), and Lei Du (Supervision, Conceptualization, Resources, Writing—review and editing).

Conflict of interest: None declared.

Funding

This work was supported in part by the MOST 2030 Brain Project Grant (No. 2022ZD0208500); National Natural Science Foundation of China (Nos 62136004, 62373306); Innovation Foundation for Doctor Dissertation (No. CX2023062); and Fundamental Research Funds for the Central Universities at Northwestern Polytechnical University.

Data availability

Data used in the preparation of this article were obtained from ADNI database (adni.loni.usc.edu). Code can obtain from https://github.com/ZJ-Techie/MA-CBxDP.

References

1.

Sims
 
R
,
Hill
 
M
,
Williams
 
J
.
The multiplex model of the genetics of Alzheimer’s disease
.
Nat Neurosci
 
2020
;
23
:311–22.

2.

Shen
 
L
,
Thompson
 
PM
.
Brain imaging genomics: Integrated analysis and machine learning
.
Proc IEEE
 
2019
;
108
:
125
62
.

3.

Liu
 
F
,
Jiayuan
 
X
,
Guo
 
L
. et al.  
Environmental neuroscience linking exposome to brain structure and function underlying cognition and behavior
.
Mol Psychiatry
 
2023
;
28
:
17
27
.

4.

Haotian
 
W
,
Eckhardt
 
CM
,
Baccarelli
 
AA
.
Molecular mechanisms of environmental exposures and human disease
.
Nat Rev Genet
 
2023
;
24
:
332
44
.

5.

Westerman
 
KE
,
Sofer
 
T
.
Many roads to a gene-environment interaction
.
Am J Hum Genet
 
2024
;
111
:
626
35
.

6.

Vogel
 
JW
,
Corriveau-Lecavalier
 
N
,
Franzmeier
 
N
. et al.  
Connectome-based modelling of neurodegenerative diseases: towards precision medicine and mechanistic insight
.
Nat Rev Neurosci
 
2023
;
24
:
620
39
.

7.

Graham
 
S
,
Quoc Dang
 
V
,
Jahanifar
 
M
. et al.  
One model is all you need: multi-task learning enables simultaneous histology image segmentation and classification
.
Med Image Anal
 
2023
;
83
:
102685
.

8.

Lambert
 
J-C
,
Ramirez
 
A
,
Grenier-Boley
 
B
. et al.  
Step by step: towards a better understanding of the genetic architecture of alzheimer’s disease
.
Mol Psychiatry
 
2023
;
28
:
2716
27
.

9.

Chung
 
J
,
Das
 
A
,
Sun
 
X
. et al.  
Genome-wide association and multi-omics studies identify mgmt as a novel risk gene for alzheimer’s disease among women
.
Alzheimers Dement
 
2023
;
19
:
896
908
.

10.

Zhang
 
J
,
Wang
 
H
,
Zhao
 
Y
. et al.  
Lei Du, and Alzheimer’s Disease Neuroimaging Initiative. Identification of multimodal brain imaging association via a parameter decomposition based sparse multi-view canonical correlation analysis method
.
BMC Bioinform
 
2022
;
23
:
128
.

11.

Wei
 
W-H
,
Hemani
 
G
,
Haley
 
CS
.
Detecting epistasis in human complex traits
.
Nat Rev Genet
 
2014
;
15
:
722
33
.

12.

Lin
 
D
,
Calhoun
 
VD
,
Wang
 
Y-P
.
Correspondence between fmri and snp data by group sparse canonical correlation analysis
.
Med Image Anal
 
2014
;
18
:
891
902
.

13.

Duvenaud
 
DK
,
Nickisch
 
H
,
Rasmussen
 
C
.
Additive gaussian processes
.
Advances in Neural Information Processing Systems
, 2011;
24
:1–9.

14.

Lanckriet
 
GRG
,
Cristianini
 
N
,
Bartlett
 
P
. et al.  
Learning the kernel matrix with semidefinite programming
.
J Mach Learn Res
 
2004
;
5
:
27
72
.

15.

Donghuan
 
L
,
Popuri
 
K
,
Ding
 
GW
. et al.  
Multiscale deep neural network based analysis of fdg-pet images for the early diagnosis of alzheimer’s disease
.
Med Image Anal
 
2018
;
46
:
26
34
.

16.

Rodosthenous
 
T
,
Shahrezaei
 
V
,
Evangelou
 
M
.
Integrating multi-omics data through sparse canonical correlation analysis for the prediction of complex traits: a comparison study
.
Bioinformatics
 
2020
;
36
:
4616
25
.

17.

Witten
 
DM
,
Tibshirani
 
RJ
.
Extensions of sparse canonical correlation analysis with applications to genomic data
.
Stat Appl Genet Mol Biol
 
2009
;
8
:
1
27
.

18.

Wenxing
 
H
,
Lin
 
D
,
Cao
 
S
. et al.  
Adaptive sparse multiple canonical correlation analysis with application to imaging (epi) genomics study of schizophrenia
.
IEEE Trans Biomed Eng
 
2017
;
65
:
390
9
.

19.

Eichler
 
EE
,
Flint
 
J
,
Gibson
 
G
. et al.  
Missing heritability and strategies for finding the underlying causes of complex disease
.
Nat Rev Genet
 
2010
;
11
:
446
50
.

20.

Manolio
 
TA
,
Collins
 
FS
,
Cox
 
NJ
. et al.  
Finding the missing heritability of complex diseases
.
Nature
 
2009
;
461
:
747
53
.

21.

Lei
 
D
,
Zhang
 
J
,
Liu
 
F
. et al.  
Identifying associations among genomic, proteomic and imaging biomarkers via adaptive sparse multi-view canonical correlation analysis
.
Med Image Anal
 
2021
;
70
:
102003
.

22.

Lei
 
D
,
Zhang
 
J
,
Zhao
 
Y
. et al.  
Inmtscca: an integrated multi-task sparse canonical correlation analysis for multi-omic brain imaging genetics
.
Genom Proteom Bioinform
 
21
:
396
413
.

23.

Wang
 
Y
,
Zhang
 
J
,
Li
 
Y
. et al.  
Preventing prefrontal dysfunction by tdcs modulates stress-induced creativity impairment in women: an fnirs study
.
Cereb Cortex
 
2023
;
33
:
10528
45
.

24.

Lei
 
D
,
Liu
 
K
,
Yao
 
X
. et al.  
Pattern discovery in brain imaging genetics via scca modeling with a generic non-convex penalty
.
Scientific Reports
, 2017;
7
:14052.

25.

de Los
,
Campos
 
AG
,
Funkhouser
 
S
. et al.  
Fine mapping and accurate prediction of complex traits using bayesian variable selection models applied to biobank-size data
.
Eur J Hum Genet
 
2023
;
31
:
313
20
.

26.

Zhou
 
F
,
Xi
 
L
,
Ren
 
J
. et al.  
Sparse group variable selection for gene–environment interactions in the longitudinal study
.
Genet Epidemiol
 
2022
;
46
:
317
40
.

27.

Ren
 
M
,
Zhang
 
S
,
Ma
 
S
. et al.  
Gene–environment interaction identification via penalized robust divergence
.
Biom J
 
2022
;
64
:
461
80
.

28.

Wang
 
J
,
Liang
 
H
,
Zhang
 
Q
. et al.  
Replicability in cancer omics data analysis: measures and empirical explorations
.
Brief Bioinform
 
2022
;
23
:
bbac304
.

29.

Sheng
 
B
,
Jun Li
 
F
,
Xiao
 
QL
. et al.  
Discriminative multi-view subspace feature learning for action recognition
.
IEEE Trans Circuits Syst Video Technol
 
2019
;
30
:
4591
600
.

30.

Zhao
 
S
,
Cui
 
Y
,
Huang
 
L
. et al.  
Supervised brain network learning based on deep recurrent neural networks
.
IEEE Access
 
2020
;
8
:
69967
78
.

31.

Zhang
 
J
,
Shang
 
M
,
Yang
 
Y
. et al.  
Lei Du, and Azheimers disease neuroimaging initiative. Disease progression prediction incorporating genotype-environment interactions: a longitudinal neurodegenerative disorder study
.
MICCAI
 
2024
;
152–162
:1–9.

32.

Zhang
 
J
,
Ma
 
Z
,
Yang
 
Y
. et al.  
Lei Du, and Alzheimer’s Disease Neuroimaging Initiative. Modeling genotype–protein interaction and correlation for alzheimer’s disease: a multi-omics imaging genetics study
.
Brief Bioinform
 
2024
;
25
:
bbae038
.

33.

Bi
 
X-a
,
Liu
 
Y
,
Xie
 
Y
. et al.  
Morbigenous brain region and gene detection with a genetically evolved random neural network cluster approach in late mild cognitive impairment
.
Bioinformatics
 
2020
;
36
:
2561
8
.

34.

Cui
 
P
,
Athey
 
S
.
Stable learning establishes some common ground between causal inference and machine learning
.
Nat Mach Intell
 
2022
;
4
:
110
5
.

35.

Tong Tong
 
W
,
Chen
 
YF
,
Hastie
 
T
. et al.  
Genome-wide association analysis by lasso penalized logistic regression
.
Bioinformatics
 
2009
;
25
:
714
21
.

36.

Jacob
 
L
,
Obozinski
 
G
,
Vert
 
J-P
.
Group lasso with overlap and graph lasso
. In: Danyluk A, Bottou L, Littman M. (eds),
Proceedings of the 26th Annual International Conference on Machine Learning
. Montreal, Quebec, Canada. New York, NY, USA: Association for Computing Machinery (ACM). pp.
433
40
,
2009
.

37.

Friedman
 
J
,
Hastie
 
T
,
Tibshirani
 
R
.
Sparse inverse covariance estimation with the graphical lasso
.
Biostatistics
 
2008
;
9
:
432
41
.

38.

Ivanoff
 
S
,
Picard
 
F
,
Rivoirard
 
V
.
Adaptive lasso and group-lasso for functional poisson regression
.
The Journal of Machine Learning Research
 
2016
;
17
:
1903
48
.

39.

Fan
 
J
,
Li
 
R
.
Variable selection via nonconcave penalized likelihood and its oracle properties
.
J Am Stat Assoc
 
2001
;
96
:
1348
60
.

40.

Jiang
 
H
,
Zheng
 
W
,
Dong
 
Y
.
Sparse and robust estimation with ridge minimax concave penalty
.
Inform Sci
 
2021
;
571
:
154
74
.

41.

Rosenbaum
 
PR
,
Rubin
 
DB
.
The central role of the propensity score in observational studies for causal effects
.
Biometrika
 
1983
;
70
:
41
55
.

42.

Fong
 
C
,
Hazlett
 
C
,
Imai
 
K
.
Covariate balancing propensity score for a continuous treatment: application to the efficacy of political advertisements
.
Ann Appl Stat
 
2018
;
12
:
156
77
.

43.

Li
 
S
,
Yun
 
F
.
Matching on balanced nonlinear representations for treatment effects estimation
.
Adv Neural Inform Process Syst
 
2017
;
30
:10638.

44.

Casale
 
FP
,
Rakitsch
 
B
,
Lippert
 
C
. et al.  
Efficient set tests for the genetic analysis of correlated traits
.
Nat Methods
 
2015
;
12
:
755
8
.

45.

Kuang
 
K
,
Cui
 
P
,
Li
 
B
. et al.  
Estimating treatment effect in the wild via differentiated confounder balancing
. In: Pei J, Wang H, Wang W. (eds),
Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
. New York, NY, USA: Association for Computing Machinery (ACM), pp.
265
74
,
2017
.

46.

Solovieff
 
N
,
Cotsapas
 
C
,
Lee
 
PH
. et al.  
Pleiotropy in complex traits: challenges and strategies
.
Nat Rev Genet
 
2013
;
14
:
483
95
.

47.

Liang
 
W
,
Zhang
 
Q
,
Ma
 
S
.
Hierarchical false discovery rate control for high-dimensional survival analysis with interactions
.
Comput Stat Data Anal
 
2024
;
192
:
107906
.

48.

Gao
 
L
,
Cui
 
Z
,
Shen
 
L
. et al.  
Shared genetic etiology between type 2 diabetes and alzheimer’s disease identified by bioinformatics analysis
.
J Alzheimers Dis
 
2016
;
50
:
13
7
.

49.

Raber
 
J
,
Huang
 
Y
,
Wesson
 
J
. et al.  
Apoe genotype accounts for the vast majority of ad risk and ad pathology
.
Neurobiol Aging
 
2004
;
25
:
641
50
.

50.

Liu
 
C-C
,
Kanekiyo
 
T
,
Huaxi
 
X
. et al.  
Apolipoprotein e and alzheimer disease: risk, mechanisms and therapy
.
Nat Rev Neurol
 
2013
;
9
:
106
18
. .

51.

de Leeuw
,
Mooij
 
JM
,
Heskes
 
T
. et al.  
Magma: generalized gene-set analysis of gwas data
.
PLoS Comput Biol
 
2015
;
11
:
e1004219
.

52.

Sammut
 
S-J
,
Crispin-Ortuzar
 
M
,
Chin
 
S-F
. et al.  
Multi-omic machine learning predictor of breast cancer therapy response
.
Nature
 
2022
;
601
:
623
9
.

Author notes

Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com