Bayesian integrative analysis and prediction with application to atherosclerosis cardiovascular disease

Simulation results for Scenario One (Settings 1–3): variable selection and prediction performances. FNR1; false negative rate for |$\boldsymbol{X}^1$|⁠. Similar for FNR2. FPR1; false positive rate for |$\boldsymbol{X}^1$|⁠. Similar for FPR2; F11 is F-measure for |$\boldsymbol{X}^1$|⁠. Similar for F12; MSE is mean square error.

Method	Setting.	FNR1	FNR2	FPR1	FPR2	F11	F12	MSE
BIPnet	S1	0.00 (0.00)	0.00 (0.00)	0.02 (0.02)	0.00 (0.00)	100.00 (0.00)	100.00 (0.00)	2.09 (0.04)
BIP	S1	0.00 (0.00)	0.00 (0.00)	0.14 (0.06)	0.16 (0.04)	99.73 (0.12)	99.68 (0.08)	2.16 (0.04)
SELPCCA	S1	0.00 (0.00)	0.00 (0.00)	0.40 (0.09)	0.50 (0.12)	99.21 (0.17)	99.02 (0.23)	2.11 (0.04)
CCAReg	S1	90.25 (1.73)	90.40 (1.87)	1.81 (3.32)	1.71 (3.35)	15.34 (1.87)	14.98 (2.01)	2.01 (0.05)
FusedCCA	S1	0.00 (0.00)	0.00 (0.00)	7.88 (0.76)	10.40 (0.78)	86.69 (1.17)	83.05 (1.08)	2.14 (0.04)
BIPnet	S2	0.00 (0.00)	0.00 (0.00)	2.81 (0.28)	3.07 (0.34)	74.73 (1.9)	73.34 (2.26)	2.24 (0.05)
BIP	S2	0.00 (0.00)	0.00 (0.00)	1.82 (0.34)	1.99 (0.36)	83.12 (2.82)	81.91 (2.92)	2.27 (0.04)
SELPCCA	S2	51.32 (9.50)	42.89 (9.42)	0.10 (0.10)	0.00 (0.00)	54.36 (8.08)	63.42 (8.15)	2.16 (0.04)
CCAReg	S2	77.63 (1.87)	78.68 (1.85)	3.12 (0.68)	2.71 (0.67)	22.80 (1.15)	23.49 (1.43)	2.01 (0.05)
FusedCCA	S2	0.00 (0.00)	0.00 (0.00)	19.51 (2.22)	19.64 (2.53)	30.39 (1.03)	30.55 (1.07)	2.58 (0.06)
BIPnet	S3	61.05 (2.82)	59.75 (3.12)	0.05 (0.03)	0.00 (0.00)	54.91 (2.78)	56.12 (3.03)	2.35 (0.21)
BIP	S3	63.00 (2.65)	60.35 (2.48)	0.04 (0.03)	0.06 (0.03)	52.89 (2.9)	55.79 (2.63)	2.64 (0.22)
SELPCCA	S3	93.40(1.55)	93.30(1.85)	0.06(0.06)	0.00 (0.00)	11.60 (2.52)	11.57 (3.00)	2.72 (0.13)
CCAReg	S3	88.85 (1.81)	87.95 (1.89)	1.44 (0.52)	1.40(0.50)	19.83 (2.30)	21.51 (2.44)	2.26 (0.14)
FusedCCA	S3	0.00 (0.00)	0.00 (0.00)	3.47 (0.81)	2.31 (0.55)	93.89 (1.37)	95.77 (0.97)	2.72 (0.11)

Method	Setting.	FNR1	FNR2	FPR1	FPR2	F11	F12	MSE
BIPnet	S1	0.00 (0.00)	0.00 (0.00)	0.02 (0.02)	0.00 (0.00)	100.00 (0.00)	100.00 (0.00)	2.09 (0.04)
BIP	S1	0.00 (0.00)	0.00 (0.00)	0.14 (0.06)	0.16 (0.04)	99.73 (0.12)	99.68 (0.08)	2.16 (0.04)
SELPCCA	S1	0.00 (0.00)	0.00 (0.00)	0.40 (0.09)	0.50 (0.12)	99.21 (0.17)	99.02 (0.23)	2.11 (0.04)
CCAReg	S1	90.25 (1.73)	90.40 (1.87)	1.81 (3.32)	1.71 (3.35)	15.34 (1.87)	14.98 (2.01)	2.01 (0.05)
FusedCCA	S1	0.00 (0.00)	0.00 (0.00)	7.88 (0.76)	10.40 (0.78)	86.69 (1.17)	83.05 (1.08)	2.14 (0.04)
BIPnet	S2	0.00 (0.00)	0.00 (0.00)	2.81 (0.28)	3.07 (0.34)	74.73 (1.9)	73.34 (2.26)	2.24 (0.05)
BIP	S2	0.00 (0.00)	0.00 (0.00)	1.82 (0.34)	1.99 (0.36)	83.12 (2.82)	81.91 (2.92)	2.27 (0.04)
SELPCCA	S2	51.32 (9.50)	42.89 (9.42)	0.10 (0.10)	0.00 (0.00)	54.36 (8.08)	63.42 (8.15)	2.16 (0.04)
CCAReg	S2	77.63 (1.87)	78.68 (1.85)	3.12 (0.68)	2.71 (0.67)	22.80 (1.15)	23.49 (1.43)	2.01 (0.05)
FusedCCA	S2	0.00 (0.00)	0.00 (0.00)	19.51 (2.22)	19.64 (2.53)	30.39 (1.03)	30.55 (1.07)	2.58 (0.06)
BIPnet	S3	61.05 (2.82)	59.75 (3.12)	0.05 (0.03)	0.00 (0.00)	54.91 (2.78)	56.12 (3.03)	2.35 (0.21)
BIP	S3	63.00 (2.65)	60.35 (2.48)	0.04 (0.03)	0.06 (0.03)	52.89 (2.9)	55.79 (2.63)	2.64 (0.22)
SELPCCA	S3	93.40(1.55)	93.30(1.85)	0.06(0.06)	0.00 (0.00)	11.60 (2.52)	11.57 (3.00)	2.72 (0.13)
CCAReg	S3	88.85 (1.81)	87.95 (1.89)	1.44 (0.52)	1.40(0.50)	19.83 (2.30)	21.51 (2.44)	2.26 (0.14)
FusedCCA	S3	0.00 (0.00)	0.00 (0.00)	3.47 (0.81)	2.31 (0.55)	93.89 (1.37)	95.77 (0.97)	2.72 (0.11)

Table 1.

Open in new tab Download slide

Simulation results for Scenario One (Settings 1–3): variable selection and prediction performances. FNR1; false negative rate for |$\boldsymbol{X}^1$|⁠. Similar for FNR2. FPR1; false positive rate for |$\boldsymbol{X}^1$|⁠. Similar for FPR2; F11 is F-measure for |$\boldsymbol{X}^1$|⁠. Similar for F12; MSE is mean square error.

Method	Setting.	FNR1	FNR2	FPR1	FPR2	F11	F12	MSE
BIPnet	S1	0.00 (0.00)	0.00 (0.00)	0.02 (0.02)	0.00 (0.00)	100.00 (0.00)	100.00 (0.00)	2.09 (0.04)
BIP	S1	0.00 (0.00)	0.00 (0.00)	0.14 (0.06)	0.16 (0.04)	99.73 (0.12)	99.68 (0.08)	2.16 (0.04)
SELPCCA	S1	0.00 (0.00)	0.00 (0.00)	0.40 (0.09)	0.50 (0.12)	99.21 (0.17)	99.02 (0.23)	2.11 (0.04)
CCAReg	S1	90.25 (1.73)	90.40 (1.87)	1.81 (3.32)	1.71 (3.35)	15.34 (1.87)	14.98 (2.01)	2.01 (0.05)
FusedCCA	S1	0.00 (0.00)	0.00 (0.00)	7.88 (0.76)	10.40 (0.78)	86.69 (1.17)	83.05 (1.08)	2.14 (0.04)
BIPnet	S2	0.00 (0.00)	0.00 (0.00)	2.81 (0.28)	3.07 (0.34)	74.73 (1.9)	73.34 (2.26)	2.24 (0.05)
BIP	S2	0.00 (0.00)	0.00 (0.00)	1.82 (0.34)	1.99 (0.36)	83.12 (2.82)	81.91 (2.92)	2.27 (0.04)
SELPCCA	S2	51.32 (9.50)	42.89 (9.42)	0.10 (0.10)	0.00 (0.00)	54.36 (8.08)	63.42 (8.15)	2.16 (0.04)
CCAReg	S2	77.63 (1.87)	78.68 (1.85)	3.12 (0.68)	2.71 (0.67)	22.80 (1.15)	23.49 (1.43)	2.01 (0.05)
FusedCCA	S2	0.00 (0.00)	0.00 (0.00)	19.51 (2.22)	19.64 (2.53)	30.39 (1.03)	30.55 (1.07)	2.58 (0.06)
BIPnet	S3	61.05 (2.82)	59.75 (3.12)	0.05 (0.03)	0.00 (0.00)	54.91 (2.78)	56.12 (3.03)	2.35 (0.21)
BIP	S3	63.00 (2.65)	60.35 (2.48)	0.04 (0.03)	0.06 (0.03)	52.89 (2.9)	55.79 (2.63)	2.64 (0.22)
SELPCCA	S3	93.40(1.55)	93.30(1.85)	0.06(0.06)	0.00 (0.00)	11.60 (2.52)	11.57 (3.00)	2.72 (0.13)
CCAReg	S3	88.85 (1.81)	87.95 (1.89)	1.44 (0.52)	1.40(0.50)	19.83 (2.30)	21.51 (2.44)	2.26 (0.14)
FusedCCA	S3	0.00 (0.00)	0.00 (0.00)	3.47 (0.81)	2.31 (0.55)	93.89 (1.37)	95.77 (0.97)	2.72 (0.11)

Method	Setting.	FNR1	FNR2	FPR1	FPR2	F11	F12	MSE
BIPnet	S1	0.00 (0.00)	0.00 (0.00)	0.02 (0.02)	0.00 (0.00)	100.00 (0.00)	100.00 (0.00)	2.09 (0.04)
BIP	S1	0.00 (0.00)	0.00 (0.00)	0.14 (0.06)	0.16 (0.04)	99.73 (0.12)	99.68 (0.08)	2.16 (0.04)
SELPCCA	S1	0.00 (0.00)	0.00 (0.00)	0.40 (0.09)	0.50 (0.12)	99.21 (0.17)	99.02 (0.23)	2.11 (0.04)
CCAReg	S1	90.25 (1.73)	90.40 (1.87)	1.81 (3.32)	1.71 (3.35)	15.34 (1.87)	14.98 (2.01)	2.01 (0.05)
FusedCCA	S1	0.00 (0.00)	0.00 (0.00)	7.88 (0.76)	10.40 (0.78)	86.69 (1.17)	83.05 (1.08)	2.14 (0.04)
BIPnet	S2	0.00 (0.00)	0.00 (0.00)	2.81 (0.28)	3.07 (0.34)	74.73 (1.9)	73.34 (2.26)	2.24 (0.05)
BIP	S2	0.00 (0.00)	0.00 (0.00)	1.82 (0.34)	1.99 (0.36)	83.12 (2.82)	81.91 (2.92)	2.27 (0.04)
SELPCCA	S2	51.32 (9.50)	42.89 (9.42)	0.10 (0.10)	0.00 (0.00)	54.36 (8.08)	63.42 (8.15)	2.16 (0.04)
CCAReg	S2	77.63 (1.87)	78.68 (1.85)	3.12 (0.68)	2.71 (0.67)	22.80 (1.15)	23.49 (1.43)	2.01 (0.05)
FusedCCA	S2	0.00 (0.00)	0.00 (0.00)	19.51 (2.22)	19.64 (2.53)	30.39 (1.03)	30.55 (1.07)	2.58 (0.06)
BIPnet	S3	61.05 (2.82)	59.75 (3.12)	0.05 (0.03)	0.00 (0.00)	54.91 (2.78)	56.12 (3.03)	2.35 (0.21)
BIP	S3	63.00 (2.65)	60.35 (2.48)	0.04 (0.03)	0.06 (0.03)	52.89 (2.9)	55.79 (2.63)	2.64 (0.22)
SELPCCA	S3	93.40(1.55)	93.30(1.85)	0.06(0.06)	0.00 (0.00)	11.60 (2.52)	11.57 (3.00)	2.72 (0.13)
CCAReg	S3	88.85 (1.81)	87.95 (1.89)	1.44 (0.52)	1.40(0.50)	19.83 (2.30)	21.51 (2.44)	2.26 (0.14)
FusedCCA	S3	0.00 (0.00)	0.00 (0.00)	3.47 (0.81)	2.31 (0.55)	93.89 (1.37)	95.77 (0.97)	2.72 (0.11)

We next compare the proposed methods to other existing methods. For the other methods, we estimate both the first and the first four canonical variates. We report the results from the first canonical variates here, and the results for the first four canonical covariates in Section C, Table C.1 of the Supplementary materials available at Biostatistics online. For feature selection, a variable is deemed selected if it has at least one nonzero coefficient in any of the four estimated components. We observed that for SELPCCA, CCAReg, and FusedCCA in general, the first canonical variates from each data type result in lower false negatives, lower false positives, higher harmonic means, and competitive MSE for Settings 1, 2, 4, and 5 compared to results from the first four canonical variates (see Section C of the Supplementary materials available at Biostatistics online). In Setting 3 where signal variables do not overlap on the four true latent components, we expect the first four canonical variates to have better feature selection and prediction results, and that is what we observe in Table 1. Across Settings 1, 2, 4, and 5, BIP and BIPNet had lower false negatives, lower or comparable false positives, higher harmonic means, and competitive MSEs when compared to two step methods: association followed by prediction. The feature selection performance of the proposed methods was also better when compared to CCAReg, a joint association and prediction method. CCAReg had lower MSEs in general when compared to the proposed and the other methods. BIPNet and BIP had higher false negatives in Setting 3 compared to other Settings, nevertheless lower than CCAReg and SELPCCA. We performed additional simulation studies to investigate the robustness of our BIPnet model to miss specification of our network information as biological networks are often noisy (see Section F.2 of the Supplementary materials available at Biostatistics online). The perturbation of the true network has a slight negative impact on both selection and prediction performances. Nevertheless, prediction and variable selection estimates from using partial information are sometimes better than methods that do not use any network information.

These simulation results suggest that incorporating group information in joint association and prediction methods result in better performance. Although the performance of the proposed methods are superior to existing methods in most Settings, our findings suggest that the proposed methods tend to be better at selecting most of the true signals in the settings where latent components overlap with regards to important variables. In nonoverlapping settings, our methods tend to select some of the true signals (resulting in higher false negatives). However, the proposed methods tend to have lower false positives, which suggest that we are better at not selecting noise variables. This feature is especially appealing for high-dimensional data problems where there tend to be more noise variables.

5. Application to atherosclerosis cardiovascular disease

We apply the proposed methods to analyze gene expression, genetics, and clinical data from the PHI study. The main goals of our analyses are to: (i) identify genetic variants, genes, and gene pathways that are potentially predictive of 10-year atherosclerotic cardiovascular disease (ASCVD) and (ii) illustrate the use of the shared components or scores in discriminating subjects at high- vs low-risk for developing ASCVD in 10 years. There were 340 patients with gene expression and SNP data, as well as clinical covariates to calculate their ASCVD score. The following clinical covariates were added as a third data set: age, gender, BMI, systolic blood pressure, low-density lipoprotein (LDL), and triglycerides. The gene expression and SNP data were each standardized to have mean zero and standard deviation one for each feature. Details for data preprocessing and filtering are provided in Section E.1 of the Supplementary materials available at Biostatistics online.

To obtain the group structure of genes, we performed a network analysis using Ingenuity Pathway Analysis (IPA), a software program which can analyze the gene expression patterns using a built-in scientific literature based database (according to IPA Ingenuity Web Site, www.ingenuity.com). By mapping our set of genes with the IPA gene set, we identified |$p_1=561$| genes within |$K_1= 25$| gene networks. To obtain the group structure of SNPs (⁠|$p_2=413$| SNPs), we identified genes nearby SNPs on the genome using the R package bioMart. We found in total |$K_2=31$| genes that will represent groups for the SNP data. The list of those groups for both data types are shown in Section C of the Supplementary materials available at Biostatistics online.

We divided each of the data type into training (⁠|$n=272$|⁠) and testing (⁠|$n=68$|⁠), ran the analyses on the training data to estimate the latent scores (shared components) and loading matrices for each data type, and predicted the test outcome using the testing data. For the other methods, the training data were used to identify the optimal tuning parameters and to estimate the loading matrices. We then computed the training and test scores using the combined scores from both the gene expression and SNP data. We fit a linear regression model using the scores from the training data, predicted the test outcome using the test scores, and computed test MSE.

To reduce the variability in our findings due to the random split of the data, we also obtained twenty random splits, repeated the above process, and computed average test MSE and variables selected. We further attempted to assess the stability of the variables selected by considering variables that were selected at least 12 times (60%) out of the twenty random splits.

Prediction error and genes and genetic variants selected: For the purpose of illustrating the benefit of including covariates, we also considered these additional models: (i) our BIPnet model, using both molecular data and clinical data (age, gender, Body mass index, Systolic blood pressure, low-density lipoprotein (LDL), triglycerides) [BIPnet+Cov], (ii) our BIP model, using both clinical and molecular data (BIP+ Cov), and (iii) sparse PCA on stacked omics and clinical data (SPCA + Cov). For sparse PCA, we used the method of Witten and others (2009) with |$l_1$|-norm regularization. We were not able to incorporate clinical data in the integrative analysis methods under comparison since they are applicable to two datasets. Figure 2 shows marginal posterior probabilities (MPP) of the four components for each datatype (SNPs, mRNA, and response) for one random split of the data. It shows that for each of the four methods, only one component is active (i.e., MPP|$>0.5$|⁠) for gene expression, and two components for SNP data. None of the component is active for the response variable when applied to methods without clinical data (BIPnet and BIP). However, when applied to methods that incorporate covariates (BiPnet+Cov, BIP+Cov), we can obviously identify at least one active component for the response variable. Note that the MPPs for covariates [Figure (a) and (b)] are 1. This is because we do not select covariates. Our findings show the importance of incorporating clinical data in our approach.

$MPPs of the four component indicators, $\gamma^{(m)}_{l}$, $l=1,2,3,4$, and $m=$Response (i.e., ASCVD scores), Genes, SNPs, and covariates, with respect to our four proposed methods. (a) BIPnet+Cov, (b) BIP+Cov, (c) BIPnet, and (d) BIP.$

Figure 2

MPPs of the four component indicators, |$\gamma^{(m)}_{l}$|⁠, |$l=1,2,3,4$|⁠, and |$m=$|Response (i.e., ASCVD scores), Genes, SNPs, and covariates, with respect to our four proposed methods. (a) BIPnet+Cov, (b) BIP+Cov, (c) BIPnet, and (d) BIP.

Table 2 gives the average test mean square error and numbers of genes and genetic variants identified by the methods for twenty random split of the data. For FusedCCA, we assumed that all genes within a group are connected (similarly for SNPs). We computed the first and the first four canonical variates for FusedCCA, SELPCCA, and CCAReg using the training data set, and predicted the test ASCVD score using the combined scores from both the gene expression and SNP data. BPINet+ Cov and BIP + Cov have better prediction performances (lower MSEs) when compared to BIPNet, BIP, and the other methods. We also observe that BIPNet and BIPNet + Cov tend to be more stable than BIP and BIP + Cov (125 genes and 55 SNPs vs 0 genes and 54 SNPs selected at least 12 times). Together, these findings demonstrate the merit in combining molecular data, prior biological knowledge, and clinical covariates in integrative analysis and prediction methods.

Table 2.

Average test mean squared error and number of variables selected by the methods for twenty random split of the data. Genes/SNPs selected |$\ge 12$| are genes or snps selected at least 12 times out of twenty random splits. Subscripts 1 and 4 respectively denote results from the first and the first four canonical variates. MSE is mean square error.

Method	MSE	Average	Average	Genes selected	SNPs selected
		no. of genes	no. of SNPs	\|$\ge 12$\| times	\|$\ge 12$\| times
BIPnet	1.362 (0.033)	128.35	113.65	125	55
BIPnet + Cov	0.758 (0.125)	122	95.8	125	55
BIP	1.363 (0.032)	68.7	88.8	0	54
BIP + Cov	0.766 (0.153)	54.7	82.5	0	54
SPCA\|$_1$\| + Cov	1.372 (0.034)	0	29.45	0	29
SPCA\|$_4$\| + Cov	1.390 (0.033)	22.40	134.30	0	113
SELPCCA\|$_1$\|	1.366 (0.034)	56.25	29.00	1	17
SELPCCA\|$_4$\|	1.371 (0.037)	271.450	155.45	83	81
CCAReg\|$_1$\|	1.371(0.008)	48.85	43.55	0	0
CCAReg\|$_4$\|	1.385 (0.040)	20.55	24.75	0	9
FusedCCA\|$_1$\|	1.367 (0.032)	561	401.35	561	400
FusedCCA\|$_4$\|	1.383 (0.038)	561	407.60	561	408

Method	MSE	Average	Average	Genes selected	SNPs selected
		no. of genes	no. of SNPs	\|$\ge 12$\| times	\|$\ge 12$\| times
BIPnet	1.362 (0.033)	128.35	113.65	125	55
BIPnet + Cov	0.758 (0.125)	122	95.8	125	55
BIP	1.363 (0.032)	68.7	88.8	0	54
BIP + Cov	0.766 (0.153)	54.7	82.5	0	54
SPCA\|$_1$\| + Cov	1.372 (0.034)	0	29.45	0	29
SPCA\|$_4$\| + Cov	1.390 (0.033)	22.40	134.30	0	113
SELPCCA\|$_1$\|	1.366 (0.034)	56.25	29.00	1	17
SELPCCA\|$_4$\|	1.371 (0.037)	271.450	155.45	83	81
CCAReg\|$_1$\|	1.371(0.008)	48.85	43.55	0	0
CCAReg\|$_4$\|	1.385 (0.040)	20.55	24.75	0	9
FusedCCA\|$_1$\|	1.367 (0.032)	561	401.35	561	400
FusedCCA\|$_4$\|	1.383 (0.038)	561	407.60	561	408

Table 2.

Open in new tab Download slide

Average test mean squared error and number of variables selected by the methods for twenty random split of the data. Genes/SNPs selected |$\ge 12$| are genes or snps selected at least 12 times out of twenty random splits. Subscripts 1 and 4 respectively denote results from the first and the first four canonical variates. MSE is mean square error.

Method	MSE	Average	Average	Genes selected	SNPs selected
		no. of genes	no. of SNPs	\|$\ge 12$\| times	\|$\ge 12$\| times
BIPnet	1.362 (0.033)	128.35	113.65	125	55
BIPnet + Cov	0.758 (0.125)	122	95.8	125	55
BIP	1.363 (0.032)	68.7	88.8	0	54
BIP + Cov	0.766 (0.153)	54.7	82.5	0	54
SPCA\|$_1$\| + Cov	1.372 (0.034)	0	29.45	0	29
SPCA\|$_4$\| + Cov	1.390 (0.033)	22.40	134.30	0	113
SELPCCA\|$_1$\|	1.366 (0.034)	56.25	29.00	1	17
SELPCCA\|$_4$\|	1.371 (0.037)	271.450	155.45	83	81
CCAReg\|$_1$\|	1.371(0.008)	48.85	43.55	0	0
CCAReg\|$_4$\|	1.385 (0.040)	20.55	24.75	0	9
FusedCCA\|$_1$\|	1.367 (0.032)	561	401.35	561	400
FusedCCA\|$_4$\|	1.383 (0.038)	561	407.60	561	408

Method	MSE	Average	Average	Genes selected	SNPs selected
		no. of genes	no. of SNPs	\|$\ge 12$\| times	\|$\ge 12$\| times
BIPnet	1.362 (0.033)	128.35	113.65	125	55
BIPnet + Cov	0.758 (0.125)	122	95.8	125	55
BIP	1.363 (0.032)	68.7	88.8	0	54
BIP + Cov	0.766 (0.153)	54.7	82.5	0	54
SPCA\|$_1$\| + Cov	1.372 (0.034)	0	29.45	0	29
SPCA\|$_4$\| + Cov	1.390 (0.033)	22.40	134.30	0	113
SELPCCA\|$_1$\|	1.366 (0.034)	56.25	29.00	1	17
SELPCCA\|$_4$\|	1.371 (0.037)	271.450	155.45	83	81
CCAReg\|$_1$\|	1.371(0.008)	48.85	43.55	0	0
CCAReg\|$_4$\|	1.385 (0.040)	20.55	24.75	0	9
FusedCCA\|$_1$\|	1.367 (0.032)	561	401.35	561	400
FusedCCA\|$_4$\|	1.383 (0.038)	561	407.60	561	408

We further assessed the biological implications of the groups of genes and SNPs identified by the proposed methods. We selected (MPP |$>$| 0.5) in total 125 genes and 55 SNPs at least 12 times out of the 20 random splits for both BIPNet and BIPNet + Cov (see Figure 2, and Figures E1 and E2 of the Supplementary materials available at Biostatistics online for a random split). Networks 1 and 10 (refer to Table D2 of the Supplementary materials available at Biostatistics online for group description) were identified 18 times (90%) out of the 20 random splits of the data. Networks 1 and 10 both have cancer as a common disease. Research suggest that CVD share some molecular and genetic pathways with cancer (Masoudkabir and others, 2017). In addition, Cell-To-Cell signaling and interaction that characterizes network 10 has led to novel therapeutic strategies in cardiovascular diseases (Shaw and others, 2012). Genes PSMB10 and UQCRQ, and SNPs rs1050152 and rs2631367 were the two top selected genes (highest averaged MPPs) and SNPs respectively across the 20 random splits (see Figures E3 and E4 of the Supplementary materials available at Biostatistics online for a random split). The gene PSMB10, a member of the ubiquitin-proteasome system, is a coding gene that encodes the Proteasome subunit beta type-10 protein in humans. PSMB10 is reported to play an essential role in maintaining cardiac protein homeostasis (Li and others, 2018) with a compromised proteasome likely contributing to the pathogenesis of cardiovascular diseases (Wang and Hill, 2015). The SNPs rs1050152 and rs2631367 found in genes SLC22A4 and SLC22A5, belong to the gene family SLC22A which has been implicated in cardiovascular diseases. Other genes and SNPs with MPP |$> 0.85$|⁠, on average, are reported in Tables C1 and C3 of the Supplementary materials available at Biostatistics online. These could be explored for their potential roles in cardiovascular (including atherosclerosis) disease risk.

Discriminating between high- vs low-risk ASCVD using the shared components: We illustrate the use of the shared components or scores in discriminating subjects at high- vs low-risk for developing ASCVD in 10 years. For this purpose, we dichotomized the response into two categories: low-risk and high-risk patients. ASCVD score less or equal to the median ASCVD was considered low-risk, and ASCVD score above the median ASCVD was considered high-risk. We computed the AUC for response classification using a simple logistic regression model with the most significant (i.e., the highest MPP for our method) component associated with the response as covariates. For the other methods, the AUC was calculated using the first canonical correlation variates from the combined scores. Figure 3 shows that the components selected by our methods applied to the omics and clinical data are able to discriminate clearly the two risk groups compared to our methods applied to the omics data only. The estimated AUCs for the omics plus clinical covariates are much higher than those obtained from omics only applications (e.g., AUC is 1.00 for the training set and AUC is 0.94 for the test set using BIPnet+Cov). When compared to the two-step methods CCAReg and SELPCCA, the estimated AUCs from our methods are considerably higher than the AUC’s from these methods. We applied BIPNet without the assumption that |$\eta_{lj}^{(M+1)}=\gamma_l^{(M+1)}=1$| that is, without assuming that for clinical covariates, all components are active, and all clinical covariates are relevant. Results show that (i) most of the clinical covariates were selected by the model with LDL being the only covariate deemed irrelevant by our model and (ii) there is a common active component between covariates and the response variable which suggests that the clinical covariates are mostly associated with the response. Refer to Section E.1 and Figure E.5 of the Supplementary materials available at Biostatistics online for more details. We also fitted a binomial regression model for ASCVD risk prediction using only clinical data as covariates, and we found an AUC of 0.93 for the test set, which shows for this particular application, SNP and gene expression do not contribute significantly to high AUC values. However, we performed simulation studies that showed the importance of incorporating omics and/or clinical covariates in BIPNet (see Section F.1 of the Supplementary materials available at Biostatistics online).

Figure 3

Estimated latent U scores on both training and test sets for low and high CVD risks with respect to our methods and competing methods.

6. Conclusion

We have presented a Bayesian hierarchical modeling framework that integrates in a unified procedure multiomics data, a response variable, and clinical covariates. We also extended the methods to integrate grouping information of features through prior distributions. The numerical experiments described in this manuscript show that our approach outperforms competing methods mostly in terms of variable selection performance. The data analysis demonstrated the merit of integrating omics, clinical covariates, and prior biological knowledge, while simultaneously predicting a clinical outcome. The identification of important genes or SNPs and their corresponding important groups ease the biological interpretation of our findings. Together, these findings demonstrate the importance of onestep methods like we propose here, and also the merit in incorporating both clinical and molecular data in integrative analysis and prediction models.

Several extensions of our model are worth investigating. First, using latent variable formulations, we can naturally extend our approach to other types of response variables such as survival time, binary or multinomial responses or mixed response outcomes (e.g., both continuous and binary). Second, the proposed method is only applicable to complete data and does not allow for missing values. A future project could extend the current methods to the scenario where data are missing using Bayesian imputation methods or the approach proposed by Chekouo and others (2017) for missing blocks of data. Third, we assume that the omics data are continuous. A future method will extend the current work to other omics data types (e.g., binary, categorical or count).

Software

We have created an R package called BIPnet for implementing the methods. It also contains functions used to generate simulated data. Its source R and C codes, along with the user pdf manual are available in the GitHub repository https://github.com/chekouo/BIPnet.

Supplementary material

Supplementary material is available at http://biostatistics.oxfordjournals.org.

Acknowledgments

We are grateful to the Emory Predictive Health Institute for providing us with the gene expression, SNP, and clinical data. Conflict of Interest: None declared.

Funding

National Institutes of Health (NIH) (5KL2TR002492-04 to S.S. in part); and Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery (RGPIN-2019-04810 to T.C., in part).

References

AHA (

2016

).

Cardiovascular disease: a costly burden for America projections through 2035

, accessed

December 21, 2019

. http://www.heart.org/idc/groups/heart-public/@wcm/@adv/documents/downloadable/ucm_491543.pdf.

Bartels,

S.

,

Franco,

A. R.

and

Rundek,

T.

(

2012

).

Carotid intima-media thickness (cIMT) and plaque from risk assessment and clinical use to genetic discoveries

.

Perspectives in Medicine

1

,

139

–

145

.

Chalise,

P.

,

Koestler,

D. C.

,

Bimali,

M.

,

Yu,

Q.

and

Fridley,

B. L.

(

2014

).

Integrative clustering methods for high-dimensional molecular data

.

Translational Cancer Research

3

,

202

–

216

.

PubMed

OpenURL Placeholder Text

Chekouo,

T.

,

Mohammed,

S.

and

Rao,

A.

(

2020

).

A Bayesian 2D functional linear model for gray-level co-occurrence matrices in texture analysis of lower grade gliomas

.

NeuroImage: Clinical

28

,

102437

.

Chekouo,

T.

,

Stingo,

F. C.

,

Doecke,

J. D.

and

Do,

K.-A.

(

2015

).

miRNA-target gene regulatory networks: a Bayesian integrative approach to biomarker selection with application to kidney cancer

.

Biometrics

71

,

428

–

438

.

Chekouo,

T.

,

Stingo,

F. C.

,

Doecke,

J. D.

and

Do,

K.-A.

(

2017

).

A Bayesian integrative approach for multi-platform genomic data: a kidney cancer case study

.

Biometrics

73

,

615

–

624

.

Chen,

R.-B.

,

Chu,

C.-H.

,

Yuan,

S.

and

Wu,

Y. N.

(

2016

).

Bayesian sparse group selection

.

Journal of Computational and Graphical Statistics

25

,

665

–

683

.

Hoeting,

J. A.

,

Madigan,

D.

,

Raftery,

A. E.

and

Volinsky,

C. T.

(

1999

).

Bayesian model averaging: a tutorial

.

Statistical Science

14

,

382

–

401

.

Klami,

A.

,

Virtanen,

S.

and

Kaski,

S.

(

2013

).

Bayesian canonical correlation analysis

.

Journal of Machine Learning Research

14

,

965

–

1003

.

OpenURL Placeholder Text

Li,

J.

,

Wang,

S.

,

Bai,

J.

,

Yang,

X.-L.

,

Zhang,

Y.-L.

,

Che,

Y.-L.

,

Li,

H.-H.

and

Yang,

Y.-Z.

(

2018

).

Novel role for the immunoproteasome subunit PSMB10 in angiotensin ii–induced atrial fibrillation in mice

.

Hypertension

71

,

866

–

876

.

Lock,

E. F.

,

Hoadley,

K. A.

,

Marron,

J. S.

and

Nobel,

A. B.

(

2013

).

Joint and individual variation explained (JIVE) for integrated analysis of multiple data types

.

The Annals of Applied Statistics

7

,

523

–

542

.

Luo,

C.

,

Liu,

J.

,

Dey,

D. K.

and

Chen,

K.

(

2016

).

Canonical variate regression

.

Biostatistics

17

,

468

–

483

.

Masoudkabir,

F.

,

Sarrafzadegan,

N.

,

Gotay,

C.

,

Ignaszewski,

A.

,

Krahn,

A. D.

,

Davis,

M. K.

,

Franco,

C.

and

Mani,

A.

(

2017

).

Cardiovascular disease and cancer: evidence for shared disease pathways and pharmacologic prevention

.

Atherosclerosis

263

,

343

–

351

.

Mo,

Q.

,

Shen,

R.

,

Guo,

C.

,

Vannucci,

M.

,

Chan,

K. S.

and

Hilsenbeck,

S. G.

(

2018

).

A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data

.

Biostatistics

19

,

71

–

86

.

Qiu,

Y.-Q.

(

2013

).

KEGG Pathway Database

.

New York, NY

:

Springer New York

, pp.

1068

–

1069

.

Google Preview

OpenURL Placeholder Text

Rockova,

V.

and

Lesaffre,

E.

(

2014

).

Incorporating grouping information in Bayesian variable selection with applications in genomics

.

Bayesian Analysis

9

,

221

–

258

.

Safo,

S. E.

,

Ahn,

J.

,

Jeon,

Y.

and

Jung,

S.

(

2018

).

Sparse generalized eigenvalue problem with application to canonical correlation analysis for integrative analysis of methylation and gene expression data

.

Biometrics

74

,

1362

–

1371

.

Safo,

S. E.

,

Li,

S.

and

Long,

Q.

(

2018

).

Integrative analysis of transcriptomic and metabolomic data via sparse canonical correlation analysis with incorporation of biological information

.

Biometrics

74

,

300

–

312

.

Safo,

S. E.

,

Min,

E. J.

and

Haine,

L.

(

2021

).

Sparse linear discriminant analysis for multi-view structured data

.

Biometrics

, DOI: https://doi.org/10.1111/biom.13458.

Shaw,

S. G.

,

Abraham,

D. J.

,

Baker,

D. M.

and

Tsui,

J.

(

2012

). Cell signalling pathways leading to novel therapeutic strategies in cardiovascular disease.

Cardiology Research and Practice

2012

,

475094

.

PubMed

Shen,

R.

,

Olshen,

A. B.

and

Ladanyi,

M.

(

2009

).

Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis

.

Bioinformatics

25

,

2906

–

2912

.

Shen,

R.

,

Wang,

S.

and

Mo,

Q.

(

2013

).

Sparse integrative clustering of multiple omics data sets

.

The Annals of Applied Statistics

7

,

269

–

294

.

Stingo,

F. C.

,

Chen,

Y. A.

,

Tadesse,

M. G.

and

Vannucci,

M.

(

2011

).

Incorporating biological information into linear models: a Bayesian approach to the selection of pathways and genes

.

The Annals of Applied Statistics

5

,

1978

–

2002

.

van Dyk,

D. A.

and

Park,

T.

(

2008

).

Partially collapsed Gibbs samplers

.

Journal of the American Statistical Association

103

,

790

–

796

.