HIP: a method for high-dimensional multi-view data integration and prediction accounting for subgroup heterogeneity

COPDGene Participant Characteristics; the measurements presented were collected at the Year 5 study visit to align with the collection of proteomic and genomic data collection

Variable	Males	Females	\|$P$\|-value
	\|$N$\| = 782	\|$N$\| = 594
Age	68.28 (8.35)	68.03 (8.36)	0.581
BMI	28.03 (5.62)	27.69 (6.59)	0.317
FEV1 % Predicted	61.94 (22.97)	62.91 (22.59)	0.431
BODE Index	2.45 (2.45)	2.63 (2.38)	0.176
% Emphysema	11.30 (11.86)	9.39 (11.45)	0.003
Pack Years	53.05 (26.63)	47.57 (24.99)	<0.001
AWT	1.17 (0.23)	1.00 (0.21)	<0.001
Non-Hispanic White (%)	82	78	0.084
Current Smoker (%)	66	65	0.510
Diabetes (%)	17	14	0.204

Variable	Males	Females	\|$P$\|-value
	\|$N$\| = 782	\|$N$\| = 594
Age	68.28 (8.35)	68.03 (8.36)	0.581
BMI	28.03 (5.62)	27.69 (6.59)	0.317
FEV1 % Predicted	61.94 (22.97)	62.91 (22.59)	0.431
BODE Index	2.45 (2.45)	2.63 (2.38)	0.176
% Emphysema	11.30 (11.86)	9.39 (11.45)	0.003
Pack Years	53.05 (26.63)	47.57 (24.99)	<0.001
AWT	1.17 (0.23)	1.00 (0.21)	<0.001
Non-Hispanic White (%)	82	78	0.084
Current Smoker (%)	66	65	0.510
Diabetes (%)	17	14	0.204

BMI = Body Mass Index, FEV|$_{1}$| = Forced Expiratory Volume in 1 Second, BODE = Body mass index, airflow Obstruction, Dyspnea, and Exercise capacity.

Table 1

COPDGene Participant Characteristics; the measurements presented were collected at the Year 5 study visit to align with the collection of proteomic and genomic data collection

Variable	Males	Females	\|$P$\|-value
	\|$N$\| = 782	\|$N$\| = 594
Age	68.28 (8.35)	68.03 (8.36)	0.581
BMI	28.03 (5.62)	27.69 (6.59)	0.317
FEV1 % Predicted	61.94 (22.97)	62.91 (22.59)	0.431
BODE Index	2.45 (2.45)	2.63 (2.38)	0.176
% Emphysema	11.30 (11.86)	9.39 (11.45)	0.003
Pack Years	53.05 (26.63)	47.57 (24.99)	<0.001
AWT	1.17 (0.23)	1.00 (0.21)	<0.001
Non-Hispanic White (%)	82	78	0.084
Current Smoker (%)	66	65	0.510
Diabetes (%)	17	14	0.204

Variable	Males	Females	\|$P$\|-value
	\|$N$\| = 782	\|$N$\| = 594
Age	68.28 (8.35)	68.03 (8.36)	0.581
BMI	28.03 (5.62)	27.69 (6.59)	0.317
FEV1 % Predicted	61.94 (22.97)	62.91 (22.59)	0.431
BODE Index	2.45 (2.45)	2.63 (2.38)	0.176
% Emphysema	11.30 (11.86)	9.39 (11.45)	0.003
Pack Years	53.05 (26.63)	47.57 (24.99)	<0.001
AWT	1.17 (0.23)	1.00 (0.21)	<0.001
Non-Hispanic White (%)	82	78	0.084
Current Smoker (%)	66	65	0.510
Diabetes (%)	17	14	0.204

BMI = Body Mass Index, FEV|$_{1}$| = Forced Expiratory Volume in 1 Second, BODE = Body mass index, airflow Obstruction, Dyspnea, and Exercise capacity.

Applying the proposed and competing methods

Blood samples from COPDGene participants at Phase 2 were quantified for proteins and RNA sequencing. Proteomics data were quantified using the SomaScan 5K technology from SomaLogic. The RNA sequencing data were obtained from Illummina. The original data set has 4979 SOMAmers, representing 4776 unique proteins and 19263 genes. We work with SOMAmers and convert our findings to unique proteins. Please refer to the supplementary material for more details on the techniques used to generate the RNASeq and proteomics data. To reduce dimensionality, we first applied unsupervised filtering to select the top 5000 genes and 2000 proteins with the largest standard deviations. In this work, for statistical rigor and to reduce false findings, we used resampling techniques to identify genes and proteins that are consistently selected by our method to be associated with AWT. We refer to these genes and proteins as “stable”. In particular, we generated 50 random splits of the filtered data, stratified by subgroup, such that for each split 75% of the data were the training data and 25% were the testing data. Within each split, we performed supervised filtering by regressing AWT on each of the genes and proteins selected by unsupervised filtering, adjusting for sex, race and pack years, and retained genes and proteins with potential to explain the variation in AWT (uncorrected |$P$|-value |$<.05$|⁠). This means that the variables entering the models could differ for each split of the data.

To select tuning parameters, we set the range of possible values for |$\lambda _{G}$| and |$\lambda _{\xi }$| in HIP to |$(0,2]$| as in the simulations and selected the best model using BIC. Joint Lasso used 10-fold cross-validation over the same grid values used in the simulations. CVR, Lasso, and Elastic Net used 10-fold cross-validation with default settings to select tuning parameters. HIP and CVR both require specification of a rank, i.e. the number of latent components used in the solutions. Our proposed automatic approach (threshold |$=0.25$|⁠; refer to Section 1.1 of Supplementary Information) on the concatenated data suggested |$K=3$|⁠. Interestingly, when applied to each |$\boldsymbol X^{d,s}$| separately, it suggested |$K=3$| for the gene data and |$K=1$| for the protein data. Supplementary Figure S9 shows the scree plots for both the concatenated and separate |$\boldsymbol X^{d,s}$|⁠. Based on these results and the robustness seen in the sensitivity analyses, we selected |$K=3$| components for HIP and CVR. For HIP, we set |$N_{top} = 75$| genes and |$25$| proteins.

For each split of the data, we applied HIP (Grid), HIP (Random), and the subgroup versions of the competing methods used in the simulations. For Elastic Net and Lasso, we stacked the views and ran separate analyses for males and females. For Joint Lasso, we ran separate analyses for the protein and gene data. For CVR, we ran separate analyses for males and females. We used the selected tuning parameters and test datasets to predict AWT and estimate test MSEs. Since the univariate filtering could result in different variables that enter our model for each data split, we combined information on how often a variable passed univariate filtering and served as input into our model and how often that variable was ranked high as relevant by our method. In particular, we considered (i) the number of splits in which the variable was included in the |$N_{top}$| variables and (ii) the proportion of splits in which the variable was included in the |$N_{top}$| variables, i.e. the number of splits in which the variable was included in the |$N_{top}$| variables divided by the number of splits in which the variable was entered into the model after supervised filtering. We took the product of these two terms and ranked the variables in descending order based on this product—the higher the better. A higher product indicates that the variable was frequently selected to differentiate AWT in the supervised univariate filtering approach and was also highly ranked as being associated with AWT in our multivariate HIP method. We then selected the top 1% of genes and proteins according to the product to represent the “stable” genes and proteins. We used this cut-off point because our focus was on identifying a few molecular signatures shared by and unique to men and women and predictive of AWT.

Real Data Results

Average MSEs, and proteins and genes selected

Supplementary Figures S10 and S11 show violin plots of the test MSEs and run times, respectively, from all 50 splits of the data. The average test MSEs from the splits were slightly lower for CVR and Joint Lasso but also used many more variables (Supplementary Table S3). HIP (Random) has a computational advantage over CVR and Joint Lasso.

Supplementary Table S4 shows the number of “stable” common and subgroup-specific genes and proteins identified by each method. We note few overlaps in selected genes and proteins between HIP and existing methods (Supplementary Fig. S12). Supplementary Table S5 compares the variables selected by HIP (Random) and HIP (Grid); the selected genes and proteins are very similar, again supporting the use of the random search instead of the grid search.

Supplementary Tables S6 and S7 list the genes and Supplementary Table S8 lists the proteins identified as “stable” and important to males and females by HIP (Random) including weights for each protein and gene calculated as the |$L_{2}$| norm of coefficients in |$\hat{\boldsymbol B}^{d,s}$| across components (i.e. rows) and averaging over the splits where the variable was selected.

Proteins with large weights include NPLOC4 and SPG21 for males and SMAP1 and CDKN2D for females. [24] introduced a novel method called SubmiRine to analyze miRNA and predict miRNA target site variants (miRNA-TSV). When this method was applied to a subset of genomic samples from patients with COPD from the Lung Genome Research Consortium (LGRC; http://www.lung-genomics.org), SPG21 was the top-scoring miRNA-TSV.

The gene with the largest weight was ADIPOR1 for males and BCL2L1 for females. In a study of 60 male COPD patients and 30 male controls, [25] found adiponectin that is associated with inflammation from COPD evidenced by a positive correlation with IL-8 and a negative correlation with FEV|$_{1}$| %.

Pathway enrichment analysis

We performed pathway enrichment analysis using Ingenuity Pathway Analysis (IPA) [26] to test for overrepresentation of pathways among our lists of “stable” proteins and genes for males and females. The top 10 canonical gene pathways (Table 2) for males and females had some common and some subgroup-specific pathways. The top pathway for males is the iron homeostasis signaling pathway; this is the second ranked pathway for females, and the top pathway for females is heme biosynthesis II. There is strong evidence that disrupted iron homeostasis is associated with the presence and severity of lung disease including COPD [27, 28]. Methylglyoxal degradation I ranks second for males and third for females. Reference [29] performed gene expression profiling on small airway epithelium samples and also found this pathway to be activated in both male and female smokers.

Table 2

Top canonical pathways from IPA enrichment analysis

View	Subgroup	Canonical pathway	Molecules	Unadjusted \|$P$\|-value
Genes	Males	Iron homeostasis signaling pathway	CDC34,FECH,SLC25A37	.003
		Methylglyoxal Degradation I	HAGH	.006
		Heme Biosynthesis from Uroporphyrinogen-III I	FECH	.008
		Pentose Phosphate Pathway (Non-oxidative Branch)	RPIA	.013
		Heme Biosynthesis II	FECH	.019
		Pentose Phosphate Pathway	RPIA	.023
		Erythropoietin Signaling Pathway	BCL2L1,GATA1	.054
		ID1 Signaling Pathway	BCL2L1,GSPT1	.068
		Sertoli Cell-Sertoli Cell Junction Signaling	SPTB,YBX3	.069
		Autophagy	GABARAPL2,SLC1A5	.076
	Females	Heme Biosynthesis II	ALAS2,FECH	<.001
		Iron homeostasis signaling pathway	ALAS2,CDC34,FECH,SLC25A37	<.001
		Methylglyoxal Degradation I	HAGH	.006
		Heme Biosynthesis from Uroporphyrinogen-III I	FECH	.008
		Tetrapyrrole Biosynthesis II	ALAS2	.010
		Hypoxia Signaling in the Cardiovascular System	CDC34,UBE2H	.011
		Pentose Phosphate Pathway (Non-oxidative Branch)	RPIA	.013
		Pentose Phosphate Pathway	RPIA	.023
		Erythropoietin Signaling Pathway	BCL2L1,GATA1	.054
		ID1 Signaling Pathway	BCL2L1,GSPT1	.068
Proteins	Males	Role of JAK2 in Hormone-like Cytokine Signaling	EPO,LEP,PTPN6	<.001
		White Adipose Tissue Browning Pathway	BDNF,LEP,NPPB	<.001
		Erythropoietin Signaling Pathway	EPO,LEP,PTPN6	<.001
		Serotonin Receptor Signaling	ADIPOQ,BDNF,LEP,NPPB	<.001
		AMPK Signaling	ADIPOQ,INS,LEP	.001
		Leptin Signaling in Obesity	INS,LEP	.002
		IL-3 Signaling	PPP3R1,PTPN6	.002
		Maturity Onset Diabetes of Young (MODY) Signaling	ADIPOQ,INS	.002
		Thyroid Cancer Signaling	BDNF,INS	.002
		ABRA Signaling Pathway	NPPB,PPP3R1	.003
	Females	Granulocyte Adhesion and Diapedesis	PF4,PPBP,TNFRSF1A	<.001
		Agranulocyte Adhesion and Diapedesis	PF4,PPBP,TNFRSF1A	<.001
		Wound Healing Signaling Pathway	EGF,PF4,TNFRSF1A	<.001
		Huntington’s Disease Signaling	BDNF,CPLX2,EGF	.001
		Pathogen Induced Cytokine Storm Signaling Pathway	PF4,PPBP,TNFRSF1A	.002
		Glioma Signaling	CDKN2D,EGF	.004
		Type II Diabetes Mellitus Signaling	ADIPOQ,TNFRSF1A	.005
		Axonal Guidance Signaling	BDNF,EGF,PAPPA	.005
		Tumor Microenvironment Pathway	EGF,TNFRSF1A	.007
		Regulation Of The Epithelial Mesenchymal Transition By Growth Factors Pathway	EGF,TNFRSF1A	.008

View	Subgroup	Canonical pathway	Molecules	Unadjusted \|$P$\|-value
Genes	Males	Iron homeostasis signaling pathway	CDC34,FECH,SLC25A37	.003
		Methylglyoxal Degradation I	HAGH	.006
		Heme Biosynthesis from Uroporphyrinogen-III I	FECH	.008
		Pentose Phosphate Pathway (Non-oxidative Branch)	RPIA	.013
		Heme Biosynthesis II	FECH	.019
		Pentose Phosphate Pathway	RPIA	.023
		Erythropoietin Signaling Pathway	BCL2L1,GATA1	.054
		ID1 Signaling Pathway	BCL2L1,GSPT1	.068
		Sertoli Cell-Sertoli Cell Junction Signaling	SPTB,YBX3	.069
		Autophagy	GABARAPL2,SLC1A5	.076
	Females	Heme Biosynthesis II	ALAS2,FECH	<.001
		Iron homeostasis signaling pathway	ALAS2,CDC34,FECH,SLC25A37	<.001
		Methylglyoxal Degradation I	HAGH	.006
		Heme Biosynthesis from Uroporphyrinogen-III I	FECH	.008
		Tetrapyrrole Biosynthesis II	ALAS2	.010
		Hypoxia Signaling in the Cardiovascular System	CDC34,UBE2H	.011
		Pentose Phosphate Pathway (Non-oxidative Branch)	RPIA	.013
		Pentose Phosphate Pathway	RPIA	.023
		Erythropoietin Signaling Pathway	BCL2L1,GATA1	.054
		ID1 Signaling Pathway	BCL2L1,GSPT1	.068
Proteins	Males	Role of JAK2 in Hormone-like Cytokine Signaling	EPO,LEP,PTPN6	<.001
		White Adipose Tissue Browning Pathway	BDNF,LEP,NPPB	<.001
		Erythropoietin Signaling Pathway	EPO,LEP,PTPN6	<.001
		Serotonin Receptor Signaling	ADIPOQ,BDNF,LEP,NPPB	<.001
		AMPK Signaling	ADIPOQ,INS,LEP	.001
		Leptin Signaling in Obesity	INS,LEP	.002
		IL-3 Signaling	PPP3R1,PTPN6	.002
		Maturity Onset Diabetes of Young (MODY) Signaling	ADIPOQ,INS	.002
		Thyroid Cancer Signaling	BDNF,INS	.002
		ABRA Signaling Pathway	NPPB,PPP3R1	.003
	Females	Granulocyte Adhesion and Diapedesis	PF4,PPBP,TNFRSF1A	<.001
		Agranulocyte Adhesion and Diapedesis	PF4,PPBP,TNFRSF1A	<.001
		Wound Healing Signaling Pathway	EGF,PF4,TNFRSF1A	<.001
		Huntington’s Disease Signaling	BDNF,CPLX2,EGF	.001
		Pathogen Induced Cytokine Storm Signaling Pathway	PF4,PPBP,TNFRSF1A	.002
		Glioma Signaling	CDKN2D,EGF	.004
		Type II Diabetes Mellitus Signaling	ADIPOQ,TNFRSF1A	.005
		Axonal Guidance Signaling	BDNF,EGF,PAPPA	.005
		Tumor Microenvironment Pathway	EGF,TNFRSF1A	.007
		Regulation Of The Epithelial Mesenchymal Transition By Growth Factors Pathway	EGF,TNFRSF1A	.008

Table 2

Top canonical pathways from IPA enrichment analysis

View	Subgroup	Canonical pathway	Molecules	Unadjusted \|$P$\|-value
Genes	Males	Iron homeostasis signaling pathway	CDC34,FECH,SLC25A37	.003
		Methylglyoxal Degradation I	HAGH	.006
		Heme Biosynthesis from Uroporphyrinogen-III I	FECH	.008
		Pentose Phosphate Pathway (Non-oxidative Branch)	RPIA	.013
		Heme Biosynthesis II	FECH	.019
		Pentose Phosphate Pathway	RPIA	.023
		Erythropoietin Signaling Pathway	BCL2L1,GATA1	.054
		ID1 Signaling Pathway	BCL2L1,GSPT1	.068
		Sertoli Cell-Sertoli Cell Junction Signaling	SPTB,YBX3	.069
		Autophagy	GABARAPL2,SLC1A5	.076
	Females	Heme Biosynthesis II	ALAS2,FECH	<.001
		Iron homeostasis signaling pathway	ALAS2,CDC34,FECH,SLC25A37	<.001
		Methylglyoxal Degradation I	HAGH	.006
		Heme Biosynthesis from Uroporphyrinogen-III I	FECH	.008
		Tetrapyrrole Biosynthesis II	ALAS2	.010
		Hypoxia Signaling in the Cardiovascular System	CDC34,UBE2H	.011
		Pentose Phosphate Pathway (Non-oxidative Branch)	RPIA	.013
		Pentose Phosphate Pathway	RPIA	.023
		Erythropoietin Signaling Pathway	BCL2L1,GATA1	.054
		ID1 Signaling Pathway	BCL2L1,GSPT1	.068
Proteins	Males	Role of JAK2 in Hormone-like Cytokine Signaling	EPO,LEP,PTPN6	<.001
		White Adipose Tissue Browning Pathway	BDNF,LEP,NPPB	<.001
		Erythropoietin Signaling Pathway	EPO,LEP,PTPN6	<.001
		Serotonin Receptor Signaling	ADIPOQ,BDNF,LEP,NPPB	<.001
		AMPK Signaling	ADIPOQ,INS,LEP	.001
		Leptin Signaling in Obesity	INS,LEP	.002
		IL-3 Signaling	PPP3R1,PTPN6	.002
		Maturity Onset Diabetes of Young (MODY) Signaling	ADIPOQ,INS	.002
		Thyroid Cancer Signaling	BDNF,INS	.002
		ABRA Signaling Pathway	NPPB,PPP3R1	.003
	Females	Granulocyte Adhesion and Diapedesis	PF4,PPBP,TNFRSF1A	<.001
		Agranulocyte Adhesion and Diapedesis	PF4,PPBP,TNFRSF1A	<.001
		Wound Healing Signaling Pathway	EGF,PF4,TNFRSF1A	<.001
		Huntington’s Disease Signaling	BDNF,CPLX2,EGF	.001
		Pathogen Induced Cytokine Storm Signaling Pathway	PF4,PPBP,TNFRSF1A	.002
		Glioma Signaling	CDKN2D,EGF	.004
		Type II Diabetes Mellitus Signaling	ADIPOQ,TNFRSF1A	.005
		Axonal Guidance Signaling	BDNF,EGF,PAPPA	.005
		Tumor Microenvironment Pathway	EGF,TNFRSF1A	.007
		Regulation Of The Epithelial Mesenchymal Transition By Growth Factors Pathway	EGF,TNFRSF1A	.008

View	Subgroup	Canonical pathway	Molecules	Unadjusted \|$P$\|-value
Genes	Males	Iron homeostasis signaling pathway	CDC34,FECH,SLC25A37	.003
		Methylglyoxal Degradation I	HAGH	.006
		Heme Biosynthesis from Uroporphyrinogen-III I	FECH	.008
		Pentose Phosphate Pathway (Non-oxidative Branch)	RPIA	.013
		Heme Biosynthesis II	FECH	.019
		Pentose Phosphate Pathway	RPIA	.023
		Erythropoietin Signaling Pathway	BCL2L1,GATA1	.054
		ID1 Signaling Pathway	BCL2L1,GSPT1	.068
		Sertoli Cell-Sertoli Cell Junction Signaling	SPTB,YBX3	.069
		Autophagy	GABARAPL2,SLC1A5	.076
	Females	Heme Biosynthesis II	ALAS2,FECH	<.001
		Iron homeostasis signaling pathway	ALAS2,CDC34,FECH,SLC25A37	<.001
		Methylglyoxal Degradation I	HAGH	.006
		Heme Biosynthesis from Uroporphyrinogen-III I	FECH	.008
		Tetrapyrrole Biosynthesis II	ALAS2	.010
		Hypoxia Signaling in the Cardiovascular System	CDC34,UBE2H	.011
		Pentose Phosphate Pathway (Non-oxidative Branch)	RPIA	.013
		Pentose Phosphate Pathway	RPIA	.023
		Erythropoietin Signaling Pathway	BCL2L1,GATA1	.054
		ID1 Signaling Pathway	BCL2L1,GSPT1	.068
Proteins	Males	Role of JAK2 in Hormone-like Cytokine Signaling	EPO,LEP,PTPN6	<.001
		White Adipose Tissue Browning Pathway	BDNF,LEP,NPPB	<.001
		Erythropoietin Signaling Pathway	EPO,LEP,PTPN6	<.001
		Serotonin Receptor Signaling	ADIPOQ,BDNF,LEP,NPPB	<.001
		AMPK Signaling	ADIPOQ,INS,LEP	.001
		Leptin Signaling in Obesity	INS,LEP	.002
		IL-3 Signaling	PPP3R1,PTPN6	.002
		Maturity Onset Diabetes of Young (MODY) Signaling	ADIPOQ,INS	.002
		Thyroid Cancer Signaling	BDNF,INS	.002
		ABRA Signaling Pathway	NPPB,PPP3R1	.003
	Females	Granulocyte Adhesion and Diapedesis	PF4,PPBP,TNFRSF1A	<.001
		Agranulocyte Adhesion and Diapedesis	PF4,PPBP,TNFRSF1A	<.001
		Wound Healing Signaling Pathway	EGF,PF4,TNFRSF1A	<.001
		Huntington’s Disease Signaling	BDNF,CPLX2,EGF	.001
		Pathogen Induced Cytokine Storm Signaling Pathway	PF4,PPBP,TNFRSF1A	.002
		Glioma Signaling	CDKN2D,EGF	.004
		Type II Diabetes Mellitus Signaling	ADIPOQ,TNFRSF1A	.005
		Axonal Guidance Signaling	BDNF,EGF,PAPPA	.005
		Tumor Microenvironment Pathway	EGF,TNFRSF1A	.007
		Regulation Of The Epithelial Mesenchymal Transition By Growth Factors Pathway	EGF,TNFRSF1A	.008

There was no overlap in the top 10 protein pathways for males and females. The top pathway for males was role of JAK2 in hormone-like cytokine signaling. The top pathway for females was granulocyte adhesion and diapedesis which is associated with regulation of inflammation. Reference [30] also found this to be a top pathway involving upregulated genes when comparing patients with COPD and healthy controls.

Table 3

Comparison of regression model estimates

Variable	Estimate	95% CI	P-value	\|$R^{2}$\|	Adjusted \|$R^{2}$\|
ERF				0.152	0.147
Intercept	–1.180	–1.872, –0.489	0.001
Age	0.021	–0.035, 0.077	0.461
Sex (Female)	0.006	–0.093, 0.105	0.901
Race (African American)	–0.070	–0.202, 0.062	0.297
BMI	0.352	0.299, 0.406	<0.001
Former Smoker	0.947	0.257, 1.637	0.007
Current Smoker	1.429	0.735, 2.123	<0.001
% Emphysema	0.023	–0.032, 0.079	0.414
Scanner - Philips	0.379	0.097, 0.661	0.009
Scanner - Siemens	0.130	0.025, 0.235	0.015
ERF + Common Protein Score				0.157	0.151
Intercept	–1.138	–1.828, –0.447	0.001
Age	0.032	–0.024, 0.089	0.266
Sex (Female)	0.007	–0.092, 0.105	0.897
Race (African American)	–0.049	–0.181, 0.084	0.469
BMI	0.327	0.271, 0.383	<0.001
Former Smoker	0.911	0.223, 1.600	0.010
Current Smoker	1.390	0.697, 2.083	<0.001
% Emphysema	0.028	–0.028, 0.083	0.328
Scanner - Philips	0.404	0.123, 0.686	0.005
Scanner - Siemens	0.110	0.005, 0.216	0.041
Common Protein Score	0.076	0.023, 0.130	0.005
ERF + Common Gene Score				0.153	0.147
Intercept	–1.169	–1.861, –0.477	0.001
Age	0.021	–0.035, 0.077	0.462
Sex (Female)	0.007	–0.092, 0.106	0.893
Race (African American)	–0.083	–0.216, 0.050	0.221
BMI	0.343	0.288, 0.398	<0.001
Former Smoker	0.932	0.242, 1.623	0.008
Current Smoker	1.423	0.729, 2.117	<0.001
% Emphysema	0.023	–0.033, 0.079	0.421
Scanner - Philips	0.387	0.105, 0.669	0.007
Scanner - Siemens	0.134	0.029, 0.239	0.013
Common Gene Score	0.035	–0.017, 0.086	0.185
ERF + Common Scores				0.158	0.151
Intercept	–1.128	–1.819, –0.437	0.001
Age	0.032	–0.025, 0.088	0.270
Sex (Female)	0.007	–0.092, 0.106	0.890
Race (African American)	–0.061	–0.195, 0.073	0.373
BMI	0.320	0.262, 0.377	<0.001
Former Smoker	0.899	0.210, 1.588	0.011
Current Smoker	1.385	0.692, 2.078	<0.001
% Emphysema	0.027	–0.028, 0.083	0.335
Scanner - Philips	0.411	0.129, 0.693	0.004
Scanner - Siemens	0.114	0.008, 0.220	0.035
Common Protein Score	0.074	0.021, 0.128	0.006
Common Gene Score	0.031	–0.020, 0.083	0.237
ERF + Subgroup Protein Score				0.165	0.159
Intercept	–1.071	–1.760, –0.383	0.002
Age	0.014	–0.041, 0.070	0.614
Sex (Female)	0.007	–0.091, 0.105	0.891
Race (African American)	–0.035	–0.167, 0.097	0.605
BMI	0.312	0.256, 0.368	<0.001
Former Smoker	0.860	0.174, 1.546	0.014
Current Smoker	1.335	0.645, 2.025	<0.001
% Emphysema	0.030	–0.025, 0.086	0.284
Scanner - Philips	0.418	0.137, 0.698	0.004
Scanner - Siemens	0.079	–0.027, 0.185	0.146
Subgroup Protein Score	0.126	0.072, 0.179	<0.001
ERF + Subgroup Gene Score				0.153	0.147
Intercept	–1.169	–1.861, –0.477	0.001
Age	0.021	–0.035, 0.077	0.459
Sex (Female)	0.007	–0.092, 0.106	0.893
Race (African American)	–0.083	–0.217, 0.050	0.220
BMI	0.343	0.288, 0.398	<0.001
Former Smoker	0.932	0.242, 1.622	0.008
Current Smoker	1.423	0.729, 2.117	<0.001
% Emphysema	0.023	–0.033, 0.079	0.421
Scanner - Philips	0.387	0.105, 0.669	0.007
Scanner - Siemens	0.134	0.029, 0.239	0.013
Subgroup Gene Score	0.035	–0.016, 0.087	0.180
ERF + Subgroup Scores				0.166	0.159
Intercept	–1.064	–1.752, –0.375	0.003
Age	0.015	–0.041, 0.070	0.610
Sex (Female)	0.007	–0.091, 0.106	0.885
Race (African American)	–0.045	–0.179, 0.088	0.504
BMI	0.306	0.249, 0.363	<0.001
Former Smoker	0.850	0.164, 1.536	0.015
Current Smoker	1.332	0.642, 2.022	<0.001
% Emphysema	0.030	–0.025, 0.085	0.290
Scanner - Philips	0.423	0.142, 0.704	0.003
Scanner - Siemens	0.083	–0.024, 0.189	0.129
Subgroup Protein Score	0.124	0.070, 0.177	<0.001
Subtype Gene Score	0.027	–0.024, 0.078	0.304

Variable	Estimate	95% CI	P-value	\|$R^{2}$\|	Adjusted \|$R^{2}$\|
ERF				0.152	0.147
Intercept	–1.180	–1.872, –0.489	0.001
Age	0.021	–0.035, 0.077	0.461
Sex (Female)	0.006	–0.093, 0.105	0.901
Race (African American)	–0.070	–0.202, 0.062	0.297
BMI	0.352	0.299, 0.406	<0.001
Former Smoker	0.947	0.257, 1.637	0.007
Current Smoker	1.429	0.735, 2.123	<0.001
% Emphysema	0.023	–0.032, 0.079	0.414
Scanner - Philips	0.379	0.097, 0.661	0.009
Scanner - Siemens	0.130	0.025, 0.235	0.015
ERF + Common Protein Score				0.157	0.151
Intercept	–1.138	–1.828, –0.447	0.001
Age	0.032	–0.024, 0.089	0.266
Sex (Female)	0.007	–0.092, 0.105	0.897
Race (African American)	–0.049	–0.181, 0.084	0.469
BMI	0.327	0.271, 0.383	<0.001
Former Smoker	0.911	0.223, 1.600	0.010
Current Smoker	1.390	0.697, 2.083	<0.001
% Emphysema	0.028	–0.028, 0.083	0.328
Scanner - Philips	0.404	0.123, 0.686	0.005
Scanner - Siemens	0.110	0.005, 0.216	0.041
Common Protein Score	0.076	0.023, 0.130	0.005
ERF + Common Gene Score				0.153	0.147
Intercept	–1.169	–1.861, –0.477	0.001
Age	0.021	–0.035, 0.077	0.462
Sex (Female)	0.007	–0.092, 0.106	0.893
Race (African American)	–0.083	–0.216, 0.050	0.221
BMI	0.343	0.288, 0.398	<0.001
Former Smoker	0.932	0.242, 1.623	0.008
Current Smoker	1.423	0.729, 2.117	<0.001
% Emphysema	0.023	–0.033, 0.079	0.421
Scanner - Philips	0.387	0.105, 0.669	0.007
Scanner - Siemens	0.134	0.029, 0.239	0.013
Common Gene Score	0.035	–0.017, 0.086	0.185
ERF + Common Scores				0.158	0.151
Intercept	–1.128	–1.819, –0.437	0.001
Age	0.032	–0.025, 0.088	0.270
Sex (Female)	0.007	–0.092, 0.106	0.890
Race (African American)	–0.061	–0.195, 0.073	0.373
BMI	0.320	0.262, 0.377	<0.001
Former Smoker	0.899	0.210, 1.588	0.011
Current Smoker	1.385	0.692, 2.078	<0.001
% Emphysema	0.027	–0.028, 0.083	0.335
Scanner - Philips	0.411	0.129, 0.693	0.004
Scanner - Siemens	0.114	0.008, 0.220	0.035
Common Protein Score	0.074	0.021, 0.128	0.006
Common Gene Score	0.031	–0.020, 0.083	0.237
ERF + Subgroup Protein Score				0.165	0.159
Intercept	–1.071	–1.760, –0.383	0.002
Age	0.014	–0.041, 0.070	0.614
Sex (Female)	0.007	–0.091, 0.105	0.891
Race (African American)	–0.035	–0.167, 0.097	0.605
BMI	0.312	0.256, 0.368	<0.001
Former Smoker	0.860	0.174, 1.546	0.014
Current Smoker	1.335	0.645, 2.025	<0.001
% Emphysema	0.030	–0.025, 0.086	0.284
Scanner - Philips	0.418	0.137, 0.698	0.004
Scanner - Siemens	0.079	–0.027, 0.185	0.146
Subgroup Protein Score	0.126	0.072, 0.179	<0.001
ERF + Subgroup Gene Score				0.153	0.147
Intercept	–1.169	–1.861, –0.477	0.001
Age	0.021	–0.035, 0.077	0.459
Sex (Female)	0.007	–0.092, 0.106	0.893
Race (African American)	–0.083	–0.217, 0.050	0.220
BMI	0.343	0.288, 0.398	<0.001
Former Smoker	0.932	0.242, 1.622	0.008
Current Smoker	1.423	0.729, 2.117	<0.001
% Emphysema	0.023	–0.033, 0.079	0.421
Scanner - Philips	0.387	0.105, 0.669	0.007
Scanner - Siemens	0.134	0.029, 0.239	0.013
Subgroup Gene Score	0.035	–0.016, 0.087	0.180
ERF + Subgroup Scores				0.166	0.159
Intercept	–1.064	–1.752, –0.375	0.003
Age	0.015	–0.041, 0.070	0.610
Sex (Female)	0.007	–0.091, 0.106	0.885
Race (African American)	–0.045	–0.179, 0.088	0.504
BMI	0.306	0.249, 0.363	<0.001
Former Smoker	0.850	0.164, 1.536	0.015
Current Smoker	1.332	0.642, 2.022	<0.001
% Emphysema	0.030	–0.025, 0.085	0.290
Scanner - Philips	0.423	0.142, 0.704	0.003
Scanner - Siemens	0.083	–0.024, 0.189	0.129
Subgroup Protein Score	0.124	0.070, 0.177	<0.001
Subtype Gene Score	0.027	–0.024, 0.078	0.304

Models were fit on all participants in the COPD Data set and adjusted for age, sex, race, BMI, smoking status, percent emphysema, and scanner make (⁠|$N=1374$|⁠). Scores were developed using the “stable” proteins and genes selected by HIP (Random). ERF = Established Risk Factors (Age, Sex, Race, BMI, and Smoking Status).

Table 3

10.15326/jcopdf.1.1.2014.0120

Comparison of regression model estimates

Variable	Estimate	95% CI	P-value	\|$R^{2}$\|	Adjusted \|$R^{2}$\|
ERF				0.152	0.147
Intercept	–1.180	–1.872, –0.489	0.001
Age	0.021	–0.035, 0.077	0.461
Sex (Female)	0.006	–0.093, 0.105	0.901
Race (African American)	–0.070	–0.202, 0.062	0.297
BMI	0.352	0.299, 0.406	<0.001
Former Smoker	0.947	0.257, 1.637	0.007
Current Smoker	1.429	0.735, 2.123	<0.001
% Emphysema	0.023	–0.032, 0.079	0.414
Scanner - Philips	0.379	0.097, 0.661	0.009
Scanner - Siemens	0.130	0.025, 0.235	0.015
ERF + Common Protein Score				0.157	0.151
Intercept	–1.138	–1.828, –0.447	0.001
Age	0.032	–0.024, 0.089	0.266
Sex (Female)	0.007	–0.092, 0.105	0.897
Race (African American)	–0.049	–0.181, 0.084	0.469
BMI	0.327	0.271, 0.383	<0.001
Former Smoker	0.911	0.223, 1.600	0.010
Current Smoker	1.390	0.697, 2.083	<0.001
% Emphysema	0.028	–0.028, 0.083	0.328
Scanner - Philips	0.404	0.123, 0.686	0.005
Scanner - Siemens	0.110	0.005, 0.216	0.041
Common Protein Score	0.076	0.023, 0.130	0.005
ERF + Common Gene Score				0.153	0.147
Intercept	–1.169	–1.861, –0.477	0.001
Age	0.021	–0.035, 0.077	0.462
Sex (Female)	0.007	–0.092, 0.106	0.893
Race (African American)	–0.083	–0.216, 0.050	0.221
BMI	0.343	0.288, 0.398	<0.001
Former Smoker	0.932	0.242, 1.623	0.008
Current Smoker	1.423	0.729, 2.117	<0.001
% Emphysema	0.023	–0.033, 0.079	0.421
Scanner - Philips	0.387	0.105, 0.669	0.007
Scanner - Siemens	0.134	0.029, 0.239	0.013
Common Gene Score	0.035	–0.017, 0.086	0.185
ERF + Common Scores				0.158	0.151
Intercept	–1.128	–1.819, –0.437	0.001
Age	0.032	–0.025, 0.088	0.270
Sex (Female)	0.007	–0.092, 0.106	0.890
Race (African American)	–0.061	–0.195, 0.073	0.373
BMI	0.320	0.262, 0.377	<0.001
Former Smoker	0.899	0.210, 1.588	0.011
Current Smoker	1.385	0.692, 2.078	<0.001
% Emphysema	0.027	–0.028, 0.083	0.335
Scanner - Philips	0.411	0.129, 0.693	0.004
Scanner - Siemens	0.114	0.008, 0.220	0.035
Common Protein Score	0.074	0.021, 0.128	0.006
Common Gene Score	0.031	–0.020, 0.083	0.237
ERF + Subgroup Protein Score				0.165	0.159
Intercept	–1.071	–1.760, –0.383	0.002
Age	0.014	–0.041, 0.070	0.614
Sex (Female)	0.007	–0.091, 0.105	0.891
Race (African American)	–0.035	–0.167, 0.097	0.605
BMI	0.312	0.256, 0.368	<0.001
Former Smoker	0.860	0.174, 1.546	0.014
Current Smoker	1.335	0.645, 2.025	<0.001
% Emphysema	0.030	–0.025, 0.086	0.284
Scanner - Philips	0.418	0.137, 0.698	0.004
Scanner - Siemens	0.079	–0.027, 0.185	0.146
Subgroup Protein Score	0.126	0.072, 0.179	<0.001
ERF + Subgroup Gene Score				0.153	0.147
Intercept	–1.169	–1.861, –0.477	0.001
Age	0.021	–0.035, 0.077	0.459
Sex (Female)	0.007	–0.092, 0.106	0.893
Race (African American)	–0.083	–0.217, 0.050	0.220
BMI	0.343	0.288, 0.398	<0.001
Former Smoker	0.932	0.242, 1.622	0.008
Current Smoker	1.423	0.729, 2.117	<0.001
% Emphysema	0.023	–0.033, 0.079	0.421
Scanner - Philips	0.387	0.105, 0.669	0.007
Scanner - Siemens	0.134	0.029, 0.239	0.013
Subgroup Gene Score	0.035	–0.016, 0.087	0.180
ERF + Subgroup Scores				0.166	0.159
Intercept	–1.064	–1.752, –0.375	0.003
Age	0.015	–0.041, 0.070	0.610
Sex (Female)	0.007	–0.091, 0.106	0.885
Race (African American)	–0.045	–0.179, 0.088	0.504
BMI	0.306	0.249, 0.363	<0.001
Former Smoker	0.850	0.164, 1.536	0.015
Current Smoker	1.332	0.642, 2.022	<0.001
% Emphysema	0.030	–0.025, 0.085	0.290
Scanner - Philips	0.423	0.142, 0.704	0.003
Scanner - Siemens	0.083	–0.024, 0.189	0.129
Subgroup Protein Score	0.124	0.070, 0.177	<0.001
Subtype Gene Score	0.027	–0.024, 0.078	0.304

Variable	Estimate	95% CI	P-value	\|$R^{2}$\|	Adjusted \|$R^{2}$\|
ERF				0.152	0.147
Intercept	–1.180	–1.872, –0.489	0.001
Age	0.021	–0.035, 0.077	0.461
Sex (Female)	0.006	–0.093, 0.105	0.901
Race (African American)	–0.070	–0.202, 0.062	0.297
BMI	0.352	0.299, 0.406	<0.001
Former Smoker	0.947	0.257, 1.637	0.007
Current Smoker	1.429	0.735, 2.123	<0.001
% Emphysema	0.023	–0.032, 0.079	0.414
Scanner - Philips	0.379	0.097, 0.661	0.009
Scanner - Siemens	0.130	0.025, 0.235	0.015
ERF + Common Protein Score				0.157	0.151
Intercept	–1.138	–1.828, –0.447	0.001
Age	0.032	–0.024, 0.089	0.266
Sex (Female)	0.007	–0.092, 0.105	0.897
Race (African American)	–0.049	–0.181, 0.084	0.469
BMI	0.327	0.271, 0.383	<0.001
Former Smoker	0.911	0.223, 1.600	0.010
Current Smoker	1.390	0.697, 2.083	<0.001
% Emphysema	0.028	–0.028, 0.083	0.328
Scanner - Philips	0.404	0.123, 0.686	0.005
Scanner - Siemens	0.110	0.005, 0.216	0.041
Common Protein Score	0.076	0.023, 0.130	0.005
ERF + Common Gene Score				0.153	0.147
Intercept	–1.169	–1.861, –0.477	0.001
Age	0.021	–0.035, 0.077	0.462
Sex (Female)	0.007	–0.092, 0.106	0.893
Race (African American)	–0.083	–0.216, 0.050	0.221
BMI	0.343	0.288, 0.398	<0.001
Former Smoker	0.932	0.242, 1.623	0.008
Current Smoker	1.423	0.729, 2.117	<0.001
% Emphysema	0.023	–0.033, 0.079	0.421
Scanner - Philips	0.387	0.105, 0.669	0.007
Scanner - Siemens	0.134	0.029, 0.239	0.013
Common Gene Score	0.035	–0.017, 0.086	0.185
ERF + Common Scores				0.158	0.151
Intercept	–1.128	–1.819, –0.437	0.001
Age	0.032	–0.025, 0.088	0.270
Sex (Female)	0.007	–0.092, 0.106	0.890
Race (African American)	–0.061	–0.195, 0.073	0.373
BMI	0.320	0.262, 0.377	<0.001
Former Smoker	0.899	0.210, 1.588	0.011
Current Smoker	1.385	0.692, 2.078	<0.001
% Emphysema	0.027	–0.028, 0.083	0.335
Scanner - Philips	0.411	0.129, 0.693	0.004
Scanner - Siemens	0.114	0.008, 0.220	0.035
Common Protein Score	0.074	0.021, 0.128	0.006
Common Gene Score	0.031	–0.020, 0.083	0.237
ERF + Subgroup Protein Score				0.165	0.159
Intercept	–1.071	–1.760, –0.383	0.002
Age	0.014	–0.041, 0.070	0.614
Sex (Female)	0.007	–0.091, 0.105	0.891
Race (African American)	–0.035	–0.167, 0.097	0.605
BMI	0.312	0.256, 0.368	<0.001
Former Smoker	0.860	0.174, 1.546	0.014
Current Smoker	1.335	0.645, 2.025	<0.001
% Emphysema	0.030	–0.025, 0.086	0.284
Scanner - Philips	0.418	0.137, 0.698	0.004
Scanner - Siemens	0.079	–0.027, 0.185	0.146
Subgroup Protein Score	0.126	0.072, 0.179	<0.001
ERF + Subgroup Gene Score				0.153	0.147
Intercept	–1.169	–1.861, –0.477	0.001
Age	0.021	–0.035, 0.077	0.459
Sex (Female)	0.007	–0.092, 0.106	0.893
Race (African American)	–0.083	–0.217, 0.050	0.220
BMI	0.343	0.288, 0.398	<0.001
Former Smoker	0.932	0.242, 1.622	0.008
Current Smoker	1.423	0.729, 2.117	<0.001
% Emphysema	0.023	–0.033, 0.079	0.421
Scanner - Philips	0.387	0.105, 0.669	0.007
Scanner - Siemens	0.134	0.029, 0.239	0.013
Subgroup Gene Score	0.035	–0.016, 0.087	0.180
ERF + Subgroup Scores				0.166	0.159
Intercept	–1.064	–1.752, –0.375	0.003
Age	0.015	–0.041, 0.070	0.610
Sex (Female)	0.007	–0.091, 0.106	0.885
Race (African American)	–0.045	–0.179, 0.088	0.504
BMI	0.306	0.249, 0.363	<0.001
Former Smoker	0.850	0.164, 1.536	0.015
Current Smoker	1.332	0.642, 2.022	<0.001
% Emphysema	0.030	–0.025, 0.085	0.290
Scanner - Philips	0.423	0.142, 0.704	0.003
Scanner - Siemens	0.083	–0.024, 0.189	0.129
Subgroup Protein Score	0.124	0.070, 0.177	<0.001
Subtype Gene Score	0.027	–0.024, 0.078	0.304

Effect of common and sex-specific genes and proteins on AWT

Finally, we created common and sex-specific protein and gene scores from the “stable” proteins and genes selected by HIP (Random) and assessed whether these scores improved the prediction of AWT beyond some established COPD risk factors. We created the common protein score for subject |$i$| as CommonProtScore|$_{i}=\sum _{j=1}^{\# common~proteins}w_{j}x_{ij}^{1}$|⁠, where |$x_{ij}^{1}$| is subject |$i$|’s protein expression value for the |$j$|th common protein (i.e. the |$ij$|th entry for the protein data, |$\boldsymbol X^{1}$|⁠), and |$w_{j}$| is the weight for protein |$j$|⁠. Each protein weight, |$w_{j}$|⁠, was obtained via bootstrap. Specifically, we obtained 200 bootstrap datasets, and for each bootstrap dataset, we obtained regression coefficients and standard errors from univariate regression models of AWT and each of the common proteins identified. This resulted in 200 regression coefficients and standard errors which we combined using a weighted mean. The subgroup-specific scores were also obtained in a similar fashion. The scores were standardized to have mean 0 and variance 1 in each subgroup since different variables were identified for males and females.

Once the scores were created, we fit several multiple linear regression models on the full data: (i) Established Risk Factors (ERF) Model, (ii) ERF + Common Protein Score, (iii) ERF + Common Gene Score, (iv) ERF + Common Protein and Gene Scores, (v) ERF + Subgroup Protein Score, (vi) ERF + Subgroup Gene Score, and (vii) ERF + Subgroup Protein and Gene Scores. Table 3 shows the coefficient estimates with confidence intervals and |$P$|-values. We observe both the common and subgroup-specific protein scores are statistically significant, but neither the common nor subgroup-specific gene scores were. This could be due to including too few genes in the scores or because there was a large overlap between the genes selected for males and females. The “stable” proteins we identified to be common and specific to males and females could potentially be explored to further our understanding of sex differences in COPD mechanisms.

Conclusion

We have tackled the problem of accounting for subgroup heterogeneity in an integrative analysis framework. Motivated by the COPDGene study and a scientific need to understand sex differences in COPD, we developed appropriate statistical methods that leverage the strengths of multi-view data, account for subgroup heterogeneity, incorporate clinical covariates, and combine the association step with a clinical outcome step to guide the selection of clinically meaningful molecular signatures. Through the use of a hierarchical penalty, we identify omics signatures that are common and subgroup-specific and can predict a clinical outcome. HIP showed comparable to substantially improved prediction and variable selection performance in simulation settings when compared with existing methods.

When we applied HIP to genomic and proteomic data from COPDGene, we identified protein and gene biomarkers and pathways common and specific to males and females. When the proteins and genes were developed into scores, the common and subgroup-specific protein scores were statistically significant predictors of AWT even when including established risk factors of COPD. These findings suggest that the proteins and genes identified to be common and specific to males and females could be explored to further our understanding of sex differences in COPD mechanisms.

Recently, [31] also explored gene signatures related to AWT and found that interferon stimulated genes were associated with AWT. We did not find these same genes in our analysis, but there were several differences in the analyses that could explain the differing results: (i) the subset of COPDGene participants in the two analyses was different as we only included participants with COPD, while [31] included participants with and without COPD, (ii) [31] looked for associations between individual genes and AWT while adjusting for covariates, whereas we selected genes based on rankings from our model that included several genes at once, (iii) we considered both gene and protein data (which also impacted which participants we could include), whereas [31] only considered genes, and (iv) we use IPA [26] to find pathways, whereas [31] used MSigDB (https://www.gsea-msigdb.org/gsea/msigdb).

HIP has some limitations that warrant further research. First, the number of variables to be kept for the subset model refit has to be specified. In simulations where this value is known, performance is very good, but the truth will not be known in applied settings. Users could look at plots of the weights from the |$\hat{\boldsymbol B}^{d,s}$| to see how many variables seem to have large weights. We also found that if there were some splits where the train MSEs were very small but test MSEs very large, i.e. evidence of overfitting that more variables needed to be retained. Second, the tuning range for |$\lambda _{\xi }$| and |$\lambda _{g}$| is not determined by the data, so the tuning range may need to be adjusted to attain optimal sparsity. This can be done with an optional parameter in the code. Additionally, the number of components, |$K$|⁠, needs to be specified. Although the truth can never be known, we provide an automatic method to select |$K$| and discuss other options in the supplemental material. Future research should explore the possibility that |$K$| may differ by data view. Finally, HIP is limited to cross-sectional data, but future work could extend it to accommodate longitudinal data to determine whether trends in some outcome vary by subgroup. Despite these limitations, HIP advances statistical methods for joint association and prediction of multi-view data, and the encouraging simulation and real data findings motivate further applications.

Key Points

Many complex diseases are characterized by subgroup differences, yet most existing integrative analysis methods ignore differential outcomes in subgroups. We have tackled the problem of accounting for subgroup heterogeneity in data integration.
We have developed HIP, a statistical method that leverages the strengths of multi-view data, accounts for subgroup heterogeneity, incorporates clinical covariates, and couples data integration with a clinical outcome prediction, yielding clinically meaningful molecular signatures and enhancing interpretability.
We have conducted rigorous simulation studies and found HIP to be competitive. The COPD data application showcases the potential application of HIP in identifying molecular signatures shared by, and unique to, males and females, and predictive of airway wall thickness.
We have developed efficient PyTorch codes for users with programming expertise, and a Shiny App, for users without programming expertise, to promote research inclusion and accessibility.

Author contributions

S.E.S. and Q.L. conceived of the idea. S.E.S., J.B., and L.E. developed the methods. J.B. and S.E.S. developed code to implement the methods. J.B. and L.V. conducted simulations and real data analyses. J.B. and C.W. interpreted results from the real data analyses. J.B. and S.E.S. wrote a first draft of the paper. All authors read and edited the final manuscript.

Conflict of interest: The authors declare that they have no competing interests.

Funding

This work was supported by National Center For Advancing Translational Science (5KL2TR002492), National Institute Of General Medical Sciences (1R35GM142695), and NHLBI U01 HL089897 and U01 HL089856. The COPDGene study (NCT00608764) is also supported by the COPD Foundation through contributions made to an Industry Advisory Committee that has included AstraZeneca, Bayer Pharmaceuticals, Boehringer-Ingelheim, GlaxoSmithKline, Genentech, Novartis, Pfizer, and Sunovion. The views expressed in this article are those of the authors and do not reflect the views of the United States Government, the Department of Veterans Affairs, the funders, the sponsors, or any of the authors’ affiliated academic institutions.

Data availability

Access to the clinical and genomic data can be requested through dbGaP (IDs: phs000951.v4.p4 and phs000179.v6.p2). The proteomic data can be requested from the COPDGene Study Group (http://www.copdgene.org/).

The Python source code for implementing the methods and generating simulated data along with README files is available on GitHub at https://github.com/lasandrall/HIP. We interface the Python code with the R programming language and provide an R-package for users without programming experience in Python. A Shiny Application of HIP for users with limited programming experience can be found at https://multi-viewlearn.shinyapps.io/HIP_ShinyApp/. We provide a video tutorial at https://youtu.be/O6E2OLmeMDo to facilitate the use of HIP.

Ethics approval and consent to participate

This research uses previously collected, de-identified data from the COPDGene Study [7], a multi-center study with 21 clinical sites each with local IRB approval (NCT00608764).

Consent for publication

Not applicable

References

Wheaton

Cunningham

Ford

. et al. .

Employment and activity limitations among adults with chronic obstructive pulmonary disease—United States, 2013

MMWR Morb Mortal Wkly Rep

2015

;

289

–

GOLD, “GOLD 2020 Report.” https://goldcopd.org/wp-content/uploads/2019/11/GOLD-2020-REPORT-ver1.1wms.pdf, 2020 (20 May 2020, date last accessed)

Hardin

Silverman

Chronic obstructive pulmonary disease genetics: a review of the past and a look into the future

Chronic Obstr Pulm Dis

2014

;

–

Zhou

Tian

. et al. .

Risk of copd from exposure to biomass smoke: a metaanalysis

Chest

2010

;

138

–

10.1378/chest.08-2114

Pauwels

Buist

Calverley

. et al. .

Global strategy for the diagnosis, management, and prevention of chronic obstructive pulmonary disease: Nhlbi/who global initiative for chronic obstructive lung disease (gold) workshop summary

Am J Respir Crit Care Med

2001

;

163

1256

–

10.1164/ajrccm.163.5.2101039

Chung

Adcock

Multifaceted mechanisms in copd: Inflammation, immunity, and tissue repair and destruction

Eur Respir J

2008

;

1334

–

10.1183/09031936.00018908

Regan

Hokanson

Murphy

. et al. .

Genetic epidemiology of copd (copdgene) study design

COPD: J Chron Obstruct Pulmon Dis

2011

;

–

10.3109/15412550903499522

. https://10.1164/rccm.201512-2379ED.

Barnes

Sex differences in chronic obstructive pulmonary disease mechanisms

Am J Respir Crit Care Med

2016

;

193

813

–

Gan

Man

Postma

. et al. .

Female smokers beyond the perimenopausal period are at increased risk of chronic obstructive pulmonary disease: a systematic review and meta-analysis

Respir Res

2006

;

–

10.1186/1465-9921-7-52

10.

Kim

Y-I

Schroeder

Lynch

. et al. .

Gender differences of airway dimensions in anatomically matched sites on ct in smokers

COPD: J Chron Obstruct Pulmon Dis

2011

;

285

–

10.3109/15412555.2011.586658

10.1183/09031936.97.10040822

11.

Prescott

Bjerg

Andersen

. et al. .

Gender difference in smoking effects on lung function and risk of hospitalization for copd: results from a danish longitudinal population study

Eur Respir J

1997

;

822

–

12.

Safo

Min

Haine

Sparse linear discriminant analysis for multiview structured data

Biometrics

2021

;

612

–

[Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1111/biom.13458

13.

Chekouo

Safo

Bayesian integrative analysis and prediction with application to atherosclerosis cardiovascular disease

Biostatistics

2020

;

124

–

10.1093/biostatistics/kxab016

14.

Luo

Liu

Dey

. et al. .

Canonical variate regression

Biostatistics

2016

;

468

–

10.1093/biostatistics/kxw001

15.

Dondelinger

Mukherjee

Initiative

TADN

The joint lasso: high-dimensional regression for group structured data

Biostatistics

2018

;

219

–

_eprint: https://dbpia.nl.go.kr/biostatistics/article-pdf/21/2/219/32914593/kxy035.pdf. [Online]. Available: https://doi.org/10.1093/biostatistics/kxy035

10.1093/acprof:oso/9780198510581.001.0001

16.

Wang

Huang

C-C

. et al. .

Meta-analysis based variable selection for gene expression data

Biometrics

2014

;

872

–

17.

Paszke

Gross

Massa

. et al. .

Pytorch: an imperative style, high-performance deep learning library

. In:

Wallach

Larochelle

Beygelzimer

d’ Alché-Buc

Fox

Garnett

, (eds.),

Advances in Neural Information Processing Systems 32

Curran Associates, Inc.

2019

8024

–

18.

Gower

Dijksterhuis

. et al. .

Procrustes Problems

, Vol.

Oxford University Press on Demand

2004

19.

Luo

Chen

CVR: Canonical Variate Regression

2017

R package version 0.1.1. [Online]

Available: https://CRAN.R-project.org/package=CVR

20.

Dondelinger

Wilkinson

Fuser: Fused Lasso for High-Dimensional Regression over Groups

2018

R package version 1.0.1. [Online]

Available: https://CRAN.R-project.org/package=fuser

21.

Tibshirani

Regression shrinkage and selection via the lasso

J R Stat Soc Ser B

1994

;

267

–

10.1111/j.2517-6161.1996.tb02080.x

10.1111/j.1467-9868.2005.00503.x

22.

Zou

Hastie

Regularization and variable selection via the elastic net

Journal of the Royal Statistical Society, Series B

2005

;

301

–

10.4103/ijabmr.IJABMR_65_17

23.

Friedman

Hastie

Tibshirani

Regularization paths for generalized linear models via coordinate descent

J Stat Softw

2010

; Volume

–

[Online]. Available: http://www.jstatsoft.org/v33/i01/

10.18637/jss.v033.i01

24.

Maxwell

Campbell

Spira

. et al. .

Submirine: Assessing variants in microrna targets using clinical genomic data sets

Nucleic Acids Res

2015

;

3886

–

25.

Jaswal

Saini

Kaur

. et al. .

Association of adiponectin with lung function impairment and disease severity in chronic obstructive pulmonary disease

Int J Appl Basic Med Res

2018

;

26.

Kramer

Greeen

. et al. .

Causal analysis approaches in ingenuity pathway analysis

Bionformatics

2014

;

523

–

[Online]. Available: https://doi.org/10.1093/bioinformatics/btt703

10.1164/rccm.201702-0311PP

27.

Neves

Haider

Gassmann

. et al. .

Iron homeostasis in the lungs—a balance between health and disease

Pharmaceuticals

2019

;

28.

Cloonan

Mumby

Adcock

. et al. .

The ”iron”-y of iron overload and iron deficiency in chronic obstructive pulmonary disease

Am J Respir Crit Care Med

2017

;

196

1103

–

29.

Salit

Kaner

Mezey

. et al. .

Small airway epithelial responses associated with enhanced female susceptibility to smoking-related lung disease

American Thoracic Society

2019

; Vol

199

A7096

–

[Online]. Available: https://www.atsjournals.org/doi/abs/10.1164/ajrccm-conference.2019.199.1_MeetingAbstracts.A7096

OpenURL Placeholder Text

30.

Wang

Zhao

Raman

. et al. .

Peripheral blood mononuclear cell gene expression in chronic obstructive pulmonary disease: Mirna and mrna regulation

J Inflamm Res

2022

;

2167

–

31.

Yun

Lee

Srinivasa

. et al. .

”An interferon-inducible signature of airway disease from blood gene expression profiling

Eur Respir J

2022

;

:1–22.

[Online]. Available: https://erj.ersjournals.com/content/59/5/2100569

OpenURL Placeholder Text