-
PDF
- Split View
-
Views
-
Cite
Cite
Jessica Butts, Leif Verace, Christine Wendt, Russel P Bowler, Craig P Hersh, Qi Long, Lynn Eberly, Sandra E Safo, HIP: a method for high-dimensional multi-view data integration and prediction accounting for subgroup heterogeneity, Briefings in Bioinformatics, Volume 25, Issue 6, November 2024, bbae470, https://doi.org/10.1093/bib/bbae470
- Share Icon Share
Abstract
Epidemiologic and genetic studies in many complex diseases suggest subgroup disparities (e.g. by sex, race) in disease course and patient outcomes. We consider this from the standpoint of integrative analysis where we combine information from different views (e.g. genomics, proteomics, clinical data). Existing integrative analysis methods ignore the heterogeneity in subgroups, and stacking the views and accounting for subgroup heterogeneity does not model the association among the views. We propose Heterogeneity in Integration and Prediction (HIP), a statistical approach for joint association and prediction that leverages the strengths in each view to identify molecular signatures that are shared by and specific to a subgroup. We apply HIP to proteomics and gene expression data pertaining to chronic obstructive pulmonary disease (COPD) to identify proteins and genes shared by, and unique to, males and females, contributing to the variation in COPD, measured by airway wall thickness. Our COPD findings have identified proteins, genes, and pathways that are common across and specific to males and females, some implicated in COPD, while others could lead to new insights into sex differences in COPD mechanisms. HIP accounts for subgroup heterogeneity in multi-view data, ranks variables based on importance, is applicable to univariate or multivariate continuous outcomes, and incorporates covariate adjustment. With the efficient algorithms implemented using PyTorch, this method has many potential scientific applications and could enhance multiomics research in health disparities. HIP is available at https://github.com/lasandrall/HIP, a video tutorial at https://youtu.be/O6E2OLmeMDo and a Shiny Application at https://multi-viewlearn.shinyapps.io/HIP_ShinyApp/ for users with limited programming experience.
Introduction
Epidemiologic and genetic studies suggest subgroup (e.g. sex) disparities exist for many complex diseases. Subgroups of a population can present similar symptoms but have different clinical courses and respond to therapy differently. By determining factors predictive of an outcome for each subgroup, we can better personalize treatments to improve patient outcomes.
Consider chronic obstructive pulmonary disease (COPD), a chronic progressive disease affecting more than 16 million adults, presenting a substantial and increasing economic and social burden [1]. Although tobacco smoking is the leading environmental risk factor for COPD, even in heavy smokers fewer than 50% develop COPD [2], genetics [3], environmental exposures [4], inflammation [5], and other factors [6] predispose individuals to develop COPD. The Genetic Epidemiology of COPD (COPDGene) Study [7] is one of the largest studies to investigate the underlying genetic factors of COPD to understand why certain smokers develop COPD while others do not. Research suggests that sex disparities exist in COPD mechanisms [8]. A meta-analysis of 11 studies showed that female smokers, even if smoking fewer cigarettes, had a faster annual decline in forced expiratory volume in one second (FEV|$_{1}$|) [9]. A study using COPDGene data found that women smokers tended to have higher airway wall thickness (AWT) compared with male smokers [10], likely explaining some of the sex differences in the prevalence of COPD. Women with severe COPD may be at higher risk for hospitalization and death [11]. These studies primarily used data from one source, so combining data from multiple sources has the potential to reveal new insights into sex differences in COPD mechanisms. Motivated by the crucial scientific need to understand sex differences in COPD, we leverage the strengths from multiple data views from the COPDGene Study to identify genes and proteins ’common among’ and ’specific to’ males and females contributing to variation in AWT.
Existing methods for integrating data from multiple views are inadequate for our problem as they do not account for subgroup heterogeneity. In particular, one-step methods have been proposed for joint association of data from multiple views and simultaneous prediction of an outcome [12–14]. To do this, one would build a separate integrative analysis model for each subgroup to determine the important multidimensional variables that are associated and predictive of the outcome for each subgroup. While this approach is intuitive, it is limited by the sample size for each subgroup and does not pool information across subgroups making estimation challenging. This is especially true for high-dimensional data settings where the number of variables is larger than the sample size for each subgroup. Another approach that makes use of samples in all subgroups is to apply these one-step methods on the combined subgroup data, but this precludes us from examining whether such heterogeneity exists.
The need to account for subgroup heterogeneity has been recognized and studied in the case where there is only one data view. Reference [15] proposes the Joint Lasso to jointly estimate regression coefficients for different subgroups while allowing for the identification of subgroup-specific features and also encouraging similarity between subgroup-specific coefficients. In [16], the authors proposed the meta lasso for feature selection for different studies (in our application, subgroups) that incorporates a hierarchical penalty to borrow strength across different studies while allowing for feature selection flexibility. The goal of the meta lasso is to combine data sets with the same variables measured on distinct subjects from separate studies to improve variable selection across all data sets when considering a binary outcome; it is not specifically designed to account for subgroup heterogeneity and does not consider multiple data views obtained on the same set of subjects. To use these existing methods, one would stack the different data views for each subgroup; this approach assumes the many variables across the data views are independent and ignores the overall dependency structure among the different views.
We make three main contributions in this article. First, we propose integrative analysis and prediction methods that account for subgroup heterogeneity and are appropriate for our motivating data by modifying the hierarchical penalty proposed in [16] to improve the power to identify common and subgroup-specific features (Fig. 1). Second, the methods we propose, called Heterogeneity in Integration and Prediction (HIP), allow for one or more continuous outcomes and can force specified covariates into the model giving HIP more flexibility than current methods. Third, we develop computationally efficient algorithms using PyTorch [17]. We apply the methods to our motivating data from the COPDGene Study to identify genes and proteins common across and specific to males and females and associated with AWT. We then explore enriched pathways and the ability of these omics biomarkers to predict AWT beyond some established COPD risk factors.

Simplified overview of general data integration and HIP; HIP allows to select variables unique to a particular subgroup and shared between subgroups compared withgeneral data integration methods whose variable selection approach is agnostic to heterogeneity existing in subgroups.
The remainder of the paper is structured as follows. In Methods section 2, we present the proposed methods (HIP). In Simulation section 3, we conduct simulation studies to assess the performance of HIP in comparison with existing methods. In Real data analysis section 4, we apply HIP to data from the COPDGene Study. We conclude with some brief discussion in Conclusion section 5.
Materials and Methods
Notation and problem
Suppose we have |$D$| views (e.g. genomics, proteomics, clinical) with |$p_{d}$| variables measured on the same |$N$| subjects. Each view has |$s =1,\ldots ,S$| subgroups known a priori, each with sample size |$n_{s}$|, where |$N=\sum _{s=1}^{S}n_{s}$|. For subgroup |$s$| and view |$d$|, |$\boldsymbol X^{d,s} \in \Re ^{n_{s} \times p_{d}}$| represents the data matrix. Assume we also have outcome data for each subgroup. For a continuous outcome(s) (e.g. AWT), we have matrix |$\boldsymbol Y^{s} \in \Re ^{n_{s} \times q}$|, where |$q$| is the number of outcomes. Our primary goal is to perform integrative analysis that considers the overall dependency structure among views, predicts an outcome, incorporates feature ranking, and accounts for subgroup heterogeneity to identify common and subgroup-specific variables contributing to variation in the outcome. Figure 1 is a visual representation of the proposed method, HIP, compared with general data integration methods.
Integration of multi-view data
To relate the views within each subgroup, we assume that there are subgroup-specific scores (|$\boldsymbol Z^{s}$|) that drive the dependency structure among the views. Then, each view is written as the product of the subgroup-specific scores |$\boldsymbol Z^{s} \in R^{n_{s} \times K}$| and a matrix of view and subgroup-specific loadings |$\boldsymbol B^{d,s} \in R^{p_{d} \times K}$| plus a matrix of errors: |$\boldsymbol X^{d,s} = \boldsymbol Z^{s} {\boldsymbol B^{d,s}}^{T} + \boldsymbol E^{d,s}$|. Here, |$K$| is the number of components used to approximate each view. The |$\boldsymbol Z^{s}$| incorporates the correlation across the |$D$| views for subgroup |$s$|, and |$\boldsymbol E^{d,s}$| accounts for the remaining variability unique to view |$d$| for subgroup |$s$|. In optimizing |$\boldsymbol Z^{s}$| and |${\boldsymbol B^{d,s}}$|, we want to minimize the error in reconstructing |$\boldsymbol X^{d,s}$|, i.e. |$\boldsymbol E^{d,s}$|, via the loss function |$F(\boldsymbol X^{d,s}, \boldsymbol Z^{s}, \boldsymbol B^{d,s})= \|\boldsymbol X^{d,s} - \boldsymbol Z^{s} {\boldsymbol B^{d,s}}^{T}\|_{F}^{2}$|. For a random matrix |$\boldsymbol A$|, |$\|\boldsymbol A\|_{F}^{2}$| is the square of the Frobenius norm defined as trace(|$\boldsymbol A^{T} \boldsymbol A$|).
The decomposition of |$\boldsymbol X^{d,s}$| in our approach is motivated by a principal components framework rather than a factor analytic framework as we do not impose any distribution on |$\boldsymbol Z^{s}$| or |$\boldsymbol E^{d,s}$|. Typically, we would require |${\boldsymbol B^{d,s}}^{T} {\boldsymbol B}^{d,s} = \boldsymbol I$|, and |${\boldsymbol Z^{s}}^{T} {\boldsymbol Z^{s}} = \boldsymbol I$| for uniqueness, but we do not require these constraints because we are interested in whether a variable’s estimated coefficients in |$\boldsymbol B^{d,s}$| are zero or not. Since we propose to use a penalty that encourages row-sparsity, we preserve the sparsity pattern in |$\boldsymbol B^{d,s}$| over matrix multiplication. Furthermore, we only use |$\boldsymbol Z^{s}$| to predict the clinical outcome and not to make inference on the estimates in |$\boldsymbol Z^{s}$|.
Hierarchical penalty for common and subgroup-specific feature ranking
A main goal in this paper is to identify common and subgroup-specific features associated with an outcome. Based on the hierarchical reparameterization proposed in [16], we decompose |$\boldsymbol B^{d,s}$| as the element-wise product of |$\boldsymbol G^{d}$| and |$\boldsymbol \Xi ^{d,s}$|, i.e. |$\boldsymbol B^{d,s} = \boldsymbol G^{d} \cdot \boldsymbol \Xi ^{d,s}$| for |$d=1,\ldots ,D$| and |$s=1,\ldots ,S$| to estimate effects that are common across subgroups using |$\boldsymbol G^{d} \in \Re ^{p_{d} \times K}$| while also allowing for heterogeneity in the subgroups through |$\boldsymbol \Xi ^{d,s} \in \Re ^{p_{d} \times K}$|. If there is no heterogeneity, then |$\boldsymbol \Xi ^{d,s}$| is a matrix of ones for all |$s$|, and |$\boldsymbol G^{d} = \boldsymbol B^{d}$|, i.e. the view-specific loadings are the same for all subgroups. In estimating |$\boldsymbol G^{d}$|, we borrow strength across subgroups for increased power. In this reparameterization, exact values of |$\boldsymbol G^{d}$| and |$\boldsymbol \Xi ^{d,s}$| are not identifiable but also are not directly needed for variable ranking.
We use regularization to induce sparsity by adding the block |$l_{1}/l_{2}$| penalty on |$\boldsymbol G^{d}$| and |$\boldsymbol \Xi ^{d,s}$|
Here, |$\boldsymbol g_{l}^{d}$| and |$\boldsymbol \xi _{l}^{d,s}$| are the |$l$|th rows in |$\boldsymbol G^{d}$| and |$\boldsymbol \Xi ^{d,s}$|, respectively, and are each length |$K$|. By imposing the block |$l_{1}/l_{2}$| penalty on the rows of |$\boldsymbol G^{d}$| and |$\boldsymbol \Xi ^{d,s}$|, the |$K$| components are considered as a group, encouraging variables to be selected in all |$K$| components or not to be selected. This is desirable because the selection of variables is not component-dependent and thus appropriate for variable screening. This differs from the original hierarchical penalty reparameterization proposed in [16], which imposes an |$l_{1}$| penalty. Both |$\lambda _{G}$| and |$\lambda _{\xi }$| are tuning parameters controlling feature selection. Specifically, |$\lambda _{G}$| controls feature selection for all subgroups combined and encourages removal of variables that are not important for all |$S$| subgroups. Also, |$\lambda _{\xi }$| encourages feature selection for each subgroup. Further details on the selection of |$\lambda _{G}$| and |$\lambda _{\xi }$| are given in the Supplementary Material. The |$\gamma _{d}$| is a user-specified indicator for whether the view should be penalized. This is to allow some views, such as a set of clinical covariates, to be forced into the model to guide the selection of other important variables, which can result in better prediction of the outcome.
Relating shared scores to clinical outcome(s)
Besides identifying the common and subgroup-specific features, we aim to predict a clinical outcome while allowing for heterogeneity in effects based on the subgroup and multi-view data. We assume that the outcome is related to the views only through the shared scores, i.e. |$Z^{s}$|, for each subgroup. This allows us to couple the problem of associations among different views and predicting an outcome. We relate the outcome to |$\boldsymbol Z^{s} $| by minimizing a loss function |$\sum _{s=1}^{S} F(\boldsymbol Y^{s}, \boldsymbol Z^{s}, \boldsymbol \Theta , \beta _{0})$|. For continuous outcome(s), |$F(\boldsymbol Y^{s}, \boldsymbol Z^{s}, \boldsymbol \Theta , \beta _{0}) = || \boldsymbol Y^{s} - (\beta _{0} + \boldsymbol Z^{s} \boldsymbol \Theta )||_{F}^{2}$|, where |$\boldsymbol \Theta \in \Re ^{K \times q}$| are regression coefficients.
We can impose the constraint that the columns of |$\boldsymbol Z^{s}$| are uncorrelated (|$\boldsymbol Z^{s^{T}}\boldsymbol Z^{s}=\boldsymbol I$|) so each of the |$K$| components provides unique information. Of note, our goal is not to interpret the coefficients |$\boldsymbol \Theta $| as in regression analysis; our goal is to rank the variables corresponding to those coefficients. In our applications, we standardize each column of |$\boldsymbol Y^{s}$| to have mean |$0$| and variance |$1$| at the subgroup level, but this is not necessary because of the estimation of the intercept |$\beta _{0}$|.
Our proposal to model one or more continuous outcomes in a one-step integrative analysis model that provides predictions based on rank-selected features, allowing for subgroup heterogeneity in those features, is novel and will be of use in many scientific applications.
Joint model for integration and prediction
Typical integration and prediction methods follow two steps. First, the subgroup-specific scores |$\boldsymbol Z^{s}$| and view and subgroup-specific loadings |$\boldsymbol B^{d,s}$| (hence common and subgroup-specific variables, |$\boldsymbol G^{d}$| and |$\boldsymbol \Xi ^{d,s}$|) are learned. Second, the learned |$\boldsymbol Z^{s}$| are associated with the outcome in a regression model. Since these steps are independent, the common and subgroup-specific variables identified may not be meaningfully connected to the clinical outcome. To overcome this limitation, we use the outcome to guide the selection of the common and subgroup-specific variables in a joint model. Thus, we use HIP to estimate the following: view and subgroup-specific loadings |$\boldsymbol B^{d,s}$| (|$\boldsymbol G^{d}$|, the common variables, and |$\boldsymbol \Xi ^{d,s}$|, the subgroup-specific variables), the subgroup-specific scores shared across views (|$\boldsymbol Z^{s}$|), and the regression estimates (|$\boldsymbol \Theta $|, |$\beta _{0}$|). To obtain these estimates, we combine the outcome loss function, the multi-view loss function, and the regularization penalty to minimize the following overall loss function:
Although versions of the hierarchical penalty have been used before, our paper is among the first to use this penalty in joint association and prediction studies for data from multiple views to account for common and subgroup-specific variation and to extract subgroup-specific features and/or clinical variables.
Because sparsity will not be induced directly due to numerical limitations of the automatic differentiation algorithm, we identify important variables by ranking according to the |$L_{2}$| norm of the corresponding row in |$\hat{\boldsymbol B}^{d,s}$|. The user specifies the number of variables (denote as |$N_{top}$|) they wish to keep; this value can vary across data views. Algorithm 1 (Supplementary Material) is run once on the full training data and the |$N_{top}$| variables are selected for each view and subgroup based on the estimated |$\boldsymbol B^{d,s}$|. We then run Algorithm 1 a second time including only these selected variables. This ”subset” result should be used when applying the prediction procedure.
The optimization problem in (2) is multi-convex in |$\boldsymbol B^{d,s}$|, |$\boldsymbol Z^{s}$|, and |$\boldsymbol \Theta $| each but jointly non-convex. As such, we are not guaranteed convergence to a global minimum. A local optimum can be found by iteratively minimizing over each of the optimization parameters with the rest of the optimization parameters fixed. Overall algorithm convergence is determined by the relative change in the objective function in (2) without the penalty terms. The full algorithm is summarized in the Supplementary Material. We also describe in the Supplementary Material our approach for selecting hyper-parameters.
Prediction
In order to predict an outcome on new data (say |$\boldsymbol X^{d,s}_{test}$|), we first predict the test shared component, |$\boldsymbol Z^{s}_{pred}$|, and then use this information to predict the outcome. To predict |$\boldsymbol Z^{s}_{pred}$|, we learn the model defined by (2) on the training data (i.e. |$\boldsymbol X^{d,s}_{train}$|) and obtain the learned estimates |$\hat{\boldsymbol B}^{d,s}$|, |$\hat{\boldsymbol \Theta }$|, and |$\hat{\beta }_{0}$|. Using these estimates and the testing data |$\boldsymbol X^{d,s}_{test}$|, we solve the problem in (3).
Without an orthogonality condition on |$\boldsymbol Z^{s}$|, the solution of this problem has a closed form given by |$\hat{\boldsymbol Z}^{s}_{pred} = \boldsymbol X_{cat}^{s} {\hat{\boldsymbol B}_{cat}^{s}}({\hat{\boldsymbol B}_{cat}^{s^{T}}} {\hat{\boldsymbol B}_{cat}^{s}})^{-1}$| for |$s=1,\ldots ,S$|. Here, |$\boldsymbol X_{cat}^{s}$| is an |$n_{s} \times \{p_{1} + \cdots + p_{d}\}$| matrix that concatenates all |$D$| views for subgroup |$s$|, i.e. |$X_{cat}^{s}=[ \boldsymbol X^{1,s}_{test},\cdots ,\boldsymbol X^{D,s}_{test}]$|. Similarly, |$\hat{\boldsymbol B}_{cat}= [\hat{\boldsymbol B}^{1,s},\cdots , \hat{\boldsymbol B}^{D,s}]$| and is a |$\{p_{1} + \cdots + p_{D} \} \times K$| matrix of variable coefficients. We add a small multiple of the identity matrix before taking the inverse to help with stability, although we note that the inverse is of a |$K \times K$| matrix, which is computationally inexpensive since |$K$| is typically small. With an orthogonality condition on |$\boldsymbol Z^{s}$|, the above optimization for |$\hat{\boldsymbol Z}^{s}_{pred}$| is an orthogonal Procrustes problem [18]. Let the singular value decomposition of |$\boldsymbol X_{cat}^{s} {\hat{\boldsymbol B}_{cat}^{s}}$| be |$\boldsymbol U \boldsymbol D \boldsymbol V^{T}$|. Then, |$\hat{\boldsymbol Z}^{s}_{pred} = \boldsymbol U \boldsymbol V^{T}$|. Once we have obtained |$\hat{\boldsymbol Z}^{s}_{pred}$|, we predict a continuous outcome |$\hat{\boldsymbol Y}^{s}_{pred}$| as |$\hat{\boldsymbol Y}^{s}_{pred} = \hat{\beta }_{0} + \hat{\boldsymbol Z}_{pred}^{s} \hat{\boldsymbol \Theta }$|.
Simulations
Set-up
We performed simulations for a single continuous outcome with two views and two subgroups, i.e. |$D=S=2$|. There were |$n_{1} = 250$| subjects in the first subgroup and |$n_{2} = 260$| subjects in the second. There were two different scenarios to test the ability of the algorithm to perform variable ranking and prediction: Full Overlap and Partial Overlap (Fig. 2). In the Full Overlap scenario, the signal variables for each subgroup completely overlapped, i.e. the first |$50$| variables of the |$\boldsymbol{B}^{d,s}$| matrices were important for both subgroups. We expect competing methods to perform relatively well in this scenario as there is no subgroup heterogeneity. In the Partial Overlap scenario, the first |$50$| variables are important for the first subgroup. Of these |$50$|, the last |$25$| are also important to the second subgroup in addition to the |$25$| subsequent variables. We expect some deterioration in the ability of the comparison methods to select the appropriate variables due to subgroup heterogeneity. We also considered simulation settings where (i) the sample sizes for the two groups were unbalanced and (ii) the error term for each view was non-Gaussian, to investigate the robustness of the proposed method. Please refer to the supporting information for simulation results for these scenarios.

Visual representation of variable overlap scenarios; in the Full Overlap scenario, the signal variables for each subgroup completely overlapped, i.e. the same variables were important for both subgroups, and in the Partial Overlap Scenario, half of the variables important to each subgroup are the same, and the remaining important variables are unique to each subgroup.
For each example and scenario, there were three different numbers of variables in the data sets with |$p_{d}$| indicating the number of variables in view |$d$|. In the P1 setting, |$p_{1} = 300$| and |$p_{2} = 350$|. In the P2 setting, |$p_{1} = 1000$| and |$p_{2} = 1500$|. Finally, in the P3 setting, |$p_{1} = 2000$| and |$p_{2} = 3000$|. For these simulations, |$K$| was fixed to the true value of |$2$|. We also set |$N_{top}$| to the true value of 50 for all simulations.
The data generation process is based on [14]. First, the |$\boldsymbol{B}^{d,s}$| matrices are generated according to the Full or Partial Overlap scenario. If the entry corresponds to a signal variable, it is drawn from a |$U(0.5,1)$| with the sign determined by a draw from a Bernoulli distribution with equal probability; otherwise, it is set to 0. We then orthogonalize the columns of each |$\boldsymbol{B}^{d,s}$|. Next, we generate the entries of |$\boldsymbol Z^{s} \sim N(\mu = 25.0, \sigma = 3.0)$| and |$\boldsymbol{E}^{d,s} \sim N(\mu = 0.0, \sigma = 1.0)$|. Then, the data matrix for subgroup |$s$| in view |$d$| is generated as |$\boldsymbol X^{d,s}$| = |$\boldsymbol Z^{s} \boldsymbol B^{{d,s}^{T}} + \boldsymbol E^{d,s}$|. Finally, the outcome is generated as |$\boldsymbol Y^{s} = \beta _{0} + \boldsymbol Z^{s} \boldsymbol \Theta + \boldsymbol E^{s}$|, where |$\boldsymbol E^{s} \in R^{n_{s} \times 1}$| contains entries from a standard normal distribution. The true value of |$\boldsymbol \Theta = [0.7, 0.2]^{T}$| and |$\beta _{0} = 2.0$|. In the supporting information, we consider additional simulation settings where |$\boldsymbol E^{d,s} \sim t_{\nu }$| (i.e. follows a |$t$| distribution with |$\nu $| degrees of freedom), where |$\nu = 5, 10, 25, 50, 100$|.
Comparison methods
First, we compare our proposed method (HIP) to canonical variate regression (CVR) [14] as implemented in R package CVR [19]. This is a joint association and prediction method for multiple views (though existing code only implements two views), but it does not account for subgroup heterogeneity. Thus, we implement the method in two ways: (i) all subgroups are concatenated in each view (Concatenated CVR) and (ii) a separate model is fit for each subgroup (Subgroup CVR). Second, we compare our method to the Joint Lasso [15] as implemented in R package ref20 [20]. The Joint Lasso does not perform integrative analysis but does account for subgroup heterogeneity. We implement this method on the data stacked over views (Concatenated Joint Lasso), but because Joint Lasso allows for subgroups, we also apply the method on each view separately (Dataset Joint Lasso). Third, we compare our method to the Lasso [21] and Elastic Net [22] as implemented in R package ref23 [23] using both the concatenated and separate subgroup models (Concatenated Lasso/Elastic Net and Subgroup Lasso/Elastic Net, respectively); we stack the two views in each case. For the Elastic Net, we fixed |$\alpha = 0.5$|. In fitting the models, we allowed any non-fixed tuning parameters to be chosen using the default in the corresponding R package. For Joint Lasso, there is no function for choosing the two tuning parameters in the R package, so we implemented a grid search over 55 parameter combinations. We applied each method to the training data sets and predicted the outcome on the test data sets. We do not compare with the meta lasso because it is only available for binary outcomes and thus not applicable to the motivating COPD data.
Evaluation measures
We compare HIP to existing methods in terms of variable selection and prediction. For variable selection, we estimate the true positive rate (TPR), false positive rate (FPR), and F|$1$| score. All are constrained to the range |$[0,1]$|. Note |$\text{TPR} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}}$|, and |$\text{FPR} = \frac{\text{False Positives}}{\text{True Negatives + False Positives}}$|. Also, |$\text{F1} = \frac{\text{True Positives}}{\text{True Positives} + \frac{1}{2}\text{(False Positives + False Negatives)}}$|. Ideally, TPR and F|$1$| are 1 and FPR is |$0$|.
For HIP, the variables are ranked by the |$L_{2}$| norm of the rows in the estimated |$\boldsymbol B^{d,s}$| matrices. For each comparison method, the result includes some kind of regression coefficients, so variables with an estimated coefficient that has been shrunk zero are considered not selected and those with non-zero estimated coefficients are considered selected. For prediction, we estimated test mean squared error (MSE); smaller MSEs indicate better performance. We averaged results over our 20 test data sets.
Simulation Results
In the Full Overlap scenario, we compare HIP (Grid) and HIP (Random) and find similar results. This supports using HIP (Random) over HIP (Grid) as it is faster computationally (Supplementary Fig. S4). Looking at Fig. 3, we note that HIP has a TPR and F|$1$| close to |$1$| and FPR close to |$0$|. CVR is the closest competing method for F|$1$| score. For Joint Lasso, the FPR is fairly high, so it is selecting a lot of unimportant variables. Joint Lasso, Lasso, and Elastic Net have lower TPR values suggesting that these methods are missing important variables. Overall, the competing methods have worse and more variable TPR, FPR, and F|$1$| scores compared with HIP. In terms of prediction, HIP and Subgroup CVR have lower test MSEs than the other methods. Even when subgroups share the same important variables, we see advantages in taking an integrative approach and accounting for heterogeneity. The results are mostly consistent across P1, P2, and P3, although P3 does show some deterioration in performance and increased variability. The Lasso and Elastic net are the fastest in all parameter settings followed by HIP (Random). HIP (Random) shows a larger computational advantage over CVR and Joint Lasso as the number of variables increase (Supplementary Figs S4 and S5). The Partial Overlap scenario results (Supplementary Fig. S3) are similar to the Full Overlap results but show a greater advantage for HIP in variable selection performance. Partial Overlap Scenario results for non-Gaussian errors (Supplementary Fig. S6) are similar to the Partial Overlap Scenario results for Gaussian errors: the prediction performance for HIP was better than, or comparable to, the other methods.

Results for Full Overlap scenario; the first row corresponds to P1 (|$p_{1}$| = 300, |$p_{2}$| = 350), the second to P2 (|$p_{1}$| = 1000, |$p_{2}$| = 1500), and the third to P3 (|$p_{1}$| = 2000, |$p_{2}$| = 3000); for all settings, |$n_{1} = 250$| and |$n_{2} = 260$|, and the right column is test MSE, so a lower value indicates better performance; all results are based on 20 iterations.
The subgroup variable selection performance for HIP is better than the methods compared even in the highly unbalanced sample design (i.e. 9:1) (Supplementary Fig. S7). In general, we did observe a deterioration in the variable selection performance for Subgroup 2 (i.e. group with the smaller sample size) as the sample sizes became unbalanced. However, the performance of HIP was better than that of the other methods. In terms of prediction, HIP had consistent and better (or lower) test MSEs (|$\sim 0.25$|) for all sample size designs, but the results for CVR and Joint Lasso varied (Supplementary Fig. S8). HIP generally showed consistent and better subgroup prediction estimates for varying sample sizes (Supplementary Fig. S7). There was some variation in the prediction performance for Subgroup 2 for all methods. Most of the methods had poor prediction performance in Subgroup 2, except HIP (Supplementary Fig. S7).
Real data analysis
Study goals
As mentioned previously, sex disparities exist in COPD susceptibility. In this section, our goal is to use molecular data from the COPDGene Study [7] in combination with clinical data to gain new insights into the molecular architecture of COPD in males and females. Sex was self-reported. We focus on individuals with COPD (defined as GOLD stage |$\ge 1$|) at Year 5 (Phase 2 of the COPDGene) Study who had proteomics, RNA-sequencing, and AWT data available at Year 5. Of the |$N=1376$| individuals with COPD at Year 5 who had complete data, |$n_{1}=782$| were males and |$n_{2}=594$| were females. Table 1 gives some characteristics of subjects who had COPD at Year 5. We assessed for sex differences using t-tests for continuous variables and |$\chi ^{2}$| tests for categorical variables. Subjects were predominantly non-Hispanic white, but there were no sex differences. There were also not sex differences in age, BMI, systolic blood pressure, percentage of current smokers, or percentage with diabetes. Males and females differed in their mean AWT (|$P$||$<.001$|) but did not differ by lung function as measured by mean FEV|$_{1}$|% predicted. Given the available data and the sex differences in AWT, we will (i) identify genes and proteins common and specific to males and females associated with AWT, (ii) explore pathways enriched in the proteins and genes identified for males and females, and (iii) investigate the effect of these proteins and genes on AWT, adjusting for covariates.
COPDGene Participant Characteristics; the measurements presented were collected at the Year 5 study visit to align with the collection of proteomic and genomic data collection
Variable . | Males . | Females . | |$P$|-value . |
---|---|---|---|
. | |$N$| = 782 . | |$N$| = 594 . | . |
Age | 68.28 (8.35) | 68.03 (8.36) | 0.581 |
BMI | 28.03 (5.62) | 27.69 (6.59) | 0.317 |
FEV1 % Predicted | 61.94 (22.97) | 62.91 (22.59) | 0.431 |
BODE Index | 2.45 (2.45) | 2.63 (2.38) | 0.176 |
% Emphysema | 11.30 (11.86) | 9.39 (11.45) | 0.003 |
Pack Years | 53.05 (26.63) | 47.57 (24.99) | <0.001 |
AWT | 1.17 (0.23) | 1.00 (0.21) | <0.001 |
Non-Hispanic White (%) | 82 | 78 | 0.084 |
Current Smoker (%) | 66 | 65 | 0.510 |
Diabetes (%) | 17 | 14 | 0.204 |
Variable . | Males . | Females . | |$P$|-value . |
---|---|---|---|
. | |$N$| = 782 . | |$N$| = 594 . | . |
Age | 68.28 (8.35) | 68.03 (8.36) | 0.581 |
BMI | 28.03 (5.62) | 27.69 (6.59) | 0.317 |
FEV1 % Predicted | 61.94 (22.97) | 62.91 (22.59) | 0.431 |
BODE Index | 2.45 (2.45) | 2.63 (2.38) | 0.176 |
% Emphysema | 11.30 (11.86) | 9.39 (11.45) | 0.003 |
Pack Years | 53.05 (26.63) | 47.57 (24.99) | <0.001 |
AWT | 1.17 (0.23) | 1.00 (0.21) | <0.001 |
Non-Hispanic White (%) | 82 | 78 | 0.084 |
Current Smoker (%) | 66 | 65 | 0.510 |
Diabetes (%) | 17 | 14 | 0.204 |
BMI = Body Mass Index, FEV|$_{1}$| = Forced Expiratory Volume in 1 Second, BODE = Body mass index, airflow Obstruction, Dyspnea, and Exercise capacity.
COPDGene Participant Characteristics; the measurements presented were collected at the Year 5 study visit to align with the collection of proteomic and genomic data collection
Variable . | Males . | Females . | |$P$|-value . |
---|---|---|---|
. | |$N$| = 782 . | |$N$| = 594 . | . |
Age | 68.28 (8.35) | 68.03 (8.36) | 0.581 |
BMI | 28.03 (5.62) | 27.69 (6.59) | 0.317 |
FEV1 % Predicted | 61.94 (22.97) | 62.91 (22.59) | 0.431 |
BODE Index | 2.45 (2.45) | 2.63 (2.38) | 0.176 |
% Emphysema | 11.30 (11.86) | 9.39 (11.45) | 0.003 |
Pack Years | 53.05 (26.63) | 47.57 (24.99) | <0.001 |
AWT | 1.17 (0.23) | 1.00 (0.21) | <0.001 |
Non-Hispanic White (%) | 82 | 78 | 0.084 |
Current Smoker (%) | 66 | 65 | 0.510 |
Diabetes (%) | 17 | 14 | 0.204 |
Variable . | Males . | Females . | |$P$|-value . |
---|---|---|---|
. | |$N$| = 782 . | |$N$| = 594 . | . |
Age | 68.28 (8.35) | 68.03 (8.36) | 0.581 |
BMI | 28.03 (5.62) | 27.69 (6.59) | 0.317 |
FEV1 % Predicted | 61.94 (22.97) | 62.91 (22.59) | 0.431 |
BODE Index | 2.45 (2.45) | 2.63 (2.38) | 0.176 |
% Emphysema | 11.30 (11.86) | 9.39 (11.45) | 0.003 |
Pack Years | 53.05 (26.63) | 47.57 (24.99) | <0.001 |
AWT | 1.17 (0.23) | 1.00 (0.21) | <0.001 |
Non-Hispanic White (%) | 82 | 78 | 0.084 |
Current Smoker (%) | 66 | 65 | 0.510 |
Diabetes (%) | 17 | 14 | 0.204 |
BMI = Body Mass Index, FEV|$_{1}$| = Forced Expiratory Volume in 1 Second, BODE = Body mass index, airflow Obstruction, Dyspnea, and Exercise capacity.
Applying the proposed and competing methods
Blood samples from COPDGene participants at Phase 2 were quantified for proteins and RNA sequencing. Proteomics data were quantified using the SomaScan 5K technology from SomaLogic. The RNA sequencing data were obtained from Illummina. The original data set has 4979 SOMAmers, representing 4776 unique proteins and 19263 genes. We work with SOMAmers and convert our findings to unique proteins. Please refer to the supplementary material for more details on the techniques used to generate the RNASeq and proteomics data. To reduce dimensionality, we first applied unsupervised filtering to select the top 5000 genes and 2000 proteins with the largest standard deviations. In this work, for statistical rigor and to reduce false findings, we used resampling techniques to identify genes and proteins that are consistently selected by our method to be associated with AWT. We refer to these genes and proteins as “stable”. In particular, we generated 50 random splits of the filtered data, stratified by subgroup, such that for each split 75% of the data were the training data and 25% were the testing data. Within each split, we performed supervised filtering by regressing AWT on each of the genes and proteins selected by unsupervised filtering, adjusting for sex, race and pack years, and retained genes and proteins with potential to explain the variation in AWT (uncorrected |$P$|-value |$<.05$|). This means that the variables entering the models could differ for each split of the data.
To select tuning parameters, we set the range of possible values for |$\lambda _{G}$| and |$\lambda _{\xi }$| in HIP to |$(0,2]$| as in the simulations and selected the best model using BIC. Joint Lasso used 10-fold cross-validation over the same grid values used in the simulations. CVR, Lasso, and Elastic Net used 10-fold cross-validation with default settings to select tuning parameters. HIP and CVR both require specification of a rank, i.e. the number of latent components used in the solutions. Our proposed automatic approach (threshold |$=0.25$|; refer to Section 1.1 of Supplementary Information) on the concatenated data suggested |$K=3$|. Interestingly, when applied to each |$\boldsymbol X^{d,s}$| separately, it suggested |$K=3$| for the gene data and |$K=1$| for the protein data. Supplementary Figure S9 shows the scree plots for both the concatenated and separate |$\boldsymbol X^{d,s}$|. Based on these results and the robustness seen in the sensitivity analyses, we selected |$K=3$| components for HIP and CVR. For HIP, we set |$N_{top} = 75$| genes and |$25$| proteins.
For each split of the data, we applied HIP (Grid), HIP (Random), and the subgroup versions of the competing methods used in the simulations. For Elastic Net and Lasso, we stacked the views and ran separate analyses for males and females. For Joint Lasso, we ran separate analyses for the protein and gene data. For CVR, we ran separate analyses for males and females. We used the selected tuning parameters and test datasets to predict AWT and estimate test MSEs. Since the univariate filtering could result in different variables that enter our model for each data split, we combined information on how often a variable passed univariate filtering and served as input into our model and how often that variable was ranked high as relevant by our method. In particular, we considered (i) the number of splits in which the variable was included in the |$N_{top}$| variables and (ii) the proportion of splits in which the variable was included in the |$N_{top}$| variables, i.e. the number of splits in which the variable was included in the |$N_{top}$| variables divided by the number of splits in which the variable was entered into the model after supervised filtering. We took the product of these two terms and ranked the variables in descending order based on this product—the higher the better. A higher product indicates that the variable was frequently selected to differentiate AWT in the supervised univariate filtering approach and was also highly ranked as being associated with AWT in our multivariate HIP method. We then selected the top 1% of genes and proteins according to the product to represent the “stable” genes and proteins. We used this cut-off point because our focus was on identifying a few molecular signatures shared by and unique to men and women and predictive of AWT.
Real Data Results
Average MSEs, and proteins and genes selected
Supplementary Figures S10 and S11 show violin plots of the test MSEs and run times, respectively, from all 50 splits of the data. The average test MSEs from the splits were slightly lower for CVR and Joint Lasso but also used many more variables (Supplementary Table S3). HIP (Random) has a computational advantage over CVR and Joint Lasso.
Supplementary Table S4 shows the number of “stable” common and subgroup-specific genes and proteins identified by each method. We note few overlaps in selected genes and proteins between HIP and existing methods (Supplementary Fig. S12). Supplementary Table S5 compares the variables selected by HIP (Random) and HIP (Grid); the selected genes and proteins are very similar, again supporting the use of the random search instead of the grid search.
Supplementary Tables S6 and S7 list the genes and Supplementary Table S8 lists the proteins identified as “stable” and important to males and females by HIP (Random) including weights for each protein and gene calculated as the |$L_{2}$| norm of coefficients in |$\hat{\boldsymbol B}^{d,s}$| across components (i.e. rows) and averaging over the splits where the variable was selected.
Proteins with large weights include NPLOC4 and SPG21 for males and SMAP1 and CDKN2D for females. [24] introduced a novel method called SubmiRine to analyze miRNA and predict miRNA target site variants (miRNA-TSV). When this method was applied to a subset of genomic samples from patients with COPD from the Lung Genome Research Consortium (LGRC; http://www.lung-genomics.org), SPG21 was the top-scoring miRNA-TSV.
The gene with the largest weight was ADIPOR1 for males and BCL2L1 for females. In a study of 60 male COPD patients and 30 male controls, [25] found adiponectin that is associated with inflammation from COPD evidenced by a positive correlation with IL-8 and a negative correlation with FEV|$_{1}$| %.
Pathway enrichment analysis
We performed pathway enrichment analysis using Ingenuity Pathway Analysis (IPA) [26] to test for overrepresentation of pathways among our lists of “stable” proteins and genes for males and females. The top 10 canonical gene pathways (Table 2) for males and females had some common and some subgroup-specific pathways. The top pathway for males is the iron homeostasis signaling pathway; this is the second ranked pathway for females, and the top pathway for females is heme biosynthesis II. There is strong evidence that disrupted iron homeostasis is associated with the presence and severity of lung disease including COPD [27, 28]. Methylglyoxal degradation I ranks second for males and third for females. Reference [29] performed gene expression profiling on small airway epithelium samples and also found this pathway to be activated in both male and female smokers.
View . | Subgroup . | Canonical pathway . | Molecules . | Unadjusted |$P$|-value . |
---|---|---|---|---|
Genes | Males | Iron homeostasis signaling pathway | CDC34,FECH,SLC25A37 | .003 |
Methylglyoxal Degradation I | HAGH | .006 | ||
Heme Biosynthesis from Uroporphyrinogen-III I | FECH | .008 | ||
Pentose Phosphate Pathway (Non-oxidative Branch) | RPIA | .013 | ||
Heme Biosynthesis II | FECH | .019 | ||
Pentose Phosphate Pathway | RPIA | .023 | ||
Erythropoietin Signaling Pathway | BCL2L1,GATA1 | .054 | ||
ID1 Signaling Pathway | BCL2L1,GSPT1 | .068 | ||
Sertoli Cell-Sertoli Cell Junction Signaling | SPTB,YBX3 | .069 | ||
Autophagy | GABARAPL2,SLC1A5 | .076 | ||
Females | Heme Biosynthesis II | ALAS2,FECH | <.001 | |
Iron homeostasis signaling pathway | ALAS2,CDC34,FECH,SLC25A37 | <.001 | ||
Methylglyoxal Degradation I | HAGH | .006 | ||
Heme Biosynthesis from Uroporphyrinogen-III I | FECH | .008 | ||
Tetrapyrrole Biosynthesis II | ALAS2 | .010 | ||
Hypoxia Signaling in the Cardiovascular System | CDC34,UBE2H | .011 | ||
Pentose Phosphate Pathway (Non-oxidative Branch) | RPIA | .013 | ||
Pentose Phosphate Pathway | RPIA | .023 | ||
Erythropoietin Signaling Pathway | BCL2L1,GATA1 | .054 | ||
ID1 Signaling Pathway | BCL2L1,GSPT1 | .068 | ||
Proteins | Males | Role of JAK2 in Hormone-like Cytokine Signaling | EPO,LEP,PTPN6 | <.001 |
White Adipose Tissue Browning Pathway | BDNF,LEP,NPPB | <.001 | ||
Erythropoietin Signaling Pathway | EPO,LEP,PTPN6 | <.001 | ||
Serotonin Receptor Signaling | ADIPOQ,BDNF,LEP,NPPB | <.001 | ||
AMPK Signaling | ADIPOQ,INS,LEP | .001 | ||
Leptin Signaling in Obesity | INS,LEP | .002 | ||
IL-3 Signaling | PPP3R1,PTPN6 | .002 | ||
Maturity Onset Diabetes of Young (MODY) Signaling | ADIPOQ,INS | .002 | ||
Thyroid Cancer Signaling | BDNF,INS | .002 | ||
ABRA Signaling Pathway | NPPB,PPP3R1 | .003 | ||
Females | Granulocyte Adhesion and Diapedesis | PF4,PPBP,TNFRSF1A | <.001 | |
Agranulocyte Adhesion and Diapedesis | PF4,PPBP,TNFRSF1A | <.001 | ||
Wound Healing Signaling Pathway | EGF,PF4,TNFRSF1A | <.001 | ||
Huntington’s Disease Signaling | BDNF,CPLX2,EGF | .001 | ||
Pathogen Induced Cytokine Storm Signaling Pathway | PF4,PPBP,TNFRSF1A | .002 | ||
Glioma Signaling | CDKN2D,EGF | .004 | ||
Type II Diabetes Mellitus Signaling | ADIPOQ,TNFRSF1A | .005 | ||
Axonal Guidance Signaling | BDNF,EGF,PAPPA | .005 | ||
Tumor Microenvironment Pathway | EGF,TNFRSF1A | .007 | ||
Regulation Of The Epithelial Mesenchymal Transition By Growth Factors Pathway | EGF,TNFRSF1A | .008 |
View . | Subgroup . | Canonical pathway . | Molecules . | Unadjusted |$P$|-value . |
---|---|---|---|---|
Genes | Males | Iron homeostasis signaling pathway | CDC34,FECH,SLC25A37 | .003 |
Methylglyoxal Degradation I | HAGH | .006 | ||
Heme Biosynthesis from Uroporphyrinogen-III I | FECH | .008 | ||
Pentose Phosphate Pathway (Non-oxidative Branch) | RPIA | .013 | ||
Heme Biosynthesis II | FECH | .019 | ||
Pentose Phosphate Pathway | RPIA | .023 | ||
Erythropoietin Signaling Pathway | BCL2L1,GATA1 | .054 | ||
ID1 Signaling Pathway | BCL2L1,GSPT1 | .068 | ||
Sertoli Cell-Sertoli Cell Junction Signaling | SPTB,YBX3 | .069 | ||
Autophagy | GABARAPL2,SLC1A5 | .076 | ||
Females | Heme Biosynthesis II | ALAS2,FECH | <.001 | |
Iron homeostasis signaling pathway | ALAS2,CDC34,FECH,SLC25A37 | <.001 | ||
Methylglyoxal Degradation I | HAGH | .006 | ||
Heme Biosynthesis from Uroporphyrinogen-III I | FECH | .008 | ||
Tetrapyrrole Biosynthesis II | ALAS2 | .010 | ||
Hypoxia Signaling in the Cardiovascular System | CDC34,UBE2H | .011 | ||
Pentose Phosphate Pathway (Non-oxidative Branch) | RPIA | .013 | ||
Pentose Phosphate Pathway | RPIA | .023 | ||
Erythropoietin Signaling Pathway | BCL2L1,GATA1 | .054 | ||
ID1 Signaling Pathway | BCL2L1,GSPT1 | .068 | ||
Proteins | Males | Role of JAK2 in Hormone-like Cytokine Signaling | EPO,LEP,PTPN6 | <.001 |
White Adipose Tissue Browning Pathway | BDNF,LEP,NPPB | <.001 | ||
Erythropoietin Signaling Pathway | EPO,LEP,PTPN6 | <.001 | ||
Serotonin Receptor Signaling | ADIPOQ,BDNF,LEP,NPPB | <.001 | ||
AMPK Signaling | ADIPOQ,INS,LEP | .001 | ||
Leptin Signaling in Obesity | INS,LEP | .002 | ||
IL-3 Signaling | PPP3R1,PTPN6 | .002 | ||
Maturity Onset Diabetes of Young (MODY) Signaling | ADIPOQ,INS | .002 | ||
Thyroid Cancer Signaling | BDNF,INS | .002 | ||
ABRA Signaling Pathway | NPPB,PPP3R1 | .003 | ||
Females | Granulocyte Adhesion and Diapedesis | PF4,PPBP,TNFRSF1A | <.001 | |
Agranulocyte Adhesion and Diapedesis | PF4,PPBP,TNFRSF1A | <.001 | ||
Wound Healing Signaling Pathway | EGF,PF4,TNFRSF1A | <.001 | ||
Huntington’s Disease Signaling | BDNF,CPLX2,EGF | .001 | ||
Pathogen Induced Cytokine Storm Signaling Pathway | PF4,PPBP,TNFRSF1A | .002 | ||
Glioma Signaling | CDKN2D,EGF | .004 | ||
Type II Diabetes Mellitus Signaling | ADIPOQ,TNFRSF1A | .005 | ||
Axonal Guidance Signaling | BDNF,EGF,PAPPA | .005 | ||
Tumor Microenvironment Pathway | EGF,TNFRSF1A | .007 | ||
Regulation Of The Epithelial Mesenchymal Transition By Growth Factors Pathway | EGF,TNFRSF1A | .008 |
View . | Subgroup . | Canonical pathway . | Molecules . | Unadjusted |$P$|-value . |
---|---|---|---|---|
Genes | Males | Iron homeostasis signaling pathway | CDC34,FECH,SLC25A37 | .003 |
Methylglyoxal Degradation I | HAGH | .006 | ||
Heme Biosynthesis from Uroporphyrinogen-III I | FECH | .008 | ||
Pentose Phosphate Pathway (Non-oxidative Branch) | RPIA | .013 | ||
Heme Biosynthesis II | FECH | .019 | ||
Pentose Phosphate Pathway | RPIA | .023 | ||
Erythropoietin Signaling Pathway | BCL2L1,GATA1 | .054 | ||
ID1 Signaling Pathway | BCL2L1,GSPT1 | .068 | ||
Sertoli Cell-Sertoli Cell Junction Signaling | SPTB,YBX3 | .069 | ||
Autophagy | GABARAPL2,SLC1A5 | .076 | ||
Females | Heme Biosynthesis II | ALAS2,FECH | <.001 | |
Iron homeostasis signaling pathway | ALAS2,CDC34,FECH,SLC25A37 | <.001 | ||
Methylglyoxal Degradation I | HAGH | .006 | ||
Heme Biosynthesis from Uroporphyrinogen-III I | FECH | .008 | ||
Tetrapyrrole Biosynthesis II | ALAS2 | .010 | ||
Hypoxia Signaling in the Cardiovascular System | CDC34,UBE2H | .011 | ||
Pentose Phosphate Pathway (Non-oxidative Branch) | RPIA | .013 | ||
Pentose Phosphate Pathway | RPIA | .023 | ||
Erythropoietin Signaling Pathway | BCL2L1,GATA1 | .054 | ||
ID1 Signaling Pathway | BCL2L1,GSPT1 | .068 | ||
Proteins | Males | Role of JAK2 in Hormone-like Cytokine Signaling | EPO,LEP,PTPN6 | <.001 |
White Adipose Tissue Browning Pathway | BDNF,LEP,NPPB | <.001 | ||
Erythropoietin Signaling Pathway | EPO,LEP,PTPN6 | <.001 | ||
Serotonin Receptor Signaling | ADIPOQ,BDNF,LEP,NPPB | <.001 | ||
AMPK Signaling | ADIPOQ,INS,LEP | .001 | ||
Leptin Signaling in Obesity | INS,LEP | .002 | ||
IL-3 Signaling | PPP3R1,PTPN6 | .002 | ||
Maturity Onset Diabetes of Young (MODY) Signaling | ADIPOQ,INS | .002 | ||
Thyroid Cancer Signaling | BDNF,INS | .002 | ||
ABRA Signaling Pathway | NPPB,PPP3R1 | .003 | ||
Females | Granulocyte Adhesion and Diapedesis | PF4,PPBP,TNFRSF1A | <.001 | |
Agranulocyte Adhesion and Diapedesis | PF4,PPBP,TNFRSF1A | <.001 | ||
Wound Healing Signaling Pathway | EGF,PF4,TNFRSF1A | <.001 | ||
Huntington’s Disease Signaling | BDNF,CPLX2,EGF | .001 | ||
Pathogen Induced Cytokine Storm Signaling Pathway | PF4,PPBP,TNFRSF1A | .002 | ||
Glioma Signaling | CDKN2D,EGF | .004 | ||
Type II Diabetes Mellitus Signaling | ADIPOQ,TNFRSF1A | .005 | ||
Axonal Guidance Signaling | BDNF,EGF,PAPPA | .005 | ||
Tumor Microenvironment Pathway | EGF,TNFRSF1A | .007 | ||
Regulation Of The Epithelial Mesenchymal Transition By Growth Factors Pathway | EGF,TNFRSF1A | .008 |
View . | Subgroup . | Canonical pathway . | Molecules . | Unadjusted |$P$|-value . |
---|---|---|---|---|
Genes | Males | Iron homeostasis signaling pathway | CDC34,FECH,SLC25A37 | .003 |
Methylglyoxal Degradation I | HAGH | .006 | ||
Heme Biosynthesis from Uroporphyrinogen-III I | FECH | .008 | ||
Pentose Phosphate Pathway (Non-oxidative Branch) | RPIA | .013 | ||
Heme Biosynthesis II | FECH | .019 | ||
Pentose Phosphate Pathway | RPIA | .023 | ||
Erythropoietin Signaling Pathway | BCL2L1,GATA1 | .054 | ||
ID1 Signaling Pathway | BCL2L1,GSPT1 | .068 | ||
Sertoli Cell-Sertoli Cell Junction Signaling | SPTB,YBX3 | .069 | ||
Autophagy | GABARAPL2,SLC1A5 | .076 | ||
Females | Heme Biosynthesis II | ALAS2,FECH | <.001 | |
Iron homeostasis signaling pathway | ALAS2,CDC34,FECH,SLC25A37 | <.001 | ||
Methylglyoxal Degradation I | HAGH | .006 | ||
Heme Biosynthesis from Uroporphyrinogen-III I | FECH | .008 | ||
Tetrapyrrole Biosynthesis II | ALAS2 | .010 | ||
Hypoxia Signaling in the Cardiovascular System | CDC34,UBE2H | .011 | ||
Pentose Phosphate Pathway (Non-oxidative Branch) | RPIA | .013 | ||
Pentose Phosphate Pathway | RPIA | .023 | ||
Erythropoietin Signaling Pathway | BCL2L1,GATA1 | .054 | ||
ID1 Signaling Pathway | BCL2L1,GSPT1 | .068 | ||
Proteins | Males | Role of JAK2 in Hormone-like Cytokine Signaling | EPO,LEP,PTPN6 | <.001 |
White Adipose Tissue Browning Pathway | BDNF,LEP,NPPB | <.001 | ||
Erythropoietin Signaling Pathway | EPO,LEP,PTPN6 | <.001 | ||
Serotonin Receptor Signaling | ADIPOQ,BDNF,LEP,NPPB | <.001 | ||
AMPK Signaling | ADIPOQ,INS,LEP | .001 | ||
Leptin Signaling in Obesity | INS,LEP | .002 | ||
IL-3 Signaling | PPP3R1,PTPN6 | .002 | ||
Maturity Onset Diabetes of Young (MODY) Signaling | ADIPOQ,INS | .002 | ||
Thyroid Cancer Signaling | BDNF,INS | .002 | ||
ABRA Signaling Pathway | NPPB,PPP3R1 | .003 | ||
Females | Granulocyte Adhesion and Diapedesis | PF4,PPBP,TNFRSF1A | <.001 | |
Agranulocyte Adhesion and Diapedesis | PF4,PPBP,TNFRSF1A | <.001 | ||
Wound Healing Signaling Pathway | EGF,PF4,TNFRSF1A | <.001 | ||
Huntington’s Disease Signaling | BDNF,CPLX2,EGF | .001 | ||
Pathogen Induced Cytokine Storm Signaling Pathway | PF4,PPBP,TNFRSF1A | .002 | ||
Glioma Signaling | CDKN2D,EGF | .004 | ||
Type II Diabetes Mellitus Signaling | ADIPOQ,TNFRSF1A | .005 | ||
Axonal Guidance Signaling | BDNF,EGF,PAPPA | .005 | ||
Tumor Microenvironment Pathway | EGF,TNFRSF1A | .007 | ||
Regulation Of The Epithelial Mesenchymal Transition By Growth Factors Pathway | EGF,TNFRSF1A | .008 |
There was no overlap in the top 10 protein pathways for males and females. The top pathway for males was role of JAK2 in hormone-like cytokine signaling. The top pathway for females was granulocyte adhesion and diapedesis which is associated with regulation of inflammation. Reference [30] also found this to be a top pathway involving upregulated genes when comparing patients with COPD and healthy controls.
Variable . | Estimate . | 95% CI . | P-value . | |$R^{2}$| . | Adjusted |$R^{2}$| . |
---|---|---|---|---|---|
ERF | 0.152 | 0.147 | |||
Intercept | –1.180 | –1.872, –0.489 | 0.001 | ||
Age | 0.021 | –0.035, 0.077 | 0.461 | ||
Sex (Female) | 0.006 | –0.093, 0.105 | 0.901 | ||
Race (African American) | –0.070 | –0.202, 0.062 | 0.297 | ||
BMI | 0.352 | 0.299, 0.406 | <0.001 | ||
Former Smoker | 0.947 | 0.257, 1.637 | 0.007 | ||
Current Smoker | 1.429 | 0.735, 2.123 | <0.001 | ||
% Emphysema | 0.023 | –0.032, 0.079 | 0.414 | ||
Scanner - Philips | 0.379 | 0.097, 0.661 | 0.009 | ||
Scanner - Siemens | 0.130 | 0.025, 0.235 | 0.015 | ||
ERF + Common Protein Score | 0.157 | 0.151 | |||
Intercept | –1.138 | –1.828, –0.447 | 0.001 | ||
Age | 0.032 | –0.024, 0.089 | 0.266 | ||
Sex (Female) | 0.007 | –0.092, 0.105 | 0.897 | ||
Race (African American) | –0.049 | –0.181, 0.084 | 0.469 | ||
BMI | 0.327 | 0.271, 0.383 | <0.001 | ||
Former Smoker | 0.911 | 0.223, 1.600 | 0.010 | ||
Current Smoker | 1.390 | 0.697, 2.083 | <0.001 | ||
% Emphysema | 0.028 | –0.028, 0.083 | 0.328 | ||
Scanner - Philips | 0.404 | 0.123, 0.686 | 0.005 | ||
Scanner - Siemens | 0.110 | 0.005, 0.216 | 0.041 | ||
Common Protein Score | 0.076 | 0.023, 0.130 | 0.005 | ||
ERF + Common Gene Score | 0.153 | 0.147 | |||
Intercept | –1.169 | –1.861, –0.477 | 0.001 | ||
Age | 0.021 | –0.035, 0.077 | 0.462 | ||
Sex (Female) | 0.007 | –0.092, 0.106 | 0.893 | ||
Race (African American) | –0.083 | –0.216, 0.050 | 0.221 | ||
BMI | 0.343 | 0.288, 0.398 | <0.001 | ||
Former Smoker | 0.932 | 0.242, 1.623 | 0.008 | ||
Current Smoker | 1.423 | 0.729, 2.117 | <0.001 | ||
% Emphysema | 0.023 | –0.033, 0.079 | 0.421 | ||
Scanner - Philips | 0.387 | 0.105, 0.669 | 0.007 | ||
Scanner - Siemens | 0.134 | 0.029, 0.239 | 0.013 | ||
Common Gene Score | 0.035 | –0.017, 0.086 | 0.185 | ||
ERF + Common Scores | 0.158 | 0.151 | |||
Intercept | –1.128 | –1.819, –0.437 | 0.001 | ||
Age | 0.032 | –0.025, 0.088 | 0.270 | ||
Sex (Female) | 0.007 | –0.092, 0.106 | 0.890 | ||
Race (African American) | –0.061 | –0.195, 0.073 | 0.373 | ||
BMI | 0.320 | 0.262, 0.377 | <0.001 | ||
Former Smoker | 0.899 | 0.210, 1.588 | 0.011 | ||
Current Smoker | 1.385 | 0.692, 2.078 | <0.001 | ||
% Emphysema | 0.027 | –0.028, 0.083 | 0.335 | ||
Scanner - Philips | 0.411 | 0.129, 0.693 | 0.004 | ||
Scanner - Siemens | 0.114 | 0.008, 0.220 | 0.035 | ||
Common Protein Score | 0.074 | 0.021, 0.128 | 0.006 | ||
Common Gene Score | 0.031 | –0.020, 0.083 | 0.237 | ||
ERF + Subgroup Protein Score | 0.165 | 0.159 | |||
Intercept | –1.071 | –1.760, –0.383 | 0.002 | ||
Age | 0.014 | –0.041, 0.070 | 0.614 | ||
Sex (Female) | 0.007 | –0.091, 0.105 | 0.891 | ||
Race (African American) | –0.035 | –0.167, 0.097 | 0.605 | ||
BMI | 0.312 | 0.256, 0.368 | <0.001 | ||
Former Smoker | 0.860 | 0.174, 1.546 | 0.014 | ||
Current Smoker | 1.335 | 0.645, 2.025 | <0.001 | ||
% Emphysema | 0.030 | –0.025, 0.086 | 0.284 | ||
Scanner - Philips | 0.418 | 0.137, 0.698 | 0.004 | ||
Scanner - Siemens | 0.079 | –0.027, 0.185 | 0.146 | ||
Subgroup Protein Score | 0.126 | 0.072, 0.179 | <0.001 | ||
ERF + Subgroup Gene Score | 0.153 | 0.147 | |||
Intercept | –1.169 | –1.861, –0.477 | 0.001 | ||
Age | 0.021 | –0.035, 0.077 | 0.459 | ||
Sex (Female) | 0.007 | –0.092, 0.106 | 0.893 | ||
Race (African American) | –0.083 | –0.217, 0.050 | 0.220 | ||
BMI | 0.343 | 0.288, 0.398 | <0.001 | ||
Former Smoker | 0.932 | 0.242, 1.622 | 0.008 | ||
Current Smoker | 1.423 | 0.729, 2.117 | <0.001 | ||
% Emphysema | 0.023 | –0.033, 0.079 | 0.421 | ||
Scanner - Philips | 0.387 | 0.105, 0.669 | 0.007 | ||
Scanner - Siemens | 0.134 | 0.029, 0.239 | 0.013 | ||
Subgroup Gene Score | 0.035 | –0.016, 0.087 | 0.180 | ||
ERF + Subgroup Scores | 0.166 | 0.159 | |||
Intercept | –1.064 | –1.752, –0.375 | 0.003 | ||
Age | 0.015 | –0.041, 0.070 | 0.610 | ||
Sex (Female) | 0.007 | –0.091, 0.106 | 0.885 | ||
Race (African American) | –0.045 | –0.179, 0.088 | 0.504 | ||
BMI | 0.306 | 0.249, 0.363 | <0.001 | ||
Former Smoker | 0.850 | 0.164, 1.536 | 0.015 | ||
Current Smoker | 1.332 | 0.642, 2.022 | <0.001 | ||
% Emphysema | 0.030 | –0.025, 0.085 | 0.290 | ||
Scanner - Philips | 0.423 | 0.142, 0.704 | 0.003 | ||
Scanner - Siemens | 0.083 | –0.024, 0.189 | 0.129 | ||
Subgroup Protein Score | 0.124 | 0.070, 0.177 | <0.001 | ||
Subtype Gene Score | 0.027 | –0.024, 0.078 | 0.304 |
Variable . | Estimate . | 95% CI . | P-value . | |$R^{2}$| . | Adjusted |$R^{2}$| . |
---|---|---|---|---|---|
ERF | 0.152 | 0.147 | |||
Intercept | –1.180 | –1.872, –0.489 | 0.001 | ||
Age | 0.021 | –0.035, 0.077 | 0.461 | ||
Sex (Female) | 0.006 | –0.093, 0.105 | 0.901 | ||
Race (African American) | –0.070 | –0.202, 0.062 | 0.297 | ||
BMI | 0.352 | 0.299, 0.406 | <0.001 | ||
Former Smoker | 0.947 | 0.257, 1.637 | 0.007 | ||
Current Smoker | 1.429 | 0.735, 2.123 | <0.001 | ||
% Emphysema | 0.023 | –0.032, 0.079 | 0.414 | ||
Scanner - Philips | 0.379 | 0.097, 0.661 | 0.009 | ||
Scanner - Siemens | 0.130 | 0.025, 0.235 | 0.015 | ||
ERF + Common Protein Score | 0.157 | 0.151 | |||
Intercept | –1.138 | –1.828, –0.447 | 0.001 | ||
Age | 0.032 | –0.024, 0.089 | 0.266 | ||
Sex (Female) | 0.007 | –0.092, 0.105 | 0.897 | ||
Race (African American) | –0.049 | –0.181, 0.084 | 0.469 | ||
BMI | 0.327 | 0.271, 0.383 | <0.001 | ||
Former Smoker | 0.911 | 0.223, 1.600 | 0.010 | ||
Current Smoker | 1.390 | 0.697, 2.083 | <0.001 | ||
% Emphysema | 0.028 | –0.028, 0.083 | 0.328 | ||
Scanner - Philips | 0.404 | 0.123, 0.686 | 0.005 | ||
Scanner - Siemens | 0.110 | 0.005, 0.216 | 0.041 | ||
Common Protein Score | 0.076 | 0.023, 0.130 | 0.005 | ||
ERF + Common Gene Score | 0.153 | 0.147 | |||
Intercept | –1.169 | –1.861, –0.477 | 0.001 | ||
Age | 0.021 | –0.035, 0.077 | 0.462 | ||
Sex (Female) | 0.007 | –0.092, 0.106 | 0.893 | ||
Race (African American) | –0.083 | –0.216, 0.050 | 0.221 | ||
BMI | 0.343 | 0.288, 0.398 | <0.001 | ||
Former Smoker | 0.932 | 0.242, 1.623 | 0.008 | ||
Current Smoker | 1.423 | 0.729, 2.117 | <0.001 | ||
% Emphysema | 0.023 | –0.033, 0.079 | 0.421 | ||
Scanner - Philips | 0.387 | 0.105, 0.669 | 0.007 | ||
Scanner - Siemens | 0.134 | 0.029, 0.239 | 0.013 | ||
Common Gene Score | 0.035 | –0.017, 0.086 | 0.185 | ||
ERF + Common Scores | 0.158 | 0.151 | |||
Intercept | –1.128 | –1.819, –0.437 | 0.001 | ||
Age | 0.032 | –0.025, 0.088 | 0.270 | ||
Sex (Female) | 0.007 | –0.092, 0.106 | 0.890 | ||
Race (African American) | –0.061 | –0.195, 0.073 | 0.373 | ||
BMI | 0.320 | 0.262, 0.377 | <0.001 | ||
Former Smoker | 0.899 | 0.210, 1.588 | 0.011 | ||
Current Smoker | 1.385 | 0.692, 2.078 | <0.001 | ||
% Emphysema | 0.027 | –0.028, 0.083 | 0.335 | ||
Scanner - Philips | 0.411 | 0.129, 0.693 | 0.004 | ||
Scanner - Siemens | 0.114 | 0.008, 0.220 | 0.035 | ||
Common Protein Score | 0.074 | 0.021, 0.128 | 0.006 | ||
Common Gene Score | 0.031 | –0.020, 0.083 | 0.237 | ||
ERF + Subgroup Protein Score | 0.165 | 0.159 | |||
Intercept | –1.071 | –1.760, –0.383 | 0.002 | ||
Age | 0.014 | –0.041, 0.070 | 0.614 | ||
Sex (Female) | 0.007 | –0.091, 0.105 | 0.891 | ||
Race (African American) | –0.035 | –0.167, 0.097 | 0.605 | ||
BMI | 0.312 | 0.256, 0.368 | <0.001 | ||
Former Smoker | 0.860 | 0.174, 1.546 | 0.014 | ||
Current Smoker | 1.335 | 0.645, 2.025 | <0.001 | ||
% Emphysema | 0.030 | –0.025, 0.086 | 0.284 | ||
Scanner - Philips | 0.418 | 0.137, 0.698 | 0.004 | ||
Scanner - Siemens | 0.079 | –0.027, 0.185 | 0.146 | ||
Subgroup Protein Score | 0.126 | 0.072, 0.179 | <0.001 | ||
ERF + Subgroup Gene Score | 0.153 | 0.147 | |||
Intercept | –1.169 | –1.861, –0.477 | 0.001 | ||
Age | 0.021 | –0.035, 0.077 | 0.459 | ||
Sex (Female) | 0.007 | –0.092, 0.106 | 0.893 | ||
Race (African American) | –0.083 | –0.217, 0.050 | 0.220 | ||
BMI | 0.343 | 0.288, 0.398 | <0.001 | ||
Former Smoker | 0.932 | 0.242, 1.622 | 0.008 | ||
Current Smoker | 1.423 | 0.729, 2.117 | <0.001 | ||
% Emphysema | 0.023 | –0.033, 0.079 | 0.421 | ||
Scanner - Philips | 0.387 | 0.105, 0.669 | 0.007 | ||
Scanner - Siemens | 0.134 | 0.029, 0.239 | 0.013 | ||
Subgroup Gene Score | 0.035 | –0.016, 0.087 | 0.180 | ||
ERF + Subgroup Scores | 0.166 | 0.159 | |||
Intercept | –1.064 | –1.752, –0.375 | 0.003 | ||
Age | 0.015 | –0.041, 0.070 | 0.610 | ||
Sex (Female) | 0.007 | –0.091, 0.106 | 0.885 | ||
Race (African American) | –0.045 | –0.179, 0.088 | 0.504 | ||
BMI | 0.306 | 0.249, 0.363 | <0.001 | ||
Former Smoker | 0.850 | 0.164, 1.536 | 0.015 | ||
Current Smoker | 1.332 | 0.642, 2.022 | <0.001 | ||
% Emphysema | 0.030 | –0.025, 0.085 | 0.290 | ||
Scanner - Philips | 0.423 | 0.142, 0.704 | 0.003 | ||
Scanner - Siemens | 0.083 | –0.024, 0.189 | 0.129 | ||
Subgroup Protein Score | 0.124 | 0.070, 0.177 | <0.001 | ||
Subtype Gene Score | 0.027 | –0.024, 0.078 | 0.304 |
Models were fit on all participants in the COPD Data set and adjusted for age, sex, race, BMI, smoking status, percent emphysema, and scanner make (|$N=1374$|). Scores were developed using the “stable” proteins and genes selected by HIP (Random). ERF = Established Risk Factors (Age, Sex, Race, BMI, and Smoking Status).
Variable . | Estimate . | 95% CI . | P-value . | |$R^{2}$| . | Adjusted |$R^{2}$| . |
---|---|---|---|---|---|
ERF | 0.152 | 0.147 | |||
Intercept | –1.180 | –1.872, –0.489 | 0.001 | ||
Age | 0.021 | –0.035, 0.077 | 0.461 | ||
Sex (Female) | 0.006 | –0.093, 0.105 | 0.901 | ||
Race (African American) | –0.070 | –0.202, 0.062 | 0.297 | ||
BMI | 0.352 | 0.299, 0.406 | <0.001 | ||
Former Smoker | 0.947 | 0.257, 1.637 | 0.007 | ||
Current Smoker | 1.429 | 0.735, 2.123 | <0.001 | ||
% Emphysema | 0.023 | –0.032, 0.079 | 0.414 | ||
Scanner - Philips | 0.379 | 0.097, 0.661 | 0.009 | ||
Scanner - Siemens | 0.130 | 0.025, 0.235 | 0.015 | ||
ERF + Common Protein Score | 0.157 | 0.151 | |||
Intercept | –1.138 | –1.828, –0.447 | 0.001 | ||
Age | 0.032 | –0.024, 0.089 | 0.266 | ||
Sex (Female) | 0.007 | –0.092, 0.105 | 0.897 | ||
Race (African American) | –0.049 | –0.181, 0.084 | 0.469 | ||
BMI | 0.327 | 0.271, 0.383 | <0.001 | ||
Former Smoker | 0.911 | 0.223, 1.600 | 0.010 | ||
Current Smoker | 1.390 | 0.697, 2.083 | <0.001 | ||
% Emphysema | 0.028 | –0.028, 0.083 | 0.328 | ||
Scanner - Philips | 0.404 | 0.123, 0.686 | 0.005 | ||
Scanner - Siemens | 0.110 | 0.005, 0.216 | 0.041 | ||
Common Protein Score | 0.076 | 0.023, 0.130 | 0.005 | ||
ERF + Common Gene Score | 0.153 | 0.147 | |||
Intercept | –1.169 | –1.861, –0.477 | 0.001 | ||
Age | 0.021 | –0.035, 0.077 | 0.462 | ||
Sex (Female) | 0.007 | –0.092, 0.106 | 0.893 | ||
Race (African American) | –0.083 | –0.216, 0.050 | 0.221 | ||
BMI | 0.343 | 0.288, 0.398 | <0.001 | ||
Former Smoker | 0.932 | 0.242, 1.623 | 0.008 | ||
Current Smoker | 1.423 | 0.729, 2.117 | <0.001 | ||
% Emphysema | 0.023 | –0.033, 0.079 | 0.421 | ||
Scanner - Philips | 0.387 | 0.105, 0.669 | 0.007 | ||
Scanner - Siemens | 0.134 | 0.029, 0.239 | 0.013 | ||
Common Gene Score | 0.035 | –0.017, 0.086 | 0.185 | ||
ERF + Common Scores | 0.158 | 0.151 | |||
Intercept | –1.128 | –1.819, –0.437 | 0.001 | ||
Age | 0.032 | –0.025, 0.088 | 0.270 | ||
Sex (Female) | 0.007 | –0.092, 0.106 | 0.890 | ||
Race (African American) | –0.061 | –0.195, 0.073 | 0.373 | ||
BMI | 0.320 | 0.262, 0.377 | <0.001 | ||
Former Smoker | 0.899 | 0.210, 1.588 | 0.011 | ||
Current Smoker | 1.385 | 0.692, 2.078 | <0.001 | ||
% Emphysema | 0.027 | –0.028, 0.083 | 0.335 | ||
Scanner - Philips | 0.411 | 0.129, 0.693 | 0.004 | ||
Scanner - Siemens | 0.114 | 0.008, 0.220 | 0.035 | ||
Common Protein Score | 0.074 | 0.021, 0.128 | 0.006 | ||
Common Gene Score | 0.031 | –0.020, 0.083 | 0.237 | ||
ERF + Subgroup Protein Score | 0.165 | 0.159 | |||
Intercept | –1.071 | –1.760, –0.383 | 0.002 | ||
Age | 0.014 | –0.041, 0.070 | 0.614 | ||
Sex (Female) | 0.007 | –0.091, 0.105 | 0.891 | ||
Race (African American) | –0.035 | –0.167, 0.097 | 0.605 | ||
BMI | 0.312 | 0.256, 0.368 | <0.001 | ||
Former Smoker | 0.860 | 0.174, 1.546 | 0.014 | ||
Current Smoker | 1.335 | 0.645, 2.025 | <0.001 | ||
% Emphysema | 0.030 | –0.025, 0.086 | 0.284 | ||
Scanner - Philips | 0.418 | 0.137, 0.698 | 0.004 | ||
Scanner - Siemens | 0.079 | –0.027, 0.185 | 0.146 | ||
Subgroup Protein Score | 0.126 | 0.072, 0.179 | <0.001 | ||
ERF + Subgroup Gene Score | 0.153 | 0.147 | |||
Intercept | –1.169 | –1.861, –0.477 | 0.001 | ||
Age | 0.021 | –0.035, 0.077 | 0.459 | ||
Sex (Female) | 0.007 | –0.092, 0.106 | 0.893 | ||
Race (African American) | –0.083 | –0.217, 0.050 | 0.220 | ||
BMI | 0.343 | 0.288, 0.398 | <0.001 | ||
Former Smoker | 0.932 | 0.242, 1.622 | 0.008 | ||
Current Smoker | 1.423 | 0.729, 2.117 | <0.001 | ||
% Emphysema | 0.023 | –0.033, 0.079 | 0.421 | ||
Scanner - Philips | 0.387 | 0.105, 0.669 | 0.007 | ||
Scanner - Siemens | 0.134 | 0.029, 0.239 | 0.013 | ||
Subgroup Gene Score | 0.035 | –0.016, 0.087 | 0.180 | ||
ERF + Subgroup Scores | 0.166 | 0.159 | |||
Intercept | –1.064 | –1.752, –0.375 | 0.003 | ||
Age | 0.015 | –0.041, 0.070 | 0.610 | ||
Sex (Female) | 0.007 | –0.091, 0.106 | 0.885 | ||
Race (African American) | –0.045 | –0.179, 0.088 | 0.504 | ||
BMI | 0.306 | 0.249, 0.363 | <0.001 | ||
Former Smoker | 0.850 | 0.164, 1.536 | 0.015 | ||
Current Smoker | 1.332 | 0.642, 2.022 | <0.001 | ||
% Emphysema | 0.030 | –0.025, 0.085 | 0.290 | ||
Scanner - Philips | 0.423 | 0.142, 0.704 | 0.003 | ||
Scanner - Siemens | 0.083 | –0.024, 0.189 | 0.129 | ||
Subgroup Protein Score | 0.124 | 0.070, 0.177 | <0.001 | ||
Subtype Gene Score | 0.027 | –0.024, 0.078 | 0.304 |
Variable . | Estimate . | 95% CI . | P-value . | |$R^{2}$| . | Adjusted |$R^{2}$| . |
---|---|---|---|---|---|
ERF | 0.152 | 0.147 | |||
Intercept | –1.180 | –1.872, –0.489 | 0.001 | ||
Age | 0.021 | –0.035, 0.077 | 0.461 | ||
Sex (Female) | 0.006 | –0.093, 0.105 | 0.901 | ||
Race (African American) | –0.070 | –0.202, 0.062 | 0.297 | ||
BMI | 0.352 | 0.299, 0.406 | <0.001 | ||
Former Smoker | 0.947 | 0.257, 1.637 | 0.007 | ||
Current Smoker | 1.429 | 0.735, 2.123 | <0.001 | ||
% Emphysema | 0.023 | –0.032, 0.079 | 0.414 | ||
Scanner - Philips | 0.379 | 0.097, 0.661 | 0.009 | ||
Scanner - Siemens | 0.130 | 0.025, 0.235 | 0.015 | ||
ERF + Common Protein Score | 0.157 | 0.151 | |||
Intercept | –1.138 | –1.828, –0.447 | 0.001 | ||
Age | 0.032 | –0.024, 0.089 | 0.266 | ||
Sex (Female) | 0.007 | –0.092, 0.105 | 0.897 | ||
Race (African American) | –0.049 | –0.181, 0.084 | 0.469 | ||
BMI | 0.327 | 0.271, 0.383 | <0.001 | ||
Former Smoker | 0.911 | 0.223, 1.600 | 0.010 | ||
Current Smoker | 1.390 | 0.697, 2.083 | <0.001 | ||
% Emphysema | 0.028 | –0.028, 0.083 | 0.328 | ||
Scanner - Philips | 0.404 | 0.123, 0.686 | 0.005 | ||
Scanner - Siemens | 0.110 | 0.005, 0.216 | 0.041 | ||
Common Protein Score | 0.076 | 0.023, 0.130 | 0.005 | ||
ERF + Common Gene Score | 0.153 | 0.147 | |||
Intercept | –1.169 | –1.861, –0.477 | 0.001 | ||
Age | 0.021 | –0.035, 0.077 | 0.462 | ||
Sex (Female) | 0.007 | –0.092, 0.106 | 0.893 | ||
Race (African American) | –0.083 | –0.216, 0.050 | 0.221 | ||
BMI | 0.343 | 0.288, 0.398 | <0.001 | ||
Former Smoker | 0.932 | 0.242, 1.623 | 0.008 | ||
Current Smoker | 1.423 | 0.729, 2.117 | <0.001 | ||
% Emphysema | 0.023 | –0.033, 0.079 | 0.421 | ||
Scanner - Philips | 0.387 | 0.105, 0.669 | 0.007 | ||
Scanner - Siemens | 0.134 | 0.029, 0.239 | 0.013 | ||
Common Gene Score | 0.035 | –0.017, 0.086 | 0.185 | ||
ERF + Common Scores | 0.158 | 0.151 | |||
Intercept | –1.128 | –1.819, –0.437 | 0.001 | ||
Age | 0.032 | –0.025, 0.088 | 0.270 | ||
Sex (Female) | 0.007 | –0.092, 0.106 | 0.890 | ||
Race (African American) | –0.061 | –0.195, 0.073 | 0.373 | ||
BMI | 0.320 | 0.262, 0.377 | <0.001 | ||
Former Smoker | 0.899 | 0.210, 1.588 | 0.011 | ||
Current Smoker | 1.385 | 0.692, 2.078 | <0.001 | ||
% Emphysema | 0.027 | –0.028, 0.083 | 0.335 | ||
Scanner - Philips | 0.411 | 0.129, 0.693 | 0.004 | ||
Scanner - Siemens | 0.114 | 0.008, 0.220 | 0.035 | ||
Common Protein Score | 0.074 | 0.021, 0.128 | 0.006 | ||
Common Gene Score | 0.031 | –0.020, 0.083 | 0.237 | ||
ERF + Subgroup Protein Score | 0.165 | 0.159 | |||
Intercept | –1.071 | –1.760, –0.383 | 0.002 | ||
Age | 0.014 | –0.041, 0.070 | 0.614 | ||
Sex (Female) | 0.007 | –0.091, 0.105 | 0.891 | ||
Race (African American) | –0.035 | –0.167, 0.097 | 0.605 | ||
BMI | 0.312 | 0.256, 0.368 | <0.001 | ||
Former Smoker | 0.860 | 0.174, 1.546 | 0.014 | ||
Current Smoker | 1.335 | 0.645, 2.025 | <0.001 | ||
% Emphysema | 0.030 | –0.025, 0.086 | 0.284 | ||
Scanner - Philips | 0.418 | 0.137, 0.698 | 0.004 | ||
Scanner - Siemens | 0.079 | –0.027, 0.185 | 0.146 | ||
Subgroup Protein Score | 0.126 | 0.072, 0.179 | <0.001 | ||
ERF + Subgroup Gene Score | 0.153 | 0.147 | |||
Intercept | –1.169 | –1.861, –0.477 | 0.001 | ||
Age | 0.021 | –0.035, 0.077 | 0.459 | ||
Sex (Female) | 0.007 | –0.092, 0.106 | 0.893 | ||
Race (African American) | –0.083 | –0.217, 0.050 | 0.220 | ||
BMI | 0.343 | 0.288, 0.398 | <0.001 | ||
Former Smoker | 0.932 | 0.242, 1.622 | 0.008 | ||
Current Smoker | 1.423 | 0.729, 2.117 | <0.001 | ||
% Emphysema | 0.023 | –0.033, 0.079 | 0.421 | ||
Scanner - Philips | 0.387 | 0.105, 0.669 | 0.007 | ||
Scanner - Siemens | 0.134 | 0.029, 0.239 | 0.013 | ||
Subgroup Gene Score | 0.035 | –0.016, 0.087 | 0.180 | ||
ERF + Subgroup Scores | 0.166 | 0.159 | |||
Intercept | –1.064 | –1.752, –0.375 | 0.003 | ||
Age | 0.015 | –0.041, 0.070 | 0.610 | ||
Sex (Female) | 0.007 | –0.091, 0.106 | 0.885 | ||
Race (African American) | –0.045 | –0.179, 0.088 | 0.504 | ||
BMI | 0.306 | 0.249, 0.363 | <0.001 | ||
Former Smoker | 0.850 | 0.164, 1.536 | 0.015 | ||
Current Smoker | 1.332 | 0.642, 2.022 | <0.001 | ||
% Emphysema | 0.030 | –0.025, 0.085 | 0.290 | ||
Scanner - Philips | 0.423 | 0.142, 0.704 | 0.003 | ||
Scanner - Siemens | 0.083 | –0.024, 0.189 | 0.129 | ||
Subgroup Protein Score | 0.124 | 0.070, 0.177 | <0.001 | ||
Subtype Gene Score | 0.027 | –0.024, 0.078 | 0.304 |
Models were fit on all participants in the COPD Data set and adjusted for age, sex, race, BMI, smoking status, percent emphysema, and scanner make (|$N=1374$|). Scores were developed using the “stable” proteins and genes selected by HIP (Random). ERF = Established Risk Factors (Age, Sex, Race, BMI, and Smoking Status).
Effect of common and sex-specific genes and proteins on AWT
Finally, we created common and sex-specific protein and gene scores from the “stable” proteins and genes selected by HIP (Random) and assessed whether these scores improved the prediction of AWT beyond some established COPD risk factors. We created the common protein score for subject |$i$| as CommonProtScore|$_{i}=\sum _{j=1}^{\# common~proteins}w_{j}x_{ij}^{1}$|, where |$x_{ij}^{1}$| is subject |$i$|’s protein expression value for the |$j$|th common protein (i.e. the |$ij$|th entry for the protein data, |$\boldsymbol X^{1}$|), and |$w_{j}$| is the weight for protein |$j$|. Each protein weight, |$w_{j}$|, was obtained via bootstrap. Specifically, we obtained 200 bootstrap datasets, and for each bootstrap dataset, we obtained regression coefficients and standard errors from univariate regression models of AWT and each of the common proteins identified. This resulted in 200 regression coefficients and standard errors which we combined using a weighted mean. The subgroup-specific scores were also obtained in a similar fashion. The scores were standardized to have mean 0 and variance 1 in each subgroup since different variables were identified for males and females.
Once the scores were created, we fit several multiple linear regression models on the full data: (i) Established Risk Factors (ERF) Model, (ii) ERF + Common Protein Score, (iii) ERF + Common Gene Score, (iv) ERF + Common Protein and Gene Scores, (v) ERF + Subgroup Protein Score, (vi) ERF + Subgroup Gene Score, and (vii) ERF + Subgroup Protein and Gene Scores. Table 3 shows the coefficient estimates with confidence intervals and |$P$|-values. We observe both the common and subgroup-specific protein scores are statistically significant, but neither the common nor subgroup-specific gene scores were. This could be due to including too few genes in the scores or because there was a large overlap between the genes selected for males and females. The “stable” proteins we identified to be common and specific to males and females could potentially be explored to further our understanding of sex differences in COPD mechanisms.
Conclusion
We have tackled the problem of accounting for subgroup heterogeneity in an integrative analysis framework. Motivated by the COPDGene study and a scientific need to understand sex differences in COPD, we developed appropriate statistical methods that leverage the strengths of multi-view data, account for subgroup heterogeneity, incorporate clinical covariates, and combine the association step with a clinical outcome step to guide the selection of clinically meaningful molecular signatures. Through the use of a hierarchical penalty, we identify omics signatures that are common and subgroup-specific and can predict a clinical outcome. HIP showed comparable to substantially improved prediction and variable selection performance in simulation settings when compared with existing methods.
When we applied HIP to genomic and proteomic data from COPDGene, we identified protein and gene biomarkers and pathways common and specific to males and females. When the proteins and genes were developed into scores, the common and subgroup-specific protein scores were statistically significant predictors of AWT even when including established risk factors of COPD. These findings suggest that the proteins and genes identified to be common and specific to males and females could be explored to further our understanding of sex differences in COPD mechanisms.
Recently, [31] also explored gene signatures related to AWT and found that interferon stimulated genes were associated with AWT. We did not find these same genes in our analysis, but there were several differences in the analyses that could explain the differing results: (i) the subset of COPDGene participants in the two analyses was different as we only included participants with COPD, while [31] included participants with and without COPD, (ii) [31] looked for associations between individual genes and AWT while adjusting for covariates, whereas we selected genes based on rankings from our model that included several genes at once, (iii) we considered both gene and protein data (which also impacted which participants we could include), whereas [31] only considered genes, and (iv) we use IPA [26] to find pathways, whereas [31] used MSigDB (https://www.gsea-msigdb.org/gsea/msigdb).
HIP has some limitations that warrant further research. First, the number of variables to be kept for the subset model refit has to be specified. In simulations where this value is known, performance is very good, but the truth will not be known in applied settings. Users could look at plots of the weights from the |$\hat{\boldsymbol B}^{d,s}$| to see how many variables seem to have large weights. We also found that if there were some splits where the train MSEs were very small but test MSEs very large, i.e. evidence of overfitting that more variables needed to be retained. Second, the tuning range for |$\lambda _{\xi }$| and |$\lambda _{g}$| is not determined by the data, so the tuning range may need to be adjusted to attain optimal sparsity. This can be done with an optional parameter in the code. Additionally, the number of components, |$K$|, needs to be specified. Although the truth can never be known, we provide an automatic method to select |$K$| and discuss other options in the supplemental material. Future research should explore the possibility that |$K$| may differ by data view. Finally, HIP is limited to cross-sectional data, but future work could extend it to accommodate longitudinal data to determine whether trends in some outcome vary by subgroup. Despite these limitations, HIP advances statistical methods for joint association and prediction of multi-view data, and the encouraging simulation and real data findings motivate further applications.
Many complex diseases are characterized by subgroup differences, yet most existing integrative analysis methods ignore differential outcomes in subgroups. We have tackled the problem of accounting for subgroup heterogeneity in data integration.
We have developed HIP, a statistical method that leverages the strengths of multi-view data, accounts for subgroup heterogeneity, incorporates clinical covariates, and couples data integration with a clinical outcome prediction, yielding clinically meaningful molecular signatures and enhancing interpretability.
We have conducted rigorous simulation studies and found HIP to be competitive. The COPD data application showcases the potential application of HIP in identifying molecular signatures shared by, and unique to, males and females, and predictive of airway wall thickness.
We have developed efficient PyTorch codes for users with programming expertise, and a Shiny App, for users without programming expertise, to promote research inclusion and accessibility.
Author contributions
S.E.S. and Q.L. conceived of the idea. S.E.S., J.B., and L.E. developed the methods. J.B. and S.E.S. developed code to implement the methods. J.B. and L.V. conducted simulations and real data analyses. J.B. and C.W. interpreted results from the real data analyses. J.B. and S.E.S. wrote a first draft of the paper. All authors read and edited the final manuscript.
Conflict of interest: The authors declare that they have no competing interests.
Funding
This work was supported by National Center For Advancing Translational Science (5KL2TR002492), National Institute Of General Medical Sciences (1R35GM142695), and NHLBI U01 HL089897 and U01 HL089856. The COPDGene study (NCT00608764) is also supported by the COPD Foundation through contributions made to an Industry Advisory Committee that has included AstraZeneca, Bayer Pharmaceuticals, Boehringer-Ingelheim, GlaxoSmithKline, Genentech, Novartis, Pfizer, and Sunovion. The views expressed in this article are those of the authors and do not reflect the views of the United States Government, the Department of Veterans Affairs, the funders, the sponsors, or any of the authors’ affiliated academic institutions.
Data availability
Access to the clinical and genomic data can be requested through dbGaP (IDs: phs000951.v4.p4 and phs000179.v6.p2). The proteomic data can be requested from the COPDGene Study Group (http://www.copdgene.org/).
The Python source code for implementing the methods and generating simulated data along with README files is available on GitHub at https://github.com/lasandrall/HIP. We interface the Python code with the R programming language and provide an R-package for users without programming experience in Python. A Shiny Application of HIP for users with limited programming experience can be found at https://multi-viewlearn.shinyapps.io/HIP_ShinyApp/. We provide a video tutorial at https://youtu.be/O6E2OLmeMDo to facilitate the use of HIP.
Ethics approval and consent to participate
This research uses previously collected, de-identified data from the COPDGene Study [7], a multi-center study with 21 clinical sites each with local IRB approval (NCT00608764).
Consent for publication
Not applicable
References