Abstract

Epidemiologic and genetic studies in many complex diseases suggest subgroup disparities (e.g. by sex, race) in disease course and patient outcomes. We consider this from the standpoint of integrative analysis where we combine information from different views (e.g. genomics, proteomics, clinical data). Existing integrative analysis methods ignore the heterogeneity in subgroups, and stacking the views and accounting for subgroup heterogeneity does not model the association among the views. We propose Heterogeneity in Integration and Prediction (HIP), a statistical approach for joint association and prediction that leverages the strengths in each view to identify molecular signatures that are shared by and specific to a subgroup. We apply HIP to proteomics and gene expression data pertaining to chronic obstructive pulmonary disease (COPD) to identify proteins and genes shared by, and unique to, males and females, contributing to the variation in COPD, measured by airway wall thickness. Our COPD findings have identified proteins, genes, and pathways that are common across and specific to males and females, some implicated in COPD, while others could lead to new insights into sex differences in COPD mechanisms. HIP accounts for subgroup heterogeneity in multi-view data, ranks variables based on importance, is applicable to univariate or multivariate continuous outcomes, and incorporates covariate adjustment. With the efficient algorithms implemented using PyTorch, this method has many potential scientific applications and could enhance multiomics research in health disparities. HIP is available at https://github.com/lasandrall/HIP, a video tutorial at https://youtu.be/O6E2OLmeMDo and a Shiny Application at https://multi-viewlearn.shinyapps.io/HIP_ShinyApp/ for users with limited programming experience.

Introduction

Epidemiologic and genetic studies suggest subgroup (e.g. sex) disparities exist for many complex diseases. Subgroups of a population can present similar symptoms but have different clinical courses and respond to therapy differently. By determining factors predictive of an outcome for each subgroup, we can better personalize treatments to improve patient outcomes.

Consider chronic obstructive pulmonary disease (COPD), a chronic progressive disease affecting more than 16 million adults, presenting a substantial and increasing economic and social burden [1]. Although tobacco smoking is the leading environmental risk factor for COPD, even in heavy smokers fewer than 50% develop COPD [2], genetics [3], environmental exposures [4], inflammation [5], and other factors [6] predispose individuals to develop COPD. The Genetic Epidemiology of COPD (COPDGene) Study [7] is one of the largest studies to investigate the underlying genetic factors of COPD to understand why certain smokers develop COPD while others do not. Research suggests that sex disparities exist in COPD mechanisms [8]. A meta-analysis of 11 studies showed that female smokers, even if smoking fewer cigarettes, had a faster annual decline in forced expiratory volume in one second (FEV|$_{1}$|⁠) [9]. A study using COPDGene data found that women smokers tended to have higher airway wall thickness (AWT) compared with male smokers [10], likely explaining some of the sex differences in the prevalence of COPD. Women with severe COPD may be at higher risk for hospitalization and death [11]. These studies primarily used data from one source, so combining data from multiple sources has the potential to reveal new insights into sex differences in COPD mechanisms. Motivated by the crucial scientific need to understand sex differences in COPD, we leverage the strengths from multiple data views from the COPDGene Study to identify genes and proteins ’common among’ and ’specific to’ males and females contributing to variation in AWT.

Existing methods for integrating data from multiple views are inadequate for our problem as they do not account for subgroup heterogeneity. In particular, one-step methods have been proposed for joint association of data from multiple views and simultaneous prediction of an outcome [12–14]. To do this, one would build a separate integrative analysis model for each subgroup to determine the important multidimensional variables that are associated and predictive of the outcome for each subgroup. While this approach is intuitive, it is limited by the sample size for each subgroup and does not pool information across subgroups making estimation challenging. This is especially true for high-dimensional data settings where the number of variables is larger than the sample size for each subgroup. Another approach that makes use of samples in all subgroups is to apply these one-step methods on the combined subgroup data, but this precludes us from examining whether such heterogeneity exists.

The need to account for subgroup heterogeneity has been recognized and studied in the case where there is only one data view. Reference [15] proposes the Joint Lasso to jointly estimate regression coefficients for different subgroups while allowing for the identification of subgroup-specific features and also encouraging similarity between subgroup-specific coefficients. In [16], the authors proposed the meta lasso for feature selection for different studies (in our application, subgroups) that incorporates a hierarchical penalty to borrow strength across different studies while allowing for feature selection flexibility. The goal of the meta lasso is to combine data sets with the same variables measured on distinct subjects from separate studies to improve variable selection across all data sets when considering a binary outcome; it is not specifically designed to account for subgroup heterogeneity and does not consider multiple data views obtained on the same set of subjects. To use these existing methods, one would stack the different data views for each subgroup; this approach assumes the many variables across the data views are independent and ignores the overall dependency structure among the different views.

We make three main contributions in this article. First, we propose integrative analysis and prediction methods that account for subgroup heterogeneity and are appropriate for our motivating data by modifying the hierarchical penalty proposed in [16] to improve the power to identify common and subgroup-specific features (Fig. 1). Second, the methods we propose, called Heterogeneity in Integration and Prediction (HIP), allow for one or more continuous outcomes and can force specified covariates into the model giving HIP more flexibility than current methods. Third, we develop computationally efficient algorithms using PyTorch [17]. We apply the methods to our motivating data from the COPDGene Study to identify genes and proteins common across and specific to males and females and associated with AWT. We then explore enriched pathways and the ability of these omics biomarkers to predict AWT beyond some established COPD risk factors.

Simplified overview of general data integration and HIP; HIP allows to select variables unique to a particular subgroup and shared between subgroups compared withgeneral data integration methods whose variable selection approach is agnostic to heterogeneity existing in subgroups.
Figure 1

Simplified overview of general data integration and HIP; HIP allows to select variables unique to a particular subgroup and shared between subgroups compared withgeneral data integration methods whose variable selection approach is agnostic to heterogeneity existing in subgroups.

The remainder of the paper is structured as follows. In Methods section 2, we present the proposed methods (HIP). In Simulation section 3, we conduct simulation studies to assess the performance of HIP in comparison with existing methods. In Real data analysis section 4, we apply HIP to data from the COPDGene Study. We conclude with some brief discussion in Conclusion section 5.

Materials and Methods

Notation and problem

Suppose we have |$D$| views (e.g. genomics, proteomics, clinical) with |$p_{d}$| variables measured on the same |$N$| subjects. Each view has |$s =1,\ldots ,S$| subgroups known a priori, each with sample size |$n_{s}$|⁠, where |$N=\sum _{s=1}^{S}n_{s}$|⁠. For subgroup |$s$| and view |$d$|⁠, |$\boldsymbol X^{d,s} \in \Re ^{n_{s} \times p_{d}}$| represents the data matrix. Assume we also have outcome data for each subgroup. For a continuous outcome(s) (e.g. AWT), we have matrix |$\boldsymbol Y^{s} \in \Re ^{n_{s} \times q}$|⁠, where |$q$| is the number of outcomes. Our primary goal is to perform integrative analysis that considers the overall dependency structure among views, predicts an outcome, incorporates feature ranking, and accounts for subgroup heterogeneity to identify common and subgroup-specific variables contributing to variation in the outcome. Figure 1 is a visual representation of the proposed method, HIP, compared with general data integration methods.

Integration of multi-view data

To relate the views within each subgroup, we assume that there are subgroup-specific scores (⁠|$\boldsymbol Z^{s}$|⁠) that drive the dependency structure among the views. Then, each view is written as the product of the subgroup-specific scores |$\boldsymbol Z^{s} \in R^{n_{s} \times K}$| and a matrix of view and subgroup-specific loadings |$\boldsymbol B^{d,s} \in R^{p_{d} \times K}$| plus a matrix of errors: |$\boldsymbol X^{d,s} = \boldsymbol Z^{s} {\boldsymbol B^{d,s}}^{T} + \boldsymbol E^{d,s}$|⁠. Here, |$K$| is the number of components used to approximate each view. The |$\boldsymbol Z^{s}$| incorporates the correlation across the |$D$| views for subgroup |$s$|⁠, and |$\boldsymbol E^{d,s}$| accounts for the remaining variability unique to view |$d$| for subgroup |$s$|⁠. In optimizing |$\boldsymbol Z^{s}$| and |${\boldsymbol B^{d,s}}$|⁠, we want to minimize the error in reconstructing |$\boldsymbol X^{d,s}$|⁠, i.e. |$\boldsymbol E^{d,s}$|⁠, via the loss function |$F(\boldsymbol X^{d,s}, \boldsymbol Z^{s}, \boldsymbol B^{d,s})= \|\boldsymbol X^{d,s} - \boldsymbol Z^{s} {\boldsymbol B^{d,s}}^{T}\|_{F}^{2}$|⁠. For a random matrix |$\boldsymbol A$|⁠, |$\|\boldsymbol A\|_{F}^{2}$| is the square of the Frobenius norm defined as trace(⁠|$\boldsymbol A^{T} \boldsymbol A$|⁠).

The decomposition of |$\boldsymbol X^{d,s}$| in our approach is motivated by a principal components framework rather than a factor analytic framework as we do not impose any distribution on |$\boldsymbol Z^{s}$| or |$\boldsymbol E^{d,s}$|⁠. Typically, we would require |${\boldsymbol B^{d,s}}^{T} {\boldsymbol B}^{d,s} = \boldsymbol I$|⁠, and |${\boldsymbol Z^{s}}^{T} {\boldsymbol Z^{s}} = \boldsymbol I$| for uniqueness, but we do not require these constraints because we are interested in whether a variable’s estimated coefficients in |$\boldsymbol B^{d,s}$| are zero or not. Since we propose to use a penalty that encourages row-sparsity, we preserve the sparsity pattern in |$\boldsymbol B^{d,s}$| over matrix multiplication. Furthermore, we only use |$\boldsymbol Z^{s}$| to predict the clinical outcome and not to make inference on the estimates in |$\boldsymbol Z^{s}$|⁠.

Hierarchical penalty for common and subgroup-specific feature ranking

A main goal in this paper is to identify common and subgroup-specific features associated with an outcome. Based on the hierarchical reparameterization proposed in [16], we decompose |$\boldsymbol B^{d,s}$| as the element-wise product of |$\boldsymbol G^{d}$| and |$\boldsymbol \Xi ^{d,s}$|⁠, i.e. |$\boldsymbol B^{d,s} = \boldsymbol G^{d} \cdot \boldsymbol \Xi ^{d,s}$| for |$d=1,\ldots ,D$| and |$s=1,\ldots ,S$| to estimate effects that are common across subgroups using |$\boldsymbol G^{d} \in \Re ^{p_{d} \times K}$| while also allowing for heterogeneity in the subgroups through |$\boldsymbol \Xi ^{d,s} \in \Re ^{p_{d} \times K}$|⁠. If there is no heterogeneity, then |$\boldsymbol \Xi ^{d,s}$| is a matrix of ones for all |$s$|⁠, and |$\boldsymbol G^{d} = \boldsymbol B^{d}$|⁠, i.e. the view-specific loadings are the same for all subgroups. In estimating |$\boldsymbol G^{d}$|⁠, we borrow strength across subgroups for increased power. In this reparameterization, exact values of |$\boldsymbol G^{d}$| and |$\boldsymbol \Xi ^{d,s}$| are not identifiable but also are not directly needed for variable ranking.

We use regularization to induce sparsity by adding the block |$l_{1}/l_{2}$| penalty on |$\boldsymbol G^{d}$| and |$\boldsymbol \Xi ^{d,s}$|

(1)

Here, |$\boldsymbol g_{l}^{d}$| and |$\boldsymbol \xi _{l}^{d,s}$| are the |$l$|th rows in |$\boldsymbol G^{d}$| and |$\boldsymbol \Xi ^{d,s}$|⁠, respectively, and are each length |$K$|⁠. By imposing the block |$l_{1}/l_{2}$| penalty on the rows of |$\boldsymbol G^{d}$| and |$\boldsymbol \Xi ^{d,s}$|⁠, the |$K$| components are considered as a group, encouraging variables to be selected in all |$K$| components or not to be selected. This is desirable because the selection of variables is not component-dependent and thus appropriate for variable screening. This differs from the original hierarchical penalty reparameterization proposed in [16], which imposes an |$l_{1}$| penalty. Both |$\lambda _{G}$| and |$\lambda _{\xi }$| are tuning parameters controlling feature selection. Specifically, |$\lambda _{G}$| controls feature selection for all subgroups combined and encourages removal of variables that are not important for all |$S$| subgroups. Also, |$\lambda _{\xi }$| encourages feature selection for each subgroup. Further details on the selection of |$\lambda _{G}$| and |$\lambda _{\xi }$| are given in the Supplementary Material. The |$\gamma _{d}$| is a user-specified indicator for whether the view should be penalized. This is to allow some views, such as a set of clinical covariates, to be forced into the model to guide the selection of other important variables, which can result in better prediction of the outcome.

Relating shared scores to clinical outcome(s)

Besides identifying the common and subgroup-specific features, we aim to predict a clinical outcome while allowing for heterogeneity in effects based on the subgroup and multi-view data. We assume that the outcome is related to the views only through the shared scores, i.e. |$Z^{s}$|⁠, for each subgroup. This allows us to couple the problem of associations among different views and predicting an outcome. We relate the outcome to |$\boldsymbol Z^{s} $| by minimizing a loss function |$\sum _{s=1}^{S} F(\boldsymbol Y^{s}, \boldsymbol Z^{s}, \boldsymbol \Theta , \beta _{0})$|⁠. For continuous outcome(s), |$F(\boldsymbol Y^{s}, \boldsymbol Z^{s}, \boldsymbol \Theta , \beta _{0}) = || \boldsymbol Y^{s} - (\beta _{0} + \boldsymbol Z^{s} \boldsymbol \Theta )||_{F}^{2}$|⁠, where |$\boldsymbol \Theta \in \Re ^{K \times q}$| are regression coefficients.

We can impose the constraint that the columns of |$\boldsymbol Z^{s}$| are uncorrelated (⁠|$\boldsymbol Z^{s^{T}}\boldsymbol Z^{s}=\boldsymbol I$|⁠) so each of the |$K$| components provides unique information. Of note, our goal is not to interpret the coefficients |$\boldsymbol \Theta $| as in regression analysis; our goal is to rank the variables corresponding to those coefficients. In our applications, we standardize each column of |$\boldsymbol Y^{s}$| to have mean |$0$| and variance |$1$| at the subgroup level, but this is not necessary because of the estimation of the intercept |$\beta _{0}$|⁠.

Our proposal to model one or more continuous outcomes in a one-step integrative analysis model that provides predictions based on rank-selected features, allowing for subgroup heterogeneity in those features, is novel and will be of use in many scientific applications.

Joint model for integration and prediction

Typical integration and prediction methods follow two steps. First, the subgroup-specific scores |$\boldsymbol Z^{s}$| and view and subgroup-specific loadings |$\boldsymbol B^{d,s}$| (hence common and subgroup-specific variables, |$\boldsymbol G^{d}$| and |$\boldsymbol \Xi ^{d,s}$|⁠) are learned. Second, the learned |$\boldsymbol Z^{s}$| are associated with the outcome in a regression model. Since these steps are independent, the common and subgroup-specific variables identified may not be meaningfully connected to the clinical outcome. To overcome this limitation, we use the outcome to guide the selection of the common and subgroup-specific variables in a joint model. Thus, we use HIP to estimate the following: view and subgroup-specific loadings |$\boldsymbol B^{d,s}$| (⁠|$\boldsymbol G^{d}$|⁠, the common variables, and |$\boldsymbol \Xi ^{d,s}$|⁠, the subgroup-specific variables), the subgroup-specific scores shared across views (⁠|$\boldsymbol Z^{s}$|⁠), and the regression estimates (⁠|$\boldsymbol \Theta $|⁠, |$\beta _{0}$|⁠). To obtain these estimates, we combine the outcome loss function, the multi-view loss function, and the regularization penalty to minimize the following overall loss function:

(2)

Although versions of the hierarchical penalty have been used before, our paper is among the first to use this penalty in joint association and prediction studies for data from multiple views to account for common and subgroup-specific variation and to extract subgroup-specific features and/or clinical variables.

Because sparsity will not be induced directly due to numerical limitations of the automatic differentiation algorithm, we identify important variables by ranking according to the |$L_{2}$| norm of the corresponding row in |$\hat{\boldsymbol B}^{d,s}$|⁠. The user specifies the number of variables (denote as |$N_{top}$|⁠) they wish to keep; this value can vary across data views. Algorithm 1 (Supplementary Material) is run once on the full training data and the |$N_{top}$| variables are selected for each view and subgroup based on the estimated |$\boldsymbol B^{d,s}$|⁠. We then run Algorithm 1 a second time including only these selected variables. This ”subset” result should be used when applying the prediction procedure.

The optimization problem in (2) is multi-convex in |$\boldsymbol B^{d,s}$|⁠, |$\boldsymbol Z^{s}$|⁠, and |$\boldsymbol \Theta $| each but jointly non-convex. As such, we are not guaranteed convergence to a global minimum. A local optimum can be found by iteratively minimizing over each of the optimization parameters with the rest of the optimization parameters fixed. Overall algorithm convergence is determined by the relative change in the objective function in (2) without the penalty terms. The full algorithm is summarized in the Supplementary Material. We also describe in the Supplementary Material our approach for selecting hyper-parameters.

Prediction

In order to predict an outcome on new data (say |$\boldsymbol X^{d,s}_{test}$|⁠), we first predict the test shared component, |$\boldsymbol Z^{s}_{pred}$|⁠, and then use this information to predict the outcome. To predict |$\boldsymbol Z^{s}_{pred}$|⁠, we learn the model defined by (2) on the training data (i.e. |$\boldsymbol X^{d,s}_{train}$|⁠) and obtain the learned estimates |$\hat{\boldsymbol B}^{d,s}$|⁠, |$\hat{\boldsymbol \Theta }$|⁠, and |$\hat{\beta }_{0}$|⁠. Using these estimates and the testing data |$\boldsymbol X^{d,s}_{test}$|⁠, we solve the problem in (3).

(3)

Without an orthogonality condition on |$\boldsymbol Z^{s}$|⁠, the solution of this problem has a closed form given by |$\hat{\boldsymbol Z}^{s}_{pred} = \boldsymbol X_{cat}^{s} {\hat{\boldsymbol B}_{cat}^{s}}({\hat{\boldsymbol B}_{cat}^{s^{T}}} {\hat{\boldsymbol B}_{cat}^{s}})^{-1}$| for |$s=1,\ldots ,S$|⁠. Here, |$\boldsymbol X_{cat}^{s}$| is an |$n_{s} \times \{p_{1} + \cdots + p_{d}\}$| matrix that concatenates all |$D$| views for subgroup |$s$|⁠, i.e. |$X_{cat}^{s}=[ \boldsymbol X^{1,s}_{test},\cdots ,\boldsymbol X^{D,s}_{test}]$|⁠. Similarly, |$\hat{\boldsymbol B}_{cat}= [\hat{\boldsymbol B}^{1,s},\cdots , \hat{\boldsymbol B}^{D,s}]$| and is a |$\{p_{1} + \cdots + p_{D} \} \times K$| matrix of variable coefficients. We add a small multiple of the identity matrix before taking the inverse to help with stability, although we note that the inverse is of a |$K \times K$| matrix, which is computationally inexpensive since |$K$| is typically small. With an orthogonality condition on |$\boldsymbol Z^{s}$|⁠, the above optimization for |$\hat{\boldsymbol Z}^{s}_{pred}$| is an orthogonal Procrustes problem [18]. Let the singular value decomposition of |$\boldsymbol X_{cat}^{s} {\hat{\boldsymbol B}_{cat}^{s}}$| be |$\boldsymbol U \boldsymbol D \boldsymbol V^{T}$|⁠. Then, |$\hat{\boldsymbol Z}^{s}_{pred} = \boldsymbol U \boldsymbol V^{T}$|⁠. Once we have obtained |$\hat{\boldsymbol Z}^{s}_{pred}$|⁠, we predict a continuous outcome |$\hat{\boldsymbol Y}^{s}_{pred}$| as |$\hat{\boldsymbol Y}^{s}_{pred} = \hat{\beta }_{0} + \hat{\boldsymbol Z}_{pred}^{s} \hat{\boldsymbol \Theta }$|⁠.

Simulations

Set-up

We performed simulations for a single continuous outcome with two views and two subgroups, i.e. |$D=S=2$|⁠. There were |$n_{1} = 250$| subjects in the first subgroup and |$n_{2} = 260$| subjects in the second. There were two different scenarios to test the ability of the algorithm to perform variable ranking and prediction: Full Overlap and Partial Overlap (Fig. 2). In the Full Overlap scenario, the signal variables for each subgroup completely overlapped, i.e. the first |$50$| variables of the |$\boldsymbol{B}^{d,s}$| matrices were important for both subgroups. We expect competing methods to perform relatively well in this scenario as there is no subgroup heterogeneity. In the Partial Overlap scenario, the first |$50$| variables are important for the first subgroup. Of these |$50$|⁠, the last |$25$| are also important to the second subgroup in addition to the |$25$| subsequent variables. We expect some deterioration in the ability of the comparison methods to select the appropriate variables due to subgroup heterogeneity. We also considered simulation settings where (i) the sample sizes for the two groups were unbalanced and (ii) the error term for each view was non-Gaussian, to investigate the robustness of the proposed method. Please refer to the supporting information for simulation results for these scenarios.

Visual representation of variable overlap scenarios; in the Full Overlap scenario, the signal variables for each subgroup completely overlapped, i.e. the same variables were important for both subgroups, and in the Partial Overlap Scenario, half of the variables important to each subgroup are the same, and the remaining important variables are unique to each subgroup.
Figure 2

Visual representation of variable overlap scenarios; in the Full Overlap scenario, the signal variables for each subgroup completely overlapped, i.e. the same variables were important for both subgroups, and in the Partial Overlap Scenario, half of the variables important to each subgroup are the same, and the remaining important variables are unique to each subgroup.

For each example and scenario, there were three different numbers of variables in the data sets with |$p_{d}$| indicating the number of variables in view |$d$|⁠. In the P1 setting, |$p_{1} = 300$| and |$p_{2} = 350$|⁠. In the P2 setting, |$p_{1} = 1000$| and |$p_{2} = 1500$|⁠. Finally, in the P3 setting, |$p_{1} = 2000$| and |$p_{2} = 3000$|⁠. For these simulations, |$K$| was fixed to the true value of |$2$|⁠. We also set |$N_{top}$| to the true value of 50 for all simulations.

The data generation process is based on [14]. First, the |$\boldsymbol{B}^{d,s}$| matrices are generated according to the Full or Partial Overlap scenario. If the entry corresponds to a signal variable, it is drawn from a |$U(0.5,1)$| with the sign determined by a draw from a Bernoulli distribution with equal probability; otherwise, it is set to 0. We then orthogonalize the columns of each |$\boldsymbol{B}^{d,s}$|⁠. Next, we generate the entries of |$\boldsymbol Z^{s} \sim N(\mu = 25.0, \sigma = 3.0)$| and |$\boldsymbol{E}^{d,s} \sim N(\mu = 0.0, \sigma = 1.0)$|⁠. Then, the data matrix for subgroup |$s$| in view |$d$| is generated as |$\boldsymbol X^{d,s}$| = |$\boldsymbol Z^{s} \boldsymbol B^{{d,s}^{T}} + \boldsymbol E^{d,s}$|⁠. Finally, the outcome is generated as |$\boldsymbol Y^{s} = \beta _{0} + \boldsymbol Z^{s} \boldsymbol \Theta + \boldsymbol E^{s}$|⁠, where |$\boldsymbol E^{s} \in R^{n_{s} \times 1}$| contains entries from a standard normal distribution. The true value of |$\boldsymbol \Theta = [0.7, 0.2]^{T}$| and |$\beta _{0} = 2.0$|⁠. In the supporting information, we consider additional simulation settings where |$\boldsymbol E^{d,s} \sim t_{\nu }$| (i.e. follows a |$t$| distribution with |$\nu $| degrees of freedom), where |$\nu = 5, 10, 25, 50, 100$|⁠.

Comparison methods

First, we compare our proposed method (HIP) to canonical variate regression (CVR) [14] as implemented in R package CVR [19]. This is a joint association and prediction method for multiple views (though existing code only implements two views), but it does not account for subgroup heterogeneity. Thus, we implement the method in two ways: (i) all subgroups are concatenated in each view (Concatenated CVR) and (ii) a separate model is fit for each subgroup (Subgroup CVR). Second, we compare our method to the Joint Lasso [15] as implemented in R package ref20 [20]. The Joint Lasso does not perform integrative analysis but does account for subgroup heterogeneity. We implement this method on the data stacked over views (Concatenated Joint Lasso), but because Joint Lasso allows for subgroups, we also apply the method on each view separately (Dataset Joint Lasso). Third, we compare our method to the Lasso [21] and Elastic Net [22] as implemented in R package ref23 [23] using both the concatenated and separate subgroup models (Concatenated Lasso/Elastic Net and Subgroup Lasso/Elastic Net, respectively); we stack the two views in each case. For the Elastic Net, we fixed |$\alpha = 0.5$|⁠. In fitting the models, we allowed any non-fixed tuning parameters to be chosen using the default in the corresponding R package. For Joint Lasso, there is no function for choosing the two tuning parameters in the R package, so we implemented a grid search over 55 parameter combinations. We applied each method to the training data sets and predicted the outcome on the test data sets. We do not compare with the meta lasso because it is only available for binary outcomes and thus not applicable to the motivating COPD data.

Evaluation measures

We compare HIP to existing methods in terms of variable selection and prediction. For variable selection, we estimate the true positive rate (TPR), false positive rate (FPR), and F|$1$| score. All are constrained to the range |$[0,1]$|⁠. Note |$\text{TPR} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}}$|⁠, and |$\text{FPR} = \frac{\text{False Positives}}{\text{True Negatives + False Positives}}$|⁠. Also, |$\text{F1} = \frac{\text{True Positives}}{\text{True Positives} + \frac{1}{2}\text{(False Positives + False Negatives)}}$|⁠. Ideally, TPR and F|$1$| are 1 and FPR is |$0$|⁠.

For HIP, the variables are ranked by the |$L_{2}$| norm of the rows in the estimated |$\boldsymbol B^{d,s}$| matrices. For each comparison method, the result includes some kind of regression coefficients, so variables with an estimated coefficient that has been shrunk zero are considered not selected and those with non-zero estimated coefficients are considered selected. For prediction, we estimated test mean squared error (MSE); smaller MSEs indicate better performance. We averaged results over our 20 test data sets.

Simulation Results

In the Full Overlap scenario, we compare HIP (Grid) and HIP (Random) and find similar results. This supports using HIP (Random) over HIP (Grid) as it is faster computationally (Supplementary Fig. S4). Looking at Fig. 3, we note that HIP has a TPR and F|$1$| close to |$1$| and FPR close to |$0$|⁠. CVR is the closest competing method for F|$1$| score. For Joint Lasso, the FPR is fairly high, so it is selecting a lot of unimportant variables. Joint Lasso, Lasso, and Elastic Net have lower TPR values suggesting that these methods are missing important variables. Overall, the competing methods have worse and more variable TPR, FPR, and F|$1$| scores compared with HIP. In terms of prediction, HIP and Subgroup CVR have lower test MSEs than the other methods. Even when subgroups share the same important variables, we see advantages in taking an integrative approach and accounting for heterogeneity. The results are mostly consistent across P1, P2, and P3, although P3 does show some deterioration in performance and increased variability. The Lasso and Elastic net are the fastest in all parameter settings followed by HIP (Random). HIP (Random) shows a larger computational advantage over CVR and Joint Lasso as the number of variables increase (Supplementary Figs S4 and S5). The Partial Overlap scenario results (Supplementary Fig. S3) are similar to the Full Overlap results but show a greater advantage for HIP in variable selection performance. Partial Overlap Scenario results for non-Gaussian errors (Supplementary Fig. S6) are similar to the Partial Overlap Scenario results for Gaussian errors: the prediction performance for HIP was better than, or comparable to, the other methods.

Results for Full Overlap scenario; the first row corresponds to P1 ($p_{1}$ = 300, $p_{2}$ = 350), the second to P2 ($p_{1}$ = 1000, $p_{2}$ = 1500), and the third to P3 ($p_{1}$ = 2000, $p_{2}$ = 3000); for all settings, $n_{1} = 250$ and $n_{2} = 260$, and the right column is test MSE, so a lower value indicates better performance; all results are based on 20 iterations.
Figure 3

Results for Full Overlap scenario; the first row corresponds to P1 (⁠|$p_{1}$| = 300, |$p_{2}$| = 350), the second to P2 (⁠|$p_{1}$| = 1000, |$p_{2}$| = 1500), and the third to P3 (⁠|$p_{1}$| = 2000, |$p_{2}$| = 3000); for all settings, |$n_{1} = 250$| and |$n_{2} = 260$|⁠, and the right column is test MSE, so a lower value indicates better performance; all results are based on 20 iterations.

The subgroup variable selection performance for HIP is better than the methods compared even in the highly unbalanced sample design (i.e. 9:1) (Supplementary Fig. S7). In general, we did observe a deterioration in the variable selection performance for Subgroup 2 (i.e. group with the smaller sample size) as the sample sizes became unbalanced. However, the performance of HIP was better than that of the other methods. In terms of prediction, HIP had consistent and better (or lower) test MSEs (⁠|$\sim 0.25$|⁠) for all sample size designs, but the results for CVR and Joint Lasso varied (Supplementary Fig. S8). HIP generally showed consistent and better subgroup prediction estimates for varying sample sizes (Supplementary Fig. S7). There was some variation in the prediction performance for Subgroup 2 for all methods. Most of the methods had poor prediction performance in Subgroup 2, except HIP (Supplementary Fig. S7).

Real data analysis

Study goals

As mentioned previously, sex disparities exist in COPD susceptibility. In this section, our goal is to use molecular data from the COPDGene Study [7] in combination with clinical data to gain new insights into the molecular architecture of COPD in males and females. Sex was self-reported. We focus on individuals with COPD (defined as GOLD stage |$\ge 1$|⁠) at Year 5 (Phase 2 of the COPDGene) Study who had proteomics, RNA-sequencing, and AWT data available at Year 5. Of the |$N=1376$| individuals with COPD at Year 5 who had complete data, |$n_{1}=782$| were males and |$n_{2}=594$| were females. Table 1 gives some characteristics of subjects who had COPD at Year 5. We assessed for sex differences using t-tests for continuous variables and |$\chi ^{2}$| tests for categorical variables. Subjects were predominantly non-Hispanic white, but there were no sex differences. There were also not sex differences in age, BMI, systolic blood pressure, percentage of current smokers, or percentage with diabetes. Males and females differed in their mean AWT (⁠|$P$||$<.001$|⁠) but did not differ by lung function as measured by mean FEV|$_{1}$|% predicted. Given the available data and the sex differences in AWT, we will (i) identify genes and proteins common and specific to males and females associated with AWT, (ii) explore pathways enriched in the proteins and genes identified for males and females, and (iii) investigate the effect of these proteins and genes on AWT, adjusting for covariates.

Table 1

COPDGene Participant Characteristics; the measurements presented were collected at the Year 5 study visit to align with the collection of proteomic and genomic data collection

VariableMalesFemales|$P$|-value
|$N$| = 782|$N$| = 594
Age68.28 (8.35)68.03 (8.36)0.581
BMI28.03 (5.62)27.69 (6.59)0.317
FEV1 % Predicted61.94 (22.97)62.91 (22.59)0.431
BODE Index2.45 (2.45)2.63 (2.38)0.176
% Emphysema11.30 (11.86)9.39 (11.45)0.003
Pack Years53.05 (26.63)47.57 (24.99)<0.001
AWT1.17 (0.23)1.00 (0.21)<0.001
Non-Hispanic White (%)82780.084
Current Smoker (%)66650.510
Diabetes (%)17140.204
VariableMalesFemales|$P$|-value
|$N$| = 782|$N$| = 594
Age68.28 (8.35)68.03 (8.36)0.581
BMI28.03 (5.62)27.69 (6.59)0.317
FEV1 % Predicted61.94 (22.97)62.91 (22.59)0.431
BODE Index2.45 (2.45)2.63 (2.38)0.176
% Emphysema11.30 (11.86)9.39 (11.45)0.003
Pack Years53.05 (26.63)47.57 (24.99)<0.001
AWT1.17 (0.23)1.00 (0.21)<0.001
Non-Hispanic White (%)82780.084
Current Smoker (%)66650.510
Diabetes (%)17140.204

BMI = Body Mass Index, FEV|$_{1}$| = Forced Expiratory Volume in 1 Second, BODE = Body mass index, airflow Obstruction, Dyspnea, and Exercise capacity.

Table 1

COPDGene Participant Characteristics; the measurements presented were collected at the Year 5 study visit to align with the collection of proteomic and genomic data collection

VariableMalesFemales|$P$|-value
|$N$| = 782|$N$| = 594
Age68.28 (8.35)68.03 (8.36)0.581
BMI28.03 (5.62)27.69 (6.59)0.317
FEV1 % Predicted61.94 (22.97)62.91 (22.59)0.431
BODE Index2.45 (2.45)2.63 (2.38)0.176
% Emphysema11.30 (11.86)9.39 (11.45)0.003
Pack Years53.05 (26.63)47.57 (24.99)<0.001
AWT1.17 (0.23)1.00 (0.21)<0.001
Non-Hispanic White (%)82780.084
Current Smoker (%)66650.510
Diabetes (%)17140.204
VariableMalesFemales|$P$|-value
|$N$| = 782|$N$| = 594
Age68.28 (8.35)68.03 (8.36)0.581
BMI28.03 (5.62)27.69 (6.59)0.317
FEV1 % Predicted61.94 (22.97)62.91 (22.59)0.431
BODE Index2.45 (2.45)2.63 (2.38)0.176
% Emphysema11.30 (11.86)9.39 (11.45)0.003
Pack Years53.05 (26.63)47.57 (24.99)<0.001
AWT1.17 (0.23)1.00 (0.21)<0.001
Non-Hispanic White (%)82780.084
Current Smoker (%)66650.510
Diabetes (%)17140.204

BMI = Body Mass Index, FEV|$_{1}$| = Forced Expiratory Volume in 1 Second, BODE = Body mass index, airflow Obstruction, Dyspnea, and Exercise capacity.

Applying the proposed and competing methods

Blood samples from COPDGene participants at Phase 2 were quantified for proteins and RNA sequencing. Proteomics data were quantified using the SomaScan 5K technology from SomaLogic. The RNA sequencing data were obtained from Illummina. The original data set has 4979 SOMAmers, representing 4776 unique proteins and 19263 genes. We work with SOMAmers and convert our findings to unique proteins. Please refer to the supplementary material for more details on the techniques used to generate the RNASeq and proteomics data. To reduce dimensionality, we first applied unsupervised filtering to select the top 5000 genes and 2000 proteins with the largest standard deviations. In this work, for statistical rigor and to reduce false findings, we used resampling techniques to identify genes and proteins that are consistently selected by our method to be associated with AWT. We refer to these genes and proteins as “stable”. In particular, we generated 50 random splits of the filtered data, stratified by subgroup, such that for each split 75% of the data were the training data and 25% were the testing data. Within each split, we performed supervised filtering by regressing AWT on each of the genes and proteins selected by unsupervised filtering, adjusting for sex, race and pack years, and retained genes and proteins with potential to explain the variation in AWT (uncorrected |$P$|-value |$<.05$|⁠). This means that the variables entering the models could differ for each split of the data.

To select tuning parameters, we set the range of possible values for |$\lambda _{G}$| and |$\lambda _{\xi }$| in HIP to |$(0,2]$| as in the simulations and selected the best model using BIC. Joint Lasso used 10-fold cross-validation over the same grid values used in the simulations. CVR, Lasso, and Elastic Net used 10-fold cross-validation with default settings to select tuning parameters. HIP and CVR both require specification of a rank, i.e. the number of latent components used in the solutions. Our proposed automatic approach (threshold |$=0.25$|⁠; refer to Section 1.1 of Supplementary Information) on the concatenated data suggested |$K=3$|⁠. Interestingly, when applied to each |$\boldsymbol X^{d,s}$| separately, it suggested |$K=3$| for the gene data and |$K=1$| for the protein data. Supplementary Figure S9 shows the scree plots for both the concatenated and separate |$\boldsymbol X^{d,s}$|⁠. Based on these results and the robustness seen in the sensitivity analyses, we selected |$K=3$| components for HIP and CVR. For HIP, we set |$N_{top} = 75$| genes and |$25$| proteins.

For each split of the data, we applied HIP (Grid), HIP (Random), and the subgroup versions of the competing methods used in the simulations. For Elastic Net and Lasso, we stacked the views and ran separate analyses for males and females. For Joint Lasso, we ran separate analyses for the protein and gene data. For CVR, we ran separate analyses for males and females. We used the selected tuning parameters and test datasets to predict AWT and estimate test MSEs. Since the univariate filtering could result in different variables that enter our model for each data split, we combined information on how often a variable passed univariate filtering and served as input into our model and how often that variable was ranked high as relevant by our method. In particular, we considered (i) the number of splits in which the variable was included in the |$N_{top}$| variables and (ii) the proportion of splits in which the variable was included in the |$N_{top}$| variables, i.e. the number of splits in which the variable was included in the |$N_{top}$| variables divided by the number of splits in which the variable was entered into the model after supervised filtering. We took the product of these two terms and ranked the variables in descending order based on this product—the higher the better. A higher product indicates that the variable was frequently selected to differentiate AWT in the supervised univariate filtering approach and was also highly ranked as being associated with AWT in our multivariate HIP method. We then selected the top 1% of genes and proteins according to the product to represent the “stable” genes and proteins. We used this cut-off point because our focus was on identifying a few molecular signatures shared by and unique to men and women and predictive of AWT.

Real Data Results

Average MSEs, and proteins and genes selected

Supplementary Figures S10 and S11 show violin plots of the test MSEs and run times, respectively, from all 50 splits of the data. The average test MSEs from the splits were slightly lower for CVR and Joint Lasso but also used many more variables (Supplementary Table S3). HIP (Random) has a computational advantage over CVR and Joint Lasso.

Supplementary Table S4 shows the number of “stable” common and subgroup-specific genes and proteins identified by each method. We note few overlaps in selected genes and proteins between HIP and existing methods (Supplementary Fig. S12). Supplementary Table S5 compares the variables selected by HIP (Random) and HIP (Grid); the selected genes and proteins are very similar, again supporting the use of the random search instead of the grid search.

Supplementary Tables S6 and S7 list the genes and Supplementary Table S8 lists the proteins identified as “stable” and important to males and females by HIP (Random) including weights for each protein and gene calculated as the |$L_{2}$| norm of coefficients in |$\hat{\boldsymbol B}^{d,s}$| across components (i.e. rows) and averaging over the splits where the variable was selected.

Proteins with large weights include NPLOC4 and SPG21 for males and SMAP1 and CDKN2D for females. [24] introduced a novel method called SubmiRine to analyze miRNA and predict miRNA target site variants (miRNA-TSV). When this method was applied to a subset of genomic samples from patients with COPD from the Lung Genome Research Consortium (LGRC; http://www.lung-genomics.org), SPG21 was the top-scoring miRNA-TSV.

The gene with the largest weight was ADIPOR1 for males and BCL2L1 for females. In a study of 60 male COPD patients and 30 male controls, [25] found adiponectin that is associated with inflammation from COPD evidenced by a positive correlation with IL-8 and a negative correlation with FEV|$_{1}$| %.

Pathway enrichment analysis

We performed pathway enrichment analysis using Ingenuity Pathway Analysis (IPA) [26] to test for overrepresentation of pathways among our lists of “stable” proteins and genes for males and females. The top 10 canonical gene pathways (Table 2) for males and females had some common and some subgroup-specific pathways. The top pathway for males is the iron homeostasis signaling pathway; this is the second ranked pathway for females, and the top pathway for females is heme biosynthesis II. There is strong evidence that disrupted iron homeostasis is associated with the presence and severity of lung disease including COPD [27, 28]. Methylglyoxal degradation I ranks second for males and third for females. Reference [29] performed gene expression profiling on small airway epithelium samples and also found this pathway to be activated in both male and female smokers.

Table 2

Top canonical pathways from IPA enrichment analysis

ViewSubgroupCanonical pathwayMoleculesUnadjusted |$P$|-value
GenesMalesIron homeostasis signaling pathwayCDC34,FECH,SLC25A37.003
Methylglyoxal Degradation IHAGH.006
Heme Biosynthesis from Uroporphyrinogen-III IFECH.008
Pentose Phosphate Pathway (Non-oxidative Branch)RPIA.013
Heme Biosynthesis IIFECH.019
Pentose Phosphate PathwayRPIA.023
Erythropoietin Signaling PathwayBCL2L1,GATA1.054
ID1 Signaling PathwayBCL2L1,GSPT1.068
Sertoli Cell-Sertoli Cell Junction SignalingSPTB,YBX3.069
AutophagyGABARAPL2,SLC1A5.076
FemalesHeme Biosynthesis IIALAS2,FECH<.001
Iron homeostasis signaling pathwayALAS2,CDC34,FECH,SLC25A37<.001
Methylglyoxal Degradation IHAGH.006
Heme Biosynthesis from Uroporphyrinogen-III IFECH.008
Tetrapyrrole Biosynthesis IIALAS2.010
Hypoxia Signaling in the Cardiovascular SystemCDC34,UBE2H.011
Pentose Phosphate Pathway (Non-oxidative Branch)RPIA.013
Pentose Phosphate PathwayRPIA.023
Erythropoietin Signaling PathwayBCL2L1,GATA1.054
ID1 Signaling PathwayBCL2L1,GSPT1.068
ProteinsMalesRole of JAK2 in Hormone-like Cytokine SignalingEPO,LEP,PTPN6<.001
White Adipose Tissue Browning PathwayBDNF,LEP,NPPB<.001
Erythropoietin Signaling PathwayEPO,LEP,PTPN6<.001
Serotonin Receptor SignalingADIPOQ,BDNF,LEP,NPPB<.001
AMPK SignalingADIPOQ,INS,LEP.001
Leptin Signaling in ObesityINS,LEP.002
IL-3 SignalingPPP3R1,PTPN6.002
Maturity Onset Diabetes of Young (MODY) SignalingADIPOQ,INS.002
Thyroid Cancer SignalingBDNF,INS.002
ABRA Signaling PathwayNPPB,PPP3R1.003
FemalesGranulocyte Adhesion and DiapedesisPF4,PPBP,TNFRSF1A<.001
Agranulocyte Adhesion and DiapedesisPF4,PPBP,TNFRSF1A<.001
Wound Healing Signaling PathwayEGF,PF4,TNFRSF1A<.001
Huntington’s Disease SignalingBDNF,CPLX2,EGF.001
Pathogen Induced Cytokine Storm Signaling PathwayPF4,PPBP,TNFRSF1A.002
Glioma SignalingCDKN2D,EGF.004
Type II Diabetes Mellitus SignalingADIPOQ,TNFRSF1A.005
Axonal Guidance SignalingBDNF,EGF,PAPPA.005
Tumor Microenvironment PathwayEGF,TNFRSF1A.007
Regulation Of The Epithelial Mesenchymal Transition By Growth Factors PathwayEGF,TNFRSF1A.008
ViewSubgroupCanonical pathwayMoleculesUnadjusted |$P$|-value
GenesMalesIron homeostasis signaling pathwayCDC34,FECH,SLC25A37.003
Methylglyoxal Degradation IHAGH.006
Heme Biosynthesis from Uroporphyrinogen-III IFECH.008
Pentose Phosphate Pathway (Non-oxidative Branch)RPIA.013
Heme Biosynthesis IIFECH.019
Pentose Phosphate PathwayRPIA.023
Erythropoietin Signaling PathwayBCL2L1,GATA1.054
ID1 Signaling PathwayBCL2L1,GSPT1.068
Sertoli Cell-Sertoli Cell Junction SignalingSPTB,YBX3.069
AutophagyGABARAPL2,SLC1A5.076
FemalesHeme Biosynthesis IIALAS2,FECH<.001
Iron homeostasis signaling pathwayALAS2,CDC34,FECH,SLC25A37<.001
Methylglyoxal Degradation IHAGH.006
Heme Biosynthesis from Uroporphyrinogen-III IFECH.008
Tetrapyrrole Biosynthesis IIALAS2.010
Hypoxia Signaling in the Cardiovascular SystemCDC34,UBE2H.011
Pentose Phosphate Pathway (Non-oxidative Branch)RPIA.013
Pentose Phosphate PathwayRPIA.023
Erythropoietin Signaling PathwayBCL2L1,GATA1.054
ID1 Signaling PathwayBCL2L1,GSPT1.068
ProteinsMalesRole of JAK2 in Hormone-like Cytokine SignalingEPO,LEP,PTPN6<.001
White Adipose Tissue Browning PathwayBDNF,LEP,NPPB<.001
Erythropoietin Signaling PathwayEPO,LEP,PTPN6<.001
Serotonin Receptor SignalingADIPOQ,BDNF,LEP,NPPB<.001
AMPK SignalingADIPOQ,INS,LEP.001
Leptin Signaling in ObesityINS,LEP.002
IL-3 SignalingPPP3R1,PTPN6.002
Maturity Onset Diabetes of Young (MODY) SignalingADIPOQ,INS.002
Thyroid Cancer SignalingBDNF,INS.002
ABRA Signaling PathwayNPPB,PPP3R1.003
FemalesGranulocyte Adhesion and DiapedesisPF4,PPBP,TNFRSF1A<.001
Agranulocyte Adhesion and DiapedesisPF4,PPBP,TNFRSF1A<.001
Wound Healing Signaling PathwayEGF,PF4,TNFRSF1A<.001
Huntington’s Disease SignalingBDNF,CPLX2,EGF.001
Pathogen Induced Cytokine Storm Signaling PathwayPF4,PPBP,TNFRSF1A.002
Glioma SignalingCDKN2D,EGF.004
Type II Diabetes Mellitus SignalingADIPOQ,TNFRSF1A.005
Axonal Guidance SignalingBDNF,EGF,PAPPA.005
Tumor Microenvironment PathwayEGF,TNFRSF1A.007
Regulation Of The Epithelial Mesenchymal Transition By Growth Factors PathwayEGF,TNFRSF1A.008
Table 2

Top canonical pathways from IPA enrichment analysis

ViewSubgroupCanonical pathwayMoleculesUnadjusted |$P$|-value
GenesMalesIron homeostasis signaling pathwayCDC34,FECH,SLC25A37.003
Methylglyoxal Degradation IHAGH.006
Heme Biosynthesis from Uroporphyrinogen-III IFECH.008
Pentose Phosphate Pathway (Non-oxidative Branch)RPIA.013
Heme Biosynthesis IIFECH.019
Pentose Phosphate PathwayRPIA.023
Erythropoietin Signaling PathwayBCL2L1,GATA1.054
ID1 Signaling PathwayBCL2L1,GSPT1.068
Sertoli Cell-Sertoli Cell Junction SignalingSPTB,YBX3.069
AutophagyGABARAPL2,SLC1A5.076
FemalesHeme Biosynthesis IIALAS2,FECH<.001
Iron homeostasis signaling pathwayALAS2,CDC34,FECH,SLC25A37<.001
Methylglyoxal Degradation IHAGH.006
Heme Biosynthesis from Uroporphyrinogen-III IFECH.008
Tetrapyrrole Biosynthesis IIALAS2.010
Hypoxia Signaling in the Cardiovascular SystemCDC34,UBE2H.011
Pentose Phosphate Pathway (Non-oxidative Branch)RPIA.013
Pentose Phosphate PathwayRPIA.023
Erythropoietin Signaling PathwayBCL2L1,GATA1.054
ID1 Signaling PathwayBCL2L1,GSPT1.068
ProteinsMalesRole of JAK2 in Hormone-like Cytokine SignalingEPO,LEP,PTPN6<.001
White Adipose Tissue Browning PathwayBDNF,LEP,NPPB<.001
Erythropoietin Signaling PathwayEPO,LEP,PTPN6<.001
Serotonin Receptor SignalingADIPOQ,BDNF,LEP,NPPB<.001
AMPK SignalingADIPOQ,INS,LEP.001
Leptin Signaling in ObesityINS,LEP.002
IL-3 SignalingPPP3R1,PTPN6.002
Maturity Onset Diabetes of Young (MODY) SignalingADIPOQ,INS.002
Thyroid Cancer SignalingBDNF,INS.002
ABRA Signaling PathwayNPPB,PPP3R1.003
FemalesGranulocyte Adhesion and DiapedesisPF4,PPBP,TNFRSF1A<.001
Agranulocyte Adhesion and DiapedesisPF4,PPBP,TNFRSF1A<.001
Wound Healing Signaling PathwayEGF,PF4,TNFRSF1A<.001
Huntington’s Disease SignalingBDNF,CPLX2,EGF.001
Pathogen Induced Cytokine Storm Signaling PathwayPF4,PPBP,TNFRSF1A.002
Glioma SignalingCDKN2D,EGF.004
Type II Diabetes Mellitus SignalingADIPOQ,TNFRSF1A.005
Axonal Guidance SignalingBDNF,EGF,PAPPA.005
Tumor Microenvironment PathwayEGF,TNFRSF1A.007
Regulation Of The Epithelial Mesenchymal Transition By Growth Factors PathwayEGF,TNFRSF1A.008
ViewSubgroupCanonical pathwayMoleculesUnadjusted |$P$|-value
GenesMalesIron homeostasis signaling pathwayCDC34,FECH,SLC25A37.003
Methylglyoxal Degradation IHAGH.006
Heme Biosynthesis from Uroporphyrinogen-III IFECH.008
Pentose Phosphate Pathway (Non-oxidative Branch)RPIA.013
Heme Biosynthesis IIFECH.019
Pentose Phosphate PathwayRPIA.023
Erythropoietin Signaling PathwayBCL2L1,GATA1.054
ID1 Signaling PathwayBCL2L1,GSPT1.068
Sertoli Cell-Sertoli Cell Junction SignalingSPTB,YBX3.069
AutophagyGABARAPL2,SLC1A5.076
FemalesHeme Biosynthesis IIALAS2,FECH<.001
Iron homeostasis signaling pathwayALAS2,CDC34,FECH,SLC25A37<.001
Methylglyoxal Degradation IHAGH.006
Heme Biosynthesis from Uroporphyrinogen-III IFECH.008
Tetrapyrrole Biosynthesis IIALAS2.010
Hypoxia Signaling in the Cardiovascular SystemCDC34,UBE2H.011
Pentose Phosphate Pathway (Non-oxidative Branch)RPIA.013
Pentose Phosphate PathwayRPIA.023
Erythropoietin Signaling PathwayBCL2L1,GATA1.054
ID1 Signaling PathwayBCL2L1,GSPT1.068
ProteinsMalesRole of JAK2 in Hormone-like Cytokine SignalingEPO,LEP,PTPN6<.001
White Adipose Tissue Browning PathwayBDNF,LEP,NPPB<.001
Erythropoietin Signaling PathwayEPO,LEP,PTPN6<.001
Serotonin Receptor SignalingADIPOQ,BDNF,LEP,NPPB<.001
AMPK SignalingADIPOQ,INS,LEP.001
Leptin Signaling in ObesityINS,LEP.002
IL-3 SignalingPPP3R1,PTPN6.002
Maturity Onset Diabetes of Young (MODY) SignalingADIPOQ,INS.002
Thyroid Cancer SignalingBDNF,INS.002
ABRA Signaling PathwayNPPB,PPP3R1.003
FemalesGranulocyte Adhesion and DiapedesisPF4,PPBP,TNFRSF1A<.001
Agranulocyte Adhesion and DiapedesisPF4,PPBP,TNFRSF1A<.001
Wound Healing Signaling PathwayEGF,PF4,TNFRSF1A<.001
Huntington’s Disease SignalingBDNF,CPLX2,EGF.001
Pathogen Induced Cytokine Storm Signaling PathwayPF4,PPBP,TNFRSF1A.002
Glioma SignalingCDKN2D,EGF.004
Type II Diabetes Mellitus SignalingADIPOQ,TNFRSF1A.005
Axonal Guidance SignalingBDNF,EGF,PAPPA.005
Tumor Microenvironment PathwayEGF,TNFRSF1A.007
Regulation Of The Epithelial Mesenchymal Transition By Growth Factors PathwayEGF,TNFRSF1A.008

There was no overlap in the top 10 protein pathways for males and females. The top pathway for males was role of JAK2 in hormone-like cytokine signaling. The top pathway for females was granulocyte adhesion and diapedesis which is associated with regulation of inflammation. Reference [30] also found this to be a top pathway involving upregulated genes when comparing patients with COPD and healthy controls.

Table 3

Comparison of regression model estimates

VariableEstimate95% CIP-value|$R^{2}$|Adjusted |$R^{2}$|
ERF0.1520.147
 Intercept–1.180–1.872, –0.4890.001
 Age0.021–0.035, 0.0770.461
 Sex (Female)0.006–0.093, 0.1050.901
 Race (African American)–0.070–0.202, 0.0620.297
 BMI0.3520.299, 0.406<0.001
 Former Smoker0.9470.257, 1.6370.007
 Current Smoker1.4290.735, 2.123<0.001
 % Emphysema0.023–0.032, 0.0790.414
 Scanner - Philips0.3790.097, 0.6610.009
 Scanner - Siemens0.1300.025, 0.2350.015
ERF + Common Protein Score0.1570.151
 Intercept–1.138–1.828, –0.4470.001
 Age0.032–0.024, 0.0890.266
 Sex (Female)0.007–0.092, 0.1050.897
 Race (African American)–0.049–0.181, 0.0840.469
 BMI0.3270.271, 0.383<0.001
 Former Smoker0.9110.223, 1.6000.010
 Current Smoker1.3900.697, 2.083<0.001
 % Emphysema0.028–0.028, 0.0830.328
 Scanner - Philips0.4040.123, 0.6860.005
 Scanner - Siemens0.1100.005, 0.2160.041
 Common Protein Score0.0760.023, 0.1300.005
ERF + Common Gene Score0.1530.147
 Intercept–1.169–1.861, –0.4770.001
 Age0.021–0.035, 0.0770.462
 Sex (Female)0.007–0.092, 0.1060.893
 Race (African American)–0.083–0.216, 0.0500.221
 BMI0.3430.288, 0.398<0.001
 Former Smoker0.9320.242, 1.6230.008
 Current Smoker1.4230.729, 2.117<0.001
 % Emphysema0.023–0.033, 0.0790.421
 Scanner - Philips0.3870.105, 0.6690.007
 Scanner - Siemens0.1340.029, 0.2390.013
 Common Gene Score0.035–0.017, 0.0860.185
ERF + Common Scores0.1580.151
 Intercept–1.128–1.819, –0.4370.001
 Age0.032–0.025, 0.0880.270
 Sex (Female)0.007–0.092, 0.1060.890
 Race (African American)–0.061–0.195, 0.0730.373
 BMI0.3200.262, 0.377<0.001
 Former Smoker0.8990.210, 1.5880.011
 Current Smoker1.3850.692, 2.078<0.001
 % Emphysema0.027–0.028, 0.0830.335
 Scanner - Philips0.4110.129, 0.6930.004
 Scanner - Siemens0.1140.008, 0.2200.035
 Common Protein Score0.0740.021, 0.1280.006
 Common Gene Score0.031–0.020, 0.0830.237
ERF + Subgroup Protein Score0.1650.159
 Intercept–1.071–1.760, –0.3830.002
 Age0.014–0.041, 0.0700.614
 Sex (Female)0.007–0.091, 0.1050.891
 Race (African American)–0.035–0.167, 0.0970.605
 BMI0.3120.256, 0.368<0.001
 Former Smoker0.8600.174, 1.5460.014
 Current Smoker1.3350.645, 2.025<0.001
 % Emphysema0.030–0.025, 0.0860.284
 Scanner - Philips0.4180.137, 0.6980.004
 Scanner - Siemens0.079–0.027, 0.1850.146
 Subgroup Protein Score0.1260.072, 0.179<0.001
ERF + Subgroup Gene Score0.1530.147
 Intercept–1.169–1.861, –0.4770.001
 Age0.021–0.035, 0.0770.459
 Sex (Female)0.007–0.092, 0.1060.893
 Race (African American)–0.083–0.217, 0.0500.220
 BMI0.3430.288, 0.398<0.001
 Former Smoker0.9320.242, 1.6220.008
 Current Smoker1.4230.729, 2.117<0.001
 % Emphysema0.023–0.033, 0.0790.421
 Scanner - Philips0.3870.105, 0.6690.007
 Scanner - Siemens0.1340.029, 0.2390.013
 Subgroup Gene Score0.035–0.016, 0.0870.180
ERF + Subgroup Scores0.1660.159
 Intercept–1.064–1.752, –0.3750.003
 Age0.015–0.041, 0.0700.610
 Sex (Female)0.007–0.091, 0.1060.885
 Race (African American)–0.045–0.179, 0.0880.504
 BMI0.3060.249, 0.363<0.001
 Former Smoker0.8500.164, 1.5360.015
 Current Smoker1.3320.642, 2.022<0.001
 % Emphysema0.030–0.025, 0.0850.290
 Scanner - Philips0.4230.142, 0.7040.003
 Scanner - Siemens0.083–0.024, 0.1890.129
 Subgroup Protein Score0.1240.070, 0.177<0.001
 Subtype Gene Score0.027–0.024, 0.0780.304
VariableEstimate95% CIP-value|$R^{2}$|Adjusted |$R^{2}$|
ERF0.1520.147
 Intercept–1.180–1.872, –0.4890.001
 Age0.021–0.035, 0.0770.461
 Sex (Female)0.006–0.093, 0.1050.901
 Race (African American)–0.070–0.202, 0.0620.297
 BMI0.3520.299, 0.406<0.001
 Former Smoker0.9470.257, 1.6370.007
 Current Smoker1.4290.735, 2.123<0.001
 % Emphysema0.023–0.032, 0.0790.414
 Scanner - Philips0.3790.097, 0.6610.009
 Scanner - Siemens0.1300.025, 0.2350.015
ERF + Common Protein Score0.1570.151
 Intercept–1.138–1.828, –0.4470.001
 Age0.032–0.024, 0.0890.266
 Sex (Female)0.007–0.092, 0.1050.897
 Race (African American)–0.049–0.181, 0.0840.469
 BMI0.3270.271, 0.383<0.001
 Former Smoker0.9110.223, 1.6000.010
 Current Smoker1.3900.697, 2.083<0.001
 % Emphysema0.028–0.028, 0.0830.328
 Scanner - Philips0.4040.123, 0.6860.005
 Scanner - Siemens0.1100.005, 0.2160.041
 Common Protein Score0.0760.023, 0.1300.005
ERF + Common Gene Score0.1530.147
 Intercept–1.169–1.861, –0.4770.001
 Age0.021–0.035, 0.0770.462
 Sex (Female)0.007–0.092, 0.1060.893
 Race (African American)–0.083–0.216, 0.0500.221
 BMI0.3430.288, 0.398<0.001
 Former Smoker0.9320.242, 1.6230.008
 Current Smoker1.4230.729, 2.117<0.001
 % Emphysema0.023–0.033, 0.0790.421
 Scanner - Philips0.3870.105, 0.6690.007
 Scanner - Siemens0.1340.029, 0.2390.013
 Common Gene Score0.035–0.017, 0.0860.185
ERF + Common Scores0.1580.151
 Intercept–1.128–1.819, –0.4370.001
 Age0.032–0.025, 0.0880.270
 Sex (Female)0.007–0.092, 0.1060.890
 Race (African American)–0.061–0.195, 0.0730.373
 BMI0.3200.262, 0.377<0.001
 Former Smoker0.8990.210, 1.5880.011
 Current Smoker1.3850.692, 2.078<0.001
 % Emphysema0.027–0.028, 0.0830.335
 Scanner - Philips0.4110.129, 0.6930.004
 Scanner - Siemens0.1140.008, 0.2200.035
 Common Protein Score0.0740.021, 0.1280.006
 Common Gene Score0.031–0.020, 0.0830.237
ERF + Subgroup Protein Score0.1650.159
 Intercept–1.071–1.760, –0.3830.002
 Age0.014–0.041, 0.0700.614
 Sex (Female)0.007–0.091, 0.1050.891
 Race (African American)–0.035–0.167, 0.0970.605
 BMI0.3120.256, 0.368<0.001
 Former Smoker0.8600.174, 1.5460.014
 Current Smoker1.3350.645, 2.025<0.001
 % Emphysema0.030–0.025, 0.0860.284
 Scanner - Philips0.4180.137, 0.6980.004
 Scanner - Siemens0.079–0.027, 0.1850.146
 Subgroup Protein Score0.1260.072, 0.179<0.001
ERF + Subgroup Gene Score0.1530.147
 Intercept–1.169–1.861, –0.4770.001
 Age0.021–0.035, 0.0770.459
 Sex (Female)0.007–0.092, 0.1060.893
 Race (African American)–0.083–0.217, 0.0500.220
 BMI0.3430.288, 0.398<0.001
 Former Smoker0.9320.242, 1.6220.008
 Current Smoker1.4230.729, 2.117<0.001
 % Emphysema0.023–0.033, 0.0790.421
 Scanner - Philips0.3870.105, 0.6690.007
 Scanner - Siemens0.1340.029, 0.2390.013
 Subgroup Gene Score0.035–0.016, 0.0870.180
ERF + Subgroup Scores0.1660.159
 Intercept–1.064–1.752, –0.3750.003
 Age0.015–0.041, 0.0700.610
 Sex (Female)0.007–0.091, 0.1060.885
 Race (African American)–0.045–0.179, 0.0880.504
 BMI0.3060.249, 0.363<0.001
 Former Smoker0.8500.164, 1.5360.015
 Current Smoker1.3320.642, 2.022<0.001
 % Emphysema0.030–0.025, 0.0850.290
 Scanner - Philips0.4230.142, 0.7040.003
 Scanner - Siemens0.083–0.024, 0.1890.129
 Subgroup Protein Score0.1240.070, 0.177<0.001
 Subtype Gene Score0.027–0.024, 0.0780.304

Models were fit on all participants in the COPD Data set and adjusted for age, sex, race, BMI, smoking status, percent emphysema, and scanner make (⁠|$N=1374$|⁠). Scores were developed using the “stable” proteins and genes selected by HIP (Random). ERF = Established Risk Factors (Age, Sex, Race, BMI, and Smoking Status).

Table 3

Comparison of regression model estimates

VariableEstimate95% CIP-value|$R^{2}$|Adjusted |$R^{2}$|
ERF0.1520.147
 Intercept–1.180–1.872, –0.4890.001
 Age0.021–0.035, 0.0770.461
 Sex (Female)0.006–0.093, 0.1050.901
 Race (African American)–0.070–0.202, 0.0620.297
 BMI0.3520.299, 0.406<0.001
 Former Smoker0.9470.257, 1.6370.007
 Current Smoker1.4290.735, 2.123<0.001
 % Emphysema0.023–0.032, 0.0790.414
 Scanner - Philips0.3790.097, 0.6610.009
 Scanner - Siemens0.1300.025, 0.2350.015
ERF + Common Protein Score0.1570.151
 Intercept–1.138–1.828, –0.4470.001
 Age0.032–0.024, 0.0890.266
 Sex (Female)0.007–0.092, 0.1050.897
 Race (African American)–0.049–0.181, 0.0840.469
 BMI0.3270.271, 0.383<0.001
 Former Smoker0.9110.223, 1.6000.010
 Current Smoker1.3900.697, 2.083<0.001
 % Emphysema0.028–0.028, 0.0830.328
 Scanner - Philips0.4040.123, 0.6860.005
 Scanner - Siemens0.1100.005, 0.2160.041
 Common Protein Score0.0760.023, 0.1300.005
ERF + Common Gene Score0.1530.147
 Intercept–1.169–1.861, –0.4770.001
 Age0.021–0.035, 0.0770.462
 Sex (Female)0.007–0.092, 0.1060.893
 Race (African American)–0.083–0.216, 0.0500.221
 BMI0.3430.288, 0.398<0.001
 Former Smoker0.9320.242, 1.6230.008
 Current Smoker1.4230.729, 2.117<0.001
 % Emphysema0.023–0.033, 0.0790.421
 Scanner - Philips0.3870.105, 0.6690.007
 Scanner - Siemens0.1340.029, 0.2390.013
 Common Gene Score0.035–0.017, 0.0860.185
ERF + Common Scores0.1580.151
 Intercept–1.128–1.819, –0.4370.001
 Age0.032–0.025, 0.0880.270
 Sex (Female)0.007–0.092, 0.1060.890
 Race (African American)–0.061–0.195, 0.0730.373
 BMI0.3200.262, 0.377<0.001
 Former Smoker0.8990.210, 1.5880.011
 Current Smoker1.3850.692, 2.078<0.001
 % Emphysema0.027–0.028, 0.0830.335
 Scanner - Philips0.4110.129, 0.6930.004
 Scanner - Siemens0.1140.008, 0.2200.035
 Common Protein Score0.0740.021, 0.1280.006
 Common Gene Score0.031–0.020, 0.0830.237
ERF + Subgroup Protein Score0.1650.159
 Intercept–1.071–1.760, –0.3830.002
 Age0.014–0.041, 0.0700.614
 Sex (Female)0.007–0.091, 0.1050.891
 Race (African American)–0.035–0.167, 0.0970.605
 BMI0.3120.256, 0.368<0.001
 Former Smoker0.8600.174, 1.5460.014
 Current Smoker1.3350.645, 2.025<0.001
 % Emphysema0.030–0.025, 0.0860.284
 Scanner - Philips0.4180.137, 0.6980.004
 Scanner - Siemens0.079–0.027, 0.1850.146
 Subgroup Protein Score0.1260.072, 0.179<0.001
ERF + Subgroup Gene Score0.1530.147
 Intercept–1.169–1.861, –0.4770.001
 Age0.021–0.035, 0.0770.459
 Sex (Female)0.007–0.092, 0.1060.893
 Race (African American)–0.083–0.217, 0.0500.220
 BMI0.3430.288, 0.398<0.001
 Former Smoker0.9320.242, 1.6220.008
 Current Smoker1.4230.729, 2.117<0.001
 % Emphysema0.023–0.033, 0.0790.421
 Scanner - Philips0.3870.105, 0.6690.007
 Scanner - Siemens0.1340.029, 0.2390.013
 Subgroup Gene Score0.035–0.016, 0.0870.180
ERF + Subgroup Scores0.1660.159
 Intercept–1.064–1.752, –0.3750.003
 Age0.015–0.041, 0.0700.610
 Sex (Female)0.007–0.091, 0.1060.885
 Race (African American)–0.045–0.179, 0.0880.504
 BMI0.3060.249, 0.363<0.001
 Former Smoker0.8500.164, 1.5360.015
 Current Smoker1.3320.642, 2.022<0.001
 % Emphysema0.030–0.025, 0.0850.290
 Scanner - Philips0.4230.142, 0.7040.003
 Scanner - Siemens0.083–0.024, 0.1890.129
 Subgroup Protein Score0.1240.070, 0.177<0.001
 Subtype Gene Score0.027–0.024, 0.0780.304
VariableEstimate95% CIP-value|$R^{2}$|Adjusted |$R^{2}$|
ERF0.1520.147
 Intercept–1.180–1.872, –0.4890.001
 Age0.021–0.035, 0.0770.461
 Sex (Female)0.006–0.093, 0.1050.901
 Race (African American)–0.070–0.202, 0.0620.297
 BMI0.3520.299, 0.406<0.001
 Former Smoker0.9470.257, 1.6370.007
 Current Smoker1.4290.735, 2.123<0.001
 % Emphysema0.023–0.032, 0.0790.414
 Scanner - Philips0.3790.097, 0.6610.009
 Scanner - Siemens0.1300.025, 0.2350.015
ERF + Common Protein Score0.1570.151
 Intercept–1.138–1.828, –0.4470.001
 Age0.032–0.024, 0.0890.266
 Sex (Female)0.007–0.092, 0.1050.897
 Race (African American)–0.049–0.181, 0.0840.469
 BMI0.3270.271, 0.383<0.001
 Former Smoker0.9110.223, 1.6000.010
 Current Smoker1.3900.697, 2.083<0.001
 % Emphysema0.028–0.028, 0.0830.328
 Scanner - Philips0.4040.123, 0.6860.005
 Scanner - Siemens0.1100.005, 0.2160.041
 Common Protein Score0.0760.023, 0.1300.005
ERF + Common Gene Score0.1530.147
 Intercept–1.169–1.861, –0.4770.001
 Age0.021–0.035, 0.0770.462
 Sex (Female)0.007–0.092, 0.1060.893
 Race (African American)–0.083–0.216, 0.0500.221
 BMI0.3430.288, 0.398<0.001
 Former Smoker0.9320.242, 1.6230.008
 Current Smoker1.4230.729, 2.117<0.001
 % Emphysema0.023–0.033, 0.0790.421
 Scanner - Philips0.3870.105, 0.6690.007
 Scanner - Siemens0.1340.029, 0.2390.013
 Common Gene Score0.035–0.017, 0.0860.185
ERF + Common Scores0.1580.151
 Intercept–1.128–1.819, –0.4370.001
 Age0.032–0.025, 0.0880.270
 Sex (Female)0.007–0.092, 0.1060.890
 Race (African American)–0.061–0.195, 0.0730.373
 BMI0.3200.262, 0.377<0.001
 Former Smoker0.8990.210, 1.5880.011
 Current Smoker1.3850.692, 2.078<0.001
 % Emphysema0.027–0.028, 0.0830.335
 Scanner - Philips0.4110.129, 0.6930.004
 Scanner - Siemens0.1140.008, 0.2200.035
 Common Protein Score0.0740.021, 0.1280.006
 Common Gene Score0.031–0.020, 0.0830.237
ERF + Subgroup Protein Score0.1650.159
 Intercept–1.071–1.760, –0.3830.002
 Age0.014–0.041, 0.0700.614
 Sex (Female)0.007–0.091, 0.1050.891
 Race (African American)–0.035–0.167, 0.0970.605
 BMI0.3120.256, 0.368<0.001
 Former Smoker0.8600.174, 1.5460.014
 Current Smoker1.3350.645, 2.025<0.001
 % Emphysema0.030–0.025, 0.0860.284
 Scanner - Philips0.4180.137, 0.6980.004
 Scanner - Siemens0.079–0.027, 0.1850.146
 Subgroup Protein Score0.1260.072, 0.179<0.001
ERF + Subgroup Gene Score0.1530.147
 Intercept–1.169–1.861, –0.4770.001
 Age0.021–0.035, 0.0770.459
 Sex (Female)0.007–0.092, 0.1060.893
 Race (African American)–0.083–0.217, 0.0500.220
 BMI0.3430.288, 0.398<0.001
 Former Smoker0.9320.242, 1.6220.008
 Current Smoker1.4230.729, 2.117<0.001
 % Emphysema0.023–0.033, 0.0790.421
 Scanner - Philips0.3870.105, 0.6690.007
 Scanner - Siemens0.1340.029, 0.2390.013
 Subgroup Gene Score0.035–0.016, 0.0870.180
ERF + Subgroup Scores0.1660.159
 Intercept–1.064–1.752, –0.3750.003
 Age0.015–0.041, 0.0700.610
 Sex (Female)0.007–0.091, 0.1060.885
 Race (African American)–0.045–0.179, 0.0880.504
 BMI0.3060.249, 0.363<0.001
 Former Smoker0.8500.164, 1.5360.015
 Current Smoker1.3320.642, 2.022<0.001
 % Emphysema0.030–0.025, 0.0850.290
 Scanner - Philips0.4230.142, 0.7040.003
 Scanner - Siemens0.083–0.024, 0.1890.129
 Subgroup Protein Score0.1240.070, 0.177<0.001
 Subtype Gene Score0.027–0.024, 0.0780.304

Models were fit on all participants in the COPD Data set and adjusted for age, sex, race, BMI, smoking status, percent emphysema, and scanner make (⁠|$N=1374$|⁠). Scores were developed using the “stable” proteins and genes selected by HIP (Random). ERF = Established Risk Factors (Age, Sex, Race, BMI, and Smoking Status).

Effect of common and sex-specific genes and proteins on AWT

Finally, we created common and sex-specific protein and gene scores from the “stable” proteins and genes selected by HIP (Random) and assessed whether these scores improved the prediction of AWT beyond some established COPD risk factors. We created the common protein score for subject |$i$| as CommonProtScore|$_{i}=\sum _{j=1}^{\# common~proteins}w_{j}x_{ij}^{1}$|⁠, where |$x_{ij}^{1}$| is subject |$i$|’s protein expression value for the |$j$|th common protein (i.e. the |$ij$|th entry for the protein data, |$\boldsymbol X^{1}$|⁠), and |$w_{j}$| is the weight for protein |$j$|⁠. Each protein weight, |$w_{j}$|⁠, was obtained via bootstrap. Specifically, we obtained 200 bootstrap datasets, and for each bootstrap dataset, we obtained regression coefficients and standard errors from univariate regression models of AWT and each of the common proteins identified. This resulted in 200 regression coefficients and standard errors which we combined using a weighted mean. The subgroup-specific scores were also obtained in a similar fashion. The scores were standardized to have mean 0 and variance 1 in each subgroup since different variables were identified for males and females.

Once the scores were created, we fit several multiple linear regression models on the full data: (i) Established Risk Factors (ERF) Model, (ii) ERF + Common Protein Score, (iii) ERF + Common Gene Score, (iv) ERF + Common Protein and Gene Scores, (v) ERF + Subgroup Protein Score, (vi) ERF + Subgroup Gene Score, and (vii) ERF + Subgroup Protein and Gene Scores. Table 3 shows the coefficient estimates with confidence intervals and |$P$|-values. We observe both the common and subgroup-specific protein scores are statistically significant, but neither the common nor subgroup-specific gene scores were. This could be due to including too few genes in the scores or because there was a large overlap between the genes selected for males and females. The “stable” proteins we identified to be common and specific to males and females could potentially be explored to further our understanding of sex differences in COPD mechanisms.

Conclusion

We have tackled the problem of accounting for subgroup heterogeneity in an integrative analysis framework. Motivated by the COPDGene study and a scientific need to understand sex differences in COPD, we developed appropriate statistical methods that leverage the strengths of multi-view data, account for subgroup heterogeneity, incorporate clinical covariates, and combine the association step with a clinical outcome step to guide the selection of clinically meaningful molecular signatures. Through the use of a hierarchical penalty, we identify omics signatures that are common and subgroup-specific and can predict a clinical outcome. HIP showed comparable to substantially improved prediction and variable selection performance in simulation settings when compared with existing methods.

When we applied HIP to genomic and proteomic data from COPDGene, we identified protein and gene biomarkers and pathways common and specific to males and females. When the proteins and genes were developed into scores, the common and subgroup-specific protein scores were statistically significant predictors of AWT even when including established risk factors of COPD. These findings suggest that the proteins and genes identified to be common and specific to males and females could be explored to further our understanding of sex differences in COPD mechanisms.

Recently, [31] also explored gene signatures related to AWT and found that interferon stimulated genes were associated with AWT. We did not find these same genes in our analysis, but there were several differences in the analyses that could explain the differing results: (i) the subset of COPDGene participants in the two analyses was different as we only included participants with COPD, while [31] included participants with and without COPD, (ii) [31] looked for associations between individual genes and AWT while adjusting for covariates, whereas we selected genes based on rankings from our model that included several genes at once, (iii) we considered both gene and protein data (which also impacted which participants we could include), whereas [31] only considered genes, and (iv) we use IPA [26] to find pathways, whereas [31] used MSigDB (https://www.gsea-msigdb.org/gsea/msigdb).

HIP has some limitations that warrant further research. First, the number of variables to be kept for the subset model refit has to be specified. In simulations where this value is known, performance is very good, but the truth will not be known in applied settings. Users could look at plots of the weights from the |$\hat{\boldsymbol B}^{d,s}$| to see how many variables seem to have large weights. We also found that if there were some splits where the train MSEs were very small but test MSEs very large, i.e. evidence of overfitting that more variables needed to be retained. Second, the tuning range for |$\lambda _{\xi }$| and |$\lambda _{g}$| is not determined by the data, so the tuning range may need to be adjusted to attain optimal sparsity. This can be done with an optional parameter in the code. Additionally, the number of components, |$K$|⁠, needs to be specified. Although the truth can never be known, we provide an automatic method to select |$K$| and discuss other options in the supplemental material. Future research should explore the possibility that |$K$| may differ by data view. Finally, HIP is limited to cross-sectional data, but future work could extend it to accommodate longitudinal data to determine whether trends in some outcome vary by subgroup. Despite these limitations, HIP advances statistical methods for joint association and prediction of multi-view data, and the encouraging simulation and real data findings motivate further applications.

Key Points
  • Many complex diseases are characterized by subgroup differences, yet most existing integrative analysis methods ignore differential outcomes in subgroups. We have tackled the problem of accounting for subgroup heterogeneity in data integration.

  • We have developed HIP, a statistical method that leverages the strengths of multi-view data, accounts for subgroup heterogeneity, incorporates clinical covariates, and couples data integration with a clinical outcome prediction, yielding clinically meaningful molecular signatures and enhancing interpretability.

  • We have conducted rigorous simulation studies and found HIP to be competitive. The COPD data application showcases the potential application of HIP in identifying molecular signatures shared by, and unique to, males and females, and predictive of airway wall thickness.

  • We have developed efficient PyTorch codes for users with programming expertise, and a Shiny App, for users without programming expertise, to promote research inclusion and accessibility.

Author contributions

S.E.S. and Q.L. conceived of the idea. S.E.S., J.B., and L.E. developed the methods. J.B. and S.E.S. developed code to implement the methods. J.B. and L.V. conducted simulations and real data analyses. J.B. and C.W. interpreted results from the real data analyses. J.B. and S.E.S. wrote a first draft of the paper. All authors read and edited the final manuscript.

Conflict of interest: The authors declare that they have no competing interests.

Funding

This work was supported by National Center For Advancing Translational Science (5KL2TR002492), National Institute Of General Medical Sciences (1R35GM142695), and NHLBI U01 HL089897 and U01 HL089856. The COPDGene study (NCT00608764) is also supported by the COPD Foundation through contributions made to an Industry Advisory Committee that has included AstraZeneca, Bayer Pharmaceuticals, Boehringer-Ingelheim, GlaxoSmithKline, Genentech, Novartis, Pfizer, and Sunovion. The views expressed in this article are those of the authors and do not reflect the views of the United States Government, the Department of Veterans Affairs, the funders, the sponsors, or any of the authors’ affiliated academic institutions.

Data availability

Access to the clinical and genomic data can be requested through dbGaP (IDs: phs000951.v4.p4 and phs000179.v6.p2). The proteomic data can be requested from the COPDGene Study Group (http://www.copdgene.org/).

The Python source code for implementing the methods and generating simulated data along with README files is available on GitHub at https://github.com/lasandrall/HIP. We interface the Python code with the R programming language and provide an R-package for users without programming experience in Python. A Shiny Application of HIP for users with limited programming experience can be found at https://multi-viewlearn.shinyapps.io/HIP_ShinyApp/. We provide a video tutorial at https://youtu.be/O6E2OLmeMDo to facilitate the use of HIP.

Ethics approval and consent to participate

This research uses previously collected, de-identified data from the COPDGene Study [7], a multi-center study with 21 clinical sites each with local IRB approval (NCT00608764).

Consent for publication

Not applicable

References

1.

Wheaton
AG
,
Cunningham
TJ
,
Ford
ES
. et al. .
Employment and activity limitations among adults with chronic obstructive pulmonary disease—United States, 2013
.
MMWR Morb Mortal Wkly Rep
2015
;
64
:
289
95
.

2.

GOLD, “GOLD 2020 Report.” https://goldcopd.org/wp-content/uploads/2019/11/GOLD-2020-REPORT-ver1.1wms.pdf, 2020 (20 May 2020, date last accessed)
.

3.

Hardin
M
,
Silverman
EK
.
Chronic obstructive pulmonary disease genetics: a review of the past and a look into the future
.
Chronic Obstr Pulm Dis
2014
;
1
:
33
46
. .

4.

Hu
G
,
Zhou
Y
,
Tian
J
. et al. .
Risk of copd from exposure to biomass smoke: a metaanalysis
.
Chest
2010
;
138
:
20
31
. .

5.

Pauwels
RA
,
Buist
AS
,
Calverley
PM
. et al. .
Global strategy for the diagnosis, management, and prevention of chronic obstructive pulmonary disease: Nhlbi/who global initiative for chronic obstructive lung disease (gold) workshop summary
.
Am J Respir Crit Care Med
2001
;
163
:
1256
76
. .

6.

Chung
K
,
Adcock
I
.
Multifaceted mechanisms in copd: Inflammation, immunity, and tissue repair and destruction
.
Eur Respir J
2008
;
31
:
1334
56
. .

7.

Regan
EA
,
Hokanson
JE
,
Murphy
JR
. et al. .
Genetic epidemiology of copd (copdgene) study design
.
COPD: J Chron Obstruct Pulmon Dis
2011
;
7
:
32
43
. .

8.

Barnes
PJ
.
Sex differences in chronic obstructive pulmonary disease mechanisms
.
Am J Respir Crit Care Med
2016
;
193
:
813
4
. https://10.1164/rccm.201512-2379ED.

9.

Gan
WQ
,
Man
SP
,
Postma
DS
. et al. .
Female smokers beyond the perimenopausal period are at increased risk of chronic obstructive pulmonary disease: a systematic review and meta-analysis
.
Respir Res
2006
;
7
:
1
9
. .

10.

Kim
Y-I
,
Schroeder
J
,
Lynch
D
. et al. .
Gender differences of airway dimensions in anatomically matched sites on ct in smokers
.
COPD: J Chron Obstruct Pulmon Dis
2011
;
8
:
285
92
. .

11.

Prescott
E
,
Bjerg
A
,
Andersen
P
. et al. .
Gender difference in smoking effects on lung function and risk of hospitalization for copd: results from a danish longitudinal population study
.
Eur Respir J
1997
;
10
:
822
7
. .

12.

Safo
SE
,
Min
EJ
,
Haine
L
.
Sparse linear discriminant analysis for multiview structured data
.
Biometrics
2021
;
78
:
612
23
. .

13.

Chekouo
T
,
Safo
SE
.
Bayesian integrative analysis and prediction with application to atherosclerosis cardiovascular disease
.
Biostatistics
2020
;
24
:
124
39
. .

14.

Luo
C
,
Liu
J
,
Dey
DK
. et al. .
Canonical variate regression
.
Biostatistics
2016
;
17
:
468
83
. .

15.

Dondelinger
F
,
Mukherjee
S
,
Initiative
TADN
.
The joint lasso: high-dimensional regression for group structured data
.
Biostatistics
2018
;
21
:
219
35
.
_eprint: https://dbpia.nl.go.kr/biostatistics/article-pdf/21/2/219/32914593/kxy035.pdf. [Online]. Available: https://doi.org/10.1093/biostatistics/kxy035
.

16.

Li
Q
,
Wang
S
,
Huang
C-C
. et al. .
Meta-analysis based variable selection for gene expression data
.
Biometrics
2014
;
70
:
872
80
. .

17.

Paszke
A
,
Gross
S
,
Massa
F
. et al. .
Pytorch: an imperative style, high-performance deep learning library
. In:
Wallach
H
,
Larochelle
H
,
Beygelzimer
A
,
d’ Alché-Buc
F
,
Fox
E
,
Garnett
R
, (eds.),
Advances in Neural Information Processing Systems 32
.
Curran Associates, Inc.
,
2019
,
8024
35
.

18.

Gower
JC
,
Dijksterhuis
GB
. et al. .
Procrustes Problems
, Vol.
30
.
Oxford University Press on Demand
,
2004
. .

19.

Luo
C
,
Chen
K
.
CVR: Canonical Variate Regression
2017
R package version 0.1.1. [Online]
. .

20.

Dondelinger
F
,
Wilkinson
O
.
Fuser: Fused Lasso for High-Dimensional Regression over Groups
.
2018
.
R package version 1.0.1. [Online]
. .

21.

Tibshirani
R
.
Regression shrinkage and selection via the lasso
.
J R Stat Soc Ser B
1994
;
58
:
267
88
. .

22.

Zou
H
,
Hastie
T
.
Regularization and variable selection via the elastic net
.
Journal of the Royal Statistical Society, Series B
2005
;
67
:
301
20
. .

23.

Friedman
J
,
Hastie
T
,
Tibshirani
R
.
Regularization paths for generalized linear models via coordinate descent
.
J Stat Softw
2010
; Volume
33
:
1
22
. . .

24.

Maxwell
EK
,
Campbell
JD
,
Spira
A
. et al. .
Submirine: Assessing variants in microrna targets using clinical genomic data sets
.
Nucleic Acids Res
2015
;
43
:
3886
98
. .

25.

Jaswal
S
,
Saini
V
,
Kaur
J
. et al. .
Association of adiponectin with lung function impairment and disease severity in chronic obstructive pulmonary disease
.
Int J Appl Basic Med Res
2018
;
8
:
14
. .

26.

Kramer
A
,
Greeen
J
,
Jr
JP
. et al. .
Causal analysis approaches in ingenuity pathway analysis
.
Bionformatics
2014
;
30
:
523
30
.
[Online]. Available: https://doi.org/10.1093/bioinformatics/btt703
.

27.

Neves
J
,
Haider
T
,
Gassmann
M
. et al. .
Iron homeostasis in the lungs—a balance between health and disease
.
Pharmaceuticals
2019
;
12
:
5
. .

28.

Cloonan
SM
,
Mumby
S
,
Adcock
IM
. et al. .
The ”iron”-y of iron overload and iron deficiency in chronic obstructive pulmonary disease
.
Am J Respir Crit Care Med
2017
;
196
:
1103
12
. .

29.

Salit
J
,
Kaner
R
,
Mezey
J
. et al. .
Small airway epithelial responses associated with enhanced female susceptibility to smoking-related lung disease
.
American Thoracic Society
2019
; Vol
199
:
A7096
6
. .

30.

Wang
L
,
Zhao
H
,
Raman
I
. et al. .
Peripheral blood mononuclear cell gene expression in chronic obstructive pulmonary disease: Mirna and mrna regulation
.
J Inflamm Res
2022
;
15
:
2167
80
. .

31.

Yun
JH
,
Lee
S
,
Srinivasa
P
. et al. .
”An interferon-inducible signature of airway disease from blood gene expression profiling
.
Eur Respir J
2022
;
59
:1–22. .

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]