Summary

Divide-and-conquer (DAC) is a commonly used strategy to overcome the challenges of extraordinarily large data, by first breaking the dataset into series of data blocks, then combining results from individual data blocks to obtain a final estimation. Various DAC algorithms have been proposed to fit a sparse predictive regression model in the |$L_1$| regularization setting. However, many existing DAC algorithms remain computationally intensive when sample size and number of candidate predictors are both large. In addition, no existing DAC procedures provide inference for quantifying the accuracy of risk prediction models. In this article, we propose a screening and one-step linearization infused DAC (SOLID) algorithm to fit sparse logistic regression to massive datasets, by integrating the DAC strategy with a screening step and sequences of linearization. This enables us to maximize the likelihood with only selected covariates and perform penalized estimation via a fast approximation to the likelihood. To assess the accuracy of a predictive regression model, we develop a modified cross-validation (MCV) that utilizes the side products of the SOLID, substantially reducing the computational burden. Compared with existing DAC methods, the MCV procedure is the first to make inference on accuracy. Extensive simulation studies suggest that the proposed SOLID and MCV procedures substantially outperform the existing methods with respect to computational speed and achieve similar statistical efficiency as the full sample-based estimator. We also demonstrate that the proposed inference procedure provides valid interval estimators. We apply the proposed SOLID procedure to develop and validate a classification model for disease diagnosis using narrative clinical notes based on electronic medical record data from Partners HealthCare.

1. Introduction

Large-scale healthcare data, including those from electronic medical records (EMR) and insurance claims data, open tremendous opportunities for novel discoveries. For example, risk prediction models can be developed using EMR data and incorporated into the healthcare system as clinical decision support. However, developing risk prediction models using massive and rich datasets is computationally challenging since often times both the sample size, |$n_0$|⁠, and the number of candidate predictors, |$p$|⁠, are large. To develop an accurate yet clinically interpretable risk model in such a high dimensional setting, we may fit regularized regression procedures with various penalties that can simultaneously remove non-informative predictors and estimate the effects of the informative predictors Tibshirani, 1996; Fan and Li, 2001; Zou and Hastie, 2005; Zou, 2006, e.g.). However, when |$n_0$| is extraordinarily large, directly fitting a sparse predictive model is not computationally feasible.

One strategy to overcome the computational difficulty is to employ a divide-and-conquer (DAC) scheme. Standard DAC procedures divide the full sample into |$K$| subsets, solve the optimization problem using each subset, and then combine subset-specific estimates to form a final DAC estimate. A wide range of DAC algorithms have been proposed to fit various prediction models, but we will only focus our discussion on |$L_1$|-type penalized regression due to its advantage in interpretability. Chen and Xie (2014) proposed a DAC algorithm to fit penalized GLM. The algorithm obtains a sparse GLM estimate for each subset and then combines subset-specific estimates by majority voting and averaging. Wang and others (2014) proposed a median selection subset aggregation estimator by first fitting |$L_1$| penalized linear regression in each data block to select informative features, and then performing a DAC linear regression with only selected features. Lee and others (2017) proposed a one-shot de-biased lasso approach, by obtaining the de-biased lasso estimate (Van de Geer and others, 2014) at each data block, then simply averaging the local de-biased lasso estimates. Tang and others (2016) also obtained the same de-biased lasso estimate of each data block, but then used a robust meta-analysis-type approach through the confidence distribution approach (Xie and others, 2011) to combine the local estimates. He and others (2016) propose a sparse meta-analysis approach to integrate information across sites for the low-dimensional setting with small |$p$|⁠. While the above algorithms are effective in reducing the computational burden compared to fitting a penalized regression model to the full data, they remain computationally intensive as they require |$K$| penalized estimation procedures. It is of great interest to further reduce the computational burden when fitting extraordinarily large data. Recently, building on the work of Wang and Leng (2007), Wang and others (2019) proposed an efficient DAC algorithm to fit sparse Cox regression by using a one-step linear approximation followed by a least square approximation (LSA) without requiring fitting penalized Cox model on all |$K$| sets. Although this approach can be adapted for logistic regression, it is not computationally efficient when |$p$| is large.

After a risk prediction model is developed, a critical follow-up step is to evaluate its prediction performance in order to determine whether it is feasible to adopt in practice. A wide range of accuracy parameters such as the receiving operating characteristic (ROC) curve have been proposed as performance measures (Pepe, 2003). Making inference about the accuracy parameters with an exceedingly large dataset is also computationally challenging, especially if the accuracy estimators are non-smooth functionals. Under standard settings with moderate |$n_0$| and |$p$|⁠, to make inference about the prediction accuracy of a fitted model, cross-validation (CV) is often used to correct overfitting bias and resampling methods are used for variance estimation (Tian and others, 2007; Uno and others, 2007, e.g.). Both CV and resampling becomes computationally infeasible for exceedingly large datasets.

In this article, we propose a screening and one-step linearization infused DAC (SOLID) procedure for efficiently estimating a sparse prediction model. The SOLID procedure attains computational and statistical efficiency by integrating DAC with an initial screening for an active set of estimators, and sequences of one-step linear approximation. The main advantage of the SOLID procedure compared with existing DAC procedures is that it substantially increases the computational efficiency when |$n_0$| and |$p$| are both large. To efficiently make inference about the accuracy of the prediction model, we develop a modified cross-validation (MCV), which achieves the same asymptotic efficiency as the standard CV. The MCV procedure directly utilizes side products from the SOLID procedure to estimate the corresponding accuracy parameters without additionally applying the full SOLID procedure in each random split, but still adjusting for overfitting. Specifically, during the SOLID procedure, summary statistics such as score function and hessian matrix are obtained for each data block. At the |$k$|th fold for a |$K$|-fold MCV, instead of applying the full SOLID procedure to |$K-1$| data blocks (taking out the |$k$|th data block), the model estimation is approximated by combining summary statistics of |$K-1$| data blocks which are already available. To the best of our knowledge, the proposed MCV procedure is the first effort to make inference on accuracy and adjust for overfitting under the DAC framework.

The rest of the article is organized as follows. In Section 2, we detail the SOLID procedure and the MCV. We also propose the efficient standard error estimation for the non-smooth function. In Section 3, we present simulation results demonstrating the superiority of the proposed methods compared to the existing methods. In Section 4, we employ the SOLID and the MCV to Partners HealthCare EMR data to assign ICD codes of depression to patient visits. We conclude with discussions in Section 5.

2. Methods

2.1. Notations and settings

Let |$Y$| denote the binary outcome, and |${\bf X}$| denote the |$(p+1)\times 1$| vector of predictors, where we include 1 as the first element of |${\bf X}$|⁠. Suppose the full data consist of |$n_0$| independent and identically distributed realizations of |${\bf D} = (Y,{\bf X}^{{\sf \scriptscriptstyle{T}}})^{{\sf \scriptscriptstyle{T}}}$|⁠, denoted by |$\mathscr{D}_{\sf \scriptscriptstyle +} = \{{\bf D}_i, i = 1,..., n_0\}$|⁠. We denote the index set for the full data by |$\Omega_{\sf \scriptscriptstyle +} = \{1,..., n_0\}$|⁠. We randomly partition |$\mathscr{D}_{\sf \scriptscriptstyle +}$| into |$K$| subsets with the |$k$|th subset denoted by |$\mathscr{D}_k = \{{\bf D}_i, i \in \Omega_k\}$|⁠. Without loss of generality, we assume that |$n = n_0/K$| is an integer and that the index set for the subset |$k$| is |$\Omega_k = \{(k-1)n+1,..., kn\}$|⁠. For any index set |$\Omega$|⁠, we denote the size of |$\Omega$| by |$n_{\Omega}$| with |$n_{\Omega}=n_0$| if |$\Omega=\Omega_{\sf \scriptscriptstyle +}$| and |$n_{\Omega}=n$| if |$\Omega=\Omega_k$|⁠.

To develop a risk prediction model for |$Y$| based on |${\bf X}$|⁠, we assume that
(2.1)
where |$\boldsymbol{\beta}_0$| is sparse with the size of the active set |$\mathcal{A} = \{\jmath: \boldsymbol{\beta}_{0\jmath} \ne 0\}$|⁠, denoted by |$q = \mbox{Card}(\mathcal{A})$|⁠. We allow diverging |$p$| with |$\lim_{n_0\to\infty}\log(p)/\log(n_0) = \nu$| for some |$\nu \in (0,1/2)$| but assume that |$q$| is fixed for simplicity. With the full data |$\mathscr{D}_{\sf \scriptscriptstyle +}$|⁠, an efficient sparse estimator for |$\boldsymbol{\beta}_0$| under (2.1) can be obtained using the maximizer of the adaptive LASSO penalized log-likelihood (Zou, 2006),
(2.2)
where for any index set |$\Omega$|⁠,
(2.3)
|$\widetilde{\boldsymbol{\beta}}_{\Omega} = (\widetilde{\beta}_{\Omega,1},..., \widetilde{\beta}_{\Omega,p})^{{\sf \scriptscriptstyle{T}}}= \mathop{\mbox{argmax}}_{\boldsymbol{\beta}}{\widehat{\ell}_{\Omega}(\boldsymbol{\beta})}$|⁠, |$\lambda_{\Omega_{\sf \scriptscriptstyle +}}\ge 0$| is a tuning parameter, and |$\gamma>2\nu/(1-\nu)$|⁠.
Although alternative estimators using other penalty functions such as the adaptive elastic net (Zou and Zhang, 2009) and the smoothly clipped absolute deviation Fan and Li (2001) can attain similar performance, we focus on adaptive LASSO penalized estimator for simplicity. By setting the ridge penalty to 0, the adaptive elastic net reduces to the adaptive LASSO and the results from Zou and Zhang (2009) indicate that, when |$n^{\frac{1}{2}}_0\lambda_{\Omega_{\sf \scriptscriptstyle +}} \to 0$| and |$n_0^{(1-\nu)(1+\gamma)/2} \lambda_{\Omega_{\sf \scriptscriptstyle +}} \to \infty$|⁠, |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$| achieves the variable selection consistency, i.e., the estimated active set |$\widehat{{\cal A}}{_{\sf \scriptscriptstyle +}}=\{\jmath:\widehat\beta{_{\sf \scriptscriptstyle +}}_{,\jmath}\ne 0 \}$| satisfies |$P(\widehat{{\cal A}}{_{\sf \scriptscriptstyle +}} = \mathcal{A}) \to 1$|⁠, and that the oracle property holds in that the asymptotic distribution of |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}^{\scriptscriptstyle{{\cal A}}}$| is the same as that of the maximum likelihood estimator (MLE) constrained on the parameter space |$\{\boldsymbol{\beta}: \beta_j = 0, j \notin \mathcal{A}\}$|⁠. That is,
which converges in distribution to |$\mathcal{N}\left(\textbf{0},\mathbb{A}^{\scriptscriptstyle{{\cal A}}}(\boldsymbol{\beta}_0)^{-1}\right)$|⁠, where for any set |$\mathcal{A}$|⁠, |$\mathbf{G}^{\mathcal{A}}$| denotes the subvector of |$\mathbf{G}$| corresponding to |$\mathcal{A}$| if |$\mathbf{G}$| is a vector and |$\mathbf{G}^{\mathcal{A}} = [W_{ll'}]_{l \in \mathcal{A},l' \in \mathcal{A}}$| if |$\mathbf{G}$| is a matrix,

The oracle property holds when the asymptotic distribution of the estimator is the same as the asymptotic distribution of the MLE on the true support. When |$n_0$| is exceedingly large, obtaining |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$| is not computationally feasible. Our first goal is to employ SOLID to obtain a computationally efficient estimator that achieves the same efficiency as the full sample estimator |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$|⁠.

Our second goal is to develop procedures for making inference about the accuracy of |$\pi_{\scriptscriptstyle \widehat{\boldsymbol{\beta}}}^{\sf \scriptscriptstyle{new}} = g(\widehat{\boldsymbol{\beta}}^{{\sf \scriptscriptstyle{T}}}{\bf X}^{\sf \scriptscriptstyle{new}})$| in predicting the outcome |$Y^{\sf \scriptscriptstyle{new}}$| with covariates |${\bf X}_{\sf \scriptscriptstyle{new}}$| for a future observation. Here, we use a few commonly used accuracy measures as examples, although our procedure can be applied to a broad class of accuracy parameters. For any threshold value |$c$|⁠, the expected true positive rate (TPR) and false positive rate (FPR) of |$I(\pi_{\scriptscriptstyle \widehat{\boldsymbol{\beta}}}^{\sf \scriptscriptstyle{new}} \ge c)$| in classifying |$Y^{\sf \scriptscriptstyle{new}}$| are respectively defined as |$\text{TPR}(c)=E\{\text{TPR}(c; \widehat{\boldsymbol{\beta}})\}$| and |$\text{FPR}(c) = E\{\text{FPR}(c; \widehat{\boldsymbol{\beta}})\}$|⁠, where the expectation is taken over the distribution of |$\widehat{\boldsymbol{\beta}}$|⁠,

One may summarize the trade-off between the TPR and FPR functions based on the ROC curve, |$\text{ROC}(u)=\text{TPR}\left\{\text{FPR}^{-1}(u)\right\}.$| Additionally, the area under the ROC curve |$\text{AUC}=\int_0^1 \text{ROC}(u){\rm d}u$|⁠, summarizes the overall prediction performance of |$\pi_{\scriptscriptstyle \widehat{\boldsymbol{\beta}}}^{\sf \scriptscriptstyle{new}}$|⁠. With a selected threshold |$c$|⁠, it is also often desirable to evaluate the positive predictive value (PPV) and negative predictive value (NPV) of the binary rule |$I(\pi_{\scriptscriptstyle \widehat{\boldsymbol{\beta}}}^{\sf \scriptscriptstyle{new}}\ge c)$| defined as |$\text{PPV}(c)=E\{\text{PPV}(c; \widehat{\boldsymbol{\beta}})\}$| and |$\text{NPV}(c)=E\{\text{NPV}(c; \widehat{\boldsymbol{\beta}})\}$|⁠, where |$\text{PPV}(c; \boldsymbol{\beta}) = P(Y=1|\pi_{\scriptscriptstyle \boldsymbol{\beta}}^{\sf \scriptscriptstyle{new}} \ge c)$| and |$\text{NPV}(c; \boldsymbol{\beta})=P(Y^{\sf \scriptscriptstyle{new}}=0|\pi_{\scriptscriptstyle \boldsymbol{\beta}}^{\sf \scriptscriptstyle{new}}< c).$| For simplicity, we will use |$\text{TPR}(c)$| to illustrate the proposed method in this section but note that estimation procedures can be similarly formed for other accuracy measures. We propose an efficient MCV procedure to construct bias corrected point and interval estimators for |$\text{TPR}$|⁠.

2.2. SOLID algorithm for approximating |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$|

Our SOLID algorithm is motivated by the LSA proposed in Wang and Leng (2007). We first note that |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$| can be approximated by

Following similar arguments given in Caner and Zhang (2014), one may show that |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}^{\sf\scriptscriptstyle lin}$| will also achieve the variable selection consistency as |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$| and that |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}^{\sf\scriptscriptstyle lin}Asc$| has the same limiting distribution as that of |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}^{\scriptscriptstyle{{\cal A}}}$|⁠. This suggests that an estimator can recover the distribution of |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$| if we can construct an accurate DAC approximations to |$\widetilde{\boldsymbol{\beta}}_{\Omega_{\sf \scriptscriptstyle +}}$| and |$\widehat{\mathbb{A}}_{\Omega_{\sf \scriptscriptstyle +}}(\widetilde{\boldsymbol{\beta}}_{\Omega_{\sf \scriptscriptstyle +}})$|⁠. To this end, we propose the SOLID procedure, which requires two main steps: (i) screening for an active set of predictors; (ii) constructing a linearized adaptive LASSO estimator with the active set. In both steps, linearization via the LSA are used.

The screening step consists of four sub-steps:

(1.1) use subset |$\mathscr{D}_1$| to obtain a ridge estimator
(1.2) obtain the one-step linear approximation to |$\widetilde{\boldsymbol{\beta}}_{\Omega_{\sf \scriptscriptstyle +}}$| as |$\widetilde{\boldsymbol{\beta}}_{\Omega_{\sf \scriptscriptstyle +}}^{\sf\scriptscriptstyle lin}one$|⁠, where
(1.3) apply the LSA for screening,
(1.4) screen for an active set |$\mathcal{A}$|
and obtain |$\widehat{\boldsymbol{\beta}}_{\sf\scriptscriptstyle screen}^{\odot {\scriptscriptstyle \widehat{\cal A}}} = \widehat{\boldsymbol{\beta}}_{\sf\scriptscriptstyle screen} \odot {\bf I}(\widehat{{\cal A}})$|⁠, where |${\bf I}(\widehat{{\cal A}}) = [1, I(1 \in \widehat{{\cal A}}), \cdots, I(p \in \widehat{{\cal A}})]^{{\sf \scriptscriptstyle{T}}}$| and |$\odot$| indicates elementwise product.

The post-screening linearized adaptive LASSO fitting step consists of two main sub-steps:

(2.1) obtain the DAC approximated initial estimator, |$\widetilde{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{DAC}} = \widetilde{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{DAC}}^{\scriptscriptstyle [M]}$|⁠, where |$\widetilde{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{DAC}}^{\scriptscriptstyle [M]}$| is obtained iteratively by letting |$\widetilde{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{DAC}}^{\scriptscriptstyle [0]} = \widehat{\boldsymbol{\beta}}_{\sf\scriptscriptstyle screen}^{\odot {\scriptscriptstyle \widehat{\cal A}}}$|⁠,
where |$\widehat{\mathbb{I}}_{\sf \scriptscriptstyle{DAC}}^{\odot {\scriptscriptstyle \widehat{\cal A}}}iota(\boldsymbol{\beta})$| is the |$(p+1)\times (p+1)$| matrix whose submatrix corresponding to |$\widehat{{\cal A}}$| is |$\widehat{\mathbb{A}}^{\scriptscriptstyle \widehat{\mathcal{A}}}_{\sf \scriptscriptstyle{DAC}}(\boldsymbol{\beta} )^{-1}$| and all other elements are 0;
(2.2) obtain the final SOLID estimator as

The SOLID estimator will always take value 0 for the elements that are deemed 0 in the screening step and only requires estimation of |$\boldsymbol{\beta}$| for the elements in |$\widehat{{\cal A}}$|⁠. The linearization in step (2.2) allows us to solve the penalized regression step using a pseudo likelihood based on a size of |$\mbox{Card}(\widehat{{\cal A}})$| data points, which is substantially smaller than the original sample size |$n$| in likelihood of each subset. Compared to solving (2.2), the computation cost of SOLID is substantially lower when |$n_0\gg p$| and |$p$| is also not small.

Following similar arguments as given in Wang and others (2019), Zou and Zhang (2009), and Cui and others (2017), one may show that the proposed SOLID estimator attains the same oracle properties as those of |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$| under the same regularity conditions given in Zou and Zhang (2009) provided that |$K = o\left(n^{\frac{1}{2}}_0\right)$|⁠. In the screening step, the one-step estimator |$\widetilde{\boldsymbol{\beta}}_{\Omega_{\sf \scriptscriptstyle +}}^{\sf\scriptscriptstyle lin,1}$| has a convergence rate of |$\|\widetilde{\boldsymbol{\beta}}_{\Omega_{\sf \scriptscriptstyle +}}^{\sf\scriptscriptstyle lin,1}-\boldsymbol{\beta}_0\|_{\infty} = O_p\{p/n_{\Omega_1}+\sqrt{\log(p)/n_0}\}$|⁠, which is slower than that of |$\|\widetilde{\boldsymbol{\beta}}_{\Omega_{\sf \scriptscriptstyle +}}-\boldsymbol{\beta}_0\|_{\infty}=O_p\{\sqrt{\log(p)/n_0}\}$| when |$d_1\equiv\lim_{n_0\to \infty} \log(n_{\Omega_1})/\log(n_0) < (1+\nu)/2$|⁠. Nevertheless, when |$\lambda_{\sf\scriptscriptstyle screen} =O\{(n_0^{-1/2}+p/n_{\Omega_1})^{(1-\nu)(1+\gamma)}\}$|⁠, we may follow similar arguments as given in Zou and Zhang (2009) to show that |$P(\widehat{{\cal A}}=\mathcal{A}) \to 1$| as |$n_0\to \infty$| and |$\|\widehat{\boldsymbol{\beta}}_{\sf\scriptscriptstyle screen}^{\scriptscriptstyle{{\cal A}}}sc - \boldsymbol{\beta}_0^{\scriptscriptstyle{{\cal A}}}sc\|_2 = O_p(p/n_{\Omega_1}+n_0^{-1/2})$|⁠. In addition, when |$M$| is sufficiently large, |$\widehat{\cal A}_{\sf \scriptscriptstyle{SOLID}}$| attains the oracle properties in that |$P(\widehat{\cal A}_{\sf \scriptscriptstyle{SOLID}}={\cal A})\rightarrow1$| and |$\sqrt{n_0}(\widehat{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{SOLID},j}^{\scriptscriptstyle{{\cal A}}}sc - \boldsymbol{\beta}_0^{\scriptscriptstyle{{\cal A}}}sc)$| and |$\sqrt{n_0}(\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}^{\mathcal{A}}- \boldsymbol{\beta}_0^{\scriptscriptstyle{{\cal A}}}sc)$| have the same limiting distribution, where |$\widehat{{\cal A}}_{\sf \scriptscriptstyle{SOLID},j} = \{j: \widehat{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{SOLID},j}/{\hspace{-3.5mm}} = 0\} \supseteq \widehat{{\cal A}}$|⁠.

 
Remark 1

We use the ridge estimator in (1.1) instead of the standard MLE since |$n_{\Omega_1}$| may not be substantially larger than |$p$|⁠, in which case the ridge estimator is more stable than MLE. In practice, we find that both MLE and the ridge estimator work well when |$p \ll n_{\Omega_1}$| (e.g., |$p=50$| and |$n_{\Omega_1} = 10^4$|⁠), but the ridge estimator is much more stable otherwise (e.g., |$p=500$| and |$n_{\Omega_1} = 10^4$|⁠).

 
Remark 2

Although theoretically |$\widehat{{\cal A}}$| from the screening step can already consistently estimate |$\mathcal{A}$|⁠, we follow-up with a refined estimate in (2.1) and (2.2) since |$\widehat{{\cal A}}$| is obtained based on the one-step estimator |$\widetilde{\boldsymbol{\beta}}_{\Omega_{\sf \scriptscriptstyle +}}^{\sf\scriptscriptstyle lin,1}$| which has a sub-optimal convergence rate. In practice, one may also choose a slightly conservative |$\lambda_{\sf\scriptscriptstyle screen}$| to make sure that the screening step will not remove any true signals.

 
Remark 3

Step (2.1) requires |$M$| iterations and can be done relatively fast since it is constraining to the subset identified by |$\widehat{{\cal A}}$| which is substantially smaller than |$p$|⁠. In practice, one may find |$M$| such that step (2.1) converges or to ensure |$(p/n_{\Omega_1}+n_0^{-1/2})^{2^M} = o( n_0^{-1/2})$|⁠, i.e., |$M > \log\{1/(d_1-\nu)\}/\log(2)-1$| if |$p = c \times n_0^{\nu}$| and |$n_{\Omega_1} = c \times n_0^{d_1}$| for some constant |$c$|⁠. When either holds, |$\widetilde{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{DAC}}^{\scriptscriptstyle{{\cal A}}}sc$| is asymptotically equivalent to the solution to |$\widehat{{\bf U}}_{\sf \scriptscriptstyle{DAC}}^{\scriptscriptstyle{{\cal A}}}sc(\boldsymbol{\beta}^{\scriptscriptstyle{{\cal A}}}sc) = {\bf 0}$|⁠.

2.3. Modified cross-validation for model evaluation

It is well known that the apparent accuracy estimators (APP), obtained by comparing the predicted values with the sample values in the training data, are subject to potential overfitting biases in the finite sample. The CV is commonly used to reduce bias. However, in the standard CV procedure, the regression coefficients are repeatedly estimated in each random split, which can substantially increase the computational burden when both |$n_0$| and |$p$| are large. To reduce the time cost, we propose the following procedure to approximate the standard CV estimator without repeatedly fitting the model within each data split,
where

We obtain |$\widehat{\boldsymbol{\beta}}^{\scriptscriptstyle \widehat{\mathcal{A}}} _{{\sf \scriptscriptstyle{DAC}},{\sf \scriptscriptstyle{-k}}}$| by repeating steps 1.2 to 2.2 of the SOLID procedure but taking out the |$k$|th data block as follows:

(MCV 1) obtain the one-step linear approximation
where |$ {\bf U}_{{\sf \scriptscriptstyle{DAC}},{\sf \scriptscriptstyle{-k}}}(\boldsymbol{\beta}) = (K-1)^{-1}\sum_{k'=1,k'/{\hspace{-3.5mm}} = k}^K \widehat{{\bf U}}_{\Omega_{k'}}(\boldsymbol{\beta}), \quad \mbox{and}\quad \widehat{\mathbb{A}}_{{\sf \scriptscriptstyle{DAC}},{\sf \scriptscriptstyle{-k}}}(\boldsymbol{\beta}) = (K-1)^{-1}\sum_{k'=1,k'/{\hspace{-3.5mm}} = k}^K \widehat{\mathbb{A}}_{\Omega_{k'}}(\boldsymbol{\beta})$|⁠. Note that |$\widehat{{\bf U}}_{\Omega_{k'}}$| and |$\widehat{\mathbb{A}}_{\Omega_{k'}}$| for |$k'$| in |$1, \ldots, K$| are side products of the SOLID procedure which are readily available.
(MCV 2) apply LSA for screening

We obtain the active set |$\widehat{\cal A}_{-k}= \{\jmath: \widehat{\boldsymbol{\beta}}_{\sf\scriptscriptstyle screen,-k} \ne 0\}.$|

(MCV 3) iteratively calculate the DAC approximated initial estimator. Let |${\widetilde{\boldsymbol{\beta}}}_{{\sf \scriptscriptstyle{DAC}},{\sf \scriptscriptstyle{-k}}}^{\scriptscriptstyle [0]} = {\widehat{\boldsymbol{\beta}}}_{\sf\scriptscriptstyle screen,-k}^{\odot {\scriptscriptstyle \widehat{\cal A}}_{-k}}$|⁠,

Let |$\widetilde{\boldsymbol{\beta}}_{{\sf \scriptscriptstyle{DAC}},{\sf \scriptscriptstyle{-k}}}^{\scriptscriptstyle \widehat{\mathcal{A}}}=\widetilde{\boldsymbol{\beta}}_{{\sf \scriptscriptstyle{DAC}},{\sf \scriptscriptstyle{-k}}}^{[M]}$|⁠.

(MCV 4) For the |$\Omega_{-k}$| data blocks, obtain

It is worth mentioning that the major time cost of the SOLID procedure lies in the calculation of |$\widetilde{\boldsymbol{\beta}}_{\Omega_1}^{\sf \scriptscriptstyle rid}$| in step (1.1) and the calculation of |$\widehat{{\bf U}}_{\Omega_{k}}$| and |$\widehat{\mathbb{A}}_{\Omega_k}$| for |$k=1, \ldots, K$| in step (1.2). In the MCV procedure, we reuse |$\widetilde{\boldsymbol{\beta}}_{\Omega_1}^{\sf \scriptscriptstyle rid}$|⁠, |$\widehat{{\bf U}}_{\Omega_{k}}$|⁠, and |$\widehat{\mathbb{A}}_{\Omega_k}$| as the side products from the SOLID procedure, thereby substantially reducing the time cost.

2.4. Tuning and standard errors calculation

The tuning parameter |$\lambda_{\Omega{_{\sf \scriptscriptstyle +}}}$| is chosen by minimizing the Bayesian information criteria (BIC) of the fitted model. Specifically, for any given tuning parameter |$\lambda_{\Omega_{\sf \scriptscriptstyle +}}$| with its corresponding estimate of |$\boldsymbol{\beta}$|⁠, |$\widehat{\boldsymbol{\beta}}_{\lambda_{\Omega_{\sf \scriptscriptstyle +}}}$|⁠, the BIC is defined as
(2.4)
where |$df_{\lambda_{\Omega_{\sf \scriptscriptstyle +}}}=\sum_{\jmath = 1}^p I(\widehat{\boldsymbol{\beta}}_{\lambda_{\Omega_{\sf \scriptscriptstyle +}}, \jmath} /{\hspace{-3.5mm}} = 0)$|⁠. With the LSA, we may further approximate |$\text{BIC}_{\lambda_{\Omega_{\sf \scriptscriptstyle +}}}$| by
(2.5)
We may thus estimate the variance–covariance matrix for |$n^{\frac{1}{2}}_0(\widehat{\boldsymbol{\beta}}^{\scriptscriptstyle \widehat{\mathcal{A}}}_{\sf \scriptscriptstyle{DAC}}-\boldsymbol{\beta}_0^{\scriptscriptstyle \widehat{\mathcal{A}}})$| using |$\widehat{\mathbb{A}}^{\scriptscriptstyle \widehat{\mathcal{A}}}(\widehat{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{DAC}})^{-1}$|⁠. For |$\jmath \in \widehat{{\cal A}}$|⁠, a |$(1-\alpha)\times 100\%$| confidence interval for |$\beta_{0\jmath}$| can be calculated accordingly. However, deriving the explicit variance estimate for the accuracy parameter |$\widehat{\text{TPR}}_{\sf \scriptscriptstyle{MCV}}$| is challenging since its asymptotic variance involves unknown derivative functions. On the other hand, while resampling methods such as the bootstrap are often used to overcome such challenges, they are overly time consuming under the present setting. We therefore propose an approximate standard error for both |$\widehat{\boldsymbol{\beta}}^{\scriptscriptstyle \widehat{\mathcal{A}}}_{\sf \scriptscriptstyle{DAC}}$| and |$\widehat{\text{TPR}}_{\sf \scriptscriptstyle{MCV}}$| by leveraging many quantities already calculated for the |$K$| independent partitions. Specifically, we propose to approximate the standard error of |$\widehat{\boldsymbol{\beta}}^{\scriptscriptstyle \widehat{\mathcal{A}}}_{\sf \scriptscriptstyle{DAC}}$| by
(2.6)
where |$\overline{\boldsymbol{\beta}}^{\scriptscriptstyle \widehat{{\cal A}}}_{\sf \scriptscriptstyle{DAC}}=\sum_{k=1}^K{\widehat{\boldsymbol{\beta}}}^{\scriptscriptstyle \widehat{{\cal A}}}_{{\sf \scriptscriptstyle{DAC}},{\sf \scriptscriptstyle{k}}}.$|
It is worth mentioning that although the apparent |$\widehat{\text{TPR}}_{\sf \scriptscriptstyle{APP}}$| is subject to overfitting, it can be shown that one may approximate the standard errors of |$\widehat{\text{TPR}}_{\sf \scriptscriptstyle{MCV}}$| by that of |$\widehat{\text{TPR}}_{\sf \scriptscriptstyle{APP}}$| following similar arguments as given in Tian and others (2007). We thus propose to approximate the standard error of |$\widehat{\text{TPR}}_{\sf \scriptscriptstyle{MCV}}$| by
(2.7)
where |$\overline{\text{TPR}}^{\scriptscriptstyle \widehat{{\cal A}}}_{\sf \scriptscriptstyle{APP}}=\sum_{k=1}^K\widehat{\text{TPR}}^{\scriptscriptstyle \widehat{{\cal A}}}_{\sf \scriptscriptstyle{APP,k}}$|⁠, and |$\widehat{\text{TPR}}^{\scriptscriptstyle \widehat{\mathcal{A}}}_{\sf \scriptscriptstyle{APP}}k$| is the apparent TPR of |$\widehat{\boldsymbol{\beta}}^{\scriptscriptstyle \widehat{\mathcal{A}}}_{\sf \scriptscriptstyle{DAC}}k$| built and validated using the |$k$|th data block.

3. Simulations

3.1. Simulation settings

We conducted extensive simulations to evaluate the performance of the proposed SOLID algorithm and the MCV algorithm for developing and evaluating the risk prediction models and to compare it with existing approaches. Throughout, we let |$n_0 = 10^6$| and consider |$p=50$|⁠, |$500,$| and |$1000$|⁠. We set |$K=100$| for |$p=50$| and |$500$|⁠, and |$K=50$| for |$p=1000$|⁠. We determine the rate of |$p$| as |$p = 2 \times n^{\nu}$| which gives |$\nu = 0.17, 0.33,$| and |$0.38$| for |$p=50, 500,$| and |$1000$|⁠, respectively. We generated |$Y$| from logistic models with three choices of |$\boldsymbol{\beta}_0$| to reflect different degrees of sparsity and signal strength:

We generate |${\bf X}_{-1}$| from |$N(\textbf{0}_p^{{\sf \scriptscriptstyle{T}}}, \mathbb{V})$| with |$\mathbb{V}=0.8\mathbb{I}_{p\times p} + 0.2$|⁠, where |$\mathbb{I}$| is the identity matrix.

For model estimation, we compare the SOLID estimator with four existing estimators: (i) the full sample-based adaptive lasso estimator |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$|⁠; (ii) the majority voting-based DAC scheme for GLM proposed by Chen and Xie (2014), denoted by |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Chen}$|⁠; (iii) the de-biased LASSO-based DAC scheme for GLM proposed by Tang and others (2016), denoted by |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Tang}$|⁠; (iv) the sparse meta-analysis approach |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf SMA}$| proposed by He and others (2016), denoted by |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf SMA}$|⁠. For all methods except for SMA, we set |$\gamma=1$| for fair comparison. When implementing SMA, we set |$\gamma =0$| so that no initial estimator is needed for the procedure due to the high computational cost of the SMA procedure. For model evaluation, we compare the MCV estimator with the external estimator (EXT), which is obtained by evaluating the proposed models on a large external validation size of |$10^6$|⁠. We also compare the MCV estimator with the apparent estimator (APP).

For any |$\widehat{\boldsymbol{\beta}}\in\{\widehat{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{DAC}}, \widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}, \widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Chen}, \widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Tang}, \widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf SMA}\}$|⁠, we report (i) the average computation time for |$\widehat{\boldsymbol{\beta}}$|⁠; (ii) the global mean squared error (GMSE), defined as |$(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}_0)^{{\sf \scriptscriptstyle{T}}}\mathbb{V}(\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta}_0)$|⁠; (iii) the bias of each individual coefficient; (iv) mean squared error (MSE) of each individual coefficient; (v) empirical probability of |$\jmath\not\in\widehat{{\cal A}}$|⁠; (vi) the empirical coverage level of the 95% normal confidence interval (CI). For model evaluation comparison, we report (i) the average computation time for calculating the ROC curve using EXT, APP and MCV; (ii) the average difference (DIFF) between APP and EXT accuracy estimates, and that between MCV and EXT accuracy estimates; (iii) the empirical standard errors of the EXT, APP, and MCV accuracy estimates; (iv) the average approximated standard error (ASE) of the MCV accuracy estimate as derived in equation (2.7); (v) the coverage level of the 95% normal confidence interval of APP and MCV estimates. Taking TPR for example, we defined “coverage” as the proportion of times that the confidence interval of |$\widehat{\text{TPR}}_{\sf \scriptscriptstyle{APP}}$| or |$\widehat{\text{TPR}}_{\sf \scriptscriptstyle{MCV}}$| contains |$\overline{\text{TPR}}_{\sf \scriptscriptstyle{EXT}}$|⁠, where |$\overline{\text{TPR}}_{\sf \scriptscriptstyle{EXT}}$| is defined as the average of |$\widehat{\text{TPR}}_{\sf \scriptscriptstyle{EXT}}$| taken over both the repeated sample and |$\widehat{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{SOLID},j}$| of 1000 simulations. Note that |$\overline{\text{TPR}}_{\sf \scriptscriptstyle{EXT}}(c)$| is essentially a Monte Carlo estimator of the population parameter |$\text{TPR}(c) = E\{\text{TPR}(c; \widehat{\boldsymbol{\beta}})\}$|⁠, where |$\text{TPR}(c; \boldsymbol{\beta}) = P(\pi_{\scriptscriptstyle \boldsymbol{\beta}}^{\sf \scriptscriptstyle{new}} > c \mid Y^{\sf \scriptscriptstyle{new}} = 1) $|⁠.

The statistical performance is evaluated based on 1000 simulated datasets for each setting. Each simulation is performed on one core at a cluster with Intel ® Xeon® E5-2697 v3 @2.60GHz.

3.2. Simulation results

We first conduct sensitivity studies of the number of iterations |$M$| to examine the impact of |$M$| on the proposed estimator |$\widehat{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{SOLID},j}$|⁠. Across all settings, we found that it is suffices to let |$M=3$| for |$p \le 500$| and |$M=6$| for |$p=1000$|⁠. Hence we summarize below the results for |$\widehat{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{SOLID},j}$| with |$M=3$| for |$p=50$| and |$500$|⁠, and |$M=6$| for |$p=1000$| unless noted otherwise. These choices ensure that |$M > \log\{1/(d_1-\nu)\}/\log(2)-1$| as required by the theoretical results.

We summarize in Table 1 the performance measures of different regression estimators for settings I–III, and in the Supplementary Materials available at Biostatistics online the performance for setting IV. There are substantial differences in computation time across methods and the proposed SOLID algorithm is significantly faster than all competing DAC methods. The relative performance with respect to computation time is similar across the three sets of |$\boldsymbol{\beta}$| but gains in computational efficiency are even larger for larger |$p$|⁠. For example when |$p = 1000$|⁠, the SOLID algorithm is at least 10 times faster than competing DAC methods.

Table 1.

Comparisons among the proposed SOLID algorithm, the standard adaptive LASSO (Full), the sparse meta-analysis approach (SMA), the algorithms proposed by Tang and Chen in regards to averaged computing time in minutes, global mean square error (GMSE)|$\times 10^5$|⁠, bias |$\times 10^3$|⁠, and coverage probability (CP) |$\times 10^2$| of each individual coefficient. For individual |$\beta_j$|’s, we only present the results for the six unique values with |$(b_1, b_2, b_3,b_4,b_5,b_6)$| taking values |$(1,0.8,0.4,0.2,0.1,0)$| in (I) and (II) and |$(0.5,0.4,0.2,0.1,0.05,0)$| in (III).

    BiasCP
 pMethodTimeGMSE|$b_1$||$b_2$||$b_3$||$b_4$||$b_5$||$b_6$||$b_1$||$b_2$||$b_3$||$b_4$||$b_5$|
I50FULL8.975.112.782.632.502.512.720.0094.094.395.295.093.2
  SOLID0.065.543.022.742.532.512.590.0095.695.694.995.395.7
  SMA0.2415615.914.28.747.435.101.250.00.022.756.863.6
  Tang7.0144.23.183.062.692.602.712.5793.793.994.695.394.7
  Chen7.015.933.072.962.552.482.610.0093.393.094.994.994.9
 500FULL1035.422.832.872.512.582.630.0095.195.394.894.794.3
  SOLID1.665.332.782.662.512.542.610.0094.695.694.394.592.5
  SMA21.8292791.174.340.323.715.60.070.00.00.00.00.0
  Tang94.04293.012.982.762.652.712.5893.195.094.394.594.8
  Chen92.95.602.902.812.582.552.540.0093.194.893.594.394.8
 1000FULL5395.112.872.732.302.322.390.0087.688.096.892.294.0
  SOLID9.415.532.892.732.642.702.710.0094.294.293.495.094.2
  SMA1001036518014577.042.825.60.010.00.00.00.00.0
  Tang2008623.122.902.642.472.482.5093.794.295.595.395.8
  Chen2065.343.012.772.532.372.360.0094.792.694.594.795.0
II50FULL9.4112.83.302.792.792.862.790.0096.094.494.895.193.9
  SOLID0.0613.03.142.992.982.702.800.0095.195.294.995.094.8
  SMA0.3380.49.017.324.143.223.040.0033.651.384.091.890.3
  Tang7.0859.13.903.562.912.882.882.7588.991.293.994.995.6
  Chen7.0618.83.863.522.892.812.790.0088.791.594.894.795.3
 500FULL11113.02.973.062.622.732.960.0094.094.594.594.193.1
  SOLID1.8214.43.033.052.882.732.880.0094.493.394.895.592.8
  SMA25.8654690.472.236.017.610.60.000.00.00.00.012.7
  Tang83.45043.733.492.922.852.782.7588.791.694.294.694.7
  Chen83.118.43.663.442.882.742.670.0088.491.694.293.694.8
 1000FULL30313.33.303.212.662.443.090.0092.297.794.892.892.8
  SOLID8.8014.63.533.222.602.732.670.0094.695.595.594.691.1
  SMA1143020219415677.539.321.20.000.00.00.00.00.0
  Tang23310013.723.522.962.792.832.8689.592.193.892.895.1
  Chen23119.13.673.402.902.792.700.0088.593.193.194.795.4
III50FULL9.299.562.652.522.592.482.620.0095.195.495.894.992.9
  SOLID0.089.492.652.492.582.472.530.0095.195.795.694.993.3
  SMA0.3121.34.033.522.642.602.800.0081.186.093.193.093.2
  Tang7.3344.12.962.702.472.592.502.6193.094.294.194.495.4
  Chen7.344766.627.247.597.8750.00.0048.739.027.128.50.0
 500FULL14910.52.412.502.302.603.030.0095.394.195.195.692.2
  SOLID1.7210.32.562.542.402.592.960.0094.795.695.094.991.2
  SMA23.8114537.129.114.97.596.060.000.00.00.035.168.0
  Tang89.74232.882.762.562.712.462.6192.995.194.794.594.2
  Chen87.94766.697.207.817.7350.00.0043.138.325.333.80.0
 1000FULL2629.782.722.692.872.543.150.0089.195.296.793.987.6
  SOLID8.4310.12.452.532.352.442.830.0093.394.494.495.693.3
  SMA100521580.064.532.116.610.60.000.00.00.00.010.0
  Tang2228452.792.802.602.602.632.5793.193.693.694.795.5
  Chen2194766.847.357.487.8050.00.0037.940.833.328.80.0
    BiasCP
 pMethodTimeGMSE|$b_1$||$b_2$||$b_3$||$b_4$||$b_5$||$b_6$||$b_1$||$b_2$||$b_3$||$b_4$||$b_5$|
I50FULL8.975.112.782.632.502.512.720.0094.094.395.295.093.2
  SOLID0.065.543.022.742.532.512.590.0095.695.694.995.395.7
  SMA0.2415615.914.28.747.435.101.250.00.022.756.863.6
  Tang7.0144.23.183.062.692.602.712.5793.793.994.695.394.7
  Chen7.015.933.072.962.552.482.610.0093.393.094.994.994.9
 500FULL1035.422.832.872.512.582.630.0095.195.394.894.794.3
  SOLID1.665.332.782.662.512.542.610.0094.695.694.394.592.5
  SMA21.8292791.174.340.323.715.60.070.00.00.00.00.0
  Tang94.04293.012.982.762.652.712.5893.195.094.394.594.8
  Chen92.95.602.902.812.582.552.540.0093.194.893.594.394.8
 1000FULL5395.112.872.732.302.322.390.0087.688.096.892.294.0
  SOLID9.415.532.892.732.642.702.710.0094.294.293.495.094.2
  SMA1001036518014577.042.825.60.010.00.00.00.00.0
  Tang2008623.122.902.642.472.482.5093.794.295.595.395.8
  Chen2065.343.012.772.532.372.360.0094.792.694.594.795.0
II50FULL9.4112.83.302.792.792.862.790.0096.094.494.895.193.9
  SOLID0.0613.03.142.992.982.702.800.0095.195.294.995.094.8
  SMA0.3380.49.017.324.143.223.040.0033.651.384.091.890.3
  Tang7.0859.13.903.562.912.882.882.7588.991.293.994.995.6
  Chen7.0618.83.863.522.892.812.790.0088.791.594.894.795.3
 500FULL11113.02.973.062.622.732.960.0094.094.594.594.193.1
  SOLID1.8214.43.033.052.882.732.880.0094.493.394.895.592.8
  SMA25.8654690.472.236.017.610.60.000.00.00.00.012.7
  Tang83.45043.733.492.922.852.782.7588.791.694.294.694.7
  Chen83.118.43.663.442.882.742.670.0088.491.694.293.694.8
 1000FULL30313.33.303.212.662.443.090.0092.297.794.892.892.8
  SOLID8.8014.63.533.222.602.732.670.0094.695.595.594.691.1
  SMA1143020219415677.539.321.20.000.00.00.00.00.0
  Tang23310013.723.522.962.792.832.8689.592.193.892.895.1
  Chen23119.13.673.402.902.792.700.0088.593.193.194.795.4
III50FULL9.299.562.652.522.592.482.620.0095.195.495.894.992.9
  SOLID0.089.492.652.492.582.472.530.0095.195.795.694.993.3
  SMA0.3121.34.033.522.642.602.800.0081.186.093.193.093.2
  Tang7.3344.12.962.702.472.592.502.6193.094.294.194.495.4
  Chen7.344766.627.247.597.8750.00.0048.739.027.128.50.0
 500FULL14910.52.412.502.302.603.030.0095.394.195.195.692.2
  SOLID1.7210.32.562.542.402.592.960.0094.795.695.094.991.2
  SMA23.8114537.129.114.97.596.060.000.00.00.035.168.0
  Tang89.74232.882.762.562.712.462.6192.995.194.794.594.2
  Chen87.94766.697.207.817.7350.00.0043.138.325.333.80.0
 1000FULL2629.782.722.692.872.543.150.0089.195.296.793.987.6
  SOLID8.4310.12.452.532.352.442.830.0093.394.494.495.693.3
  SMA100521580.064.532.116.610.60.000.00.00.00.010.0
  Tang2228452.792.802.602.602.632.5793.193.693.694.795.5
  Chen2194766.847.357.487.8050.00.0037.940.833.328.80.0
Table 1.

Comparisons among the proposed SOLID algorithm, the standard adaptive LASSO (Full), the sparse meta-analysis approach (SMA), the algorithms proposed by Tang and Chen in regards to averaged computing time in minutes, global mean square error (GMSE)|$\times 10^5$|⁠, bias |$\times 10^3$|⁠, and coverage probability (CP) |$\times 10^2$| of each individual coefficient. For individual |$\beta_j$|’s, we only present the results for the six unique values with |$(b_1, b_2, b_3,b_4,b_5,b_6)$| taking values |$(1,0.8,0.4,0.2,0.1,0)$| in (I) and (II) and |$(0.5,0.4,0.2,0.1,0.05,0)$| in (III).

    BiasCP
 pMethodTimeGMSE|$b_1$||$b_2$||$b_3$||$b_4$||$b_5$||$b_6$||$b_1$||$b_2$||$b_3$||$b_4$||$b_5$|
I50FULL8.975.112.782.632.502.512.720.0094.094.395.295.093.2
  SOLID0.065.543.022.742.532.512.590.0095.695.694.995.395.7
  SMA0.2415615.914.28.747.435.101.250.00.022.756.863.6
  Tang7.0144.23.183.062.692.602.712.5793.793.994.695.394.7
  Chen7.015.933.072.962.552.482.610.0093.393.094.994.994.9
 500FULL1035.422.832.872.512.582.630.0095.195.394.894.794.3
  SOLID1.665.332.782.662.512.542.610.0094.695.694.394.592.5
  SMA21.8292791.174.340.323.715.60.070.00.00.00.00.0
  Tang94.04293.012.982.762.652.712.5893.195.094.394.594.8
  Chen92.95.602.902.812.582.552.540.0093.194.893.594.394.8
 1000FULL5395.112.872.732.302.322.390.0087.688.096.892.294.0
  SOLID9.415.532.892.732.642.702.710.0094.294.293.495.094.2
  SMA1001036518014577.042.825.60.010.00.00.00.00.0
  Tang2008623.122.902.642.472.482.5093.794.295.595.395.8
  Chen2065.343.012.772.532.372.360.0094.792.694.594.795.0
II50FULL9.4112.83.302.792.792.862.790.0096.094.494.895.193.9
  SOLID0.0613.03.142.992.982.702.800.0095.195.294.995.094.8
  SMA0.3380.49.017.324.143.223.040.0033.651.384.091.890.3
  Tang7.0859.13.903.562.912.882.882.7588.991.293.994.995.6
  Chen7.0618.83.863.522.892.812.790.0088.791.594.894.795.3
 500FULL11113.02.973.062.622.732.960.0094.094.594.594.193.1
  SOLID1.8214.43.033.052.882.732.880.0094.493.394.895.592.8
  SMA25.8654690.472.236.017.610.60.000.00.00.00.012.7
  Tang83.45043.733.492.922.852.782.7588.791.694.294.694.7
  Chen83.118.43.663.442.882.742.670.0088.491.694.293.694.8
 1000FULL30313.33.303.212.662.443.090.0092.297.794.892.892.8
  SOLID8.8014.63.533.222.602.732.670.0094.695.595.594.691.1
  SMA1143020219415677.539.321.20.000.00.00.00.00.0
  Tang23310013.723.522.962.792.832.8689.592.193.892.895.1
  Chen23119.13.673.402.902.792.700.0088.593.193.194.795.4
III50FULL9.299.562.652.522.592.482.620.0095.195.495.894.992.9
  SOLID0.089.492.652.492.582.472.530.0095.195.795.694.993.3
  SMA0.3121.34.033.522.642.602.800.0081.186.093.193.093.2
  Tang7.3344.12.962.702.472.592.502.6193.094.294.194.495.4
  Chen7.344766.627.247.597.8750.00.0048.739.027.128.50.0
 500FULL14910.52.412.502.302.603.030.0095.394.195.195.692.2
  SOLID1.7210.32.562.542.402.592.960.0094.795.695.094.991.2
  SMA23.8114537.129.114.97.596.060.000.00.00.035.168.0
  Tang89.74232.882.762.562.712.462.6192.995.194.794.594.2
  Chen87.94766.697.207.817.7350.00.0043.138.325.333.80.0
 1000FULL2629.782.722.692.872.543.150.0089.195.296.793.987.6
  SOLID8.4310.12.452.532.352.442.830.0093.394.494.495.693.3
  SMA100521580.064.532.116.610.60.000.00.00.00.010.0
  Tang2228452.792.802.602.602.632.5793.193.693.694.795.5
  Chen2194766.847.357.487.8050.00.0037.940.833.328.80.0
    BiasCP
 pMethodTimeGMSE|$b_1$||$b_2$||$b_3$||$b_4$||$b_5$||$b_6$||$b_1$||$b_2$||$b_3$||$b_4$||$b_5$|
I50FULL8.975.112.782.632.502.512.720.0094.094.395.295.093.2
  SOLID0.065.543.022.742.532.512.590.0095.695.694.995.395.7
  SMA0.2415615.914.28.747.435.101.250.00.022.756.863.6
  Tang7.0144.23.183.062.692.602.712.5793.793.994.695.394.7
  Chen7.015.933.072.962.552.482.610.0093.393.094.994.994.9
 500FULL1035.422.832.872.512.582.630.0095.195.394.894.794.3
  SOLID1.665.332.782.662.512.542.610.0094.695.694.394.592.5
  SMA21.8292791.174.340.323.715.60.070.00.00.00.00.0
  Tang94.04293.012.982.762.652.712.5893.195.094.394.594.8
  Chen92.95.602.902.812.582.552.540.0093.194.893.594.394.8
 1000FULL5395.112.872.732.302.322.390.0087.688.096.892.294.0
  SOLID9.415.532.892.732.642.702.710.0094.294.293.495.094.2
  SMA1001036518014577.042.825.60.010.00.00.00.00.0
  Tang2008623.122.902.642.472.482.5093.794.295.595.395.8
  Chen2065.343.012.772.532.372.360.0094.792.694.594.795.0
II50FULL9.4112.83.302.792.792.862.790.0096.094.494.895.193.9
  SOLID0.0613.03.142.992.982.702.800.0095.195.294.995.094.8
  SMA0.3380.49.017.324.143.223.040.0033.651.384.091.890.3
  Tang7.0859.13.903.562.912.882.882.7588.991.293.994.995.6
  Chen7.0618.83.863.522.892.812.790.0088.791.594.894.795.3
 500FULL11113.02.973.062.622.732.960.0094.094.594.594.193.1
  SOLID1.8214.43.033.052.882.732.880.0094.493.394.895.592.8
  SMA25.8654690.472.236.017.610.60.000.00.00.00.012.7
  Tang83.45043.733.492.922.852.782.7588.791.694.294.694.7
  Chen83.118.43.663.442.882.742.670.0088.491.694.293.694.8
 1000FULL30313.33.303.212.662.443.090.0092.297.794.892.892.8
  SOLID8.8014.63.533.222.602.732.670.0094.695.595.594.691.1
  SMA1143020219415677.539.321.20.000.00.00.00.00.0
  Tang23310013.723.522.962.792.832.8689.592.193.892.895.1
  Chen23119.13.673.402.902.792.700.0088.593.193.194.795.4
III50FULL9.299.562.652.522.592.482.620.0095.195.495.894.992.9
  SOLID0.089.492.652.492.582.472.530.0095.195.795.694.993.3
  SMA0.3121.34.033.522.642.602.800.0081.186.093.193.093.2
  Tang7.3344.12.962.702.472.592.502.6193.094.294.194.495.4
  Chen7.344766.627.247.597.8750.00.0048.739.027.128.50.0
 500FULL14910.52.412.502.302.603.030.0095.394.195.195.692.2
  SOLID1.7210.32.562.542.402.592.960.0094.795.695.094.991.2
  SMA23.8114537.129.114.97.596.060.000.00.00.035.168.0
  Tang89.74232.882.762.562.712.462.6192.995.194.794.594.2
  Chen87.94766.697.207.817.7350.00.0043.138.325.333.80.0
 1000FULL2629.782.722.692.872.543.150.0089.195.296.793.987.6
  SOLID8.4310.12.452.532.352.442.830.0093.394.494.495.693.3
  SMA100521580.064.532.116.610.60.000.00.00.00.010.0
  Tang2228452.792.802.602.602.632.5793.193.693.694.795.5
  Chen2194766.847.357.487.8050.00.0037.940.833.328.80.0

With respect to overall estimation efficiency, |$\widehat{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{SOLID},j}$| attains GMSE comparable to |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$| across all settings, while |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf SMA}$|⁠, |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Chen}$|⁠, and |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Tang}$| attain generally larger and sometime substantially larger GMSE. For example, in Setting (I), both |$\widehat{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{SOLID},j}$| and |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Chen}$| attained comparable GMSEs as |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$|⁠. The SMA performs poorly for larger |$p$| as expected since it was not designed for larger |$p$|⁠. The debiased LASSO based DAC estimator |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Tang}$| also has poor performance due to the substantially increased variability and biases for the zero coefficients, resulting in a large aggregated error for the highly sparse settings. In Setting (III) with weaker signals, while |$\widehat{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{SOLID},j}$| continues to attain MSEs and biases comparable to that of |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$|⁠, all competing DAC methods performed poorly with drastically larger GMSE in the high dimensional setting. The |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Tang}$| estimator again suffers from larger biases for zero coefficients. On the other hand, |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Chen}$| can efficiently identify zero coefficients but has large biases for the non-zero coefficients. It is worth mentioning that both penalized estimators exhibit a small amount of bias. Such biases in the weak signals are expected for shrinkage estimators (Pavlou and others, 2016). However, it is important to note that |$\widehat{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{SOLID},j}$| and |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$| perform nearly identically, suggesting that our SOLID procedure incurs negligible additional approximation errors.

With respect to empirical probability of |$\jmath\not\in\widehat{{\cal A}}$|⁠, all methods under comparison detected non-zero signals equally well in that the percentage of non-zero estimates over 1000 simulations are |$100\%$| when the true regression coefficients are not zero. On the other hand, |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Tang}$| has difficulty in estimating zero signals. Bootstrap was used to estimate the empirical coverage level of the 95% normal confidence interval for |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$|⁠. All methods under comparison have coverage probability close to the 95% nominal level except for SMA. This is expected because SMA tends to produce biased estimates when |$p$| is relatively large.

We summarize in Table 2 the performance of our proposed procedures for making inference on accuracy parameters. We compare the performances of the MCV procedure and the apparent accuracy estimation. The external validation is included only as a benchmark but is not feasible in practical settings. For a |$100$|-fold CV, the computational cost of proposed the MCV procedure is slightly higher than that of apparent estimation and external validation, but is substantially lower than that of the standard CV, confirming that leveraging pre-computed SOLID side products enables us to efficiently perform CV. In terms of accuracy estimation, the proposed MCV procedure produces substantially lower overfitting bias than the apparent estimation. The ROC curves obtained by the EXT, APP, and MCV procedures are displayed in Figure 1. As expected, the ROC curves obtained by EXT and MCV are nearly identical, while that obtained by APP is substantially different. In terms of standard error estimation, the approximate standard error is very close to the empirical standard error, suggesting that the approximate standard error estimates the true standard error well. For the coverage probability, the estimates of the apparent procedure has coverage probability far from the 95% nominal level, while the coverage of the proposed MCV estimates are very close to 95%.

Point estimates and 95% confidence interval of the top 20 non-zero beta coefficient estimates with largest magnitudes obtained by the proposed SOLID method and its competing methods.
Fig. 1.

Point estimates and 95% confidence interval of the top 20 non-zero beta coefficient estimates with largest magnitudes obtained by the proposed SOLID method and its competing methods.

Table 2.

Comparisons among the external validation (EXT), apparent validation (APP), and MCV in regards to averaged computing time, averaged difference compared to the external validation (DIFF), empirical standard error (ESE), averaged approximate standard error (ASE), and coverage probability (CP) of area under the curve (AUC) and sensitivity.

 AUC Sensitivity
MethodTimeDIFFESEASECP (%) DIFFESEASECP (%)
EXT7.50.00413 0.00320
APP7.60.009270.0046852.1 0.030550.0177962.7
MCV35.00.000060.004480.0042292.3 |$-$|0.001220.015360.0144792.9
 AUC Sensitivity
MethodTimeDIFFESEASECP (%) DIFFESEASECP (%)
EXT7.50.00413 0.00320
APP7.60.009270.0046852.1 0.030550.0177962.7
MCV35.00.000060.004480.0042292.3 |$-$|0.001220.015360.0144792.9
Table 2.

Comparisons among the external validation (EXT), apparent validation (APP), and MCV in regards to averaged computing time, averaged difference compared to the external validation (DIFF), empirical standard error (ESE), averaged approximate standard error (ASE), and coverage probability (CP) of area under the curve (AUC) and sensitivity.

 AUC Sensitivity
MethodTimeDIFFESEASECP (%) DIFFESEASECP (%)
EXT7.50.00413 0.00320
APP7.60.009270.0046852.1 0.030550.0177962.7
MCV35.00.000060.004480.0042292.3 |$-$|0.001220.015360.0144792.9
 AUC Sensitivity
MethodTimeDIFFESEASECP (%) DIFFESEASECP (%)
EXT7.50.00413 0.00320
APP7.60.009270.0046852.1 0.030550.0177962.7
MCV35.00.000060.004480.0042292.3 |$-$|0.001220.015360.0144792.9

The proposed SOLID algorithm was implemented as an R software package solid, which is available at https://github.com/celehs/solid.

4. Data application

The EMR contains rich clinical information on patients including medical history, vital signs, lab test results, and clinical notes. Accurately assigning ICD codes for individual patient visits is highly important on many levels, from ensuring the integrity of the billing process to creating an accurate record of patient medical history. However, the coding process is tedious, subjective, and currently relies largely on manual efforts from professional medical coders. The large volume of medical records makes manual ICD assignment a labor-intensive and costly process, signifying the need for methods to automate the coding process. Narrative features can be extracted from free text such as radiology reports via natural language processing (NLP). To explore the feasibility of automatic code assignment based on narrative text, we aim to develop an ICD classification model for depression based on high dimensional NLP features, and to evaluate its accuracy using EMR data from Partner’s Healthcare Biobank (PHB).

The analysis data consisted of ICD codes and NLP features for 1 000 000 visits, which was a random sample of |$2\,096\,563$| visits from PHB. The remaining |$1\,096\,563$| visits were considered as a validation set for model evaluation. The response |$Y$| was the indicator for having at least one ICD code for depression on each visit and the prevalence was about |$4\%$|⁠. A list of |$1410$| candidate medical concepts were extracted by performing named entity recognition on five articles related to depression from online sources including Wikipedia, MedlinePlus, Medscape, Merck, and Mayo Clinic, following the strategy of Yu and others (2016). We then performed NLP on the EMR narrative notes to count the occurrences of these NLP features in each visit. After quality control and pre-screening, we retained a total of 142 NLP features that appeared in at least 5% of the visits. Because the NLP counts tend to be zero-inflated and highly skewed, we took |$x \mapsto \log(x + 1)$| transformation on all features. It is worth mentioning that for the visit level data, the variables were not independent and identically distributed, since the measurements for each patients were correlated. Therefore, we fitted a penalized generalized linear model with independent working covariance for estimation and the sandwich estimator for variance calculation. We set an effective sample size with the number of patients instead of the number of visits for tuning parameter selection and let |$K = 100$| for SOLID estimation. The full sample estimator can only be calculated with a very large memory capacity (120 GB) available.

For regression coefficients, we compare |$\widehat{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{SOLID},j}$|⁠, |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$|⁠, |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Chen}$|⁠, |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Tang}$|⁠, and |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf SMA}$| with regards to computation time, point estimates, and confidence intervals. The proposed SOLID method takes 31 s, while all the other methods take longer times (34 min, 1532 s, 1476 s, and 116 s for |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$|⁠, |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Chen}$|⁠, |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Tang}$|⁠, and |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf SMA}$|⁠, respectively). Bootstrap was used to calculate the standard errors of |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$|⁠. We present the point estimates for all methods and the confidence intervals for |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$| and |$\widehat{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{SOLID},j}$| for the 20 predictors with the largest estimated effect sizes in Figure 1. The results for the remaining predictors are summarized in the Supplementary Materials available at Biostatistics online. As expected, for most predictors among the top 20 with largest effect sizes, the point estimates, and confidence intervals of |$\widehat{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{SOLID},j}$| are very close to those of |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$|⁠. On the other hand, |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Chen}$|⁠, |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Tang}$|⁠, and |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf SMA}$| are different with regards to point estimates compared with |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$|⁠.

We conduct model evaluation of the SOLID algorithm using MCV. For benchmark, we evaluate the full-sample based adaptive lasso algorithm using external validation. We present the receiver operating characteristic curve in Figure 2. The AUCs of proposed method and full-sample based adaptive lasso are 0.908 (95% CI [0.905, 0.911]) and 0.907 (95% CI [0.906, 0.909]), respectively. At FPR = 0.1, the TPRs of the proposed method and the full-sample based adaptive lasso are 0.799 (95% CI [0.792, 0.807]) and 0.797 (95% CI [0.793, 0.800]), respectively; at FPR = 0.15, the TPRs of proposed method and the full-sample based adaptive lasso are 0.864 (95% CI [0.858, 0.870]) and 0.860 (95% CI [0.857, 0.863]). This observation suggests that the accuracy estimates of the SOLID algorithm obtained by the MCV procedure are very close to those of the full-sample based procedure obtained by external validation.

Averaged ROC curves of external validate (EXT), apparent validation (APP), and modified cross validation (MCV) over 1000 simulations.
Fig. 2.

Averaged ROC curves of external validate (EXT), apparent validation (APP), and modified cross validation (MCV) over 1000 simulations.

5. Discussion

In this article, we propose a novel SOLID method for sparse risk prediction and also develop an efficient MCV for model evaluation. The proposed SOLID for fitting adaptive LASSO reduces the computation cost while maintaining the precision of estimation with an extraordinarily large |$n_0$| and a numerically large |$p$|⁠. The use of a screening step and one-step linearization fused DAC makes it feasible to obtain a SOLID estimator that attains equivalent precision to that of the full sample estimator with a many-fold reduction of computation time. The purpose of the screening steps (1.1)–(1.4) is to choose the active set, while that of the post screening steps (2.1) and (2.2) is to get the refined estimation results. Specifically, although the active set estimate |$\widehat{{\cal A}}$| from the screening step consistently estimates |$\mathcal{A}$| asymptotically, it may not be sufficiently accurate since it is derived based on a “rough” one-step estimator |$\widetilde{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}^{\sf\scriptscriptstyle lin,1}$|⁠, which has a convergence rate of |$\|\widetilde{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}^{\sf\scriptscriptstyle lin,1}-\boldsymbol{\beta}_0\|_2 = O_p(p/n_{\Omega_1})$|⁠, slower than that of |$\widetilde{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$| when |$n_{\Omega_1}$| is not very large. However, this step is critical in improving the computational speed of SOLID since step (2.1) can be performed much faster when constraining to the estimated active set |$\widehat{{\cal A}}$|⁠. The time costs needed to implement each step of the SOLID algorithm in comparison with the direct DAC without screening are summarized in the Supplementary Materials available at Biostatistics online. To examine the performance of variable selection in the screening procedure, we summarize the size of the estimated active set, sensitivity (the proportion of actual active features that are correctly identified by the algorithm), specificity (the proportion of nonactive features that are not identified by the algorithm) in the Supplementary Materials available at Biostatistics online.

The estimator substantially outperforms existing DAC procedures with respect to both computational time and estimation accuracy. One major difference between the SOLID algorithm and the existing DAC algorithm is that we combine data from all subsets to calculate the information matrix |${\widehat A}_{\sf \scriptscriptstyle{DAC}}$|⁠, while other methods only perform estimations on individual subsets which essentially rely on an information matrix estimated from the subsets. When |$p$| is large, the approximation error of the information matrix estimated from each subset is substantially larger compared to the approximation error of the aggregated counterpart. We find that the improved precision of |${\widehat A}_{\sf \scriptscriptstyle{DAC}}$| contributes significantly to the performance of the SOLID estimator, especially for settings with weaker signals. For example, Chen and Xie (2014) essentially uses individual sets to calculate the information matrix and the score and then combine estimates across |$K$| parts via majority voting. Compared to their estimator, the SOLID estimator has a much smaller MSE when signals are weaker (simulation settings 2 and 3) and |$p$| is large.

For our numerical studies, we required |$K$| to be of order |$o(n_0^{1/2})$| and fixed it to be |$100$| with |$p=50$| and |$500$| and |$50$| when |$p=1000$|⁠. Here, we give a few practical guidelines to the choice of |$K$|⁠. The partition size |$K$| should be chosen to ensure that |$n$| is reasonably larger than |$p$|⁠, say |$n >5p$|⁠, as well as large enough to benefit from distributed execution. In our numerical studies, we did not use parallel computing for calculating the statistics on the individual subsets due to computing resource constraints. However, we highly recommend using multicore and multi-node parallel computing to maximize the efficiency of the SOLID algorithm. In step (1.1) of the screening step, we used data in |$\Omega_1$| to obtain an initial estimator. When the event rate is low and |$p$| is large, one may combine several subsets to calculate the initial estimator to improve stability. When |$p$| is also very large, it is possible that the proposed SOLID algorithm is still subject to computational limitations with |$n > p$|⁠. For such settings, additional refinement such as those adopted in sure independence screening (Fan and Lv, 2008). Future research is needed for additional model assumptions that may be needed to improve the screening step.

Supplementary material

Supplementary material is available at http://biostatistics.oxfordjournals.org.

References

Caner,
M.
and
Zhang,
H. H.
(
2014
).
Adaptive elastic net for generalized methods of moments
.
Journal of Business & Economic Statistics
32
,
30
47
.

Chen,
X.
and
Xie,
M.
(
2014
).
A split-and-conquer approach for analysis of extraordinarily large data
.
Statistica Sinica
24
,
1655
1684
.

Cui,
Y.
,
Chen,
X.
and
Yan,
L.
(
2017
).
Adaptive lasso for generalized linear models with a diverging number of parameters
.
Communications in Statistics-Theory and Methods
46
,
11826
11842
.

Fan,
J.
and
Li,
R.
(
2001
).
Variable selection via nonconcave penalized likelihood and its oracle properties
.
Journal of the American statistical Association
96
,
1348
1360
.

Fan,
J.
and
Lv,
J.
(
2008
).
Sure independence screening for ultrahigh dimensional feature space
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
70
,
849
911
.

He,
Q.
,
Zhang,
H. H.
,
Avery,
C. L.
and
Lin,
D.Y.
(
2016
).
Sparse meta-analysis with high-dimensional data
.
Biostatistics
17
,
205
220
.

Lee,
J.D.
,
Liu,
Q.
,
Sun,
Y.
and
Taylor,
J.E.,
(
2017
).
Communication-efficient sparse regression
.
The Journal of Machine Learning Research
,
18
(
1
), pp.
115
144
.

Pavlou,
M.
,
Ambler,
G.
,
Seaman,
S.
,
De Iorio,
M.
and
Omar,
R. Z.
(
2016
).
Review and evaluation of penalised regression methods for risk prediction in low-dimensional data with few events
.
Statistics in Medicine
35
,
1159
1177
.

Pepe,
M.S.
, (
2003
).
The statistical evaluation of medical tests for classification and prediction
.
Medicine. Oxford University Press
.

Tang,
L.
,
Zhou,
L.
and
Song,
P. X.-K.
(
2016
).
Method of divide-and-combine in regularised generalised linear models for big data
. arXiv preprint .

Tian,
L.
,
Cai,
T.
,
Goetghebeur,
E.
and
Wei,
L. J.
(
2007
).
Model evaluation based on the sampling distribution of estimated absolute prediction error
.
Biometrika
94
,
297
311
.

Tibshirani,
R.
(
1996
).
Regression shrinkage and selection via the lasso
.
Journal of the Royal Statistical Society: Series B (Methodological)
58
,
267
288
.

Uno,
H.
,
Cai,
T.
,
Tian,
L.
and
Wei,
L. J.
(
2007
).
Evaluating prediction rules for t-year survivors with censored regression models
.
Journal of the American Statistical Association
102
,
527
537
.

Van de Geer,
S.
,
Bühlmann,
P.
,
Ritov,
Y.
,
Dezeure,
R.
and others. (
2014
).
On asymptotically optimal confidence regions and tests for high-dimensional models
.
The Annals of Statistics
42
,
1166
1202
.

Wang,
H.
and
Leng,
C.
(
2007
).
Unified lasso estimation by least squares approximation
.
Journal of the American Statistical Association
102
,
1039
1048
.

Wang,
X.
,
Peng,
P.
and
Dunson,
D. B.
(
2014
). Median selection subset aggregation for parallel inference. In:
Ghahramani
Z.
,
Welling
M.
,
Cortes
C.
,
Lawrence
N.D.
and
Weinberger
K.Q.
(editors),
Advances in Neural Infor- mation Processing Systems
.
Red Hook, NY
:
Curran Associates, Inc.
, pp.
2195
2203
.

Wang,
Y.
,
Palmer,
N.
,
Di,
Q.
,
Schwartz,
J.
,
Kohane,
I.
and
Cai,
T.
(
2021
).
A fast divide-and-conquer sparse Cox regression
.
Biostatistics
,
22
, 381–401.

Xie,
M.
,
Singh,
K.
and
Strawderman,
W. E.
(
2011
).
Confidence distributions and a unifying framework for meta-analysis
.
Journal of the American Statistical Association
106
(
493
),
320
333
.

Yu,
S.
,
Chakrabortty,
A.
,
Liao,
K. P.
,
Cai,
T.
,
Ananthakrishnan,
A. N.
,
Gainer,
V. S.
,
Churchill,
S. E.
,
Szolovits,
P.
,
Murphy,
S. N.
,
Kohane,
I. S.
and others. (
2016
). Surrogate-assisted feature extraction for high-throughput phenotyping.
Journal of the American Medical Informatics Association
24
(
e1
),
e143
e149
.

Zou,
H.
(
2006
).
The adaptive lasso and its oracle properties
.
Journal of the American Statistical Association
101
,
1418
1429
.

Zou,
H.
and
Hastie,
T.
(
2005
).
Regularization and variable selection via the elastic net
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
67
,
301
320
.

Zou,
H.
and
Zhang,
H. H.
(
2009
).
On the adaptive elastic-net with a diverging number of parameters
.
Annals of Statistics
37
,
1733
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data