Abstract

The semiparametric Cox proportional hazards model, together with the partial likelihood principle, has been widely used to study the effects of potentially time-dependent covariates on a possibly censored event time. We propose a computationally efficient method for fitting the Cox model to big data involving millions of study subjects. Specifically, we perform maximum partial likelihood estimation on a small subset of the whole data and improve the initial estimator by incorporating the remaining data through one-step estimation with estimated efficient score functions. We show that the final estimator has the same asymptotic distribution as the conventional maximum partial likelihood estimator using the whole dataset but requires only a small fraction of computation time. We demonstrate the usefulness of the proposed method through extensive simulation studies and an application to the UK Biobank data.

1 INTRODUCTION

Large biobanks, such as the UK Biobank (Bycroft et al., 2018), provide unprecedented opportunities to explore the genetic basis of the onset and progression of complex human diseases. In particular, time-to-event analyses of the UK Biobank data have identified novel genetic variants associated with hypertension, heart disease, breast cancer, and Alzheimer’s disease (Bi et al., 2020; Dey et al., 2022). The semiparametric Cox (1972) proportional hazards model has been widely used in such time-to-event analyses. The regression parameters are commonly estimated by maximizing the partial likelihood (Cox, 1975). This maximization is time-consuming and often infeasible for big data involving a very large number of subjects, especially in the presence of time-dependent covariates.

One possible solution is to randomly divide the full dataset into several subsets, analyze each subset separately, and then combine the estimates. Two versions of this divide-and-conquer (DAC) approach have been suggested: The linearization-based DAC method calculates an initial maximum partial likelihood estimate for 1 block and iteratively updates the estimate with the score function and information matrix on each individual subset in 2 iterations (Wang et al., 2021); the weighted DAC method calculates the maximum partial likelihood estimate for each block and produces a weighted average of the estimates (Wang et al., 2022). The DAC approach is computationally efficient under the distributed computing architecture but requires separate estimation of the baseline hazard function for each subset of data, which may cause numerical instability and reduce statistical efficiency.

In this paper, we propose a computationally fast method for fitting the Cox regression model to big data without sacrificing statistical efficiency. Specifically, we divide the whole dataset into 2 blocks, with the first block being much smaller than the second block. We perform maximum partial likelihood estimation on the first block of data. We then improve this initial estimator by using one-step estimation with semiparametric efficient scores on the second block of data. Since maximum partial likelihood estimation is performed only on a small subset of the whole data, and the one-step update does not involve any iterations or calculation of the Hessian matrix, the proposed method takes only a small fraction of the computation time that is required to maximize the partial likelihood on the whole dataset. Using the counting-process martingale theory and modern empirical process theory, we show that the proposed estimator achieves the same asymptotic efficiency as the maximum partial likelihood estimator on the whole dataset.

2 METHODS

2.1 Maximum partial likelihood estimation

The Cox (1972) proportional hazards model specifies that the hazard function of an event time T conditional on a p-vector of potentially time-dependent covariates |$\boldsymbol {X}$| takes the form

where |$\boldsymbol {\beta }$| is a p-vector of unknown regression parameters, and |$\lambda _{0}(t)$| is an arbitrary baseline hazard function. When T is subject to right censoring by C, we observe |$\boldsymbol {O}= (\widetilde{T}, \Delta , \boldsymbol {X})$|⁠, where |$\widetilde{T} = \min (T, C)$|⁠, |$\Delta = I(T \le C)$|⁠, and |$I(\cdot )$| is the indicator function. It is assumed that C is independent of T conditional on |$\boldsymbol {X}$|⁠.

We consider a random sample of n subjects and use the subscript i to indicate any variable associated with the ith subject. Given the data |$(\widetilde{T}_i, \Delta _i, \boldsymbol {X}_i)\, (i=1,\ldots ,n)$|⁠, the partial likelihood for |$\boldsymbol {\beta }$| takes the form

The corresponding score function and information matrix are

and

respectively. Here and in the sequel, |$\boldsymbol {a}^{\otimes 0} = 1, \boldsymbol {a}^{\otimes 1} = \boldsymbol {a}$|⁠, and |$\boldsymbol {a}^{\otimes 2} = \boldsymbol {a}\boldsymbol {a}^{{ \mathrm{\scriptscriptstyle T} }}$|⁠. The standard method is to solve the score equation |$\boldsymbol {U}(\boldsymbol {\beta })=0$| via the Newton-Raphson algorithm. The resulting estimator |$\widehat{\boldsymbol {\beta }}$| is consistent and asymptotically normal, with a covariance matrix that can be estimated by |$\boldsymbol {I}^{-1}(\widehat{\boldsymbol {\beta }})$| (Andersen and Gill, 1982).

The time required to obtain |$\widehat{\boldsymbol {\beta }}$| is determined by the evaluation of |$\boldsymbol {U}(\boldsymbol {\beta })$| and |$\boldsymbol {I}(\boldsymbol {\beta })$| during the Newton-Raphson iterations. When |$\boldsymbol {X}$| is time-independent, the summations in the numerators and denominators of |$\boldsymbol {U}(\boldsymbol {\beta })$| and |$\boldsymbol {I}(\boldsymbol {\beta })$| take the form |$\sum _{j=1}^{n}I(\widetilde{T}_{j} \ge \widetilde{T}_{i}) \boldsymbol {g}(\boldsymbol {X}_{j})$| for some function |$\boldsymbol {g}$|⁠. Once the observed event times are sorted in descending order, each summation can be calculated as a cumulative sum of |$\boldsymbol {g}(\boldsymbol {X}_{j})\, (j=1,\ldots ,n)$| in reverse order, and the summations can be performed for all |$\widetilde{T}_i$| in 1 loop. Thus, the computation time for evaluating |$\boldsymbol {U}(\boldsymbol {\beta })$| and |$\boldsymbol {I}(\boldsymbol {\beta })$| is linear in |$n p^2$|⁠, since the dimension of |$\boldsymbol {X}$| is p. Therefore, the total time complexity for the standard method is |$O(n\log n + np^2 )$|⁠, provided that the sorting algorithm has a time complexity |$O(n\log n)$|⁠. When |$\boldsymbol {X}$| is a continuous function of t, however, the summation |$\sum _{j=1}^{n}I(\widetilde{T}_{j} \ge \widetilde{T}_{i})\boldsymbol {g}\lbrace \boldsymbol {X}_{j}(\widetilde{T}_{i})\rbrace$| must be calculated over all the subjects in the risk set |$\lbrace j: \widetilde{T}_j \ge \widetilde{T}_{i}\rbrace$| at each observed event time |$\widetilde{T}_{i}$|⁠. Because each calculation is linear in n, the total time complexity for evaluating |$\boldsymbol {U}(\boldsymbol {\beta })$| and |$\boldsymbol {I}(\boldsymbol {\beta })$| becomes |$O(mnp^2)$|⁠, where m is the total number of distinct observed event times. Consequently, the computation can be formidable when n and m are very large.

2.2 Improving computational efficiency

We propose a method to reduce computation time while maintaining statistical efficiency. To this end, we randomly divide the entire dataset into 2 blocks indexed by |${\cal J}_{1}$| and |${\cal J}_{2}$|⁠, with |$n_1 (\lt n)$| observations in |${\cal J}_{1}$|⁠. We maximize the partial likelihood with the data from the first block to obtain an initial estimator of |$\boldsymbol {\beta }$|⁠, denoted by |$\widehat{\boldsymbol {\beta }}_{(1)}$|⁠. We then incorporate the data from the second block by using the one-step estimator from semiparametric efficiency theory (Bickel et al., 1993).

The efficient score function for |$\boldsymbol {\beta }$| based on a single observation |$\boldsymbol {O}=(\widetilde{T}, \Delta , \boldsymbol {X})$| is

where |$M(t) = N(t) - \int _{0}^t I(\widetilde{T} \ge s)e^{\boldsymbol {\beta }^{{ \mathrm{\scriptscriptstyle T} }}\boldsymbol {X}(s)}d\Lambda _{0}(s)$|⁠, |$N(t) = \Delta I(\widetilde{T}\le t)$|⁠, and |$\Lambda _{0}(t) = \int _0^t \lambda _0(s)ds$|⁠. We obtain an empirical counterpart |$\widehat{\boldsymbol {S}}(\boldsymbol {O}; \widehat{\boldsymbol {\beta }}_{(1)}, \widehat{\Lambda }_{(1)})$| by replacing |$\boldsymbol {\beta }$|⁠, |$\Lambda _0(t)$|⁠, and |$E\lbrace \boldsymbol {X}(t) \mid \widetilde{T} \ge t\rbrace$| in |$\boldsymbol {S}(\boldsymbol {O}; \boldsymbol {\beta }, \Lambda _0)$| with |$\widehat{\boldsymbol {\beta }}_{(1)}$|⁠,

(Breslow, 1972), and

respectively. Our final estimator for |$\boldsymbol {\beta }$| is given by

where |$\widehat{\boldsymbol {I}}_{(1)}$| is the information matrix for the first block. We refer to |$\widehat{\boldsymbol {\beta }}_f$| as the Cox one-step efficient score (Cox-OSES) estimator, to reflect the fact that the one-step estimation with the efficient scores on the second block is used to boost the efficiency of the initial estimator |$\widehat{\boldsymbol {\beta }}_{(1)}$|⁠. We show in the next section that the Cox-OSES estimator |$\widehat{\boldsymbol {\beta }}_{f}$| has the same asymptotic distribution as the standard estimator |$\widehat{\boldsymbol {\beta }}$|⁠. Thus, we estimate the covariance matrix of |$\widehat{\boldsymbol {\beta }}_{f}$| by |$n_{1} \widehat{\boldsymbol {I}}_{(1)}^{-1} / n$|⁠.

 
Remark 1

A statistically more efficient covariance matrix estimator is given by |$\boldsymbol {I}(\widehat{\boldsymbol {\beta }}_{f})^{-1}$|⁠, which makes use of the entire data. However, its computation is more demanding than scaling |$\widehat{\boldsymbol {I}}_{(1)}$|⁠. For statistical inference, only a consistent estimator of the covariance matrix is required, and estimating the covariance matrix of |$\widehat{\boldsymbol {\beta }}_{f}$| by |$n_{1} \widehat{\boldsymbol {I}}_{(1)}^{-1} / n$| is both sufficient and practical.

According to the discussion in Section 2.1, the time complexity for calculating |$(\widehat{\boldsymbol {\beta }}_{(1)}, \widehat{\Lambda }_{(1)}, \widehat{\boldsymbol {I}}_{(1)})$| is |$O(n_{1}\log n_{1} + n_{1}p^2)$| when |$\boldsymbol {X}$| is time-independent and |$O(m_{1}n_{1}p^2)$| when |$\boldsymbol {X}$| contains continuously time-dependent covariates, where |$m_{1}$| is the total number of distinct observed event times in the first block. For the second block, we need to evaluate the term |$\sum _{i\in {\cal J}_2}\widehat{\boldsymbol {S}}(\boldsymbol {O}_{i}; \widehat{\boldsymbol {\beta }}_{(1)}, \widehat{\Lambda }_{(1)})$|⁠, which is equal to

(1)

where |$d\widehat{\Lambda }_{(1)}(\widetilde{T}_{j}) = \Delta _j/\sum _{l \in {\cal J}_1}I(\widetilde{T}_{l} \ge \widetilde{T}_{j})e^{\widehat{\boldsymbol {\beta }}_{(1)}^{{ \mathrm{\scriptscriptstyle T} }}\boldsymbol {X}_{l}(\widetilde{T}_{j})}$|⁠. Note that |$\widehat{\boldsymbol {E}}(t)$| is a step function with |$m_{1}$| jumps, which is obtained from the first block analysis. Thus, the time complexity for the first term in Equation 1 is |$O\lbrace \max (n_{1}, n-n_{1})p\rbrace =O\lbrace (n-n_1)p\rbrace$| after the event times in the second block are sorted. The time complexity for calculating the second term in Equation 1 depends on the type of covariates. If |$\boldsymbol {X}$| is time-independent, this term can be written as

Given the sorted event times in the second block, the time complexity of computing this term is |$O(np)$|⁠. When |$\boldsymbol {X}$| is continuously time-dependent, however, we need to evaluate the summation for each of the |$m_{1}$| distinct event times in the first block, with a time complexity of |$O\lbrace m_{1}(n-n_{1})p\rbrace$|⁠. Combining the calculations in the first and second blocks, we conclude that the total time complexity of the proposed method is |$O\lbrace n_{1}\log n_{1} + n_{1}p^2 + (n-n_{1})\log (n-n_{1}) + np\rbrace$| when covariates are all time-independent and |$O\lbrace m_{1}n_{1}p^2 + m_{1}(n-n_{1})p\rbrace$| when covariates are continuously time-dependent.

According to the discussion in Section 2.1, the total time complexity of the DAC methods with M blocks is |$O\lbrace n \log (n / M) + np^2\rbrace$| when covariates are all time-independent and |$O(m_{1}np^2)$| when covariates are continuously time-dependent. Here, we assume the DAC methods evenly divide the full dataset into M subsets and the block size is of the same magnitude as |$n_1$|⁠. Then the time complexity of the proposed method is of the same order as the standard method and the DAC methods when covariates are all time-independent and is much lower than both the standard method and the DAC methods when covariates are continuously time-dependent and p is large. In terms of actual computation, the proposed method is always less demanding than both the standard method and the DAC methods since the one-step estimation in the second block does not involve any iterations or calculation of the Hessian matrix.

The Cox-OSES estimator can be used for variable selection. Motivated by Wang and Leng (2007) and Zou (2006), we minimize the objective function:

where |$\lambda \ge 0$| is a tuning parameter controlling the level of penalization, and |$\gamma \gt 0$| is a pre-specified positive number. We set |$\gamma = 1$| in our simulation studies and real-data example. This optimization can be solved by using the R package glmnet (Friedman et al., 2010). Let |$\widetilde{\boldsymbol {\beta }}_{f, \lambda }$| denote the minimizer of |$Q(\boldsymbol {\beta })$|⁠. As |$n^{1 / 2}\lambda \rightarrow 0$| and |$n^{(1 + \gamma ) / 2} \lambda \rightarrow \infty$|⁠, |$\widetilde{\boldsymbol {\beta }}_{f, \lambda }$| satisfies both selection consistency and the oracle property (Wang and Leng, 2007).

2.3 Asymptotic properties of the Cox-OSES estimator

We impose the following regularity conditions:

 
Condition 1

The vector of covariates |$\boldsymbol {X}(t)$| has finite total variation over the time interval |$[0, \tau ]$|⁠, where |$\tau \lt \infty$| denotes the end of the study.

 
Condition 2

The true baseline hazard function satisfies that |$\int _{0}^{\tau }\lambda _{0}(t)dt \lt \infty$|⁠.

 
Condition 3

The probability |$P(\widetilde{T} \ge \tau ) \gt 0$|⁠.

 
Condition 4

The matrix |$\boldsymbol {\Sigma }= \int _{0}^{\tau }\boldsymbol {v}(\boldsymbol {\beta }_{0}, t)s_{0}(\boldsymbol {\beta }_{0}, t)\lambda _{0}(t)dt$| is positive definite, where |$\boldsymbol {v}(\boldsymbol {\beta }, t) = \boldsymbol {s}_{2}(\boldsymbol {\beta }, t) / s_{0}(\boldsymbol {\beta }, t) - \lbrace \boldsymbol {s}_{1}(\boldsymbol {\beta }, t) / s_{0}(\boldsymbol {\beta }, t)\rbrace ^{\otimes 2}$|⁠, |$\boldsymbol {s}_{k}(\boldsymbol {\beta }, t) = E\lbrace I(\widetilde{T} \ge t)e^{\boldsymbol {\beta }^{{ \mathrm{\scriptscriptstyle T} }}\boldsymbol {X}(t)} \boldsymbol {X}(t)^{\otimes k}\rbrace \, (k=0,1,2)$|⁠, and |$\boldsymbol {\beta }_{0}$| is the true value of |$\boldsymbol {\beta }$|⁠.

 
Condition 5

The size of the first block |$n_1$| satisfies that |$n/n_1^2\rightarrow 0$| as |$n\rightarrow \infty$|⁠.

The first 4 conditions ensure that the maximum partial likelihood estimator is consistent, asymptotically normal, and asymptotically efficient (Andersen and Gill, 1982). The last condition implies that |$n_1$| is much smaller than n but is larger than |$n^{1/2}$|⁠. We state below the asymptotic properties of the Cox-OSES estimator.

 
Theorem 1

Under Conditions 1-5, |$n^{1/2}(\widehat{\boldsymbol {\beta }}_{f} - \boldsymbol {\beta }_0)$| converges in distribution to a zero-mean normal random vector with covariance matrix |$\boldsymbol {\Sigma }^{-1}$|⁠.

The theorem is proved in Web Appendix A.

3 SIMULATION STUDIES

We conducted extensive simulation studies to compare the proposed method with the standard method and the DAC methods. We refer to the linearization-based DAC method as DAC-Linearization and weighted DAC method with the block size or information matrix as the weight as DAC-Size or DAC-Information, respectively. We considered one setting involving only time-independent covariates and one setting involving time-dependent covariates. In the first setting, we generated 100 covariates from the multivariate normal distribution with mean 0 and covariance matrix |$\lbrace I(i = j) + 0.2I(i \ne j)\rbrace _{i,j=1,\dots ,100}$|⁠. We set |$\lambda _{0}(t) = t$| and |$\boldsymbol {\beta }= (-\boldsymbol {0.5}_{25}^{{ \mathrm{\scriptscriptstyle T} }}, -\boldsymbol {0.2}_{25}^{{ \mathrm{\scriptscriptstyle T} }}, \boldsymbol {0.2}_{25}^{{ \mathrm{\scriptscriptstyle T} }}, \boldsymbol {0.5}_{25}^{{ \mathrm{\scriptscriptstyle T} }})^{{ \mathrm{\scriptscriptstyle T} }}$|⁠, and we generated C from |$\text{Exp}(1.8)$|⁠, where |$\boldsymbol {a}_{q}$| is a |$q \times 1$| vector with all elements being a, and |$\text{Exp}(1.8)$| denotes the exponential distribution with mean |$1/1.8$|⁠. In the second setting, we included 50 time-independent covariates |$\boldsymbol {X}_{\text{ind}}$| and 50 time-dependent covariates |$\boldsymbol {X}_{\text{dep}}$|⁠. We considered 5 time intervals |$R_{1} = [0, 0.05)$|⁠, |$R_{2} = [0.05, 0.1)$|⁠, |$R_{3} = [0.1, 0.15)$|⁠, |$R_{4} = [0.15, 0.2)$|⁠, and |$R_{5} = [0.2, \infty )$|⁠, where the time-dependent covariates |$\boldsymbol {X}_{\text{dep}}$| are constant within each interval but may vary between intervals. We generated |$\lbrace \boldsymbol {X}_{\text{ind}}^{{ \mathrm{\scriptscriptstyle T} }}, \boldsymbol {X}_{\text{dep}, R_{1}}^{{ \mathrm{\scriptscriptstyle T} }}, \boldsymbol {X}_{\text{dep}, R_{2}}^{{ \mathrm{\scriptscriptstyle T} }}, \boldsymbol {X}_{\text{dep}, R_{3}}^{{ \mathrm{\scriptscriptstyle T} }}, \boldsymbol {X}_{\text{dep}, R_{4}}^{{ \mathrm{\scriptscriptstyle T} }}, \boldsymbol {X}_{\text{dep}, R_{5}}^{{ \mathrm{\scriptscriptstyle T} }}\rbrace ^{{ \mathrm{\scriptscriptstyle T} }}$| from the multivariate normal distribution with mean |$\boldsymbol {0}$| and covariance matrix |$\lbrace I(i = j) + 0.2I(i \ne j)\rbrace _{i,j=1,\dots ,300}$|⁠. The final covariates are |$(\boldsymbol {X}_{\text{ind}}^{{ \mathrm{\scriptscriptstyle T} }}, \boldsymbol {X}_{\text{dep}}^{{ \mathrm{\scriptscriptstyle T} }})$|⁠, where |$\boldsymbol {X}_{\text{dep}} = \sum _{k=1}^{5}\boldsymbol {X}_{\text{dep}, R_{k}}I(t \in R_{k}).$| We set |$\lambda _{0}(t) = t$|⁠, |$\boldsymbol {\beta }= (-\boldsymbol {0.5}_{25}^{{ \mathrm{\scriptscriptstyle T} }}, -\boldsymbol {0.2}_{25}^{{ \mathrm{\scriptscriptstyle T} }}, \boldsymbol {0.2}_{25}^{{ \mathrm{\scriptscriptstyle T} }}, \boldsymbol {0.5}_{25}^{{ \mathrm{\scriptscriptstyle T} }})^{{ \mathrm{\scriptscriptstyle T} }}$|⁠, and |$C = 0.25$|⁠. In both settings, approximately 70% of the event times were censored. A total sample size of |$10^{6}$| or |$10^{7}$| was considered for both settings.

For the proposed method, we set |$n_{1} = kn$|⁠, where |$k=0.1$|⁠, 0.2, 0.3, or 0.4. For the DAC methods, we set |$M = 5$|⁠, 10, 20, 50, 100, 200, 500, or 1000. The convergence criterion for the Newton-Raphson algorithm was that the absolute relative change in the log-partial likelihood between 2 successive iterations is less than |$10^{-6}$|⁠. For each method, we recorded the total CPU time and calculated the squared error |$(\widehat{\boldsymbol {\beta }} - \boldsymbol {\beta })^{{ \mathrm{\scriptscriptstyle T} }}(\widehat{\boldsymbol {\beta }} - \boldsymbol {\beta })$|⁠, where |$\widehat{\boldsymbol {\beta }}$| denotes the estimate of |$\boldsymbol {\beta }$|⁠.

Figure 1 displays the simulation results based on 1000 replicates. In both settings, the Cox-OSES estimator has almost the same statistical efficiency as the standard estimator but takes much less time to compute. In the second setting with |$n=10^{7}$|⁠, the standard method took approximately 2.04 hours whereas the Cox-OSES method with |$k=0.2$| took only 26.91 minutes, with only 0.07% efficiency loss. Compared with the DAC-Size estimator and DAC-Information estimator, the Cox-OSES estimator always takes less time to compute. When the Cox-OSES estimator with some k and the DAC-Linearization estimator with some M have the same squared error, the Cox-OSES estimator takes less time to compute. When the computation time is the same between the 2 methods, the Cox-OSES estimator has a smaller squared error.

Average ratios of squared error and total CPU time for the Cox-OSES and DAC methods to the standard method. The solid curve pertains to Cox-OSES, and the numbers along it indicate the proportion of the first block to the entire dataset. The dashed curve pertains to DAC-Linearization, and the numbers along it indicate the number of blocks. The dotted-dashed and dotted curves pertain to DAC-Size and DAC-Information, respectively. The numbers along the curves indicate the corresponding numbers of blocks.
FIGURE 1

Average ratios of squared error and total CPU time for the Cox-OSES and DAC methods to the standard method. The solid curve pertains to Cox-OSES, and the numbers along it indicate the proportion of the first block to the entire dataset. The dashed curve pertains to DAC-Linearization, and the numbers along it indicate the number of blocks. The dotted-dashed and dotted curves pertain to DAC-Size and DAC-Information, respectively. The numbers along the curves indicate the corresponding numbers of blocks.

Figures S1 and S2 in Web Appendix B display the bias and standard error of the proposed estimator in the second setting with |$n = 10^{7}$|⁠. Neither the bias nor the standard error is influenced by the type of covariate, the true coefficient value, or the choice of |$n_{1}$|⁠.

We further compared the proposed method with the standard method and DAC methods in the context of rare events. We adopted the first simulation setting, but with C generated from |$\text{Un}(0, 0.045)$|⁠. Approximately 98% of the event times were censored. We considered |$n = 10^{7}$|⁠. Tables S1 and S2 in Web Appendix B present the simulation results based on 1000 replicates. The basic conclusions are the same as those of the first 2 settings.

The simulation results suggest that with |$n_{1} = 0.2 \times n$|⁠, the proposed estimator achieves the same statistical efficiency as the standard estimator while reducing the computation time by 80%. Nevertheless, the choice of |$n_{1}$| should depend on both n and the event rate. A smaller n or a lower event rate requires a larger |$n_{1}$|⁠.

We also conducted simulation studies to evaluate the variable selection performance of the regularized estimation approach with the Cox-OSES estimator. We adopted the first 2 simulation settings to generate |$\boldsymbol {X}$| and used the same |$\lambda _{0}(t)$|⁠. For each setting, we considered 2 choices of |$\boldsymbol {\beta }$| to reflect the effect sizes. When only time-independent covariates are involved, we set |$\boldsymbol {\beta }= (\boldsymbol {0.1}_{2}^{{ \mathrm{\scriptscriptstyle T} }}, \boldsymbol {0.2}_{2}^{{ \mathrm{\scriptscriptstyle T} }}, \boldsymbol {0.5}_{2}^{{ \mathrm{\scriptscriptstyle T} }}, \boldsymbol {0}_{94}^{{ \mathrm{\scriptscriptstyle T} }})^{{ \mathrm{\scriptscriptstyle T} }}$| and generated C from |$\text{Un}(0, 1.5)$| or set |$\boldsymbol {\beta }= (\boldsymbol {0.02}_{2}^{{ \mathrm{\scriptscriptstyle T} }}, \boldsymbol {0.04}_{2}^{{ \mathrm{\scriptscriptstyle T} }}, \boldsymbol {0.06}_{2}^{{ \mathrm{\scriptscriptstyle T} }}, \boldsymbol {0}_{94}^{{ \mathrm{\scriptscriptstyle T} }})^{{ \mathrm{\scriptscriptstyle T} }}$| and generated C from |$\text{Un}(0, 1.6)$|⁠. When time-dependent covariates are involved, we set |$\boldsymbol {\beta }= (0.1, 0.2, 0.5, \boldsymbol {0}_{47}^{{ \mathrm{\scriptscriptstyle T} }}, 0.1, 0.2, 0.5, \boldsymbol {0}_{47}^{{ \mathrm{\scriptscriptstyle T} }})^{{ \mathrm{\scriptscriptstyle T} }}$| and |$C = 0.75$| or set |$\boldsymbol {\beta }= (0.02, 0.04, 0.06, \boldsymbol {0}_{47}^{{ \mathrm{\scriptscriptstyle T} }}, 0.02, 0.04, 0.06, \boldsymbol {0}_{47}^{{ \mathrm{\scriptscriptstyle T} }})^{{ \mathrm{\scriptscriptstyle T} }}$| and |$C = 0.85$|⁠. In all scenarios, approximately 70% of the event times were censored. We set |$n = 10^{6}$|⁠.

For the proposed method and the DAC methods, we used the same block size as in the previous simulation studies. For DAC methods, we adopted the regularized estimation approach of Wang et al. (2021, 2022). For the standard method, we employed the unified Lasso estimation of Wang and Leng (2007). The tuning parameter |$\lambda$| was chosen by minimizing the Bayesian information criteria. For each method, we report the average true discoveries and false discoveries based on 1000 replicates. As shown in Table S3 in Web Appendix B, the regularized estimation approach with the Cox-OSES estimator performs similarly to the standard method.

4 AN EXAMPLE

We considered the UK Biobank, which is a cohort study on approximately half a million subjects (Bycroft et al., 2018). We related the age-at-onset of circulatory-system diseases to 1100 genetic variants, of which 50 were selected from each chromosome by univariate screening, with adjustment for sex and the top 10 genetic principal components representing population substructures. To construct the time-to-event outcome, we converted the International Classification of Disease code version 10 to the PheWAS code (Denny et al., 2013) and then calculated the difference between the date of diagnosis and the date of birth (Dey et al., 2022). The dataset contains a total of 487 252 subjects, with an event rate of 39%. We performed variable selection on the 1100 genetic variants using the regularized estimation approach with the Cox-OSES estimator. We randomly chose 20% of the whole dataset as the first block, and we calculated the Cox-OSES estimates and covariance matrix estimates. We repeated this process 10 times and averaged the 10 sets of estimates to adjust for the randomness introduced by data splitting. We then used the averaged estimates for variable selection among the 1100 genetic variants. The final analysis dataset contains 140 genetic variants.

We then randomly chose 10%, 20%, 30%, or 40% of the whole dataset as the first block for the proposed method and varied M from 5 to 1000 for the DAC methods. For each method, we reported the total CPU time and calculated the squared error |$(\widehat{\boldsymbol {\beta }} - \widehat{\boldsymbol {\beta }}_{\text{std}})^{{ \mathrm{\scriptscriptstyle T} }}(\widehat{\boldsymbol {\beta }} - \widehat{\boldsymbol {\beta }}_{\text{std}})$|⁠, where |$\widehat{\boldsymbol {\beta }}$| denotes a DAC estimate or the Cox-OSES estimate and |$\widehat{\boldsymbol {\beta }}_{\text{std}}$| is the standard estimate. We repeated the process one hundred times. All calculations were done on the same computer.

The results are presented in Figure 2. Compared to the standard method, the average squared errors for the Cox-OSES estimates are |$3.61 \times 10^{-5}$|⁠, |$1.21 \times 10^{-5}$|⁠, and |$4.93 \times 10^{-6}$| under |$n_{1} = 0.2 \times n$|⁠, |$0.3 \times n$|⁠, and |$0.4 \times n$|⁠, respectively. Under |$M = 10$| and 50, the average squared errors are |$7.76 \times 10^{-7}$| and |$5.53 \times 10^{-6}$| for the DAC-Linearization estimates, |$1.99 \times 10^{-5}$| and |$1.29 \times 10^{-4}$| for the DAC-Size estimates, and |$1.22 \times 10^{-6}$| and |$1.54 \times 10^{-5}$| for the DAC-Information estimates, respectively. The standard method took about 245.45 seconds to converge. The DAC-Size and DAC-Information methods took more time to compute than the standard method for all choices of M because Newton-Raphson algorithm sometimes required more iterations to converge on small blocks than the entire UK Biobank dataset. The DAC-Linearization method took 150 and 128 seconds under |$M = 10$| and 50, respectively. By contrast, the Cox-OSES method took only 30, 57, and 84 seconds under |$n_{1}=0.2 \times n$|⁠, |$0.3 \times n$|⁠, and |$0.4 \times n$|⁠, respectively.

Average squared errors and ratios of total CPU time for the Cox-OSES and DAC methods to the standard method. The solid curve pertains to Cox-OSES, and the numbers along it indicate the proportion of the first block to the entire dataset. The dashed curve pertains to the DAC-Linearization, and the numbers along it indicate the number of blocks. The dotted-dashed and dotted curves pertain to DAC-Size and DAC-Information, respectively. The numbers along the curves indicate the corresponding numbers of blocks.
FIGURE 2

Average squared errors and ratios of total CPU time for the Cox-OSES and DAC methods to the standard method. The solid curve pertains to Cox-OSES, and the numbers along it indicate the proportion of the first block to the entire dataset. The dashed curve pertains to the DAC-Linearization, and the numbers along it indicate the number of blocks. The dotted-dashed and dotted curves pertain to DAC-Size and DAC-Information, respectively. The numbers along the curves indicate the corresponding numbers of blocks.

Figure S3 displays the estimated hazard ratios and the corresponding 95% confidence intervals for the selected 140 genetic variants. The most significant genetic variant is 20:8603950:G:C with a P-value of |$1.5 \times 10^{-9}$|⁠. This variant was identified by Vuckovic et al. (2020) to be associated with white blood cell count, plateletcrit, and platelet count. A second variant, 19:11526765:G:T, has a P-value of |$1.0 \times 10^{-8}$|⁠, and it was previously linked to hypertension by Liu et al. (2016). In addition, 3:124337402:G:A, with a P-value of |$2.4 \times 10^{-5}$|⁠, was associated with mean platelet volume and platelet count by Vuckovic et al. (2020). Finally, 4:144122227:A:G has a P-value of |$1.2 \times 10^{-4}$| and was identified by Stanzick et al. (2021) as being associated with blood urea nitrogen levels and glomerular filtration rate.

We divided the study participants into 3 groups based on their polygenic risk scores (The International Schizophrenia Consortium, 2009). Group 1 consists of the lowest 1/3, Group 2 the middle, and Group 3 the highest 1/3. The polygenic risk scores were constructed by |$\boldsymbol {x}^{{ \mathrm{\scriptscriptstyle T} }} \widehat{\boldsymbol {\beta }}$|⁠, where |$\boldsymbol {x}$| denotes the genetic variants and |$\widehat{\boldsymbol {\beta }}$| is the estimate of the regression parameters. Figure S4 presents the Kaplan-Meier estimates of the disease-free probabilities over time for the 3 groups. There are clear separations of the 3 curves, indicating the predictive value of the fitted model.

Following Cox (1972) and Kalbfleisch and Prentice (2002), as well as the coxph function in R and phreg in SAS, we checked the proportional hazards assumption by including an interaction between variant 20:8603950:G:C and |$\log t$|⁠. In this case, the counting process format is infeasible due to the machine’s memory. This task would take more than 1 month for the standard method to complete. We randomly chose 10% and 20% of the whole dataset as the first block for the proposed method and set |$M = 5$| and 10 for the DAC methods to ensure that the size of the first block in the proposed method is the same as the block size in the DAC methods. The DAC-Size and DAC-Information took 229 and 127 hours under |$M=5$| and 10, respectively, and the DAC-Linearization took 148 and 67 hours, respectively. By contrast, the proposed method with |$n_{1}=0.1 \times n$| and |$0.2 \times n$| took only 13 and 48 hours, respectively.

We repeated the process 10 times and averaged the 10 sets of estimates to adjust for the randomness introduced by data splitting. Figure S5 in Web Appendix C displays the 10 sets of estimates for the time-dependent covariate. Figure 3 shows that Cox-OSES and DAC-Linearization yielded similar estimates and similar standard errors. Table 1 summarizes the results for the coefficient of the time-dependent covariate. The proposed method and DAC methods agreed on rejecting the proportional hazards assumption.

Parameter estimates and standard error estimates for the proposed method and DAC-Linearization methods in the analysis of the UK Biobank data with a time-dependent covariate.
FIGURE 3

Parameter estimates and standard error estimates for the proposed method and DAC-Linearization methods in the analysis of the UK Biobank data with a time-dependent covariate.

TABLE 1

Estimation results for the regression parameter of the time-dependent covariate by the proposed method and DAC methods.

EstimateStd errorP-value
Cox-OSES
|$n_{1} = 10\% \times n$|–0.07190.02880.013
|$n_{1} = 20\% \times n$|–0.07100.02879.014
DAC-Linearization
|$M=5$|–0.06900.02867.016
|$M=10$|–0.06890.02867.016
DAC-Size
|$M=5$|–0.06880.02870.017
|$M=10$|–0.06850.02874.017
DAC-Information
|$M=5$|–0.06910.02867.016
|$M=10$|–0.06910.02867.016
EstimateStd errorP-value
Cox-OSES
|$n_{1} = 10\% \times n$|–0.07190.02880.013
|$n_{1} = 20\% \times n$|–0.07100.02879.014
DAC-Linearization
|$M=5$|–0.06900.02867.016
|$M=10$|–0.06890.02867.016
DAC-Size
|$M=5$|–0.06880.02870.017
|$M=10$|–0.06850.02874.017
DAC-Information
|$M=5$|–0.06910.02867.016
|$M=10$|–0.06910.02867.016

Abbreviation: DAC, divide-and-conquer.

TABLE 1

Estimation results for the regression parameter of the time-dependent covariate by the proposed method and DAC methods.

EstimateStd errorP-value
Cox-OSES
|$n_{1} = 10\% \times n$|–0.07190.02880.013
|$n_{1} = 20\% \times n$|–0.07100.02879.014
DAC-Linearization
|$M=5$|–0.06900.02867.016
|$M=10$|–0.06890.02867.016
DAC-Size
|$M=5$|–0.06880.02870.017
|$M=10$|–0.06850.02874.017
DAC-Information
|$M=5$|–0.06910.02867.016
|$M=10$|–0.06910.02867.016
EstimateStd errorP-value
Cox-OSES
|$n_{1} = 10\% \times n$|–0.07190.02880.013
|$n_{1} = 20\% \times n$|–0.07100.02879.014
DAC-Linearization
|$M=5$|–0.06900.02867.016
|$M=10$|–0.06890.02867.016
DAC-Size
|$M=5$|–0.06880.02870.017
|$M=10$|–0.06850.02874.017
DAC-Information
|$M=5$|–0.06910.02867.016
|$M=10$|–0.06910.02867.016

Abbreviation: DAC, divide-and-conquer.

We performed only 1 analysis. The gains in computational efficiency offered by the proposed method will be more important in genome-wide association studies requiring a large number of analyses.

5 REMARKS

The proposed method can incorporate more than 2 blocks of data. Suppose that there are K blocks of data indexed by |${\cal J}_{1}, \dots , {\cal J}_{K}$|⁠, with sizes |$n_{1},\ldots , n_{K}$|⁠, respectively. We obtain |$(\widehat{\boldsymbol {\beta }}_{(1)}, \widehat{\Lambda }_{(1)}, \widehat{\boldsymbol {I}}_{(1)})$| using the first block of data. For the remaining |$(K-1)$| blocks, we calculate

We then define the Cox-OSES estimator as |$\widehat{\boldsymbol {\beta }}_{f} = \sum _{k=1}^{K} n_{k} \widehat{\boldsymbol {\beta }}_{(k)} / n$|⁠. Mathematically, |$\widehat{\boldsymbol {\beta }}_{f}$| is independent of the choice of K, provided that |$n_1$| is the same, and thus Theorem 1 holds for the Cox-OSES estimator with multiple blocks. However, there are practical advantages to using more than 2 blocks. First, when the dataset is so big that the data have to be divided and stored in multiple machines, the boosting procedure can be carried out simultaneously among the |$(K-1)$| blocks, and only the estimates |$(\widehat{\boldsymbol {\beta }}_{(1)}, \widehat{\Lambda }_{(1)}, \widehat{\boldsymbol {I}}_{(1)})$| from the first block analysis are communicated to the remaining blocks. Thus, there is great efficiency in both computation and communication. Second, because only estimates are shared among the K blocks, privacy can be preserved when combining data from different sources. Third, the proposed procedure can be used to incorporate new data in real time without revisiting previous data. By treating the new batch of data as the |$(K+1)$|th block, we can obtain |$\widehat{\boldsymbol {\beta }}_{K+1}$| and include it in the calculation of |$\widehat{\boldsymbol {\beta }}_f$|⁠.

When applying the proposed method to multiple data blocks, heterogeneity across blocks may emerge. We may account for heterogeneity by adopting the stratified Cox proportional hazards model. Specifically, we fit the stratified Cox model to the first data block, estimate the efficient score functions for each stratum using the remaining data, and then update the initial coefficient estimates accordingly.

There may exist outlying observations in the dataset. The proposed method is not sensitive to outlying observations due to the rank-based nature of the partial likelihood inference.

We can extend the proposed method to marginal and random-effects models for correlated event times. Suppose that there are n independent clusters, with |$m_{i}$| subjects in the ith cluster. To calculate the Cox-OSES estimator, we randomly select |$n_{1}$| clusters into |${\cal J}_{1}$|⁠. For the marginal Cox model, we replace |$\widehat{\boldsymbol {I}}_{(1)}^{-1}$| with the sandwich variance estimator |$\widehat{\boldsymbol {\Sigma }} = \widehat{\boldsymbol {I}}_{(1)}^{-1}\left[\sum _{i \in {\cal J}_{1}}\left\lbrace \sum _{j=1}^{m_{i}} \widehat{\boldsymbol {S}}(\boldsymbol {O}_{ij}; \widehat{\boldsymbol {\beta }}_{(1)}, \widehat{\Lambda }_{(1)})\right\rbrace ^{\otimes 2} \right] \widehat{\boldsymbol {I}}_{(1)}^{-1}$| in the one-step estimation, where |$\boldsymbol {O}_{ij}$| denotes |$\boldsymbol {O}$| for the jth subject of the ith cluster. The covariance matrix of the resulting estimator is estimated by |$n_{1} \widehat{\boldsymbol {\Sigma }} / n$|⁠.

Fitting random-effects models is more challenging. Direct maximization of nonparametric likelihood is unstable. EM algorithms (Zeng and Lin, 2007) provide a stable solution, but the convergence is slow. In such situations, the proposed method is even more beneficial than in the current setting because the orders of computation are higher. In general, efficient score functions do not have explicit expressions but can be obtained as the numerical solution to a Fredholm-type equation (Zeng and Lin, 2007).

We can also extend the proposed approach to the situation of high-dimensional covariates. Specifically, we perform variable selection using the first block of data. After debiasing the initial estimates, we perform a one-step update on the coefficients of the selected variables using the remaining data. The main computation lies in the analysis of the first block of data, which is always less demanding than direct analysis of the whole dataset. We anticipate that uncertainty due to variable selection will not affect the asymptotic efficiency of the Cox-OSES estimator.

Kawaguchi et al. (2021) proposed a forward-backward scan algorithm, which reduces the time complexity from quadratic to linear for fitting the proportional subdistribution hazards model to competing risks data with time-independent covariates. However, their algorithm cannot be used to further improve computational efficiency in our setting because the reduction in time complexity by that algorithm results from the same mechanism of cumulative sum as the standard method for fitting the Cox proportional hazards model with time-independent covariates, which has only linear time complexity, as detailed in Section 2.1.

Acknowledgement

The authors thank the editor, associate editor, and 2 referees for their helpful comments.

FUNDING

This research was supported by the National Institutes of Health grants R01 HL149683 and GM124104.

CONFLICT OF INTEREST

None declared.

DATA AVAILABILITY

UK Biobank data are available from http://www.ukbiobank.ac.uk. A formal application to the UK Biobank is required to use the data.

References

Andersen
 
P. K.
,
Gill
 
R. D.
(
1982
).
Cox’s regression model for counting processes: a large sample study
.
The Annals of Statistics
,
10
,
1100
1120
.

Bi
 
W.
,
Fritsche
 
L. G.
,
Mukherjee
 
B.
,
Kim
 
S.
,
Lee
 
S.
(
2020
).
A fast and accurate method for genome-wide time-to-event data analysis and its application to UK biobank
.
The American Journal of Human Genetics
,
107
,
222
233
.

Bickel
 
P. J.
,
Klaassen
 
C. A. J.
,
Ritov
 
Y.
,
Wellner
 
J. A.
(
1993
).
Efficient and Adaptive Estimation for Semiparametric Models
.
Baltimore
:
Johns Hopkins University Press
.

Breslow
 
N.
(
1972
).
Discussion of the paper by D. R. Cox
.
Journal of the Royal Statistical Society, Series B
,
34
,
216
217
.

Bycroft
 
C.
,
Freeman
 
C.
,
Petkova
 
D.
,
Band
 
G.
,
Elliott
 
L. T.
,
Sharp
 
K.
 et al. (
2018
).
The UK Biobank resource with deep phenotyping and genomic data
.
Nature
,
562
,
203
209
.

Cox
 
D. R.
(
1972
).
Regression models and life-tables
.
Journal of the Royal Statistical Society, Series B
,
34
,
187
202
.

Cox
 
D. R.
(
1975
).
Partial likelihood
.
Biometrika
,
62
,
269
276
.

Denny
 
J. C.
,
Bastarache
 
L.
,
Ritchie
 
M. D.
,
Carroll
 
R. J.
,
Zink
 
R.
,
Mosley
 
J. D.
 et al. (
2013
).
Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data
.
Nature Biotechnology
,
31
,
1102
1111
.

Dey
 
R.
,
Zhou
 
W.
,
Kiiskinen
 
T.
,
Havulinna
 
A.
,
Elliott
 
A.
,
Karjalainen
 
J.
 et al. (
2022
).
Efficient and accurate frailty model approach for genome-wide survival association analysis in large-scale biobanks
.
Nature Communications
,
13
,
5437
.

Friedman
 
J. H.
,
Hastie
 
T.
,
Tibshirani
 
R.
(
2010
).
Regularization paths for generalized linear models via coordinate descent
.
Journal of Statistical Software
,
33
,
1
22
.

Kalbfleisch
 
J. D.
,
Prentice
 
R. L.
(
2002
).
The Statistical Analysis of Failure Time Data
.
John Wiley & Sons
:
New York
.

Kawaguchi
 
E. S.
,
Shen
 
J. I.
,
Suchard
 
M. A.
,
Li
 
G.
(
2021
).
Scalable algorithms for large competing risks data
.
Journal of Computational and Graphical Statistics
,
30
,
685
693
.

Liu
 
C.
,
Kraja
 
A. T.
,
Smith
 
J. A.
,
Brody
 
J. A.
,
Franceschini
 
N.
,
Bis
 
J. C.
 et al. (
2016
).
Meta-analysis identifies common and rare variants influencing blood pressure and overlapping with metabolic trait loci
.
Nature Genetics
,
48
,
1162
1170
.

Stanzick
 
K. J.
,
Li
 
Y.
,
Schlosser
 
P.
,
Gorski
 
M.
,
Wuttke
 
M.
,
Thomas
 
L. F.
 et al. (
2021
).
Discovery and prioritization of variants and genes for kidney function in ≳1.2 million individuals
.
Nature Communications
,
12
,
4350
.

The International Schizophrenia Consortium
(
2009
).
Common polygenic variation contributes to risk of schizophrenia and bipolar disorder
.
Nature
,
460
,
748
752
.

Vuckovic
 
D.
,
Bao
 
E. L.
,
Akbari
 
P.
,
Lareau
 
C. A.
,
Mousas
 
A.
,
Jiang
 
T.
 et al. (
2020
).
The polygenic and monogenic basis of blood traits and diseases
.
Cell
,
182
,
1214
1231
.

Wang
 
H.
,
Leng
 
C.
(
2007
).
Unified lasso estimation by least squares approximation
.
Journal of the American Statistical Association
,
102
,
1039
1048
.

Wang
 
W.
,
Lu
 
S.-E.
,
Cheng
 
J. Q.
,
Xie
 
M.
,
Kostis
 
J. B.
(
2022
).
Multivariate survival analysis in big data: a divide-and-combine approach
.
Biometrics
,
78
,
852
866
.

Wang
 
Y.
,
Hong
 
C.
,
Palmer
 
N.
,
Di
 
Q.
,
Schwartz
 
J.
,
Kohane
 
I.
 et al. (
2021
).
A fast divide-and-conquer sparse Cox regression
.
Biostatistics
,
22
,
381
401
.

Zeng
 
D.
,
Lin
 
D. Y.
(
2007
).
Maximum likelihood estimation in semiparametric regression models with censored data
.
Journal of the Royal Statistical Society, Series B
,
69
,
507
564
.

Zou
 
H.
(
2006
).
The adaptive lasso and its oracle properties
.
Journal of the American Statistical Association
,
101
,
1418
1429
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data