Sequential One-step Estimator by Sub-sampling for Customer Churn Analysis with Massive Data sets

FIGURE 1

Illustration of the sequential one-step estimator based on random addressing sub-sampling

Remark: We offer two remarks here about the SOS estimator. First, for each sub-sample, only the one-step update is conducted by (1). This may yield marginal computational cost because (1) the sample size $n$ is relatively small; and (2) no Newton–Raphson-type iterations are involved. It is notable that, there is no need to achieve fully Newton–Raphson convergence for each sub-sample. This is because the Newton–Raphson algorithm is not affected by the initial value when it goes to convergence. Therefore, if the sub-sample estimator ${\hat{β}}_{k + 1}$ is not one-step updating, but obtained with fully Newton–Raphson convergence, then the initial value ${\overline{β}}_{k}$ would not affect the resulting estimator in the (k+1)th sub-sample. Consequently, the SOS estimator degenerates to the one-shot (OS) estimator, which we theoretically compare in the next sub-section. Second, each ${\overline{β}}_{k}$ can be viewed as the average of the one-step estimator in form (1) for the first $k$ steps. This leads to some nice properties: (1) a total of $K$ sub-samples are used in the estimation; and (2) the standard error of the final estimator can be reduced by averaging. More weighting schemes could be considered in the SOS updating strategy; see the weighting scheme in the aggregated estimating equation (AEE) method (Lin & Xi, 2011) for an example.

It is also notable that, our proposed SOS method is an extension of the classical one-step estimator (Shao, 2003; Zou & Li, 2008) in the field of sub-sampling. For example, Zhu et al. (2021) has developed a distributed least squares approximation (DLSA) method to solve regression problems on distributed systems. In the DLSA method, once the weighted least squares estimator (WLSE) is obtained by the Master, it would be broadcast to all Workers. Then each Worker would perform a one-step iteration using the WLSE as the initial value to obtain a new estimator. Another work is Wang et al. (2021), who propose a one-step upgraded pilot method for non-uniformly and non-randomly distributed data. However, both the two methods are divide-and-conquer (DC) type methods, which are quite different from the sub-sampling methods. For example, one typical difference is that, in DC methods, the estimators from different Workers can be regarded as independent. However, in our SOS method, the estimators ${\overline{β}}_{1}$ to ${\overline{β}}_{K}$ are sequentially obtained, which makes them dependent with each other. This leads to challenges in investigation of the asymptotic theory for the SOS estimator. More differences between the divide-and-conquer type methods and sub-sampling methods can be found in the Appendix B in Data S1.

2.3 Theoretical properties of the SOS estimator

We further investigate the properties of the SOS estimator in this subsection. To establish the theoretical properties of the SOS estimator, assume that $Θ$ is an open subset in the Euclidean space, and we have the following conditions.

(C1)
The sub-sample size $n$ ⁠, the whole sample size $N$ ⁠, and the sub-sampling steps $K$ satisfy that as $n \to \infty,$ $n / N \to 0$ ⁠, $K \to \infty$ ⁠, and $\log K = o (n)$ ⁠.
(C2)
Assume that the first- and second-order derivatives of log-likelihood $ℓ (β)$ satisfy equations $E {\partial ℓ (β) / (\partial β_{j})} = 0$ ⁠, and $- E {\partial^{2} ℓ (β) / (\partial β_{j_{1}} \partial β_{j_{2}})} = E [{\partial ℓ (β) / \partial β_{j_{1}}}$ ${\partial ℓ (β) / \partial β_{j_{2}}}]$ ⁠, for $1 \leq j, j_{1}, j_{2} \leq p$ ⁠.
(C3)
Assume that $E [\partial ℓ_{i} (β) / \partial β {\partial ℓ_{i} (β) / \partial β}^{⊤}] = \sum^{- 1}$ is finite and positive definite at $β = β_{0}$ ⁠, where $β_{0}$ is the true parameter.
(C4)
There is an open subset $ω$ of $Θ$ that contains the true parameter $β_{0}$ ⁠, such that for all $Z_{i}$ s, $\partial^{3} f (Z_{i}, β) / (\partial β_{j_{1}} \partial β_{j_{2}} \partial β_{j_{3}})$ exists for all $β \in ω$ ⁠, and $1 \leq j, j_{1}, j_{2}, j_{3} \leq p$ ⁠. Moreover, assume function $M (\cdot)$ exists, such that for any $β \in Θ$ ⁠, we have $E M (Y_{i}) < C$ ⁠, where $C$ is a constant. For $β \in ω$ and $1 \leq j, j_{1}, j_{2}, j_{3} \leq p$ ⁠, we have $|\partial ℓ (Y_{i}, β) / (\partial β_{j_{1}} \partial β_{j_{2}} \partial β_{j_{3}})| \leq M (Y_{i}) .$
(C5)
Assume the covariates $X_{i j}$ s independently follow Gaussian distributions.

Condition (C1) restricts the relationships of $(n, K)$ and $(n, N)$ ⁠. By the condition, we know that the sub-sampling times $K$ should not grow too fast in the sense that $\log K = o (n)$ ⁠, and the sub-sampling size $n$ should not increase too fast in the sense that $n / N \to 0 .$ Conditions (C2)–(C4) are standard regularity conditions. They are commonly adopted to guarantee asymptotic normality of the ordinary maximum likelihood estimates; see, for example, Lehmann and Casella (1998) and Fan and Li (2001). Condition (C5) is a classical assumption on covariates (Wang, 2009).

With the conditions satisfied, we can establish the properties of ${\hat{β}}^{SOS}$ ⁠, which equals ${\overline{β}}_{K}$ ⁠. Define $U_{(k)} = {n^{- 1} {\ddot{ℓ}}_{𝒮_{k}} (β_{0})}^{- 1} {n^{- 1} {\dot{ℓ}}_{𝒮_{k}} (β_{0})}$ ⁠, and $Ū_{K} = K^{- 1} \sum_{k = 1}^{K} U_{(k)}$ ⁠. Then, the following theorem holds.

Theorem 1
Assume conditions (C1)–(C5) hold. Then, we have (1) ${\overline{β}}_{K} - β_{0} = - Ū_{K} + Δ,$ with $E Ū_{K} = 0$ ⁠, $var (Ū_{K}) = {1 / (n K) + 1 / N} \sum {1 + o (1)},$ and $Δ = O_{p} {[(\log K / n) {1 / (n K) + 1 / N}]}^{1 / 2}$ ⁠. (2) ${1 / (n K) + 1 / N}^{- 1 / 2} ({\overline{β}}_{K} - β_{0}) \to_{d} N (0, \sum) .$

The proof of Theorem 1 is in Appendix B.1. As shown in Theorem 1, we separate the difference between the SOS estimator ${\overline{β}}_{K}$ and the true parameter $β_{0}$ into two parts, the bias term and variance term. One could note that the bias term $Δ$ and variance term $var ({\overline{β}}_{K})$ are both determined by two main parts. The first part is related to the whole sample size $N$ ⁠. This part cannot disappear by using the SOS procedure. The second term is ${(n K)}^{- 1}$ ⁠, which is affected by the SOS procedure and can decrease with larger $K$ or $n$ ⁠. In addition, the SOS estimator satisfies asymptotic normality with asymptotic variance ${1 / (n K) + 1 / N}^{- 1} \sum$ ⁠. In particular, when $n K$ is much larger than $N$ ⁠, it could achieve the same statistical efficiency as the global estimator. Note that the practical demand for estimation precision is usually limited. On the contrary, the budget for sampling cost is very valuable. Then, it may be more appealing to sacrifice the statistical efficiency to some extent for lower sampling cost. Therefore, in practice, we often expect the SOS method to be implemented with reasonably large $n$ and $K$ (i.e., $n K ≪ N$ ⁠) as long as the desired statistical precision is achieved.

For theoretical comparison, we introduce a simple alternative method. For each sub-sample $𝒮_{k}$ ⁠, we separately compute the MLE ${\hat{β}}_{k, mle}$ ⁠. Then, all sub-sample estimators are simply averaged to obtain the OS estimator. Let ${\overline{β}}_{K}^{O S} = K^{- 1} \sum_{k = 1}^{K} {\hat{β}}_{k, mle}$ denote the OS estimator. We obtain the following conclusion.

Proposition 1
For the OS estimator, under the same conditions in Theorem 1, we have ${\overline{β}}_{K}^{O S} - β_{0} = - Ū_{K} + Δ_{o s}$ ⁠, where $Δ_{o s} = O_{p} (1 / n)$ ⁠. Further assume $n^{2} / N \to \infty;$ then, we have ${1 / (n K) + 1 / N}^{- 1 / 2} ({\overline{β}}_{K}^{O S} - β_{0}) \to_{d} N (0, \sum) .$

The proof of Proposition 1 is in Appendix B.2. Comparing Theorem 1 and Proposition 1, we find that the leading terms for the variance of both the SOS and the OS estimators are identical. However, the bias term of the OS estimator is of order $O_{p} (1 / n)$ ⁠, which cannot be improved as $K$ increases. By contrast, the bias of the SOS estimator is $O_{p} {[(\log K / n) {1 / (n K) + 1 / N}]}^{1 / 2}$ ⁠, which can be significantly reduced as $K$ increases. Therefore, compared with the SOS estimator ${\overline{β}}_{K}$ ⁠, ${\overline{β}}_{K}^{O S}$ requires a more stringent condition $n^{2} / N \to \infty$ ⁠, such that it could achieve the same asymptotic normality as the global estimator (Huang & Huo, 2019; Jordan et al., 2019). However, this condition is not necessary for the SOS estimator.

Next, to make an automatic inference, we discuss the estimation of the standard error of the SOS estimator. Specifically, based on the SOS procedure, we construct the following statistic as

\begin{align} {\hat{SE}}^{2} ({\overline{β}}_{K}) = \frac{n}{K - 1} (\frac{1}{n K} + \frac{1}{N}) \sum_{k = 1}^{K} (U_{(k)} - Ū_{K}) {(U_{(k)} - Ū_{K})}^{⊤} . \end{align}

(2)

The properties of ${\hat{SE}}^{2} ({\overline{β}}_{K})$ are presented in the following theorem.

Theorem 2
Under the same conditions in Theorem1, we have
$\begin{align} E {{\hat{SE}}^{2} ({\overline{β}}_{K})} & = \sum (\frac{1}{n K} + \frac{1}{N}) {1 + o (1)} \\ var ({\overline{β}}_{K}) - E {{\hat{SE}}^{2} ({\overline{β}}_{K})} & = O (\frac{n}{N^{2}}) \end{align}$

The proof of Theorem 2 is in Appendix B.3. We conclude that the leading orders of $var ({\overline{β}}_{K})$ and ${\hat{SE}}^{2} ({\overline{β}}_{K})$ are the same. However, as an estimator of $var ({\overline{β}}_{K})$ ⁠, ${\hat{SE}}^{2} ({\overline{β}}_{K})$ is biassed, but the order of the bias is $O (n N^{- 2})$ ⁠, which decreases as $N$ increases or $n$ decreases. Practically, the unknown parameter $β_{0}$ in $U_{(k)}$ can be replaced by ${\overline{β}}_{k}$ to obtain $Û_{(k)}$ and ${\hat{Ū}}_{K} = K^{- 1} \sum_{k = 1}^{K} Û_{(k)}$ ⁠. Then, ${\hat{SE}}_{*}^{2} ({\overline{β}}_{K})$ can be calculated based on $Û_{(k)}$ and ${\hat{Ū}}_{K}$ in the form of (2). Note that by Theorem 1, the leading term for the variance of ${\overline{β}}_{K}$ is ${1 / (n K) + 1 / N} \sum$ ⁠. Such term can be consistently estimated by the proposed SE estimator ${\hat{SE}}_{*} ({\overline{β}}_{K})$ ⁠. Its asymptotic property is presented in the following theorem.

Theorem 3
Under the same conditions in Theorem 1, we have
${(\frac{1}{n K} + \frac{1}{N})}^{- 1} {\hat{SE}}_{*}^{2} ({\overline{β}}_{K}) \to_{p} \sum .$
(3)

The proof of Theorem 3 is in Appendix B.4. Combining the results of Theorems 1 and 3, we immediately obtain that ${{\hat{SE}}_{*} ({\overline{β}}_{K})}^{- 1} ({\overline{β}}_{K} - β_{0}) \to_{d} N (0, I_{p}) .$ As a result, both the estimator and statistical inference could be easily and efficiently derived by our SOS procedure. We illustrate the performance of the SOS estimator and ${\hat{SE}}_{*}^{2} ({\overline{β}}_{K})$ in the next section.

3 SIMULATION STUDIES

3.1 Simulation design

To demonstrate the finite sample performance of the SOS estimator, we present a variety of simulation studies. Assume that the whole data set contains $N = 150, 000$ observations. For $i = 1, . . ., N$ ⁠, we generate each observation $(X_{i}, Y_{i})$ under the logistic regression model. We choose the logistic regression model because it is a specific model used for customer churn analysis. Given that the SOS method can be extended easily to other generalised regression models, we also choose the Poisson regression model as an example to test the effectiveness of the SOS method. The specific settings for the two model examples are as follows.

Example 1
(Logistic regression
Logistic regression is used to model binary responses. In this example, we consider $p = 5$ exogenous covariates $X_{i} = {(X_{i 1}, X_{i 2}, X_{i 3}, X_{i 4}, X_{i 5})}^{⊤}$ ⁠, where each covariate is generated from a standard normal distribution $N (0, 1)$ ⁠. We set the coefficients for $X_{i}$ as $β = {(0, - 0.2, - 0.1, 0.1, 0.2)}^{⊤}$ ⁠. Then, the response $Y_{i} (1 \leq i \leq N)$ is generated from a Bernoulli distribution with the probability given as
$P (Y_{i} = 1 | X_{i}, β) = \frac{\exp (X_{i}^{⊤} β)}{1 + \exp (X_{i}^{⊤} β)} .$

Example 2
(Poisson regression)
Poisson regression is used to deal with count responses. We also consider $p = 5$ exogenous covariates, which are all generated from standard normal distribution. The corresponding coefficients are set as $β = {(- 3, - 2, - 1, 1, 2)}^{⊤}$ ⁠. Then, the response $Y_{i}$ $(1 \leq i \leq N)$ is generated from a Poisson distribution given as
$P (Y_{i} | X_{i}, β) = \frac{λ_{i}^{Y_{i}}}{Y_{i}!} \exp (λ_{i}), where λ_{i} = \exp (X_{i}^{⊤} β) .$

After obtaining $N$ observations, we consider combinations of different sub-sampling size and different sub-sampling times. In both logistic and Poisson regression examples, we set $n = (100, 200, 400)$ ⁠. Then in the logistic regression example, we consider cases of small $K$ ⁠, and set $K = (10, 20, 30, 40, 50, 100)$ ⁠. In the Poisson regression example, we consider cases of big $K$ ⁠, and set $K = (100, 200, 300, 400, 500, 1000)$ ⁠. In each combination of $n$ and $K$ ⁠, we repeat the experiment $B = 1000$ times.

3.2 Comparison with repeated sub-sampling methods

In this sub-section, we compare the proposed SOS estimator with the OS estimator, which is representative of the repeated sub-sampling methods. Specifically, in each simulated data set, we obtain the SOS estimator using Algorithm 1. For the OS estimator, we first randomly obtain $K$ sub-data, and then independently apply the Newton–Raphson method to each sub-data. Specifically, in the kth sub-data, we set the initial value as $β_{ini} = {(0, 0, 0, 0, 0)}^{⊤}$ ⁠, and then fully conduct the Newton–Raphson method to obtain estimate ${\hat{β}}_{k, mle}$ ⁠. The final OS estimator is calculated as ${\overline{β}}_{K}^{OS} = \sum_{k = 1}^{K} {\hat{β}}_{k, mle} / K$ ⁠. For one particular method (i.e., SOS and OS), we define ${\hat{β}}^{(b)} = {({\hat{β}}_{j}^{(b)})}_{j = 1}^{p}$ as the estimator in the bth (⁠ $1 \leq b \leq B$ ⁠) replication. Then, to evaluate the estimation efficiency of each estimator, we calculate the bias as $♭ = | β - \overline{β} |$ ⁠, where $\overline{β} = B^{- 1} \sum_{b} {\hat{β}}^{(b)}$ ⁠. The standard error (⁠ ${\hat{SE}}^{(b)}$ ⁠) can be estimated based on Theorem 2 for the SOS method. We report the average $\hat{SE} = B^{- 1} \sum_{b} {\hat{SE}}^{(b)}$ ⁠. Then, we compare $\hat{SE}$ with the Monte Carlo SD of ${\hat{β}}^{(b)}$ ⁠, which is calculated by $SE = {B^{- 1} \sum_{b} {({\hat{β}}^{(b)} - \overline{β})}^{2}}^{1 / 2}$ ⁠. Next, we construct a 95% confidence interval for $β$ as ${CI}^{(b)} = ({\hat{β}}^{(b)} - z_{0.975} {\hat{SE}}^{(b)}, {\hat{β}}^{(b)} + z_{0.975} {\hat{SE}}^{(b)})$ ⁠, where $z_{α}$ is the $α th$ lower quantile of a standard normal distribution. Then, the coverage probability is computed as $ECP = B^{- 1} \sum_{b} I (β \in {CI}^{(b)})$ ⁠, where $I (\cdot)$ is the indicator function. Last, we compare the computational efficiency of the two methods. It is notable that, for a fixed sample size $n$ ⁠, the computational time consumed by each Newton–Raphson iteration is the same for the SOS and OS methods. Therefore, we use the average round of Newton–Raphson iterations to compare the computational efficiency of the SOS and OS methods.

Tables 1 and 2 present the simulation results for estimation performance under the logistic regression and Poisson regression, respectively. In general, the simulation results under the two examples are similar. We draw the following conclusions. First, as the sub-sampling times $K$ increases, the bias of the SOS estimator becomes much smaller than that of the OS estimator. This is because the bias of the SOS estimator decreases with increasing $K$ ⁠, while the bias of the OS estimator is always $O (1 / n)$ ⁠. Second, the $SE$ and $\hat{SE}$ of both estimators decrease with increasing $n$ or $K$ ⁠, implying that the two estimators are consistent. Third, as the bias and $SE$ of the SOS estimator can decrease with $n$ and $K$ ⁠, the bias always behaves negligibly compared with $SE$ ⁠. However, the bias of the OS estimator is comparable to or even larger than its $SE$ ⁠; see $n = 100, K = 300$ in Table 2 for an example. Next, the empirical coverage probabilities of the SOS estimator are all around the nominal level of 95%, which suggests that the true SE can be well approximated by its estimators derived in Theorem 2. However, the empirical coverage probabilities of the OS estimator show a deceasing trend when enlarging $K$ ⁠. This is because the bias of the OS estimator cannot be negligible when $K$ is large. Last, regarding the computational time, we compare the average rounds of Newton–Raphson iterations consumed by the two methods. Except for the initial sub-sample estimator, the SOS method uses only one Newton–Raphson update for each subsequent sub-sample. However, on average, the OS estimator requires seven or eight rounds of Newton–Raphson updates to obtain convergence. Therefore, the SOS estimator is computationally more efficient than the OS estimator.

TABLE 1

Simulation results for estimation performance under logistic regression

	Bias × $100$		SE × $100$		$\hat{SE} \times 100$		ECP		ROUND
$K$	SOS	OS	SOS	OS	SOS	OS	SOS	OS	SOS	OS
$n = 100$
10	0.532	0.903	6.598	7.237	6.545	6.596	0.952	0.912	1.875	7.145
20	0.336	0.963	4.694	5.109	4.687	4.717	0.947	0.879	1.923	7.492
30	0.204	0.943	3.815	4.142	3.857	3.88	0.950	0.858	1.884	7.146
40	0.170	0.976	3.278	3.559	3.359	3.379	0.950	0.858	1.821	7.240
50	0.150	0.915	2.997	3.247	3.015	3.032	0.954	0.823	2.122	7.654
100	0.058	0.921	2.124	2.293	2.162	2.173	0.955	0.755	1.942	7.149
$n = 200$
10	0.291	0.545	4.709	4.923	4.539	4.554	0.946	0.927	1.873	7.364
20	0.160	0.479	3.233	3.374	3.283	3.293	0.952	0.919	1.917	7.381
30	0.098	0.420	2.682	2.793	2.707	2.714	0.950	0.912	1.921	7.399
40	0.065	0.453	2.328	2.422	2.363	2.370	0.954	0.906	1.893	7.416
50	0.043	0.438	2.080	2.161	2.125	2.130	0.951	0.904	1.834	7.433
100	0.041	0.448	1.550	1.609	1.553	1.556	0.948	0.882	1.985	7.450
$n = 400$
10	0.119	0.245	3.198	3.269	3.200	3.205	0.950	0.938	1.881	7.467
20	0.055	0.239	2.285	2.331	2.322	2.325	0.952	0.938	1.954	7.484
30	0.046	0.224	1.891	1.926	1.926	1.928	0.951	0.936	2.016	7.501
40	0.035	0.218	1.681	1.714	1.689	1.692	0.949	0.934	1.982	7.519
50	0.031	0.226	1.529	1.558	1.532	1.534	0.947	0.929	2.104	7.536
100	0.020	0.237	1.148	1.170	1.150	1.151	0.947	0.912	2.163	7.553

	Bias × $100$		SE × $100$		$\hat{SE} \times 100$		ECP		ROUND
$K$	SOS	OS	SOS	OS	SOS	OS	SOS	OS	SOS	OS
$n = 100$
10	0.532	0.903	6.598	7.237	6.545	6.596	0.952	0.912	1.875	7.145
20	0.336	0.963	4.694	5.109	4.687	4.717	0.947	0.879	1.923	7.492
30	0.204	0.943	3.815	4.142	3.857	3.88	0.950	0.858	1.884	7.146
40	0.170	0.976	3.278	3.559	3.359	3.379	0.950	0.858	1.821	7.240
50	0.150	0.915	2.997	3.247	3.015	3.032	0.954	0.823	2.122	7.654
100	0.058	0.921	2.124	2.293	2.162	2.173	0.955	0.755	1.942	7.149
$n = 200$
10	0.291	0.545	4.709	4.923	4.539	4.554	0.946	0.927	1.873	7.364
20	0.160	0.479	3.233	3.374	3.283	3.293	0.952	0.919	1.917	7.381
30	0.098	0.420	2.682	2.793	2.707	2.714	0.950	0.912	1.921	7.399
40	0.065	0.453	2.328	2.422	2.363	2.370	0.954	0.906	1.893	7.416
50	0.043	0.438	2.080	2.161	2.125	2.130	0.951	0.904	1.834	7.433
100	0.041	0.448	1.550	1.609	1.553	1.556	0.948	0.882	1.985	7.450
$n = 400$
10	0.119	0.245	3.198	3.269	3.200	3.205	0.950	0.938	1.881	7.467
20	0.055	0.239	2.285	2.331	2.322	2.325	0.952	0.938	1.954	7.484
30	0.046	0.224	1.891	1.926	1.926	1.928	0.951	0.936	2.016	7.501
40	0.035	0.218	1.681	1.714	1.689	1.692	0.949	0.934	1.982	7.519
50	0.031	0.226	1.529	1.558	1.532	1.534	0.947	0.929	2.104	7.536
100	0.020	0.237	1.148	1.170	1.150	1.151	0.947	0.912	2.163	7.553

Notes: The bias, SE, $\hat{SE}$ ⁠, and ECP are reported for the sequential one-step (SOS) and one-shot (OS) estimators, respectively. The average Newton–Raphson rounds (representing the computational time) of the two estimators are also reported.

TABLE 1

Simulation results for estimation performance under logistic regression

	Bias × $100$		SE × $100$		$\hat{SE} \times 100$		ECP		ROUND
$K$	SOS	OS	SOS	OS	SOS	OS	SOS	OS	SOS	OS
$n = 100$
10	0.532	0.903	6.598	7.237	6.545	6.596	0.952	0.912	1.875	7.145
20	0.336	0.963	4.694	5.109	4.687	4.717	0.947	0.879	1.923	7.492
30	0.204	0.943	3.815	4.142	3.857	3.88	0.950	0.858	1.884	7.146
40	0.170	0.976	3.278	3.559	3.359	3.379	0.950	0.858	1.821	7.240
50	0.150	0.915	2.997	3.247	3.015	3.032	0.954	0.823	2.122	7.654
100	0.058	0.921	2.124	2.293	2.162	2.173	0.955	0.755	1.942	7.149
$n = 200$
10	0.291	0.545	4.709	4.923	4.539	4.554	0.946	0.927	1.873	7.364
20	0.160	0.479	3.233	3.374	3.283	3.293	0.952	0.919	1.917	7.381
30	0.098	0.420	2.682	2.793	2.707	2.714	0.950	0.912	1.921	7.399
40	0.065	0.453	2.328	2.422	2.363	2.370	0.954	0.906	1.893	7.416
50	0.043	0.438	2.080	2.161	2.125	2.130	0.951	0.904	1.834	7.433
100	0.041	0.448	1.550	1.609	1.553	1.556	0.948	0.882	1.985	7.450
$n = 400$
10	0.119	0.245	3.198	3.269	3.200	3.205	0.950	0.938	1.881	7.467
20	0.055	0.239	2.285	2.331	2.322	2.325	0.952	0.938	1.954	7.484
30	0.046	0.224	1.891	1.926	1.926	1.928	0.951	0.936	2.016	7.501
40	0.035	0.218	1.681	1.714	1.689	1.692	0.949	0.934	1.982	7.519
50	0.031	0.226	1.529	1.558	1.532	1.534	0.947	0.929	2.104	7.536
100	0.020	0.237	1.148	1.170	1.150	1.151	0.947	0.912	2.163	7.553

	Bias × $100$		SE × $100$		$\hat{SE} \times 100$		ECP		ROUND
$K$	SOS	OS	SOS	OS	SOS	OS	SOS	OS	SOS	OS
$n = 100$
10	0.532	0.903	6.598	7.237	6.545	6.596	0.952	0.912	1.875	7.145
20	0.336	0.963	4.694	5.109	4.687	4.717	0.947	0.879	1.923	7.492
30	0.204	0.943	3.815	4.142	3.857	3.88	0.950	0.858	1.884	7.146
40	0.170	0.976	3.278	3.559	3.359	3.379	0.950	0.858	1.821	7.240
50	0.150	0.915	2.997	3.247	3.015	3.032	0.954	0.823	2.122	7.654
100	0.058	0.921	2.124	2.293	2.162	2.173	0.955	0.755	1.942	7.149
$n = 200$
10	0.291	0.545	4.709	4.923	4.539	4.554	0.946	0.927	1.873	7.364
20	0.160	0.479	3.233	3.374	3.283	3.293	0.952	0.919	1.917	7.381
30	0.098	0.420	2.682	2.793	2.707	2.714	0.950	0.912	1.921	7.399
40	0.065	0.453	2.328	2.422	2.363	2.370	0.954	0.906	1.893	7.416
50	0.043	0.438	2.080	2.161	2.125	2.130	0.951	0.904	1.834	7.433
100	0.041	0.448	1.550	1.609	1.553	1.556	0.948	0.882	1.985	7.450
$n = 400$
10	0.119	0.245	3.198	3.269	3.200	3.205	0.950	0.938	1.881	7.467
20	0.055	0.239	2.285	2.331	2.322	2.325	0.952	0.938	1.954	7.484
30	0.046	0.224	1.891	1.926	1.926	1.928	0.951	0.936	2.016	7.501
40	0.035	0.218	1.681	1.714	1.689	1.692	0.949	0.934	1.982	7.519
50	0.031	0.226	1.529	1.558	1.532	1.534	0.947	0.929	2.104	7.536
100	0.020	0.237	1.148	1.170	1.150	1.151	0.947	0.912	2.163	7.553

Notes: The bias, SE, $\hat{SE}$ ⁠, and ECP are reported for the sequential one-step (SOS) and one-shot (OS) estimators, respectively. The average Newton–Raphson rounds (representing the computational time) of the two estimators are also reported.

TABLE 2

Simulation results for estimation performance under Poisson regression

	Bias $\times 100$		SE $\times 100$		$\hat{SE} \times 100$		ECP		ROUND
$K$	SOS	OS	SOS	OS	SOS	OS	SOS	OS	SOS	OS
$n = 100$
100	0.063	3.656	3.237	3.437	3.237	3.505	0.951	0.820	1.892	8.669
200	0.042	3.649	2.329	2.478	2.365	2.555	0.951	0.711	1.934	8.673
300	0.035	3.594	1.947	2.085	1.989	2.148	0.955	0.634	1.910	8.671
400	0.033	3.599	1.720	1.839	1.770	1.910	0.959	0.571	1.907	8.671
500	0.031	3.597	1.564	1.667	1.625	1.754	0.959	0.524	1.907	8.671
1000	0.033	3.593	1.222	1.294	1.286	1.387	0.960	0.426	1.893	8.670
$n = 200$
100	0.039	1.275	2.096	2.138	2.103	2.167	0.950	0.912	1.884	8.326
200	0.034	1.241	1.555	1.584	1.575	1.622	0.957	0.879	1.892	8.323
300	0.030	1.234	1.315	1.337	1.353	1.393	0.958	0.856	1.927	8.322
400	0.026	1.241	1.194	1.214	1.226	1.262	0.961	0.834	1.962	8.323
500	0.026	1.241	1.115	1.135	1.143	1.176	0.960	0.816	2.043	8.322
1000	0.023	1.239	0.923	0.936	0.956	0.984	0.958	0.765	1.895	8.321
$n = 400$
100	0.038	0.485	1.466	1.476	1.470	1.489	0.947	0.932	1.884	8.153
200	0.035	0.485	1.131	1.137	1.145	1.160	0.956	0.932	1.892	8.153
300	0.032	0.498	1.006	1.012	1.013	1.026	0.953	0.920	1.903	8.153
400	0.034	0.485	0.935	0.939	0.940	0.952	0.951	0.915	1.906	8.153
500	0.032	0.488	0.882	0.886	0.894	0.905	0.953	0.917	1.891	8.152
1000	0.026	0.486	0.769	0.772	0.792	0.802	0.954	0.910	1.902	8.152

	Bias $\times 100$		SE $\times 100$		$\hat{SE} \times 100$		ECP		ROUND
$K$	SOS	OS	SOS	OS	SOS	OS	SOS	OS	SOS	OS
$n = 100$
100	0.063	3.656	3.237	3.437	3.237	3.505	0.951	0.820	1.892	8.669
200	0.042	3.649	2.329	2.478	2.365	2.555	0.951	0.711	1.934	8.673
300	0.035	3.594	1.947	2.085	1.989	2.148	0.955	0.634	1.910	8.671
400	0.033	3.599	1.720	1.839	1.770	1.910	0.959	0.571	1.907	8.671
500	0.031	3.597	1.564	1.667	1.625	1.754	0.959	0.524	1.907	8.671
1000	0.033	3.593	1.222	1.294	1.286	1.387	0.960	0.426	1.893	8.670
$n = 200$
100	0.039	1.275	2.096	2.138	2.103	2.167	0.950	0.912	1.884	8.326
200	0.034	1.241	1.555	1.584	1.575	1.622	0.957	0.879	1.892	8.323
300	0.030	1.234	1.315	1.337	1.353	1.393	0.958	0.856	1.927	8.322
400	0.026	1.241	1.194	1.214	1.226	1.262	0.961	0.834	1.962	8.323
500	0.026	1.241	1.115	1.135	1.143	1.176	0.960	0.816	2.043	8.322
1000	0.023	1.239	0.923	0.936	0.956	0.984	0.958	0.765	1.895	8.321
$n = 400$
100	0.038	0.485	1.466	1.476	1.470	1.489	0.947	0.932	1.884	8.153
200	0.035	0.485	1.131	1.137	1.145	1.160	0.956	0.932	1.892	8.153
300	0.032	0.498	1.006	1.012	1.013	1.026	0.953	0.920	1.903	8.153
400	0.034	0.485	0.935	0.939	0.940	0.952	0.951	0.915	1.906	8.153
500	0.032	0.488	0.882	0.886	0.894	0.905	0.953	0.917	1.891	8.152
1000	0.026	0.486	0.769	0.772	0.792	0.802	0.954	0.910	1.902	8.152

Notes: The bias, SE, $\hat{SE}$ ⁠, and ECP are reported for the sequential one-step (SOS) and one-shot (OS) estimators, respectively. The average Newton–Raphson rounds (representing the computational time) of the two estimators are also reported.

TABLE 2

Simulation results for estimation performance under Poisson regression

	Bias $\times 100$		SE $\times 100$		$\hat{SE} \times 100$		ECP		ROUND
$K$	SOS	OS	SOS	OS	SOS	OS	SOS	OS	SOS	OS
$n = 100$
100	0.063	3.656	3.237	3.437	3.237	3.505	0.951	0.820	1.892	8.669
200	0.042	3.649	2.329	2.478	2.365	2.555	0.951	0.711	1.934	8.673
300	0.035	3.594	1.947	2.085	1.989	2.148	0.955	0.634	1.910	8.671
400	0.033	3.599	1.720	1.839	1.770	1.910	0.959	0.571	1.907	8.671
500	0.031	3.597	1.564	1.667	1.625	1.754	0.959	0.524	1.907	8.671
1000	0.033	3.593	1.222	1.294	1.286	1.387	0.960	0.426	1.893	8.670
$n = 200$
100	0.039	1.275	2.096	2.138	2.103	2.167	0.950	0.912	1.884	8.326
200	0.034	1.241	1.555	1.584	1.575	1.622	0.957	0.879	1.892	8.323
300	0.030	1.234	1.315	1.337	1.353	1.393	0.958	0.856	1.927	8.322
400	0.026	1.241	1.194	1.214	1.226	1.262	0.961	0.834	1.962	8.323
500	0.026	1.241	1.115	1.135	1.143	1.176	0.960	0.816	2.043	8.322
1000	0.023	1.239	0.923	0.936	0.956	0.984	0.958	0.765	1.895	8.321
$n = 400$
100	0.038	0.485	1.466	1.476	1.470	1.489	0.947	0.932	1.884	8.153
200	0.035	0.485	1.131	1.137	1.145	1.160	0.956	0.932	1.892	8.153
300	0.032	0.498	1.006	1.012	1.013	1.026	0.953	0.920	1.903	8.153
400	0.034	0.485	0.935	0.939	0.940	0.952	0.951	0.915	1.906	8.153
500	0.032	0.488	0.882	0.886	0.894	0.905	0.953	0.917	1.891	8.152
1000	0.026	0.486	0.769	0.772	0.792	0.802	0.954	0.910	1.902	8.152

	Bias $\times 100$		SE $\times 100$		$\hat{SE} \times 100$		ECP		ROUND
$K$	SOS	OS	SOS	OS	SOS	OS	SOS	OS	SOS	OS
$n = 100$
100	0.063	3.656	3.237	3.437	3.237	3.505	0.951	0.820	1.892	8.669
200	0.042	3.649	2.329	2.478	2.365	2.555	0.951	0.711	1.934	8.673
300	0.035	3.594	1.947	2.085	1.989	2.148	0.955	0.634	1.910	8.671
400	0.033	3.599	1.720	1.839	1.770	1.910	0.959	0.571	1.907	8.671
500	0.031	3.597	1.564	1.667	1.625	1.754	0.959	0.524	1.907	8.671
1000	0.033	3.593	1.222	1.294	1.286	1.387	0.960	0.426	1.893	8.670
$n = 200$
100	0.039	1.275	2.096	2.138	2.103	2.167	0.950	0.912	1.884	8.326
200	0.034	1.241	1.555	1.584	1.575	1.622	0.957	0.879	1.892	8.323
300	0.030	1.234	1.315	1.337	1.353	1.393	0.958	0.856	1.927	8.322
400	0.026	1.241	1.194	1.214	1.226	1.262	0.961	0.834	1.962	8.323
500	0.026	1.241	1.115	1.135	1.143	1.176	0.960	0.816	2.043	8.322
1000	0.023	1.239	0.923	0.936	0.956	0.984	0.958	0.765	1.895	8.321
$n = 400$
100	0.038	0.485	1.466	1.476	1.470	1.489	0.947	0.932	1.884	8.153
200	0.035	0.485	1.131	1.137	1.145	1.160	0.956	0.932	1.892	8.153
300	0.032	0.498	1.006	1.012	1.013	1.026	0.953	0.920	1.903	8.153
400	0.034	0.485	0.935	0.939	0.940	0.952	0.951	0.915	1.906	8.153
500	0.032	0.488	0.882	0.886	0.894	0.905	0.953	0.917	1.891	8.152
1000	0.026	0.486	0.769	0.772	0.792	0.802	0.954	0.910	1.902	8.152

Notes: The bias, SE, $\hat{SE}$ ⁠, and ECP are reported for the sequential one-step (SOS) and one-shot (OS) estimators, respectively. The average Newton–Raphson rounds (representing the computational time) of the two estimators are also reported.

3.3 Comparison with non-uniform sub-sampling methods

In this sub-section, we compare the proposed SOS method with the non-uniform sub-sampling methods. To this end, we choose the OSMAC method (Wang et al., 2018) for the comparison, which is particularly designed for large sample logistic regression. The OSMAC method applies a two-step algorithm for the model estimation. In the first step, a pilot sample of size $r_{0}$ is randomly chosen to obtain a pilot estimate. Then, the pilot estimate is used to compute the optimal sub-sampling probabilities for the whole data. In the second step, a new sub-sample of size $r$ is chosen based on the optimal sub-sampling probabilities. Then, the final OSMAC estimate is obtained using the total $r_{0} + r$ samples. We compare the two methods under the logistic regression example. To mimic a large data set, we consider the whole sample size $N = (1, 2, 5, 10) \times 1 0^{5}$ ⁠. For fixed $N$ ⁠, the whole data set is generated under the logistic regression model following the procedures in Section 3.1.

Below, we compare the SOS method and OSMAC method under one specific situation. That is, the compute memory is limited, which could only support building a logistic regression model for a sample with size $n = 400$ ⁠. In this situation, the sub-sample size used in SOS is fixed as $n = 400$ ⁠, and we vary the sub-sampling times as $K = (1, 5, 10, 20, 40)$ ⁠. As for the OSMAC method, we set $r_{0} = 200$ and $r = 400$ ⁠. The OSMAC method is implemented using the corresponding R package provided by Wang et al. (2018).

For a reliable comparison, the experiment is randomly replicated for $B = 200$ times for each experimental setup. We use the mean squared error (MSE) to evaluate the statistical efficiencies of the two methods. Specifically, for one particular method (i.e., SOS and OSMAC), we define ${\hat{β}}^{(b)} = {({\hat{β}}_{j}^{(b)})}_{j = 1}^{p}$ as the estimator in the bth (⁠ $1 \leq b \leq B$ ⁠) replication. Then, the MSE of each estimator is computed as $B^{- 1} \sum_{b = 1}^{B} \sum_{j = 1}^{p} {({\hat{β}}_{j}^{(b)} - β_{j})}^{2}$ ⁠. We also compare the computational time of the two methods. All experiments are conducted on a server with 12 $\times$ Xeon Gold 6271 CPU and 64 GB RAM. The total time costs consumed by different methods under each experimental setup are averaged over $B = 200$ random replications. The detailed results are displayed in Table 3.

TABLE 3

The mean squared error (MSE) and time costs (in seconds) are obtained by the sequential one-step (SOS) and optimal sub-sampling methods motivated by the A-optimality criterion (OSMAC) methods for different sample sizes $N$ under the logistic regression model.

		MSE	Time	MSE	Time	MSE	Time	MSE	Time
Method		$N = 100, 000$		$N = 200, 000$		$N = 400, 000$		$N = 800, 000$
OSMAC		0.0460	0.0213	0.0460	0.0410	0.0452	0.0875	0.0476	0.1679
SOS	$K = 1$	0.2346	0.0058	0.2320	0.0070	0.2300	0.0064	0.2274	0.0086
	$K = 5$	0.0415	0.0101	0.0419	0.0090	0.0414	0.0124	0.0418	0.0150
	$K = 10$	0.0410	0.0171	0.0405	0.0168	0.0419	0.0175	0.0406	0.0256
	$K = 20$	0.0404	0.0318	0.0409	0.0322	0.0405	0.0378	0.0403	0.0437
	$K = 40$	0.0401	0.0574	0.0398	0.0618	0.0400	0.0740	0.0400	0.0804

		MSE	Time	MSE	Time	MSE	Time	MSE	Time
Method		$N = 100, 000$		$N = 200, 000$		$N = 400, 000$		$N = 800, 000$
OSMAC		0.0460	0.0213	0.0460	0.0410	0.0452	0.0875	0.0476	0.1679
SOS	$K = 1$	0.2346	0.0058	0.2320	0.0070	0.2300	0.0064	0.2274	0.0086
	$K = 5$	0.0415	0.0101	0.0419	0.0090	0.0414	0.0124	0.0418	0.0150
	$K = 10$	0.0410	0.0171	0.0405	0.0168	0.0419	0.0175	0.0406	0.0256
	$K = 20$	0.0404	0.0318	0.0409	0.0322	0.0405	0.0378	0.0403	0.0437
	$K = 40$	0.0401	0.0574	0.0398	0.0618	0.0400	0.0740	0.0400	0.0804

TABLE 3

The mean squared error (MSE) and time costs (in seconds) are obtained by the sequential one-step (SOS) and optimal sub-sampling methods motivated by the A-optimality criterion (OSMAC) methods for different sample sizes $N$ under the logistic regression model.

		MSE	Time	MSE	Time	MSE	Time	MSE	Time
Method		$N = 100, 000$		$N = 200, 000$		$N = 400, 000$		$N = 800, 000$
OSMAC		0.0460	0.0213	0.0460	0.0410	0.0452	0.0875	0.0476	0.1679
SOS	$K = 1$	0.2346	0.0058	0.2320	0.0070	0.2300	0.0064	0.2274	0.0086
	$K = 5$	0.0415	0.0101	0.0419	0.0090	0.0414	0.0124	0.0418	0.0150
	$K = 10$	0.0410	0.0171	0.0405	0.0168	0.0419	0.0175	0.0406	0.0256
	$K = 20$	0.0404	0.0318	0.0409	0.0322	0.0405	0.0378	0.0403	0.0437
	$K = 40$	0.0401	0.0574	0.0398	0.0618	0.0400	0.0740	0.0400	0.0804

		MSE	Time	MSE	Time	MSE	Time	MSE	Time
Method		$N = 100, 000$		$N = 200, 000$		$N = 400, 000$		$N = 800, 000$
OSMAC		0.0460	0.0213	0.0460	0.0410	0.0452	0.0875	0.0476	0.1679
SOS	$K = 1$	0.2346	0.0058	0.2320	0.0070	0.2300	0.0064	0.2274	0.0086
	$K = 5$	0.0415	0.0101	0.0419	0.0090	0.0414	0.0124	0.0418	0.0150
	$K = 10$	0.0410	0.0171	0.0405	0.0168	0.0419	0.0175	0.0406	0.0256
	$K = 20$	0.0404	0.0318	0.0409	0.0322	0.0405	0.0378	0.0403	0.0437
	$K = 40$	0.0401	0.0574	0.0398	0.0618	0.0400	0.0740	0.0400	0.0804

Based on the results presented in Table 3, we draw the following conclusions. First, the OSMAC method achieves better estimation performance than the SOS method with $K = 1$ ⁠. This is because the OSMAC method can find an optimal sub-sample, while the SOS method only randomly selects the sub-sample. Second, as the sub-sampling times $K$ increases, the estimation performance of the SOS method improves by achieving smaller MSE values. This finding is consistent with our theoretical results in Theorem 1. It is also notable that a relatively small $K$ (e.g., $K = 5$ ⁠) makes the SOS method achieve better estimation performance than the OSMAC method. Third, the computational time consumed by the OSMAC method increases largely with the whole sample size $N$ ⁠. This is because the OSMAC method computes the sub-sampling probabilities for all $N$ samples, which results in a large computational cost. Meanwhile, the computational cost of the SOS method mainly results from the repeated sub-sampling strategy. Then, with an increase of $K$ ⁠, the computational time consumed by SOS is enlarged. However, with appropriately chosen $K$ ⁠, the SOS method can achieve both better estimation performance and smaller computational time than the OSMAC method.

3.4 Comparison with all-sample methods

Finally, to complete our empirical comparison, we compare the SOS method with methods using the whole sample. We first compare the SOS method with DC methods, which are also commonly used to accomplish data analysis tasks of huge scale. The key idea of DC methods is to divide a large-scale data set into multiple sub-data sets, each of which is then estimated separately to obtain a local estimate. Then, all local estimates are reassembled together to obtain the final estimate. Different from sub-sampling methods, DC methods in fact exploit the whole data information. Therefore, they often have good statistical efficiency but high computational cost. In this regard, we take the AEE method (Lin & Xi, 2011) as a representative example. Another method to consider is the mini-batch gradient descent (MGD) estimation method (Duchi et al., 2011). The MGD method splits the whole data set into several mini-batches. Each mini-batch is then read into the memory and estimated sequentially. Different from the SOS method, MGD is not a Newton–Raphson-type method. Instead, it applies the stochastic gradient descent strategy for parameter estimation.

To undertake a comprehensive evaluation, we consider the whole sample size as $N = (10, 12, 14, 16, 18, 20) \times 1 0^{4}$ ⁠. For fixed $N$ ⁠, we generate the data set under the logistic regression model following the procedures described in Section 3.1. For the AEE method, we assume there are a total of $J = 100$ workers. For the MGD method, we assume the total number of mini-batches is also $J = 100$ ⁠. Then, the whole data set is randomly and evenly divided into $J = 100$ sub-data sets, and each sub-data set has $n = N / J$ observations. For comparison, the sub-sample size in the SOS method is fixed as $n = N / J$ ⁠. For all experiments, we assume sub-sampling times of $K = 50$ ⁠. Theoretically, the information exploited by the SOS method is smaller than the DC methods. We repeat the experiment $B = 200$ times under each experimental setup. The statistical efficiencies of the three methods are evaluated by MSE. We also compare the computational efficiency of the three methods. The detailed results are displayed in Figure 2.

FIGURE 2

The mean squared error and time costs (in logarithm) are obtained by the sequential one-step, mini-batch gradient descent, and aggregated estimating equation methods for different sample sizes $N$ under the logistic regression model. (a) Estimation performance; (b) Computation performance [Colour figure can be viewed at https://dbpia.nl.go.kr]

As shown by Figure 2a, compared with AEE and MGD, the SOS method performs statistically less efficiently by achieving the largest MSE values. This finding is obvious, because the AEE and MGD methods exploit the full data information. Therefore, in theory, the two estimators are both $\sqrt{N}$ -consistent. However as suggested by Theorem 2, the SE of the SOS estimator is $O {1 / (n K) + 1 / N}$ ⁠. Then, the SOS estimator should be statistically less efficient when $n K$ is smaller than $N$ ⁠. Although the SOS estimator has the worst statistical efficiency, its MSE has already achieved $1 0^{- 4}$ ⁠, indicating satisfactory estimation precision in practice.

We then compare the computational efficiency of the three methods. We fix the learning rate as 0.2 in the MGD method. All experiments are conducted on a server with 12 $\times$ Xeon Gold 6271 CPU and 64 GB RAM. The total time costs (in s) consumed by the different methods under different sample sizes are averaged over $B = 200$ random replications. Then, the averaged time costs are plotted in Figure 2b in log-scale. As shown, the MGD method takes the most computational time. The AEE and SOS methods are much more computationally efficient than the MGD method. Furthermore, the SOS method takes less computational time than the AEE method. In general, the time costs consumed by the SOS method are only half those of the AEE method. These empirical findings confirm that the SOS method is computationally more efficient than the AEE and MGD methods.

4 APPLICATION TO CUSTOMER CHURN ANALYSIS

4.1 Data description and pre-processing

We apply the SOS method to a large-scale customer churn data set, which is provided by a well-known securities company in China. The original data set contains 12 million transaction records from 230,000 customers from September to December 2020. This data set is originated from 10 files, which are directly exported from the company's database system. The 10 files record different aspects of users. Specifically, the 10 files include: user basic information, behaviour information on the company's APP, daily asset information, daily market information, inflow-outflow information, debt information, trading information, fare information and information of holding financial products or service products. In total, the 10 files contain 398 variables describing the asset and non-asset information of a specific customer on a specific trading day. The asset information contains 325 variables related to customer transactions, such as assets, stock value, trading volume, profit and debit. The non-asset information contains 73 variables about detailed customer information, such as customer ID, gender, age, education and login behaviour. The whole data set takes up about 300 GB on a hard drive.

The research goal of this study is to provide an early warning of customer churn status, which may help the securities company to retain customers. According to the common practice of the securities company, a customer is defined as lost when the following two criteria are met: (1) the customer has less than 20,000 floating assets in 20 trading days; and (2) the customer logs in less than three times in 20 trading days. Based on this definition, a new binary variable Churn is used to indicate whether the customer is lost (Churn = 1) or not (Churn = 0). Given the response is a binary variable (churn or not churn), a logistic regression model can be built to help customer churn prediction. To predict the customer churn status, we compute both asset-related and non-asset-related covariates for each customer using transaction information ahead of 20 trading days. In other words, all used covariates can help forecast the churn status of customers 20 trading days in advance.

Before model building, we conduct several steps to preprocess the original data set. First, we check the missing value proportions for all variables in the data set, and then discard variables whose missing value proportions were larger than 80%. Second, basic summary statistics (e.g., mean and SD) are computed for each variable to help detect potential outliers. Third, we check the stability for each variable to detect potential changepoints. Fortunately, we find the daily basic statistics for most variables are stable from September to December. Therefore, all daily observations are pooled together in the subsequent analysis.

Preliminary analysis shows that strong correlations exist among most variables in the original data set. Therefore, we design a practical procedure for variable selection, borrowing ideas from the independence screening method (Fan & Song, 2010). Specifically, we first classify all covariates into 10 groups based on their source files. Then a logistic regression model with each single covariate is conducted for variable selection. It is notable that, the prediction performance of the customer churn model should be evaluated on a test data set. Therefore, we first order all observations by time and then split the whole data set into the training data set (the first 70% observations) and the test data set (the last 30% observations). Then we build a logistic regression model with each single covariate on the training data set, and the resulting AUC value is recorded to measure the prediction ability of the specific covariate. Next in each variable group, the covariate with the largest AUC value is chosen. This leads to the final predictor set consisting of 10 covariates. Table 4 shows the detailed information about the responses and the 10 selected covariates.

We then explore the relationship between the responses and the covariates. For illustration, we take MaxFA and WheL as examples of continuous and categorical covariates, respectively. Among the whole data, the percentage of churn customers is 13.7%. Because the continuous variable MaxFA has a highly right-skewed distribution, logarithmic transformation is applied. Figure 3a presents the boxplot of MaxFA (in log-scale) under different levels of Churn. As shown, customers with fewer maximum floating assets are more likely to churn. As for the categorical variable WheL, Figure 3b presents the spinogram of this variable under different levels of Churn. As shown, customers who do not log into the system are more likely to churn.

FIGURE 3

The boxplot of (in log-scale) (a) and the spinogram of (b) under the response = 0 (non-churn status) and = 1 (churn status) [Colour figure can be viewed at https://dbpia.nl.go.kr]

TABLE 4

The detailed information about responses and covariates

Variables		Source File	Description
Response	Churn	Asset	Whether the customers churn or not. Yes: 13.7%; No: 86.3%.
Asset	MaxMVS	Market	The maximum market value of stock.
	StdTF	Fare	The SD of total fare.
	MaxFA	Asset	The maximum floating assets.
	StdTD	Debt	The SD of total debt.
	WheTVAM	Trading	Whether the trading volume of A-shares is missing or not. Yes: 69.1%; No: 30.9%.
	WheIFM	Inflow	Whether the inflow of funds is missing or not. Yes: 81.3%; No: 18.7%.
Non- asset	Age	Basic	The age of customers (4 levels). $<$ 40: 20.6%; 40–50: 25.8%; 50–60: 29.3%; $>$ 60: 24.3%.
	WheHFP	FProduct	Whether the customers hold financial products or not. Yes: 71.5%; No: 28.5%.
	WheHSP	SProduct	Whether the customers hold service products or not. Yes: 86.0%; No: 14%.
	WheL	Behaviour	Whether the customers Login or not. Yes: 16.8%; No:83.2%.

Variables		Source File	Description
Response	Churn	Asset	Whether the customers churn or not. Yes: 13.7%; No: 86.3%.
Asset	MaxMVS	Market	The maximum market value of stock.
	StdTF	Fare	The SD of total fare.
	MaxFA	Asset	The maximum floating assets.
	StdTD	Debt	The SD of total debt.
	WheTVAM	Trading	Whether the trading volume of A-shares is missing or not. Yes: 69.1%; No: 30.9%.
	WheIFM	Inflow	Whether the inflow of funds is missing or not. Yes: 81.3%; No: 18.7%.
Non- asset	Age	Basic	The age of customers (4 levels). $<$ 40: 20.6%; 40–50: 25.8%; 50–60: 29.3%; $>$ 60: 24.3%.
	WheHFP	FProduct	Whether the customers hold financial products or not. Yes: 71.5%; No: 28.5%.
	WheHSP	SProduct	Whether the customers hold service products or not. Yes: 86.0%; No: 14%.
	WheL	Behaviour	Whether the customers Login or not. Yes: 16.8%; No:83.2%.

Note: “FProduct” and “SProduct” represent source files of holding financial products or service products.

TABLE 4

The detailed information about responses and covariates

Variables		Source File	Description
Response	Churn	Asset	Whether the customers churn or not. Yes: 13.7%; No: 86.3%.
Asset	MaxMVS	Market	The maximum market value of stock.
	StdTF	Fare	The SD of total fare.
	MaxFA	Asset	The maximum floating assets.
	StdTD	Debt	The SD of total debt.
	WheTVAM	Trading	Whether the trading volume of A-shares is missing or not. Yes: 69.1%; No: 30.9%.
	WheIFM	Inflow	Whether the inflow of funds is missing or not. Yes: 81.3%; No: 18.7%.
Non- asset	Age	Basic	The age of customers (4 levels). $<$ 40: 20.6%; 40–50: 25.8%; 50–60: 29.3%; $>$ 60: 24.3%.
	WheHFP	FProduct	Whether the customers hold financial products or not. Yes: 71.5%; No: 28.5%.
	WheHSP	SProduct	Whether the customers hold service products or not. Yes: 86.0%; No: 14%.
	WheL	Behaviour	Whether the customers Login or not. Yes: 16.8%; No:83.2%.

Variables		Source File	Description
Response	Churn	Asset	Whether the customers churn or not. Yes: 13.7%; No: 86.3%.
Asset	MaxMVS	Market	The maximum market value of stock.
	StdTF	Fare	The SD of total fare.
	MaxFA	Asset	The maximum floating assets.
	StdTD	Debt	The SD of total debt.
	WheTVAM	Trading	Whether the trading volume of A-shares is missing or not. Yes: 69.1%; No: 30.9%.
	WheIFM	Inflow	Whether the inflow of funds is missing or not. Yes: 81.3%; No: 18.7%.
Non- asset	Age	Basic	The age of customers (4 levels). $<$ 40: 20.6%; 40–50: 25.8%; 50–60: 29.3%; $>$ 60: 24.3%.
	WheHFP	FProduct	Whether the customers hold financial products or not. Yes: 71.5%; No: 28.5%.
	WheHSP	SProduct	Whether the customers hold service products or not. Yes: 86.0%; No: 14%.
	WheL	Behaviour	Whether the customers Login or not. Yes: 16.8%; No:83.2%.

Note: “FProduct” and “SProduct” represent source files of holding financial products or service products.

4.2 Customer churn prediction using SOS

We build a logistic regression model to investigate influential factors in the churn status of customers. The whole data set is quite large and cannot be analysed in the computer memory directly. To handle this huge data set, we do not consider DC methods for the lack of distributed systems in hand. We also do not consider the non-informative sub-sampling methods, because they are usually statistically less efficient than the SOS method. In addition, they require computing the optimal sampling probabilities for the entire data set, which would incur high computational cost. Based on these considerations, the SOS method is applied to estimate the logistic regression model. To evaluate the predictive ability of the SOS method, we ordered all observations by time and then split the whole data set into the training data (the first 70% observations) and the test data (the last 30% observations). Below, we would build a logistic regression model on the training data set, and then evaluate the prediction performance on the test data set.

Before applying the SOS method on the training data set, we need to set the sub-sampling size $n$ and the sub-sampling times $K$ ⁠. To balance between $K$ and $n$ ⁠, we first select the sub-sampling size $n$ ⁠, and then determine the total sub-sampling times $K$ based on $n$ ⁠. Specifically, the sub-sampling size $n$ is mainly decided by the computation resources. In this real application, we use a server with 12*Xeon Gold 6271 CPU and 64 GB RAM for computation. In addition, the securities company requires fast computation speed for updating the customer churn model conveniently day by day in the future. Based on the limited computation resources and the fast computation requirement, we fix $n = 10, 000$ ⁠.

For the sub-sampling times $K$ ⁠, we apply an iterative strategy to select an appropriative value. We define ${\overline{β}}_{K}$ as the SOS estimate obtained with $K$ times sub-sampling. Then, with increasing $K$ ⁠, we compare ${\overline{β}}_{K - 1}$ with ${\overline{β}}_{K}$ ⁠. An appropriate $K$ is chosen when the $l_{2}$ -norm $‖ {\overline{β}}_{K} - {\overline{β}}_{K - 1} ‖_{2}$ is smaller than a pre-defined threshold. Other selection methods can also be considered, such as the five-fold cross-validation method. Specifically, we can monitor the out-sample prediction accuracy under each $K$ ⁠, and then select an optimal value that can balance the prediction accuracy and the computational time. In this work, we vary $K$ from 10 to 100, with a step of 10. Then, we use the iterative strategy to select $K$ ⁠. Under each $K$ ⁠, we compute the $l_{2}$ -norm $‖ {\overline{β}}_{K} - {\overline{β}}_{K - 1} ‖_{2}$ and the corresponding values are plotted in Figure 4. Based on some preliminary analysis, we find the coefficients of variables are not very small. Therefore, we set a threshold $1 0^{- 4}$ to find stable estimated coefficients. By this threshold, we choose $K = 20$ ⁠. In addition, as shown by Figure 4, $K = 20$ is also the point with the fastest decline speed of $‖ {\overline{β}}_{K} - {\overline{β}}_{K - 1} ‖_{2}$ ⁠. Based on the above considerations, we finally choose $K = 20$ ⁠.

The value of l2-norm ‖β‾K−β‾K−1‖2 under different K [Colour figure can be viewed at https://dbpia.nl.go.kr]

FIGURE 4

The value of $l_{2}$ -norm $‖ {\overline{β}}_{K} - {\overline{β}}_{K - 1} ‖_{2}$ under different $K$ [Colour figure can be viewed at https://dbpia.nl.go.kr]

Table 5 presents the detailed regression results for the SOS methods on the training data set. For comparison purpose, we also report the regression results on the full data set in Table 5. In general, the regression results under the training data and full data are similar. Specifically, the variable MaxFA plays a significantly negative role in churn status, which implies that the fewer the maximum floating assets, the more likely the customers would be to churn. This is in accordance with the descriptive results shown in Figure 3a. The variable WheTVAM plays a significantly positive role in the churn status, which implies that customers with no volume are more likely to churn. Similarly, customers with no inflow of funds are more likely to churn. As for non-asset-related variables, the variable WheHSP plays a significantly negative role in the churn status. This result indicates customers who do not hold service products are more likely to churn. In addition, the variable Age has significant influence on the churn status for some age groups. Specifically, customers in the age groups 50–60, and $>$ 60 are more likely to churn than those in the age group $<$ 40. Finally, the variable WheL has a significantly negative effect, indicating that customers who do not log in to the system in the past 20 trading days are more likely to churn. For the purpose of robustness check, we also conduct the same experiment on four daily data sets, each of which is randomly selected from September to December, respectively. The detailed results are shown in Appendix D in Data S1, which suggest stable coefficient estimates across time.

TABLE 5

The estimation results for logistic regression using the sequential one-step method on the training data set and full data set, respectively

		Training data				Full data
Variable		Est.	SE	$p$ -Value	Sig.	Est.	SE	$p$ -Value	Sig.
Intercept		$- 23.753$	0.351	$<$ 0.001	***	$- 23.482$	$0.331$	$<$ 0.001	***
MaxMVS		$- 11.736$	0.405	$<$ 0.001	***	$- 11.646$	0.336	$<$ 0.001	***
StdTF		2.759	0.313	$<$ 0.001	***	2.829	0.261	$<$ 0.001	***
StdTD		$- 3.568$	0.316	$<$ 0.001	***	$- 3.614$	0.263	$<$ 0.001	***
MaxFA		$- 27.699$	0.708	$<$ 0.001	***	$- 27.604$	0.588	$<$ 0.001	***
WheTVAM: Yes		1.293	0.306	$<$ 0.001	***	1.333	0.255	$<$ 0.001	***
WheIFM: Yes		0.529	0.253	0.037	*	0.544	0.255	0.033	*
Age	40–50	0.426	0.306	0.164		0.450	0.255	0.077
	50–60	0.732	0.282	0.009	**	0.693	0.255	0.007	**
	$>$ 60	1.131	0.308	$<$ 0.001	***	1.155	0.257	$<$ 0.001	***
WheL: Yes		$- 2.700$	0.312	$<$ 0.001	***	$- 2.622$	0.259	$<$ 0.001	***
WheHFP: Yes		0.476	0.274	0.083		0.451	0.254	0.076
WheHSP: Yes		$- 0.468$	0.226	0.039	*	$- 0.523$	0.255	0.041	*

		Training data				Full data
Variable		Est.	SE	$p$ -Value	Sig.	Est.	SE	$p$ -Value	Sig.
Intercept		$- 23.753$	0.351	$<$ 0.001	***	$- 23.482$	$0.331$	$<$ 0.001	***
MaxMVS		$- 11.736$	0.405	$<$ 0.001	***	$- 11.646$	0.336	$<$ 0.001	***
StdTF		2.759	0.313	$<$ 0.001	***	2.829	0.261	$<$ 0.001	***
StdTD		$- 3.568$	0.316	$<$ 0.001	***	$- 3.614$	0.263	$<$ 0.001	***
MaxFA		$- 27.699$	0.708	$<$ 0.001	***	$- 27.604$	0.588	$<$ 0.001	***
WheTVAM: Yes		1.293	0.306	$<$ 0.001	***	1.333	0.255	$<$ 0.001	***
WheIFM: Yes		0.529	0.253	0.037	*	0.544	0.255	0.033	*
Age	40–50	0.426	0.306	0.164		0.450	0.255	0.077
	50–60	0.732	0.282	0.009	**	0.693	0.255	0.007	**
	$>$ 60	1.131	0.308	$<$ 0.001	***	1.155	0.257	$<$ 0.001	***
WheL: Yes		$- 2.700$	0.312	$<$ 0.001	***	$- 2.622$	0.259	$<$ 0.001	***
WheHFP: Yes		0.476	0.274	0.083		0.451	0.254	0.076
WheHSP: Yes		$- 0.468$	0.226	0.039	*	$- 0.523$	0.255	0.041	*

Notes: We report the estimated coefficient ${\overline{β}}_{K}$ ⁠, standard error (⁠ ${\hat{SE}}_{*} ({\overline{β}}_{K})$ ⁠), and $p$ -values for all variables. The symbols *, **, *** represent a significant influence under the significance level 5%, 1% and 0.1%, respectively.

TABLE 5

The estimation results for logistic regression using the sequential one-step method on the training data set and full data set, respectively

		Training data				Full data
Variable		Est.	SE	$p$ -Value	Sig.	Est.	SE	$p$ -Value	Sig.
Intercept		$- 23.753$	0.351	$<$ 0.001	***	$- 23.482$	$0.331$	$<$ 0.001	***
MaxMVS		$- 11.736$	0.405	$<$ 0.001	***	$- 11.646$	0.336	$<$ 0.001	***
StdTF		2.759	0.313	$<$ 0.001	***	2.829	0.261	$<$ 0.001	***
StdTD		$- 3.568$	0.316	$<$ 0.001	***	$- 3.614$	0.263	$<$ 0.001	***
MaxFA		$- 27.699$	0.708	$<$ 0.001	***	$- 27.604$	0.588	$<$ 0.001	***
WheTVAM: Yes		1.293	0.306	$<$ 0.001	***	1.333	0.255	$<$ 0.001	***
WheIFM: Yes		0.529	0.253	0.037	*	0.544	0.255	0.033	*
Age	40–50	0.426	0.306	0.164		0.450	0.255	0.077
	50–60	0.732	0.282	0.009	**	0.693	0.255	0.007	**
	$>$ 60	1.131	0.308	$<$ 0.001	***	1.155	0.257	$<$ 0.001	***
WheL: Yes		$- 2.700$	0.312	$<$ 0.001	***	$- 2.622$	0.259	$<$ 0.001	***
WheHFP: Yes		0.476	0.274	0.083		0.451	0.254	0.076
WheHSP: Yes		$- 0.468$	0.226	0.039	*	$- 0.523$	0.255	0.041	*

		Training data				Full data
Variable		Est.	SE	$p$ -Value	Sig.	Est.	SE	$p$ -Value	Sig.
Intercept		$- 23.753$	0.351	$<$ 0.001	***	$- 23.482$	$0.331$	$<$ 0.001	***
MaxMVS		$- 11.736$	0.405	$<$ 0.001	***	$- 11.646$	0.336	$<$ 0.001	***
StdTF		2.759	0.313	$<$ 0.001	***	2.829	0.261	$<$ 0.001	***
StdTD		$- 3.568$	0.316	$<$ 0.001	***	$- 3.614$	0.263	$<$ 0.001	***
MaxFA		$- 27.699$	0.708	$<$ 0.001	***	$- 27.604$	0.588	$<$ 0.001	***
WheTVAM: Yes		1.293	0.306	$<$ 0.001	***	1.333	0.255	$<$ 0.001	***
WheIFM: Yes		0.529	0.253	0.037	*	0.544	0.255	0.033	*
Age	40–50	0.426	0.306	0.164		0.450	0.255	0.077
	50–60	0.732	0.282	0.009	**	0.693	0.255	0.007	**
	$>$ 60	1.131	0.308	$<$ 0.001	***	1.155	0.257	$<$ 0.001	***
WheL: Yes		$- 2.700$	0.312	$<$ 0.001	***	$- 2.622$	0.259	$<$ 0.001	***
WheHFP: Yes		0.476	0.274	0.083		0.451	0.254	0.076
WheHSP: Yes		$- 0.468$	0.226	0.039	*	$- 0.523$	0.255	0.041	*

Notes: We report the estimated coefficient ${\overline{β}}_{K}$ ⁠, standard error (⁠ ${\hat{SE}}_{*} ({\overline{β}}_{K})$ ⁠), and $p$ -values for all variables. The symbols *, **, *** represent a significant influence under the significance level 5%, 1% and 0.1%, respectively.

Finally, we evaluate the predictive ability of the model. In the above, we have obtained the logistic regression model on the training data set. Then, the estimated model is used to predict the churn status of customers in the test data set. We use the receiver operating characteristic (ROC) curve combined with the area-under-curve (AUC) value to measure the predictive accuracy, which is shown in Figure 5. As shown, the corresponding AUC value in this data split is 0.946, which suggests very good predictive ability of the proposed model in classifying customers as churn or non-churn. For comparison purpose, we also apply the OS method with $K = 20$ and $n = 10, 000$ on the training data set. The AUC value of the OS method on the test data is 0.938, which is smaller than the SOS method. We also compare the computational time for the two methods. On a server with 12 $\times$ Xeon Gold 6271 CPU and 64 GB RAM, the computational time for the SOS and OS methods are 3.02 and 10.01 s, respectively. It is obvious that the SOS method behaves more computationally efficiently than the OS method does.

FIGURE 5

The receiver operating characteristic curve of the logistic regression using the sequential one-step method on test data

Finally, we present a practical customer recovery strategy using the established model in Table 5. First, we sort all customers in the test data set by their predicted churn probabilities using our model. Then, we divide all customers into 10 groups of equal size. Specifically, group 1 consists of customers with the highest predicted churn probabilities, which we refer to as the high-risk group; and group 10 contains customers with the lowest predicted churn probabilities, which is regarded as the low risk group. In each of the 10 groups, we calculate the ratio of truly churned customers. As shown in Figure 6, the churn ratio of all customers is only 13.7%, but the churn ratio of group 1 is as high as 81.5%. This result verifies the predictive power of the established model. In other words, customers with high predicted churn probabilities tend to churn in reality. This finding suggests that we need to pay more attention to this group of customers and employ active strategies to retain them, such as face-to-face visits, reducing commissions, and providing exclusive services. It is also notable that group 2 shows higher churn ratios (i.e., 31.0%) than that of all customers (13.7%). Therefore, group 2 requires close attention and continuous monitoring.

FIGURE 6

The churn ratios of 10 groups divided by their predicted churn probabilities under the sequential one-step method [Colour figure can be viewed at https://dbpia.nl.go.kr]

5 CONCLUDING REMARKS

In this work, we propose a sampling-based method for customer churn analysis with massive data sets. Classic sub-sampling methods require only one round of sub-sampling, but it is necessary to calculate non-uniform sampling probabilities of all data points. This often makes classic sub-sampling methods computationally inefficient. To address this issue, we propose the SOS method, which considers sampling data points with uniform probabilities but operates the sub-sampling step repeatedly. In this way, the sub-sampling cost can be largely reduced. Based on the SOS method, a sequence of estimators is computed, each of which is calculated using one-step updating based on the newly sampled sub-data. The final SOS estimate is the average of all estimators. We establish the theoretical properties of the SOS estimator. Both its bias and SE can be reduced by increasing the sub-sampling times or the sub-sample size. The performance of the SOS estimator is elaborated by simulation studies. Finally, we apply the SOS method to a real customer churn data set of a securities company. By using the SOS method, we can handle the large-scale data set, obtain useful factors that influence costumers' churn status, and achieve a high prediction accuracy for latent churn customers. It is remarkable that, although the proposed SOS method is designed for estimation of logistic regression, it can be easily extended to other generalised regression models.

We consider directions for future study. First, the SOS estimator still depends on multiple rounds of sub-sampling. New sub-sampling methods could be designed to reduce the number of sub-sampling times, which could help reduce the computational cost further. Second, the weights of previous estimators in the final SOS step are the same. However, in reality, one could consider larger weights for estimators in later steps because they have better performance. Finally, a good topic for future study when dynamic massive data are available is how to extend the SOS method for data streams.

6 FUNDING INFORMATION

This work was supported by the National Natural Science Foundation of China (grant numbers 72001205, 72171229, 11971504, 12071477, 11701560, 11831008); fund for building world-class universities (disciplines) of Renmin University of China, Chinese National Statistical Science Research Project (2022LD06), Foundation from Ministry of Education of China (20JZD023). This research was supported by Public Computing Cloud, Renmin University of China.

DATA AVAILABILITY STATEMENT

The data could not be made public. Because the cooperation with the company is based on the confidentiality of the original data.

REFERENCES

Ahmad

,

A.K.

,

Jafar

,

A.

&

Aljoumaa

,

K.

(

2019

)

Customer churn prediction in telecom using machine learning in big data platform

.

Journal of Big Data

,

6

,

1

–

24

.

Ahn

,

Y.

,

Kim

,

D.

&

Lee

,

D.-J.

(

2019

)

Customer attrition analysis in the securities industry: a large-scale field study in Korea

.

International Journal of Bank Marketing

,

38

(

3

),

561

–

577

.

Ascarza

,

E.

,

Neslin

,

S.A.

,

Netzer

,

O.

,

Anderson

,

Z.

,

Fader

,

P.S.

,

Gupta

,

S.

et al. (

2018

)

In pursuit of enhanced customer retention management: review, key issues, and future directions

.

Customer Needs and Solutions

,

5

,

65

–

81

.

Battey

,

H.

,

Fan

,

J.

,

Liu

,

H.

,

Lu

,

J.

&

Zhu

,

Z.

(

2018

)

Distributed testing and estimation under sparse high dimensional models

.

The Annals of Statistics

,

46

,

1352

–

1382

.

Dhillon

,

P.

,

Lu

,

Y.

,

Foster

,

D.P.

&

Ungar

,

L.

(

2013

)

New subsampling algorithms for fast least squares regression. Proceedings of the International Conference on Neural Information Processing Systems

.

Drineas

,

P.

,

Magdon-Ismail

,

M.

,

Mahoney

,

M.W.

&

Woodruff

,

D.P.

(

2011

)

Fast approximation of matrix coherence and statistical leverage

.

Journal of Machine Learning Research

,

13

,

3475

–

3506

.

Duchi

,

J.

,

Hazan

,

E.

&

Singer

,

Y.

(

2011

)

Adaptive subgradient methods for online learning and stochastic optimization

.

Journal of Machine Learning Research

,

12

,

257

–

269

.

Fan

,

J.

&

Li

,

R.

(

2001

)

Variable selection via nonconcave penalized likelihood and its oracle properties

.

Journal of the American Statistical Association

,

96

,

1348

–

1360

.

Fan

,

J.

&

Song

,

R.

(

2010

)

Sure independence screening in generalized linear models with NP-dimensionality

.

Annals of Statistics

,

38

,

3567

–

3604

.

Huang

,

C.

&

Huo

,

X.

(

2019

)

A distributed one-step estimator

.

Mathematical Programming

,

174

,

41

–

76

.

Jordan

,

M.I.

,

Lee

,

J.D.

&

Yang

,

Y.

(

2019

)

Communication-efficient distributed statistical inference

.

Journal of the American Statistical Association

,

114

,

668

–

681

.

Kayaalp

,

F.

(

2017

)

Review of customer churn analysis studies in telecommunications industry

.

Karaelmas Science and Engineering Journal

,

7

,

696

–

705

.

Lee

,

J.D.

,

Liu

,

Q.

,

Sun

,

Y.

&

Taylor

,

J.E.

(

2017

)

Communication-efficient sparse regression

.

The Journal of Machine Learning Research

,

18

,

115

–

144

.

Lehmann

,

E.

&

Casella

,

G.

(

1998

)

Theory of point estimation

, 2nd edition. New York: Springer-Verlag.

Google Preview

Lin

,

N.

&

Xi

,

R.

(

2011

)

Aggregated estimating equation estimation

.

Statistics and Its Interface

,

1

,

73

–

83

.

Ma

,

P.

,

Mahoney

,

M.W.

&

Yu

,

B.

(

2015

)

A statistical perspective on algorithmic leveraging

.

Journal of Machine Learning Research

,

16

,

861

–

919

.

Ma

,

P.

&

Sun

,

X.

(

2015

)

Leveraging for big data regression

.

Wiley Interdisciplinary Reviews Computational Statistics

,

7

,

70

–

76

.

Ma

,

P.

,

Zhang

,

X.

,

Xing

,

X.

,

Ma

,

J.

&

Mahoney

,

M.W.

(

2020

)

Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms

.

AISTATS

,

108

,

1026

–

1035

.

Maldonado

,

S.

,

Domínguez

,

G.

,

Olaya

,

D.

&

Verbeke

,

W.

(

2021

)

Profit-driven churn prediction for the mutual fund industry: a multisegment approach

.

Omega

,

100

, 102380.

McDonald

,

R.

,

Mohri

,

M.

,

Silberman

,

N.

,

Walker

,

D.

&

Mann

,

G.S.

(

2009

)

Efficient large-scale distributed training of conditional maximum entropy models

.

Advances in Neural Information Processing Systems

,

22

,

1231

–

1239

.

. https://doi.org/10.48550/arXiv.2110.00936

Pan

,

R.

,

Zhu

,

Y.

,

Guo

,

B.

,

Zhu

,

X.

&

Wang

,

H.

(

2020

)

A sequential addressing subsampling method for massive data analysis under memory constraint

Quiroz

,

M.

,

Kohn

,

R.

,

Villani

,

M.

&

Tran

,

M.-N.

(

2019

)

Speeding up MCMC by efficient data subsampling

.

Journal of the American Statistical Association

,

114

,

831

–

843

.

Saulis

,

L.

&

Statulevičius

,

V.

(

2012

)

Limit theorems for large deviations

.

New York

:

Springer Science & Business Media

.

Google Preview

Shao

,

J.

(

2003

)

Mathematical statistics

.

Springer texts in statistics

.

New York

:

Springer

.

Google Preview

Wang

,

F.

,

Zhu

,

Y.

,

Huang

,

D.

,

Qi

,

H.

&

Wang

,

H.

(

2021

)

Distributed one-step upgraded estimation for non-uniformly and non-randomly distributed data

.

Computational Statistics & Data Analysis

,

162

,

107265

.

Wang

,

H.

(

2009

)

Forward regression for ultra-high dimensional variable screening

.

Journal of the American Statistical Association

,

104

(

488

),

1512

–

1524

.

Wang

,

H.

,

Zhu

,

R.

&

Ma

,

P.

(

2018

)

Optimal subsampling for large sample logistic regression

.

Journal of the American Statistical Association

,

113

,

829

–

844

.

Wang

,

H.Y.

,

Yang

,

M.

&

Stufken

,

J.

(

2019

)

Information-based optimal subdata selection for big data linear regression

.

Journal of the American Statistical Association

,

114

,

393

–

405

.

Yu

,

J.

,

Wang

,

H.

,

Ai

,

M.

&

Zhang

,

H.

(

2020

)

Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data

.

Journal of the American Statistical Association

,

117

(

537

),

265

–

276

.

Zhu

,

X.

,

Li

,

F.

&

Wang

,

H.

(

2021

)

Least squares approximation for a distributed system

.

Journal of Computational and Graphical Statistics

,

30

(

4

),

1004

–

1018

.

Zou

,

H.

&

Li

,

R.

(

2008

)

One-step sparse estimates in nonconcave penalized likelihood models

.

Annals of Statistics

,

36

(

4

),

1509

–

1533

.

PubMed