Abstract

Customer churn is one of the most important concerns for large companies. Currently, massive data are often encountered in customer churn analysis, which bring new challenges for model computation. To cope with these concerns, sub-sampling methods are often used to accomplish data analysis tasks of large scale. To cover more informative samples in one sampling round, classic sub-sampling methods need to compute sampling probabilities for all data points. However, this method creates a huge computational burden for data sets of large scale and therefore, is not applicable in practice. In this study, we propose a sequential one-step (SOS) estimation method based on repeated sub-sampling data sets. In the SOS method, data points need to be sampled only with probabilities, and the sampling step is conducted repeatedly. In each sampling step, a new estimate is computed via one-step updating based on the newly sampled data points. This leads to a sequence of estimates, of which the final SOS estimate is their average. We theoretically show that both the bias and the standard error of the SOS estimator can decrease with increasing sub-sampling sizes or sub-sampling times. The finite sample SOS performances are assessed through simulations. Finally, we apply this SOS method to analyse a real large-scale customer churn data set in a securities company. The results show that the SOS method has good interpretability and prediction power in this real application.

1 INTRODUCTION

Customer churn is one of the most important concerns for large companies (Ascarza et al., 2018; Kayaalp, 2017). With increasingly fierce competition, it is important for companies to retain existing customers and prevent potential customer churn. Customer churn often refers to a situation in which customers no longer buy or use a company's products or services. For each customer, to churn or not to churn is a typical binary indicator. Therefore, customer churn prediction is often considered as a classification task, and several classifier models have been applied to address this issue (Ahmad et al., 2019). Among the previous models, logistic regression is widely used owing to its good interpretability (Ahn et al., 2019; Maldonado et al., 2021). Therefore, in this work, we employ logistic regression models to handle the task of customer churn analysis.

With the arrival of the ‘Big Data’ era, massive data are often encountered in customer churn analysis. For example, the real data analysed in this work comprise 12 million transaction records from 230,000 customers, which take up 300 GB in total. Such large amounts of data present great challenges for customer churn analysis. The first concern is memory constraint. The data can be too large to be stored in a computer's memory, and hence, have to remain stored on a hard drive. The second challenge is computational cost. For massive data, even a simple analysis (e.g., mean calculation) can take a long time. These challenges create large barriers for customer churn analysis with massive data.

To cope with these challenges, modern statistical analysis for massive data is developing fast. Basically, there are two streams of approaches for massive data. The first stream considers storing data in a distributed system and then applying the ‘divide-and-conquer’ strategy to accomplish data analysis tasks of huge scales. See McDonald et al. (2009), Lee et al. (2017), Battey et al. (2018), Jordan et al. (2019), Zhu et al. (2021), Wang et al. (2021), and the references therein. Another stream uses sub-sampling techniques that conduct calculations on sub-samples drawn from the whole data set to reduce both memory cost and computational cost. Important works include, but are not limited to, Dhillon et al. (2013), Ma et al. (2015), Wang et al. (2018), Quiroz et al. (2019), Ma et al. (2020), and Yu et al. (2020).

It is remarkable that there are big differences in the two streams of approaches. The distributed methods usually require typical equipment support. For example, a distributed system is often required, which consists of one central computer served as Master and all other computers served as Workers. Then, the goal of distributed methods is to obtain an estimator based on the whole sample. However, the sub-sampling methods can be conducted using one single computer. They focus on the approximation of the whole sample estimator based on sub-samples with limited resources. In this customer churn application, we do not have equipment support for distributed analysis. Therefore, in this work, we focus on sub-sampling methods to handle the customer churn analysis with massive data.

Classic sub-sampling methods require only one round of data sampling. To cover more informative samples in the single sampling round, non-uniform sampling probabilities are often specified for each data point, so that more informative data points can be sampled with higher probabilities. Typical approaches include leverage score-based sub-sampling (Drineas et al., 2011; Ma et al., 2015; Ma & Sun, 2015), information-based optimal sub-data selection (Wang et al., 2019), the optimal Poisson sampling method, and its distributed implementation (Wang et al., 2021), among others. These sub-sampling estimators have been proved to be consistent and asymptotically normal under appropriate regularity conditions; see Ma et al. (2020) for an important example. However, for these non-uniform sub-sampling methods, the step of evaluating non-uniform sampling probabilities for the whole data set would create a huge computational burden. For example, Wang et al. (2018) proposed optimal sub-sampling methods motivated by the A-optimality criterion (OSMAC) for large sample logistic regression. To find the optimal sub-sampling probabilities in the OSMAC method, the computational complexity is O(Nd) for a data set with N observations and d-dimensional covariates. Consequently, this optimal sub-sampling algorithm could be computationally very expensive when N is very large. In the customer churn application, the whole data size is about 12 million. It would not be computationally feasible for the previous sub-sampling methods to handle this task.

An obvious way to address this problem would be to sample sub-data with uniform probabilities while operating the sub-sampling step repeatedly. By sampling with uniform probabilities, we are free from computing probabilities for the whole data set in advance, which can largely alleviate the computational burden. By sampling repeatedly, the sampled data would be close to the whole data set. In an ideal situation, if we were to conduct sub-sampling without replacement, then sampling N/n times would cover the whole data set, where N and n are the sizes of the whole data and sub-data, respectively. It is as if the whole data were stored in a distributed system.

The sub-sampling cost cannot be negligible, especially for repeated sub-sampling methods. It is remarkable that the time needed to sample one data point from the hard drive mainly consists of two parts. The first is the addressing cost, which is the time taken to identify the target data point on the hard drive. The second is the I/O cost, which is the time needed to read the target data point into the computer memory. Both are hard drive sampling costs, which cannot be ignored when we apply sub-sampling methods to massive data sets. Therefore, inspired by Pan et al. (2020), we adopt the sequential addressing sampling method to reduce the hard drive sampling cost. When data are randomly distributed on a hard drive, for each sub-data, only one starting point is selected and the other data points can be obtained sequentially from the starting point. In this way, the addressing cost can be reduced substantially for sub-sampling methods.

Based on the repeated sub-sampling data sets from the whole customer churn data, it is worthwhile to consider how to obtain an efficient customer churn estimation for both statistical property and computational cost. To this end, we propose a sequential one-step (SOS) method, whose estimation bias and variance can both be reduced by increasing the total sub-sampling times K. Specifically, in the first sub-sampling step, we can obtain an estimate β^1 based on the first sub-data using, for example, the traditional Newton–Raphson algorithm. Then, in the second sub-sampling step, we regard β^1 as the initial value, and conduct only one-step updating based on the second sub-data. This leads to β^2. In the next step, the average of the first two estimates, that is, β2=(β^1+β^2)/2, is regarded as the initial value, and one-step updating based on the third sub-data is conducted again to obtain β^3. In summary, in the (k+1)th step, the average of all previous estimates (i.e., βk=l=1kβ^l) serves as the initial value, and one-step updating based on the newly sampled sub-data is conducted to obtain the estimate β^k+1. The final estimate of K sub-sampling steps is the average of all estimates, that is, β^SOS=k=1Kβ^k/K.

It is noteworthy that, except for the first sub-sampling step, SOS requires only one round of updating in the subsequent sub-sampling steps. Therefore, it is computationally efficient. We also establish theoretical properties of the SOS estimator. We prove that both the bias and variance of the SOS estimator can be reduced as the sampling times K increase. We conduct extensive numerical studies based on simulated data sets to verify our theoretical findings. Finally, the SOS method is applied to the customer churn data set in a securities company to demonstrate its application.

The rest of this article is organised as follows. Section 2 introduces the SOS method. Section 3 presents simulation analysis to demonstrate the finite sample performance of the SOS estimators. Section 4 presents a real application for customer churn analysis using the SOS method. Section 5 concludes with a brief discussion.

2 SOS ESTIMATOR BY SUB-SAMPLING

2.1 Basic notations

Assume we have all the sample observations, 𝒮={1,2,,N}, where N is defined as the whole sample size. Define (Yi,Xi) to be the observation collected from the ith (1iN) observation, where Yi1 is the response and Xip is the associated predictor. Conditional on Xi, assume that Yi is independently and identically distributed with density function f(Zi;β), where Zi=(Yi,Xi), βΘ is the unknown parameter, and Θ is an open subset in d. We assume p=d for convenience. To estimate the unknown parameter β, the log-likelihood function can be spelled out as

where (·;β)=logf(·;β) is the log-likelihood function. For convenience, we use (β) to denote (Zi;β) hereafter.

Note that when N is quite large, the whole data set is often kept on a hard drive, and cannot be read into the memory as a whole. Then it would be time consuming to select a sample randomly from the hard drive into the computer memory. To save time, we apply the sequential addressing sub-sampling (SAS) method (Pan et al., 2020) to conduct the sub-sampling directly on hard drive, not memory. To apply the SAS method, the whole data should be randomly distributed on the hard drive. Otherwise, a shuffle operator would be needed to make data randomly distributed. The detailed implementation process of conducting shuffle operation can be found in Appendix  A in Data S1. Then, the sub-sampling steps can be performed iteratively based on the randomly distributed data to obtain sub-samples. Next, we describe the SAS method in detail.

First, one data point should be randomly selected on the hard drive as a starting point, that is, m(1mNn+1), where n is the desired sub-data size. This yields marginal addressing cost, as only one data point is chosen. Second, with a fixed starting point, the sub-sample with size n is selected sequentially with index {m,m+1,,m+n1}𝒮. These selected sub-samples are collected as 𝒯m={(Xm,Ym),(Xm+n1,Ym+n1)}. Except for the first starting point indexed by m, the remaining data points are sampled sequentially. Therefore, no additional addressing cost is required, and the total sampling cost can be reduced substantially.

It is notable that, although the SAS method serves as a preparing step for the SOS method, there are fundamental differences between SAS and SOS. The SAS method is actually a sub-sampling technique, which can sample data directly from the hard drive and save much addressing cost. Based on the SAS sub-samples, Pan et al. (2020) also study the theoretical properties of some basic statistics estimators (e.g., sample mean). However in SOS, we focus on the estimation problem of logistic regression and discuss updating strategy to exploit the sub-samples obtained by SAS. It is remarkable that, the SAS method only serves as a tool for fast sampling, and the theoretical properties of the SOS estimator could still be guaranteed without this sampling step.

2.2 SOS estimator

Assume the whole data set is randomly distributed on the hard drive. Then, the SAS method can be applied for fast sub-sampling. Recall that the sub-sample size is n. By the SAS method, a total of M=Nn+1 different sequential sub-samples can be generated. Assuming that the sub-sampling is repeated K times, in the kth (1kK) sub-sampling, the sub-sample 𝒯(k){𝒯1,,𝒯M} can be selected with replacement. We further denote 𝒮k as the indexes of data points in 𝒯(k). Based on 𝒮k with 1kK, we propose the SOS method for efficient estimation of β.

To obtain the SOS estimator, an initial estimator β1 is first calculated based on one of the SAS sub-samples with the index set denoted as 𝒮1. For example, it may be a maximum likelihood estimator (MLE). Then, in the (k+1)th sub-sampling step with 1kK1, a new SAS sub-sample can be obtained as 𝒯(k+1). Based on the (k+1)th sub-sample, we conduct the following two steps iteratively to obtain the SOS estimator. The details of the SOS method are shown in Algorithm 1. Recall we have assumed that, the whole data are already shuffled or randomly distributed before we begin the SOS procedure. Step 1: One-step update. Assume that the initial estimator in this step is βk. Then, we conduct one-step updating based on βk to obtain the one-step updated estimator β^k+1 in this step, that is,

(1)

where ˙𝒮k+1(βk) and ¨𝒮k+1(βk) are the first- and second-order derivatives of the likelihood function based on the kth sub-sample, respectively. Step 2: Average. The SOS sub-sampling estimator for the (k+1)th step can be calculated as

Conduct Steps 1 and 2 iteratively. The estimator obtained in the Kth step is the final SOS estimator, that is, β^SOS=βK. See Figure 1 for an illustration of calculating the SOS estimator based on the SAS sub-sampling method.

Algorithm 1. SOS estimation algorithm

Algorithm 1. SOS estimation algorithm

Illustration of the sequential one-step estimator based on random addressing sub-sampling
FIGURE 1

Illustration of the sequential one-step estimator based on random addressing sub-sampling

Remark: We offer two remarks here about the SOS estimator. First, for each sub-sample, only the one-step update is conducted by (1). This may yield marginal computational cost because (1) the sample size n is relatively small; and (2) no Newton–Raphson-type iterations are involved. It is notable that, there is no need to achieve fully Newton–Raphson convergence for each sub-sample. This is because the Newton–Raphson algorithm is not affected by the initial value when it goes to convergence. Therefore, if the sub-sample estimator β^k+1 is not one-step updating, but obtained with fully Newton–Raphson convergence, then the initial value βk would not affect the resulting estimator in the (k+1)th sub-sample. Consequently, the SOS estimator degenerates to the one-shot (OS) estimator, which we theoretically compare in the next sub-section. Second, each βk can be viewed as the average of the one-step estimator in form (1) for the first k steps. This leads to some nice properties: (1) a total of K sub-samples are used in the estimation; and (2) the standard error of the final estimator can be reduced by averaging. More weighting schemes could be considered in the SOS updating strategy; see the weighting scheme in the aggregated estimating equation (AEE) method (Lin & Xi, 2011) for an example.

It is also notable that, our proposed SOS method is an extension of the classical one-step estimator (Shao, 2003; Zou & Li, 2008) in the field of sub-sampling. For example, Zhu et al. (2021) has developed a distributed least squares approximation (DLSA) method to solve regression problems on distributed systems. In the DLSA method, once the weighted least squares estimator (WLSE) is obtained by the Master, it would be broadcast to all Workers. Then each Worker would perform a one-step iteration using the WLSE as the initial value to obtain a new estimator. Another work is Wang et al. (2021), who propose a one-step upgraded pilot method for non-uniformly and non-randomly distributed data. However, both the two methods are divide-and-conquer (DC) type methods, which are quite different from the sub-sampling methods. For example, one typical difference is that, in DC methods, the estimators from different Workers can be regarded as independent. However, in our SOS method, the estimators β1 to βK are sequentially obtained, which makes them dependent with each other. This leads to challenges in investigation of the asymptotic theory for the SOS estimator. More differences between the divide-and-conquer type methods and sub-sampling methods can be found in the Appendix  B in Data S1.

2.3 Theoretical properties of the SOS estimator

We further investigate the properties of the SOS estimator in this subsection. To establish the theoretical properties of the SOS estimator, assume that Θ is an open subset in the Euclidean space, and we have the following conditions.

  • (C1)

    The sub-sample size n, the whole sample size N, and the sub-sampling steps K satisfy that as n,n/N0, K, and logK=o(n).

  • (C2)

    Assume that the first- and second-order derivatives of log-likelihood (β) satisfy equations E{(β)/(βj)}=0, and E{2(β)/(βj1βj2)}=E[{(β)/βj1}{(β)/βj2}], for 1j,j1,j2p.

  • (C3)

    Assume that E[i(β)/β{i(β)/β}]=1 is finite and positive definite at β=β0, where β0 is the true parameter.

  • (C4)

    There is an open subset ω of Θ that contains the true parameter β0, such that for all Zis, 3f(Zi,β)/(βj1βj2βj3) exists for all βω, and 1j,j1,j2,j3p. Moreover, assume function M(·) exists, such that for any βΘ, we have EM(Yi)<C, where C is a constant. For βω and 1j,j1,j2,j3p, we have (Yi,β)/(βj1βj2βj3)M(Yi).

  • (C5)

    Assume the covariates Xijs independently follow Gaussian distributions.

Condition (C1) restricts the relationships of (n,K) and (n,N). By the condition, we know that the sub-sampling times K should not grow too fast in the sense that logK=o(n), and the sub-sampling size n should not increase too fast in the sense that n/N0. Conditions (C2)–(C4) are standard regularity conditions. They are commonly adopted to guarantee asymptotic normality of the ordinary maximum likelihood estimates; see, for example, Lehmann and Casella (1998) and Fan and Li (2001). Condition (C5) is a classical assumption on covariates (Wang, 2009).

With the conditions satisfied, we can establish the properties of β^SOS, which equals βK. Define U(k)={n1¨𝒮k(β0)}1{n1˙𝒮k(β0)}, and ŪK=K1k=1KU(k). Then, the following theorem holds.

Theorem 1

Assume conditions (C1)–(C5) hold. Then, we have (1)βKβ0=ŪK+Δ,withEŪK=0, var(ŪK)={1/(nK)+1/N}{1+o(1)},andΔ=Op[(logK/n){1/(nK)+1/N}]1/2. (2) {1/(nK)+1/N}1/2(βKβ0)dN(0,).

The proof of Theorem 1 is in Appendix B.1. As shown in Theorem 1, we separate the difference between the SOS estimator βK and the true parameter β0 into two parts, the bias term and variance term. One could note that the bias term Δ and variance term var(βK) are both determined by two main parts. The first part is related to the whole sample size N. This part cannot disappear by using the SOS procedure. The second term is (nK)1, which is affected by the SOS procedure and can decrease with larger K or n. In addition, the SOS estimator satisfies asymptotic normality with asymptotic variance {1/(nK)+1/N}1. In particular, when nK is much larger than N, it could achieve the same statistical efficiency as the global estimator. Note that the practical demand for estimation precision is usually limited. On the contrary, the budget for sampling cost is very valuable. Then, it may be more appealing to sacrifice the statistical efficiency to some extent for lower sampling cost. Therefore, in practice, we often expect the SOS method to be implemented with reasonably large n and K (i.e., nKN) as long as the desired statistical precision is achieved.

For theoretical comparison, we introduce a simple alternative method. For each sub-sample 𝒮k, we separately compute the MLE β^k,mle. Then, all sub-sample estimators are simply averaged to obtain the OS estimator. Let βKOS=K1k=1Kβ^k,mle denote the OS estimator. We obtain the following conclusion.

Proposition 1

For the OS estimator, under the same conditions in Theorem 1, we haveβKOSβ0=ŪK+Δos, whereΔos=Op(1/n). Further assumen2/N;then, we have{1/(nK)+1/N}1/2(βKOSβ0)dN(0,).

The proof of Proposition 1 is in Appendix B.2. Comparing Theorem 1 and Proposition 1, we find that the leading terms for the variance of both the SOS and the OS estimators are identical. However, the bias term of the OS estimator is of order Op(1/n), which cannot be improved as K increases. By contrast, the bias of the SOS estimator is Op[(logK/n){1/(nK)+1/N}]1/2, which can be significantly reduced as K increases. Therefore, compared with the SOS estimator βK, βKOS requires a more stringent condition n2/N, such that it could achieve the same asymptotic normality as the global estimator (Huang & Huo, 2019; Jordan et al., 2019). However, this condition is not necessary for the SOS estimator.

Next, to make an automatic inference, we discuss the estimation of the standard error of the SOS estimator. Specifically, based on the SOS procedure, we construct the following statistic as

(2)

The properties of SE^2(βK) are presented in the following theorem.

Theorem 2

Under the same conditions in Theorem1, we have

The proof of Theorem 2 is in Appendix B.3. We conclude that the leading orders of var(βK) and SE^2(βK) are the same. However, as an estimator of var(βK), SE^2(βK) is biassed, but the order of the bias is O(nN2), which decreases as N increases or n decreases. Practically, the unknown parameter β0 in U(k) can be replaced by βk to obtain Û(k) and Ū^K=K1k=1KÛ(k). Then, SE^2(βK) can be calculated based on Û(k) and Ū^K in the form of (2). Note that by Theorem 1, the leading term for the variance of βK is {1/(nK)+1/N}. Such term can be consistently estimated by the proposed SE estimator SE^(βK). Its asymptotic property is presented in the following theorem.

Theorem 3

Under the same conditions in Theorem 1, we have

(3)

The proof of Theorem 3 is in Appendix B.4. Combining the results of Theorems 1 and 3, we immediately obtain that {SE^(βK)}1(βKβ0)dN(0,Ip). As a result, both the estimator and statistical inference could be easily and efficiently derived by our SOS procedure. We illustrate the performance of the SOS estimator and SE^2(βK) in the next section.

3 SIMULATION STUDIES

3.1 Simulation design

To demonstrate the finite sample performance of the SOS estimator, we present a variety of simulation studies. Assume that the whole data set contains N=150,000 observations. For i=1,...,N, we generate each observation (Xi,Yi) under the logistic regression model. We choose the logistic regression model because it is a specific model used for customer churn analysis. Given that the SOS method can be extended easily to other generalised regression models, we also choose the Poisson regression model as an example to test the effectiveness of the SOS method. The specific settings for the two model examples are as follows.

Example 1

(Logistic regression

Logistic regression is used to model binary responses. In this example, we consider p=5 exogenous covariates Xi=(Xi1,Xi2,Xi3,Xi4,Xi5), where each covariate is generated from a standard normal distribution N(0,1). We set the coefficients for Xi as β=(0,0.2,0.1,0.1,0.2). Then, the response Yi(1iN) is generated from a Bernoulli distribution with the probability given as

Example 2

(Poisson regression)

Poisson regression is used to deal with count responses. We also consider p=5 exogenous covariates, which are all generated from standard normal distribution. The corresponding coefficients are set as β=(3,2,1,1,2). Then, the response Yi(1iN) is generated from a Poisson distribution given as

After obtaining N observations, we consider combinations of different sub-sampling size and different sub-sampling times. In both logistic and Poisson regression examples, we set n=(100,200,400). Then in the logistic regression example, we consider cases of small K, and set K=(10,20,30,40,50,100). In the Poisson regression example, we consider cases of big K, and set K=(100,200,300,400,500,1000). In each combination of n and K, we repeat the experiment B=1000 times.

3.2 Comparison with repeated sub-sampling methods

In this sub-section, we compare the proposed SOS estimator with the OS estimator, which is representative of the repeated sub-sampling methods. Specifically, in each simulated data set, we obtain the SOS estimator using Algorithm 1. For the OS estimator, we first randomly obtain K sub-data, and then independently apply the Newton–Raphson method to each sub-data. Specifically, in the kth sub-data, we set the initial value as βini=(0,0,0,0,0), and then fully conduct the Newton–Raphson method to obtain estimate β^k,mle. The final OS estimator is calculated as βKOS=k=1Kβ^k,mle/K. For one particular method (i.e., SOS and OS), we define β^(b)=(β^j(b))j=1p as the estimator in the bth (1bB) replication. Then, to evaluate the estimation efficiency of each estimator, we calculate the bias as =|ββ|, where β=B1bβ^(b). The standard error (SE^(b)) can be estimated based on Theorem 2 for the SOS method. We report the average SE^=B1bSE^(b). Then, we compare SE^ with the Monte Carlo SD of β^(b), which is calculated by SE={B1b(β^(b)β)2}1/2. Next, we construct a 95% confidence interval for β as CI(b)=(β^(b)z0.975SE^(b),β^(b)+z0.975SE^(b)), where zα is the αth lower quantile of a standard normal distribution. Then, the coverage probability is computed as ECP=B1bI(βCI(b)), where I(·) is the indicator function. Last, we compare the computational efficiency of the two methods. It is notable that, for a fixed sample size n, the computational time consumed by each Newton–Raphson iteration is the same for the SOS and OS methods. Therefore, we use the average round of Newton–Raphson iterations to compare the computational efficiency of the SOS and OS methods.

Tables 1 and 2 present the simulation results for estimation performance under the logistic regression and Poisson regression, respectively. In general, the simulation results under the two examples are similar. We draw the following conclusions. First, as the sub-sampling times K increases, the bias of the SOS estimator becomes much smaller than that of the OS estimator. This is because the bias of the SOS estimator decreases with increasing K, while the bias of the OS estimator is always O(1/n). Second, the SE and SE^ of both estimators decrease with increasing n or K, implying that the two estimators are consistent. Third, as the bias and SE of the SOS estimator can decrease with n and K, the bias always behaves negligibly compared with SE. However, the bias of the OS estimator is comparable to or even larger than its SE; see n=100,K=300 in Table 2 for an example. Next, the empirical coverage probabilities of the SOS estimator are all around the nominal level of 95%, which suggests that the true SE can be well approximated by its estimators derived in Theorem 2. However, the empirical coverage probabilities of the OS estimator show a deceasing trend when enlarging K. This is because the bias of the OS estimator cannot be negligible when K is large. Last, regarding the computational time, we compare the average rounds of Newton–Raphson iterations consumed by the two methods. Except for the initial sub-sample estimator, the SOS method uses only one Newton–Raphson update for each subsequent sub-sample. However, on average, the OS estimator requires seven or eight rounds of Newton–Raphson updates to obtain convergence. Therefore, the SOS estimator is computationally more efficient than the OS estimator.

TABLE 1

Simulation results for estimation performance under logistic regression

Bias × 100SE × 100SE^×100ECPROUND
KSOSOSSOSOSSOSOSSOSOSSOSOS
n=100
100.5320.9036.5987.2376.5456.5960.9520.9121.8757.145
200.3360.9634.6945.1094.6874.7170.9470.8791.9237.492
300.2040.9433.8154.1423.8573.880.9500.8581.8847.146
400.1700.9763.2783.5593.3593.3790.9500.8581.8217.240
500.1500.9152.9973.2473.0153.0320.9540.8232.1227.654
1000.0580.9212.1242.2932.1622.1730.9550.7551.9427.149
n=200
100.2910.5454.7094.9234.5394.5540.9460.9271.8737.364
200.1600.4793.2333.3743.2833.2930.9520.9191.9177.381
300.0980.4202.6822.7932.7072.7140.9500.9121.9217.399
400.0650.4532.3282.4222.3632.3700.9540.9061.8937.416
500.0430.4382.0802.1612.1252.1300.9510.9041.8347.433
1000.0410.4481.5501.6091.5531.5560.9480.8821.9857.450
n=400
100.1190.2453.1983.2693.2003.2050.9500.9381.8817.467
200.0550.2392.2852.3312.3222.3250.9520.9381.9547.484
300.0460.2241.8911.9261.9261.9280.9510.9362.0167.501
400.0350.2181.6811.7141.6891.6920.9490.9341.9827.519
500.0310.2261.5291.5581.5321.5340.9470.9292.1047.536
1000.0200.2371.1481.1701.1501.1510.9470.9122.1637.553
Bias × 100SE × 100SE^×100ECPROUND
KSOSOSSOSOSSOSOSSOSOSSOSOS
n=100
100.5320.9036.5987.2376.5456.5960.9520.9121.8757.145
200.3360.9634.6945.1094.6874.7170.9470.8791.9237.492
300.2040.9433.8154.1423.8573.880.9500.8581.8847.146
400.1700.9763.2783.5593.3593.3790.9500.8581.8217.240
500.1500.9152.9973.2473.0153.0320.9540.8232.1227.654
1000.0580.9212.1242.2932.1622.1730.9550.7551.9427.149
n=200
100.2910.5454.7094.9234.5394.5540.9460.9271.8737.364
200.1600.4793.2333.3743.2833.2930.9520.9191.9177.381
300.0980.4202.6822.7932.7072.7140.9500.9121.9217.399
400.0650.4532.3282.4222.3632.3700.9540.9061.8937.416
500.0430.4382.0802.1612.1252.1300.9510.9041.8347.433
1000.0410.4481.5501.6091.5531.5560.9480.8821.9857.450
n=400
100.1190.2453.1983.2693.2003.2050.9500.9381.8817.467
200.0550.2392.2852.3312.3222.3250.9520.9381.9547.484
300.0460.2241.8911.9261.9261.9280.9510.9362.0167.501
400.0350.2181.6811.7141.6891.6920.9490.9341.9827.519
500.0310.2261.5291.5581.5321.5340.9470.9292.1047.536
1000.0200.2371.1481.1701.1501.1510.9470.9122.1637.553

Notes: The bias, SE, SE^, and ECP are reported for the sequential one-step (SOS) and one-shot (OS) estimators, respectively. The average Newton–Raphson rounds (representing the computational time) of the two estimators are also reported.

TABLE 1

Simulation results for estimation performance under logistic regression

Bias × 100SE × 100SE^×100ECPROUND
KSOSOSSOSOSSOSOSSOSOSSOSOS
n=100
100.5320.9036.5987.2376.5456.5960.9520.9121.8757.145
200.3360.9634.6945.1094.6874.7170.9470.8791.9237.492
300.2040.9433.8154.1423.8573.880.9500.8581.8847.146
400.1700.9763.2783.5593.3593.3790.9500.8581.8217.240
500.1500.9152.9973.2473.0153.0320.9540.8232.1227.654
1000.0580.9212.1242.2932.1622.1730.9550.7551.9427.149
n=200
100.2910.5454.7094.9234.5394.5540.9460.9271.8737.364
200.1600.4793.2333.3743.2833.2930.9520.9191.9177.381
300.0980.4202.6822.7932.7072.7140.9500.9121.9217.399
400.0650.4532.3282.4222.3632.3700.9540.9061.8937.416
500.0430.4382.0802.1612.1252.1300.9510.9041.8347.433
1000.0410.4481.5501.6091.5531.5560.9480.8821.9857.450
n=400
100.1190.2453.1983.2693.2003.2050.9500.9381.8817.467
200.0550.2392.2852.3312.3222.3250.9520.9381.9547.484
300.0460.2241.8911.9261.9261.9280.9510.9362.0167.501
400.0350.2181.6811.7141.6891.6920.9490.9341.9827.519
500.0310.2261.5291.5581.5321.5340.9470.9292.1047.536
1000.0200.2371.1481.1701.1501.1510.9470.9122.1637.553
Bias × 100SE × 100SE^×100ECPROUND
KSOSOSSOSOSSOSOSSOSOSSOSOS
n=100
100.5320.9036.5987.2376.5456.5960.9520.9121.8757.145
200.3360.9634.6945.1094.6874.7170.9470.8791.9237.492
300.2040.9433.8154.1423.8573.880.9500.8581.8847.146
400.1700.9763.2783.5593.3593.3790.9500.8581.8217.240
500.1500.9152.9973.2473.0153.0320.9540.8232.1227.654
1000.0580.9212.1242.2932.1622.1730.9550.7551.9427.149
n=200
100.2910.5454.7094.9234.5394.5540.9460.9271.8737.364
200.1600.4793.2333.3743.2833.2930.9520.9191.9177.381
300.0980.4202.6822.7932.7072.7140.9500.9121.9217.399
400.0650.4532.3282.4222.3632.3700.9540.9061.8937.416
500.0430.4382.0802.1612.1252.1300.9510.9041.8347.433
1000.0410.4481.5501.6091.5531.5560.9480.8821.9857.450
n=400
100.1190.2453.1983.2693.2003.2050.9500.9381.8817.467
200.0550.2392.2852.3312.3222.3250.9520.9381.9547.484
300.0460.2241.8911.9261.9261.9280.9510.9362.0167.501
400.0350.2181.6811.7141.6891.6920.9490.9341.9827.519
500.0310.2261.5291.5581.5321.5340.9470.9292.1047.536
1000.0200.2371.1481.1701.1501.1510.9470.9122.1637.553

Notes: The bias, SE, SE^, and ECP are reported for the sequential one-step (SOS) and one-shot (OS) estimators, respectively. The average Newton–Raphson rounds (representing the computational time) of the two estimators are also reported.

TABLE 2

Simulation results for estimation performance under Poisson regression

Bias ×100SE×100SE^×100ECPROUND
KSOSOSSOSOSSOSOSSOSOSSOSOS
n=100
1000.0633.6563.2373.4373.2373.5050.9510.8201.8928.669
2000.0423.6492.3292.4782.3652.5550.9510.7111.9348.673
3000.0353.5941.9472.0851.9892.1480.9550.6341.9108.671
4000.0333.5991.7201.8391.7701.9100.9590.5711.9078.671
5000.0313.5971.5641.6671.6251.7540.9590.5241.9078.671
10000.0333.5931.2221.2941.2861.3870.9600.4261.8938.670
n=200
1000.0391.2752.0962.1382.1032.1670.9500.9121.8848.326
2000.0341.2411.5551.5841.5751.6220.9570.8791.8928.323
3000.0301.2341.3151.3371.3531.3930.9580.8561.9278.322
4000.0261.2411.1941.2141.2261.2620.9610.8341.9628.323
5000.0261.2411.1151.1351.1431.1760.9600.8162.0438.322
10000.0231.2390.9230.9360.9560.9840.9580.7651.8958.321
n=400
1000.0380.4851.4661.4761.4701.4890.9470.9321.8848.153
2000.0350.4851.1311.1371.1451.1600.9560.9321.8928.153
3000.0320.4981.0061.0121.0131.0260.9530.9201.9038.153
4000.0340.4850.9350.9390.9400.9520.9510.9151.9068.153
5000.0320.4880.8820.8860.8940.9050.9530.9171.8918.152
10000.0260.4860.7690.7720.7920.8020.9540.9101.9028.152
Bias ×100SE×100SE^×100ECPROUND
KSOSOSSOSOSSOSOSSOSOSSOSOS
n=100
1000.0633.6563.2373.4373.2373.5050.9510.8201.8928.669
2000.0423.6492.3292.4782.3652.5550.9510.7111.9348.673
3000.0353.5941.9472.0851.9892.1480.9550.6341.9108.671
4000.0333.5991.7201.8391.7701.9100.9590.5711.9078.671
5000.0313.5971.5641.6671.6251.7540.9590.5241.9078.671
10000.0333.5931.2221.2941.2861.3870.9600.4261.8938.670
n=200
1000.0391.2752.0962.1382.1032.1670.9500.9121.8848.326
2000.0341.2411.5551.5841.5751.6220.9570.8791.8928.323
3000.0301.2341.3151.3371.3531.3930.9580.8561.9278.322
4000.0261.2411.1941.2141.2261.2620.9610.8341.9628.323
5000.0261.2411.1151.1351.1431.1760.9600.8162.0438.322
10000.0231.2390.9230.9360.9560.9840.9580.7651.8958.321
n=400
1000.0380.4851.4661.4761.4701.4890.9470.9321.8848.153
2000.0350.4851.1311.1371.1451.1600.9560.9321.8928.153
3000.0320.4981.0061.0121.0131.0260.9530.9201.9038.153
4000.0340.4850.9350.9390.9400.9520.9510.9151.9068.153
5000.0320.4880.8820.8860.8940.9050.9530.9171.8918.152
10000.0260.4860.7690.7720.7920.8020.9540.9101.9028.152

Notes: The bias, SE, SE^, and ECP are reported for the sequential one-step (SOS) and one-shot (OS) estimators, respectively. The average Newton–Raphson rounds (representing the computational time) of the two estimators are also reported.

TABLE 2

Simulation results for estimation performance under Poisson regression

Bias ×100SE×100SE^×100ECPROUND
KSOSOSSOSOSSOSOSSOSOSSOSOS
n=100
1000.0633.6563.2373.4373.2373.5050.9510.8201.8928.669
2000.0423.6492.3292.4782.3652.5550.9510.7111.9348.673
3000.0353.5941.9472.0851.9892.1480.9550.6341.9108.671
4000.0333.5991.7201.8391.7701.9100.9590.5711.9078.671
5000.0313.5971.5641.6671.6251.7540.9590.5241.9078.671
10000.0333.5931.2221.2941.2861.3870.9600.4261.8938.670
n=200
1000.0391.2752.0962.1382.1032.1670.9500.9121.8848.326
2000.0341.2411.5551.5841.5751.6220.9570.8791.8928.323
3000.0301.2341.3151.3371.3531.3930.9580.8561.9278.322
4000.0261.2411.1941.2141.2261.2620.9610.8341.9628.323
5000.0261.2411.1151.1351.1431.1760.9600.8162.0438.322
10000.0231.2390.9230.9360.9560.9840.9580.7651.8958.321
n=400
1000.0380.4851.4661.4761.4701.4890.9470.9321.8848.153
2000.0350.4851.1311.1371.1451.1600.9560.9321.8928.153
3000.0320.4981.0061.0121.0131.0260.9530.9201.9038.153
4000.0340.4850.9350.9390.9400.9520.9510.9151.9068.153
5000.0320.4880.8820.8860.8940.9050.9530.9171.8918.152
10000.0260.4860.7690.7720.7920.8020.9540.9101.9028.152
Bias ×100SE×100SE^×100ECPROUND
KSOSOSSOSOSSOSOSSOSOSSOSOS
n=100
1000.0633.6563.2373.4373.2373.5050.9510.8201.8928.669
2000.0423.6492.3292.4782.3652.5550.9510.7111.9348.673
3000.0353.5941.9472.0851.9892.1480.9550.6341.9108.671
4000.0333.5991.7201.8391.7701.9100.9590.5711.9078.671
5000.0313.5971.5641.6671.6251.7540.9590.5241.9078.671
10000.0333.5931.2221.2941.2861.3870.9600.4261.8938.670
n=200
1000.0391.2752.0962.1382.1032.1670.9500.9121.8848.326
2000.0341.2411.5551.5841.5751.6220.9570.8791.8928.323
3000.0301.2341.3151.3371.3531.3930.9580.8561.9278.322
4000.0261.2411.1941.2141.2261.2620.9610.8341.9628.323
5000.0261.2411.1151.1351.1431.1760.9600.8162.0438.322
10000.0231.2390.9230.9360.9560.9840.9580.7651.8958.321
n=400
1000.0380.4851.4661.4761.4701.4890.9470.9321.8848.153
2000.0350.4851.1311.1371.1451.1600.9560.9321.8928.153
3000.0320.4981.0061.0121.0131.0260.9530.9201.9038.153
4000.0340.4850.9350.9390.9400.9520.9510.9151.9068.153
5000.0320.4880.8820.8860.8940.9050.9530.9171.8918.152
10000.0260.4860.7690.7720.7920.8020.9540.9101.9028.152

Notes: The bias, SE, SE^, and ECP are reported for the sequential one-step (SOS) and one-shot (OS) estimators, respectively. The average Newton–Raphson rounds (representing the computational time) of the two estimators are also reported.

3.3 Comparison with non-uniform sub-sampling methods

In this sub-section, we compare the proposed SOS method with the non-uniform sub-sampling methods. To this end, we choose the OSMAC method (Wang et al., 2018) for the comparison, which is particularly designed for large sample logistic regression. The OSMAC method applies a two-step algorithm for the model estimation. In the first step, a pilot sample of size r0 is randomly chosen to obtain a pilot estimate. Then, the pilot estimate is used to compute the optimal sub-sampling probabilities for the whole data. In the second step, a new sub-sample of size r is chosen based on the optimal sub-sampling probabilities. Then, the final OSMAC estimate is obtained using the total r0+r samples. We compare the two methods under the logistic regression example. To mimic a large data set, we consider the whole sample size N=(1,2,5,10)×105. For fixed N, the whole data set is generated under the logistic regression model following the procedures in Section 3.1.

Below, we compare the SOS method and OSMAC method under one specific situation. That is, the compute memory is limited, which could only support building a logistic regression model for a sample with size n=400. In this situation, the sub-sample size used in SOS is fixed as n=400, and we vary the sub-sampling times as K=(1,5,10,20,40). As for the OSMAC method, we set r0=200 and r=400. The OSMAC method is implemented using the corresponding R package provided by Wang et al. (2018).

For a reliable comparison, the experiment is randomly replicated for B=200 times for each experimental setup. We use the mean squared error (MSE) to evaluate the statistical efficiencies of the two methods. Specifically, for one particular method (i.e., SOS and OSMAC), we define β^(b)=(β^j(b))j=1p as the estimator in the bth (1bB) replication. Then, the MSE of each estimator is computed as B1b=1Bj=1p(β^j(b)βj)2. We also compare the computational time of the two methods. All experiments are conducted on a server with 12× Xeon Gold 6271 CPU and 64 GB RAM. The total time costs consumed by different methods under each experimental setup are averaged over B=200 random replications. The detailed results are displayed in Table 3.

TABLE 3

The mean squared error (MSE) and time costs (in seconds) are obtained by the sequential one-step (SOS) and optimal sub-sampling methods motivated by the A-optimality criterion (OSMAC) methods for different sample sizes N under the logistic regression model.

MSETimeMSETimeMSETimeMSETime
MethodN=100,000N=200,000N=400,000N=800,000
OSMAC0.04600.02130.04600.04100.04520.08750.04760.1679
SOSK=10.23460.00580.23200.00700.23000.00640.22740.0086
K=50.04150.01010.04190.00900.04140.01240.04180.0150
K=100.04100.01710.04050.01680.04190.01750.04060.0256
K=200.04040.03180.04090.03220.04050.03780.04030.0437
K=400.04010.05740.03980.06180.04000.07400.04000.0804
MSETimeMSETimeMSETimeMSETime
MethodN=100,000N=200,000N=400,000N=800,000
OSMAC0.04600.02130.04600.04100.04520.08750.04760.1679
SOSK=10.23460.00580.23200.00700.23000.00640.22740.0086
K=50.04150.01010.04190.00900.04140.01240.04180.0150
K=100.04100.01710.04050.01680.04190.01750.04060.0256
K=200.04040.03180.04090.03220.04050.03780.04030.0437
K=400.04010.05740.03980.06180.04000.07400.04000.0804
TABLE 3

The mean squared error (MSE) and time costs (in seconds) are obtained by the sequential one-step (SOS) and optimal sub-sampling methods motivated by the A-optimality criterion (OSMAC) methods for different sample sizes N under the logistic regression model.

MSETimeMSETimeMSETimeMSETime
MethodN=100,000N=200,000N=400,000N=800,000
OSMAC0.04600.02130.04600.04100.04520.08750.04760.1679
SOSK=10.23460.00580.23200.00700.23000.00640.22740.0086
K=50.04150.01010.04190.00900.04140.01240.04180.0150
K=100.04100.01710.04050.01680.04190.01750.04060.0256
K=200.04040.03180.04090.03220.04050.03780.04030.0437
K=400.04010.05740.03980.06180.04000.07400.04000.0804
MSETimeMSETimeMSETimeMSETime
MethodN=100,000N=200,000N=400,000N=800,000
OSMAC0.04600.02130.04600.04100.04520.08750.04760.1679
SOSK=10.23460.00580.23200.00700.23000.00640.22740.0086
K=50.04150.01010.04190.00900.04140.01240.04180.0150
K=100.04100.01710.04050.01680.04190.01750.04060.0256
K=200.04040.03180.04090.03220.04050.03780.04030.0437
K=400.04010.05740.03980.06180.04000.07400.04000.0804

Based on the results presented in Table 3, we draw the following conclusions. First, the OSMAC method achieves better estimation performance than the SOS method with K=1. This is because the OSMAC method can find an optimal sub-sample, while the SOS method only randomly selects the sub-sample. Second, as the sub-sampling times K increases, the estimation performance of the SOS method improves by achieving smaller MSE values. This finding is consistent with our theoretical results in Theorem 1. It is also notable that a relatively small K (e.g., K=5) makes the SOS method achieve better estimation performance than the OSMAC method. Third, the computational time consumed by the OSMAC method increases largely with the whole sample size N. This is because the OSMAC method computes the sub-sampling probabilities for all N samples, which results in a large computational cost. Meanwhile, the computational cost of the SOS method mainly results from the repeated sub-sampling strategy. Then, with an increase of K, the computational time consumed by SOS is enlarged. However, with appropriately chosen K, the SOS method can achieve both better estimation performance and smaller computational time than the OSMAC method.

3.4 Comparison with all-sample methods

Finally, to complete our empirical comparison, we compare the SOS method with methods using the whole sample. We first compare the SOS method with DC methods, which are also commonly used to accomplish data analysis tasks of huge scale. The key idea of DC methods is to divide a large-scale data set into multiple sub-data sets, each of which is then estimated separately to obtain a local estimate. Then, all local estimates are reassembled together to obtain the final estimate. Different from sub-sampling methods, DC methods in fact exploit the whole data information. Therefore, they often have good statistical efficiency but high computational cost. In this regard, we take the AEE method (Lin & Xi, 2011) as a representative example. Another method to consider is the mini-batch gradient descent (MGD) estimation method (Duchi et al., 2011). The MGD method splits the whole data set into several mini-batches. Each mini-batch is then read into the memory and estimated sequentially. Different from the SOS method, MGD is not a Newton–Raphson-type method. Instead, it applies the stochastic gradient descent strategy for parameter estimation.

To undertake a comprehensive evaluation, we consider the whole sample size as N=(10,12,14,16,18,20)×104. For fixed N, we generate the data set under the logistic regression model following the procedures described in Section 3.1. For the AEE method, we assume there are a total of J=100 workers. For the MGD method, we assume the total number of mini-batches is also J=100. Then, the whole data set is randomly and evenly divided into J=100 sub-data sets, and each sub-data set has n=N/J observations. For comparison, the sub-sample size in the SOS method is fixed as n=N/J. For all experiments, we assume sub-sampling times of K=50. Theoretically, the information exploited by the SOS method is smaller than the DC methods. We repeat the experiment B=200 times under each experimental setup. The statistical efficiencies of the three methods are evaluated by MSE. We also compare the computational efficiency of the three methods. The detailed results are displayed in Figure 2.

The mean squared error and time costs (in logarithm) are obtained by the sequential one-step, mini-batch gradient descent, and aggregated estimating equation methods for different sample sizes N under the logistic regression model. (a) Estimation performance; (b) Computation performance [Colour figure can be viewed at https://dbpia.nl.go.kr]
FIGURE 2

The mean squared error and time costs (in logarithm) are obtained by the sequential one-step, mini-batch gradient descent, and aggregated estimating equation methods for different sample sizes N under the logistic regression model. (a) Estimation performance; (b) Computation performance [Colour figure can be viewed at https://dbpia.nl.go.kr]

As shown by Figure 2a, compared with AEE and MGD, the SOS method performs statistically less efficiently by achieving the largest MSE values. This finding is obvious, because the AEE and MGD methods exploit the full data information. Therefore, in theory, the two estimators are both N-consistent. However as suggested by Theorem 2, the SE of the SOS estimator is O{1/(nK)+1/N}. Then, the SOS estimator should be statistically less efficient when nK is smaller than N. Although the SOS estimator has the worst statistical efficiency, its MSE has already achieved 104, indicating satisfactory estimation precision in practice.

We then compare the computational efficiency of the three methods. We fix the learning rate as 0.2 in the MGD method. All experiments are conducted on a server with 12× Xeon Gold 6271 CPU and 64 GB RAM. The total time costs (in s) consumed by the different methods under different sample sizes are averaged over B=200 random replications. Then, the averaged time costs are plotted in Figure 2b in log-scale. As shown, the MGD method takes the most computational time. The AEE and SOS methods are much more computationally efficient than the MGD method. Furthermore, the SOS method takes less computational time than the AEE method. In general, the time costs consumed by the SOS method are only half those of the AEE method. These empirical findings confirm that the SOS method is computationally more efficient than the AEE and MGD methods.

4 APPLICATION TO CUSTOMER CHURN ANALYSIS

4.1 Data description and pre-processing

We apply the SOS method to a large-scale customer churn data set, which is provided by a well-known securities company in China. The original data set contains 12 million transaction records from 230,000 customers from September to December 2020. This data set is originated from 10 files, which are directly exported from the company's database system. The 10 files record different aspects of users. Specifically, the 10 files include: user basic information, behaviour information on the company's APP, daily asset information, daily market information, inflow-outflow information, debt information, trading information, fare information and information of holding financial products or service products. In total, the 10 files contain 398 variables describing the asset and non-asset information of a specific customer on a specific trading day. The asset information contains 325 variables related to customer transactions, such as assets, stock value, trading volume, profit and debit. The non-asset information contains 73 variables about detailed customer information, such as customer ID, gender, age, education and login behaviour. The whole data set takes up about 300 GB on a hard drive.

The research goal of this study is to provide an early warning of customer churn status, which may help the securities company to retain customers. According to the common practice of the securities company, a customer is defined as lost when the following two criteria are met: (1) the customer has less than 20,000 floating assets in 20 trading days; and (2) the customer logs in less than three times in 20 trading days. Based on this definition, a new binary variable Churn is used to indicate whether the customer is lost (Churn = 1) or not (Churn = 0). Given the response is a binary variable (churn or not churn), a logistic regression model can be built to help customer churn prediction. To predict the customer churn status, we compute both asset-related and non-asset-related covariates for each customer using transaction information ahead of 20 trading days. In other words, all used covariates can help forecast the churn status of customers 20 trading days in advance.

Before model building, we conduct several steps to preprocess the original data set. First, we check the missing value proportions for all variables in the data set, and then discard variables whose missing value proportions were larger than 80%. Second, basic summary statistics (e.g., mean and SD) are computed for each variable to help detect potential outliers. Third, we check the stability for each variable to detect potential changepoints. Fortunately, we find the daily basic statistics for most variables are stable from September to December. Therefore, all daily observations are pooled together in the subsequent analysis.

Preliminary analysis shows that strong correlations exist among most variables in the original data set. Therefore, we design a practical procedure for variable selection, borrowing ideas from the independence screening method (Fan & Song, 2010). Specifically, we first classify all covariates into 10 groups based on their source files. Then a logistic regression model with each single covariate is conducted for variable selection. It is notable that, the prediction performance of the customer churn model should be evaluated on a test data set. Therefore, we first order all observations by time and then split the whole data set into the training data set (the first 70% observations) and the test data set (the last 30% observations). Then we build a logistic regression model with each single covariate on the training data set, and the resulting AUC value is recorded to measure the prediction ability of the specific covariate. Next in each variable group, the covariate with the largest AUC value is chosen. This leads to the final predictor set consisting of 10 covariates. Table 4 shows the detailed information about the responses and the 10 selected covariates.

We then explore the relationship between the responses and the covariates. For illustration, we take MaxFA and WheL as examples of continuous and categorical covariates, respectively. Among the whole data, the percentage of churn customers is 13.7%. Because the continuous variable MaxFA has a highly right-skewed distribution, logarithmic transformation is applied. Figure 3a presents the boxplot of MaxFA (in log-scale) under different levels of Churn. As shown, customers with fewer maximum floating assets are more likely to churn. As for the categorical variable WheL, Figure 3b presents the spinogram of this variable under different levels of Churn. As shown, customers who do not log into the system are more likely to churn.

The boxplot of (in log-scale) (a) and the spinogram of (b) under the response  = 0 (non-churn status) and  = 1 (churn status) [Colour figure can be viewed at https://dbpia.nl.go.kr]
FIGURE 3

The boxplot of (in log-scale) (a) and the spinogram of (b) under the response  = 0 (non-churn status) and  = 1 (churn status) [Colour figure can be viewed at https://dbpia.nl.go.kr]

TABLE 4

The detailed information about responses and covariates

VariablesSource FileDescription
ResponseChurnAssetWhether the customers churn or not. Yes: 13.7%; No: 86.3%.
AssetMaxMVSMarketThe maximum market value of stock.
StdTFFareThe SD of total fare.
MaxFAAssetThe maximum floating assets.
StdTDDebtThe SD of total debt.
WheTVAMTradingWhether the trading volume of A-shares is missing or not. Yes: 69.1%; No: 30.9%.
WheIFMInflowWhether the inflow of funds is missing or not. Yes: 81.3%; No: 18.7%.
Non- assetAgeBasicThe age of customers (4 levels). <40: 20.6%; 40–50: 25.8%; 50–60: 29.3%; >60: 24.3%.
WheHFPFProductWhether the customers hold financial products or not. Yes: 71.5%; No: 28.5%.
WheHSPSProductWhether the customers hold service products or not. Yes: 86.0%; No: 14%.
WheLBehaviourWhether the customers Login or not. Yes: 16.8%; No:83.2%.
VariablesSource FileDescription
ResponseChurnAssetWhether the customers churn or not. Yes: 13.7%; No: 86.3%.
AssetMaxMVSMarketThe maximum market value of stock.
StdTFFareThe SD of total fare.
MaxFAAssetThe maximum floating assets.
StdTDDebtThe SD of total debt.
WheTVAMTradingWhether the trading volume of A-shares is missing or not. Yes: 69.1%; No: 30.9%.
WheIFMInflowWhether the inflow of funds is missing or not. Yes: 81.3%; No: 18.7%.
Non- assetAgeBasicThe age of customers (4 levels). <40: 20.6%; 40–50: 25.8%; 50–60: 29.3%; >60: 24.3%.
WheHFPFProductWhether the customers hold financial products or not. Yes: 71.5%; No: 28.5%.
WheHSPSProductWhether the customers hold service products or not. Yes: 86.0%; No: 14%.
WheLBehaviourWhether the customers Login or not. Yes: 16.8%; No:83.2%.

Note: “FProduct” and “SProduct” represent source files of holding financial products or service products.

TABLE 4

The detailed information about responses and covariates

VariablesSource FileDescription
ResponseChurnAssetWhether the customers churn or not. Yes: 13.7%; No: 86.3%.
AssetMaxMVSMarketThe maximum market value of stock.
StdTFFareThe SD of total fare.
MaxFAAssetThe maximum floating assets.
StdTDDebtThe SD of total debt.
WheTVAMTradingWhether the trading volume of A-shares is missing or not. Yes: 69.1%; No: 30.9%.
WheIFMInflowWhether the inflow of funds is missing or not. Yes: 81.3%; No: 18.7%.
Non- assetAgeBasicThe age of customers (4 levels). <40: 20.6%; 40–50: 25.8%; 50–60: 29.3%; >60: 24.3%.
WheHFPFProductWhether the customers hold financial products or not. Yes: 71.5%; No: 28.5%.
WheHSPSProductWhether the customers hold service products or not. Yes: 86.0%; No: 14%.
WheLBehaviourWhether the customers Login or not. Yes: 16.8%; No:83.2%.
VariablesSource FileDescription
ResponseChurnAssetWhether the customers churn or not. Yes: 13.7%; No: 86.3%.
AssetMaxMVSMarketThe maximum market value of stock.
StdTFFareThe SD of total fare.
MaxFAAssetThe maximum floating assets.
StdTDDebtThe SD of total debt.
WheTVAMTradingWhether the trading volume of A-shares is missing or not. Yes: 69.1%; No: 30.9%.
WheIFMInflowWhether the inflow of funds is missing or not. Yes: 81.3%; No: 18.7%.
Non- assetAgeBasicThe age of customers (4 levels). <40: 20.6%; 40–50: 25.8%; 50–60: 29.3%; >60: 24.3%.
WheHFPFProductWhether the customers hold financial products or not. Yes: 71.5%; No: 28.5%.
WheHSPSProductWhether the customers hold service products or not. Yes: 86.0%; No: 14%.
WheLBehaviourWhether the customers Login or not. Yes: 16.8%; No:83.2%.

Note: “FProduct” and “SProduct” represent source files of holding financial products or service products.

4.2 Customer churn prediction using SOS

We build a logistic regression model to investigate influential factors in the churn status of customers. The whole data set is quite large and cannot be analysed in the computer memory directly. To handle this huge data set, we do not consider DC methods for the lack of distributed systems in hand. We also do not consider the non-informative sub-sampling methods, because they are usually statistically less efficient than the SOS method. In addition, they require computing the optimal sampling probabilities for the entire data set, which would incur high computational cost. Based on these considerations, the SOS method is applied to estimate the logistic regression model. To evaluate the predictive ability of the SOS method, we ordered all observations by time and then split the whole data set into the training data (the first 70% observations) and the test data (the last 30% observations). Below, we would build a logistic regression model on the training data set, and then evaluate the prediction performance on the test data set.

Before applying the SOS method on the training data set, we need to set the sub-sampling size n and the sub-sampling times K. To balance between K and n, we first select the sub-sampling size n, and then determine the total sub-sampling times K based on n. Specifically, the sub-sampling size n is mainly decided by the computation resources. In this real application, we use a server with 12*Xeon Gold 6271 CPU and 64 GB RAM for computation. In addition, the securities company requires fast computation speed for updating the customer churn model conveniently day by day in the future. Based on the limited computation resources and the fast computation requirement, we fix n=10,000.

For the sub-sampling times K, we apply an iterative strategy to select an appropriative value. We define βK as the SOS estimate obtained with K times sub-sampling. Then, with increasing K, we compare βK1 with βK. An appropriate K is chosen when the l2-norm βKβK12 is smaller than a pre-defined threshold. Other selection methods can also be considered, such as the five-fold cross-validation method. Specifically, we can monitor the out-sample prediction accuracy under each K, and then select an optimal value that can balance the prediction accuracy and the computational time. In this work, we vary K from 10 to 100, with a step of 10. Then, we use the iterative strategy to select K. Under each K, we compute the l2-norm βKβK12 and the corresponding values are plotted in Figure 4. Based on some preliminary analysis, we find the coefficients of variables are not very small. Therefore, we set a threshold 104 to find stable estimated coefficients. By this threshold, we choose K=20. In addition, as shown by Figure 4, K=20 is also the point with the fastest decline speed of βKβK12. Based on the above considerations, we finally choose K=20.

The value of l2-norm ‖β‾K−β‾K−1‖2 under different K [Colour figure can be viewed at https://dbpia.nl.go.kr]
FIGURE 4

The value of l2-norm βKβK12 under different K [Colour figure can be viewed at https://dbpia.nl.go.kr]

Table 5 presents the detailed regression results for the SOS methods on the training data set. For comparison purpose, we also report the regression results on the full data set in Table 5. In general, the regression results under the training data and full data are similar. Specifically, the variable MaxFA plays a significantly negative role in churn status, which implies that the fewer the maximum floating assets, the more likely the customers would be to churn. This is in accordance with the descriptive results shown in Figure 3a. The variable WheTVAM plays a significantly positive role in the churn status, which implies that customers with no volume are more likely to churn. Similarly, customers with no inflow of funds are more likely to churn. As for non-asset-related variables, the variable WheHSP plays a significantly negative role in the churn status. This result indicates customers who do not hold service products are more likely to churn. In addition, the variable Age has significant influence on the churn status for some age groups. Specifically, customers in the age groups 50–60, and >60 are more likely to churn than those in the age group <40. Finally, the variable WheL has a significantly negative effect, indicating that customers who do not log in to the system in the past 20 trading days are more likely to churn. For the purpose of robustness check, we also conduct the same experiment on four daily data sets, each of which is randomly selected from September to December, respectively. The detailed results are shown in Appendix D in Data S1, which suggest stable coefficient estimates across time.

TABLE 5

The estimation results for logistic regression using the sequential one-step method on the training data set and full data set, respectively

Training dataFull data
VariableEst.SEp-ValueSig.Est.SEp-ValueSig.
Intercept23.7530.351<0.001***23.4820.331<0.001***
MaxMVS11.7360.405<0.001***11.6460.336<0.001***
StdTF    2.7590.313<0.001***    2.8290.261<0.001***
StdTD3.5680.316<0.001***3.6140.263<0.001***
MaxFA27.6990.708<0.001***27.6040.588<0.001***
WheTVAM: Yes    1.2930.306<0.001***    1.3330.255<0.001***
WheIFM: Yes    0.5290.2530.037*    0.5440.2550.033*
Age40–50    0.4260.3060.164    0.4500.2550.077
50–60    0.7320.2820.009**    0.6930.2550.007**
>60    1.1310.308<0.001***    1.1550.257<0.001***
WheL: Yes2.7000.312<0.001***2.6220.259<0.001***
WheHFP: Yes    0.4760.2740.083    0.4510.2540.076
WheHSP: Yes0.4680.2260.039*0.5230.2550.041*
Training dataFull data
VariableEst.SEp-ValueSig.Est.SEp-ValueSig.
Intercept23.7530.351<0.001***23.4820.331<0.001***
MaxMVS11.7360.405<0.001***11.6460.336<0.001***
StdTF    2.7590.313<0.001***    2.8290.261<0.001***
StdTD3.5680.316<0.001***3.6140.263<0.001***
MaxFA27.6990.708<0.001***27.6040.588<0.001***
WheTVAM: Yes    1.2930.306<0.001***    1.3330.255<0.001***
WheIFM: Yes    0.5290.2530.037*    0.5440.2550.033*
Age40–50    0.4260.3060.164    0.4500.2550.077
50–60    0.7320.2820.009**    0.6930.2550.007**
>60    1.1310.308<0.001***    1.1550.257<0.001***
WheL: Yes2.7000.312<0.001***2.6220.259<0.001***
WheHFP: Yes    0.4760.2740.083    0.4510.2540.076
WheHSP: Yes0.4680.2260.039*0.5230.2550.041*

Notes: We report the estimated coefficient βK, standard error (SE^(βK)), and p-values for all variables. The symbols *, **, *** represent a significant influence under the significance level 5%, 1% and 0.1%, respectively.

TABLE 5

The estimation results for logistic regression using the sequential one-step method on the training data set and full data set, respectively

Training dataFull data
VariableEst.SEp-ValueSig.Est.SEp-ValueSig.
Intercept23.7530.351<0.001***23.4820.331<0.001***
MaxMVS11.7360.405<0.001***11.6460.336<0.001***
StdTF    2.7590.313<0.001***    2.8290.261<0.001***
StdTD3.5680.316<0.001***3.6140.263<0.001***
MaxFA27.6990.708<0.001***27.6040.588<0.001***
WheTVAM: Yes    1.2930.306<0.001***    1.3330.255<0.001***
WheIFM: Yes    0.5290.2530.037*    0.5440.2550.033*
Age40–50    0.4260.3060.164    0.4500.2550.077
50–60    0.7320.2820.009**    0.6930.2550.007**
>60    1.1310.308<0.001***    1.1550.257<0.001***
WheL: Yes2.7000.312<0.001***2.6220.259<0.001***
WheHFP: Yes    0.4760.2740.083    0.4510.2540.076
WheHSP: Yes0.4680.2260.039*0.5230.2550.041*
Training dataFull data
VariableEst.SEp-ValueSig.Est.SEp-ValueSig.
Intercept23.7530.351<0.001***23.4820.331<0.001***
MaxMVS11.7360.405<0.001***11.6460.336<0.001***
StdTF    2.7590.313<0.001***    2.8290.261<0.001***
StdTD3.5680.316<0.001***3.6140.263<0.001***
MaxFA27.6990.708<0.001***27.6040.588<0.001***
WheTVAM: Yes    1.2930.306<0.001***    1.3330.255<0.001***
WheIFM: Yes    0.5290.2530.037*    0.5440.2550.033*
Age40–50    0.4260.3060.164    0.4500.2550.077
50–60    0.7320.2820.009**    0.6930.2550.007**
>60    1.1310.308<0.001***    1.1550.257<0.001***
WheL: Yes2.7000.312<0.001***2.6220.259<0.001***
WheHFP: Yes    0.4760.2740.083    0.4510.2540.076
WheHSP: Yes0.4680.2260.039*0.5230.2550.041*

Notes: We report the estimated coefficient βK, standard error (SE^(βK)), and p-values for all variables. The symbols *, **, *** represent a significant influence under the significance level 5%, 1% and 0.1%, respectively.

Finally, we evaluate the predictive ability of the model. In the above, we have obtained the logistic regression model on the training data set. Then, the estimated model is used to predict the churn status of customers in the test data set. We use the receiver operating characteristic (ROC) curve combined with the area-under-curve (AUC) value to measure the predictive accuracy, which is shown in Figure 5. As shown, the corresponding AUC value in this data split is 0.946, which suggests very good predictive ability of the proposed model in classifying customers as churn or non-churn. For comparison purpose, we also apply the OS method with K=20 and n=10,000 on the training data set. The AUC value of the OS method on the test data is 0.938, which is smaller than the SOS method. We also compare the computational time for the two methods. On a server with 12× Xeon Gold 6271 CPU and 64 GB RAM, the computational time for the SOS and OS methods are 3.02 and 10.01 s, respectively. It is obvious that the SOS method behaves more computationally efficiently than the OS method does.

The receiver operating characteristic curve of the logistic regression using the sequential one-step method on test data
FIGURE 5

The receiver operating characteristic curve of the logistic regression using the sequential one-step method on test data

Finally, we present a practical customer recovery strategy using the established model in Table 5. First, we sort all customers in the test data set by their predicted churn probabilities using our model. Then, we divide all customers into 10 groups of equal size. Specifically, group 1 consists of customers with the highest predicted churn probabilities, which we refer to as the high-risk group; and group 10 contains customers with the lowest predicted churn probabilities, which is regarded as the low risk group. In each of the 10 groups, we calculate the ratio of truly churned customers. As shown in Figure 6, the churn ratio of all customers is only 13.7%, but the churn ratio of group 1 is as high as 81.5%. This result verifies the predictive power of the established model. In other words, customers with high predicted churn probabilities tend to churn in reality. This finding suggests that we need to pay more attention to this group of customers and employ active strategies to retain them, such as face-to-face visits, reducing commissions, and providing exclusive services. It is also notable that group 2 shows higher churn ratios (i.e., 31.0%) than that of all customers (13.7%). Therefore, group 2 requires close attention and continuous monitoring.

The churn ratios of 10 groups divided by their predicted churn probabilities under the sequential one-step method [Colour figure can be viewed at https://dbpia.nl.go.kr]
FIGURE 6

The churn ratios of 10 groups divided by their predicted churn probabilities under the sequential one-step method [Colour figure can be viewed at https://dbpia.nl.go.kr]

5 CONCLUDING REMARKS

In this work, we propose a sampling-based method for customer churn analysis with massive data sets. Classic sub-sampling methods require only one round of sub-sampling, but it is necessary to calculate non-uniform sampling probabilities of all data points. This often makes classic sub-sampling methods computationally inefficient. To address this issue, we propose the SOS method, which considers sampling data points with uniform probabilities but operates the sub-sampling step repeatedly. In this way, the sub-sampling cost can be largely reduced. Based on the SOS method, a sequence of estimators is computed, each of which is calculated using one-step updating based on the newly sampled sub-data. The final SOS estimate is the average of all estimators. We establish the theoretical properties of the SOS estimator. Both its bias and SE can be reduced by increasing the sub-sampling times or the sub-sample size. The performance of the SOS estimator is elaborated by simulation studies. Finally, we apply the SOS method to a real customer churn data set of a securities company. By using the SOS method, we can handle the large-scale data set, obtain useful factors that influence costumers' churn status, and achieve a high prediction accuracy for latent churn customers. It is remarkable that, although the proposed SOS method is designed for estimation of logistic regression, it can be easily extended to other generalised regression models.

We consider directions for future study. First, the SOS estimator still depends on multiple rounds of sub-sampling. New sub-sampling methods could be designed to reduce the number of sub-sampling times, which could help reduce the computational cost further. Second, the weights of previous estimators in the final SOS step are the same. However, in reality, one could consider larger weights for estimators in later steps because they have better performance. Finally, a good topic for future study when dynamic massive data are available is how to extend the SOS method for data streams.

6 FUNDING INFORMATION

This work was supported by the National Natural Science Foundation of China (grant numbers 72001205, 72171229, 11971504, 12071477, 11701560, 11831008); fund for building world-class universities (disciplines) of Renmin University of China, Chinese National Statistical Science Research Project (2022LD06), Foundation from Ministry of Education of China (20JZD023). This research was supported by Public Computing Cloud, Renmin University of China.

DATA AVAILABILITY STATEMENT

The data could not be made public. Because the cooperation with the company is based on the confidentiality of the original data.

REFERENCES

Ahmad
,
A.K.
,
Jafar
,
A.
&
Aljoumaa
,
K.
(
2019
)
Customer churn prediction in telecom using machine learning in big data platform
.
Journal of Big Data
,
6
,
1
24
.

Ahn
,
Y.
,
Kim
,
D.
&
Lee
,
D.-J.
(
2019
)
Customer attrition analysis in the securities industry: a large-scale field study in Korea
.
International Journal of Bank Marketing
,
38
(
3
),
561
577
.

Ascarza
,
E.
,
Neslin
,
S.A.
,
Netzer
,
O.
,
Anderson
,
Z.
,
Fader
,
P.S.
,
Gupta
,
S.
et al. (
2018
)
In pursuit of enhanced customer retention management: review, key issues, and future directions
.
Customer Needs and Solutions
,
5
,
65
81
.

Battey
,
H.
,
Fan
,
J.
,
Liu
,
H.
,
Lu
,
J.
&
Zhu
,
Z.
(
2018
)
Distributed testing and estimation under sparse high dimensional models
.
The Annals of Statistics
,
46
,
1352
1382
.

Dhillon
,
P.
,
Lu
,
Y.
,
Foster
,
D.P.
&
Ungar
,
L.
(
2013
)
New subsampling algorithms for fast least squares regression. Proceedings of the International Conference on Neural Information Processing Systems
.

Drineas
,
P.
,
Magdon-Ismail
,
M.
,
Mahoney
,
M.W.
&
Woodruff
,
D.P.
(
2011
)
Fast approximation of matrix coherence and statistical leverage
.
Journal of Machine Learning Research
,
13
,
3475
3506
.

Duchi
,
J.
,
Hazan
,
E.
&
Singer
,
Y.
(
2011
)
Adaptive subgradient methods for online learning and stochastic optimization
.
Journal of Machine Learning Research
,
12
,
257
269
.

Fan
,
J.
&
Li
,
R.
(
2001
)
Variable selection via nonconcave penalized likelihood and its oracle properties
.
Journal of the American Statistical Association
,
96
,
1348
1360
.

Fan
,
J.
&
Song
,
R.
(
2010
)
Sure independence screening in generalized linear models with NP-dimensionality
.
Annals of Statistics
,
38
,
3567
3604
.

Huang
,
C.
&
Huo
,
X.
(
2019
)
A distributed one-step estimator
.
Mathematical Programming
,
174
,
41
76
.

Jordan
,
M.I.
,
Lee
,
J.D.
&
Yang
,
Y.
(
2019
)
Communication-efficient distributed statistical inference
.
Journal of the American Statistical Association
,
114
,
668
681
.

Kayaalp
,
F.
(
2017
)
Review of customer churn analysis studies in telecommunications industry
.
Karaelmas Science and Engineering Journal
,
7
,
696
705
.

Lee
,
J.D.
,
Liu
,
Q.
,
Sun
,
Y.
&
Taylor
,
J.E.
(
2017
)
Communication-efficient sparse regression
.
The Journal of Machine Learning Research
,
18
,
115
144
.

Lehmann
,
E.
&
Casella
,
G.
(
1998
)
Theory of point estimation
, 2nd edition. New York: Springer-Verlag.

Lin
,
N.
&
Xi
,
R.
(
2011
)
Aggregated estimating equation estimation
.
Statistics and Its Interface
,
1
,
73
83
.

Ma
,
P.
,
Mahoney
,
M.W.
&
Yu
,
B.
(
2015
)
A statistical perspective on algorithmic leveraging
.
Journal of Machine Learning Research
,
16
,
861
919
.

Ma
,
P.
&
Sun
,
X.
(
2015
)
Leveraging for big data regression
.
Wiley Interdisciplinary Reviews Computational Statistics
,
7
,
70
76
.

Ma
,
P.
,
Zhang
,
X.
,
Xing
,
X.
,
Ma
,
J.
&
Mahoney
,
M.W.
(
2020
)
Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms
.
AISTATS
,
108
,
1026
1035
.

Maldonado
,
S.
,
Domínguez
,
G.
,
Olaya
,
D.
&
Verbeke
,
W.
(
2021
)
Profit-driven churn prediction for the mutual fund industry: a multisegment approach
.
Omega
,
100
, 102380.

McDonald
,
R.
,
Mohri
,
M.
,
Silberman
,
N.
,
Walker
,
D.
&
Mann
,
G.S.
(
2009
)
Efficient large-scale distributed training of conditional maximum entropy models
.
Advances in Neural Information Processing Systems
,
22
,
1231
1239
.

Pan
,
R.
,
Zhu
,
Y.
,
Guo
,
B.
,
Zhu
,
X.
&
Wang
,
H.
(
2020
)
A sequential addressing subsampling method for massive data analysis under memory constraint
. https://doi.org/10.48550/arXiv.2110.00936

Quiroz
,
M.
,
Kohn
,
R.
,
Villani
,
M.
&
Tran
,
M.-N.
(
2019
)
Speeding up MCMC by efficient data subsampling
.
Journal of the American Statistical Association
,
114
,
831
843
.

Saulis
,
L.
&
Statulevičius
,
V.
(
2012
)
Limit theorems for large deviations
.
New York
:
Springer Science & Business Media
.

Shao
,
J.
(
2003
)
Mathematical statistics
.
Springer texts in statistics
.
New York
:
Springer
.

Wang
,
F.
,
Zhu
,
Y.
,
Huang
,
D.
,
Qi
,
H.
&
Wang
,
H.
(
2021
)
Distributed one-step upgraded estimation for non-uniformly and non-randomly distributed data
.
Computational Statistics & Data Analysis
,
162
,
107265
.

Wang
,
H.
(
2009
)
Forward regression for ultra-high dimensional variable screening
.
Journal of the American Statistical Association
,
104
(
488
),
1512
1524
.

Wang
,
H.
,
Zhu
,
R.
&
Ma
,
P.
(
2018
)
Optimal subsampling for large sample logistic regression
.
Journal of the American Statistical Association
,
113
,
829
844
.

Wang
,
H.Y.
,
Yang
,
M.
&
Stufken
,
J.
(
2019
)
Information-based optimal subdata selection for big data linear regression
.
Journal of the American Statistical Association
,
114
,
393
405
.

Yu
,
J.
,
Wang
,
H.
,
Ai
,
M.
&
Zhang
,
H.
(
2020
)
Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data
.
Journal of the American Statistical Association
,
117
(
537
),
265
276
.

Zhu
,
X.
,
Li
,
F.
&
Wang
,
H.
(
2021
)
Least squares approximation for a distributed system
.
Journal of Computational and Graphical Statistics
,
30
(
4
),
1004
1018
.

Zou
,
H.
&
Li
,
R.
(
2008
)
One-step sparse estimates in nonconcave penalized likelihood models
.
Annals of Statistics
,
36
(
4
),
1509
1533
.

APPENDIX A. USEFUL LEMMAS

In this section, we prove some useful lemmas.

Lemma 1

Considering the convergence rate ofβ1, we haveβ1β0=Op(n1/2).

For generalised linear models, under conditions (C2) and (C3), the objective function 𝒮1(β) is a strictly concave function in β0. As a result, to verify that β1=β^1 is n-consistent, it suffices to follow the technique of Fan and Li (2001) to show that, for any ϵ>0, there exists a finite constant C>0 such that,

(A1)

To this end, we define βu=β0+Cu/n where C>0 is a fixed constant and up is a p-dimensional vector with unit length (i.e., u=1). Then, we apply the Taylor expansion and obtain

(A2)

where 1=˙𝒮1(β0)/n and 2=¨𝒮1(β0)/n.

Next, we compute E(1) and var(1) as follows. First, we consider E(1).

where 𝒯={(X1,Y1),(X2,Y2),,(XN,YN)}, and Si(β0) is the score function of the ith observation for 1iN.

Second, we compute var(1). It can be calculated that

(A3)

where ^1=N1i=1NSi(β0)Si(β0) and 1=E(^1)=E{Si(β0)Si(β0)}. Note that under condition (C1), n/N0, then (A3) converges to 1 as n. This suggests that 1 is an Op(1). By a similar technique, it can be verified that 2p1. Consequently, as long as C is sufficiently large, the quadratic term in (A2) dominates its linear term. Therefore, 𝒮1(β0+Cn1/2u)𝒮1(β0)<0 with probability tending to 1 as n. This suggests that with probability tending to 1, a local optimiser (i.e., β^1=β1) exists, such that β1β0=Op(n1/2). The optimiser satisfies ˙𝒮1(β1)=0. The conclusion is thus proved.

Lemma 2

Denote the mth(1mM)sequential sub-sample as𝒯m={(Xm,Ym),(Xm+n1,Ym+n1)}. Thus, givenβ0, defineUm={n1¨𝒯m(β0)}1{n1˙𝒯m(β0)}, andŪ=M1m=1MUm. The expectation and variance ofŪareE(Ū)=0and

(A4)

Furthermore, recall thatU(k)={n1¨𝒮k(β0)}1{n1˙𝒮k(β0)}, andŪk=k1k=1kU(k).For any2kK,we have

(A5)

Proof

The lemma is proved in the following two steps. In the first step, we verify E(Ū)=0 and Equation (A4). In the second step, we prove Equation (A5).

Step 1. For the expectation, we have E(Um)=E{E(Um|𝕏m)}=0, where 𝕏m={Xm,,Xm+n1}. and

(A6)

Next, we consider the variance of Ū. We have var(Ū)=E(ŪŪ)E(Ū)E(Ū)=E(ŪŪ), as EŪ=0. Furthermore,

It can be verified that m=1ME(UmUm)=Mn1. Next, we focus on the calculation of M2m1m2E(Um1Um2). We assume that n/N0, and M=Nn+1 should be much larger than the sub-sample size n. We now discuss the two cases.

First, when |m1m2|n, we have E(Um1Um2)=0. There are (Mn)(Mn+1) pairs of m1 and m2 in this case. Second, let |m1m2|=m; when 0<m<n, we have E(Um1Um2)=n1(nm), and there are 2(Mm) pairs of m1 and m2 in this case. As a result, 0<m<nE(Um1Um2)=2n1c1, where c1=n(n1)(3Mn1)/6. We can derive that E(ŪŪ)=(nM)2(nM+2c1). Then, we have

This finishes the first step.

Step 2. By a similar proof technique to that for Step 1, we immediately have E(Ūk)=0. We then focus on the computations of var(Ūk). To study the variance of Ūk, define E(·) and var(·) as the conditional expectation and conditional variance, respectively, given ={U1,U2,,UM}. We know that var(Ūk)=E{var(Ūk)}+var{E(Ūk)}. Then, we study E{var(Ūk)} and var{E(Ūk)} separately.

First, we compute E{var(Ūk)}, which is

Second, we consider var{E(Ūk)}. We have

Then, var{E(Ūk)}=var(Ū)=E(ŪŪ). Thus,

(A7)

This completes the proof.

Lemma 3

DefineR1kandR2kas follows,

HereΔk,j=(Δk,j,i1i2)p×pfor1jp, andΔk,j,i1i2=˙j,𝒮k+1(β)/βi1βi2|β=β˜k, ˙j,𝒮k+1(β)is the j th element of˙𝒮k+1(β)andβ˜k=ηβ^k+(1η)β0for some0η1. In particular, R11={n1¨𝒮1(β0)}1[(β1β0){n1Δ0,j}(β1β0)],withΔ0,j,i1,i2=˙j,𝒮1(β)/βi1βi2|β=β˜1andR21=0.Then, ifβkβ0κ1{1/(nk)+1/N}{1+op(1)}, then for anyk2, we haveΔk=k1k=1k(R1k+R2k)κ2[(logk/n)1/2{1/(nk)+1/N}1/2]{1+op(1)}. Here, κ1,κ2are some positive constants.

Proof

By condition (C4), it can be calculated that

(A8)

Here, λmax(M) denotes the largest absolute eigenvalue of M, ^k={n1¨𝒮k(βk1)}1, k={n1¨𝒮k(β0)}1, and we define β0=β1. To analyse Equation (A8), we then study the two terms on the right of the equation separately in the following two steps.

Step 1. First, we are going to show that max1kkλmax(^k) is an Op(1). It is sufficient to show that with probability tending to 1, we have

(A9)

for some positive constants τmin<τmax<. By condition (C3), we can find two positive constants, such that 2τminλmin()λmax()21τmax, for two positive constants τmin<τmax<. Thus, we know immediately that 2τmininfr=1rrsupr=1rr21τmax for a p-dimensional vector r. Thus, the desired conclusion (A9) is implied by

(A10)

where ϵ>0 is an arbitrary positive number. Then, the left-hand side of (A10) is bounded by 1kkP(supr=1|r(^k)r|>ϵ). Note that for any k, we have

Consequently, the left-hand side of (A10) can be further bounded by

(A11)

Next, under condition (C5), it can be proved that P(|σ^j1j2,kσj1j2|>ϵ)C1exp(C2nϵ2) for two positive constants C1 and C2 by theorem 3.2 on p. 45 of Saulis and Statulevičius (2012) and the proof technique of Wang (2009). Thus, (A11) can be bounded by

(A12)

As under condition (C1), log(K)=o(n), the right-hand side of (A14) converges to 0 as n. This implies that max1kkλmax(^k) is an Op(1).

Step 2. In the second step, we investigate max1kkn1˙𝒮k(β0). Similar to Step 1, it can be proved that P(max1kkn1˙𝒮k(β0)>ϵ) can be bounded by

(A13)
(A14)

If we replace ϵ by γnϵ, then we can verify that if γn=log(k)/n, there exists ϵ such that the right-hand side of (A14) could be arbitrarily small. This leads to max1kkn1˙𝒮k(β0)=Op(log(k)/n).

As a result, the right-hand side of (A8) is further bounded by

This completes the proof.

APPENDIX B. PROOF OF THE THEOREMS

In this section, we provide the detailed proof of the theorems and proposition to establish the theoretical properties of the proposed estimator.

B.1 Proof of Theorem 1

This theorem is to be proved in two parts. In the first part, we prove Theorem 1 (1). In the second part, we verify the asymptotic normality of the SOS estimator.

Part 1. We prove Theorem 1 (1) in the following two steps. In the first step, we show that for any k2,βk=β0+k1k=1k(R1k+R2k)k1k=1kU(k), where R1k and R2k are defined in Lemma 3. In the second step, denote Δ=K1k=1K(R1k+R2k), and recall that U(k)={n1¨𝒮k(β0)}1{n1˙𝒮k(β0)}, ŪK=K1k=1KU(k), we then verify that EŪK=0, var(βk)=var(ŪK){1+o(1)}={1/(nK)+1/N}{1+o(1)} and Δ=Op[(logk/n){1/(nk)+1/N}].

Step 1. By the Taylor expansion, we have

(B1)

Thus, based on (1), we have

(B2)

By (B3), we can rewrite βk as

(B4)

Step 2. This step is decomposed into two sub-steps. In Step 2.1, we verify that βkβ0κ1(1/(nk)+1/N){1+op(1)} for some constant κ1 in a deductive way. In Step 2.2, we prove the remaining results.

Step 2.1. First, we consider k=2. By (B4), it can be verified that β2=β0+21({n1¨𝒮2(β1)}1[(β1β0){n1Δ1,j}(β1β0)]+{n1¨𝒮1(β0)}1[(β1β0){n1Δ0,j}$β1β0)]+ {n1¨𝒮2(β0)}1{n1¨𝒮2(β1)}1n1˙𝒮2(β0))Ū2. From Lemma 2, we have E(Ū2)=0 and var(Ū2)=(2n)1+(11/2)N1{1+o(1)}. Consequently, Ū222{(2n)1+(11/2)N1}{1+op(1)}. Furthermore, by Lemma 1, we have 21{R11+R12}=Op(1/n) and 21{R21+R22}=Op(1/n). As a result, 21k=12(R1k+R2k)=op(Ū2).

Next, we assume that for any 2kk1, we have Ūk22{(kn)1+(11/k)N1}{1+op(1)} and k1k˜=1k(R1k˜+R2k˜)=op(Ūk). This suggests that βkβ0κ1(1/nk+1/N){1+op(1)} for some κ1>0. By Lemma 2, we know that E(Ūk)=0 and var(Ūk)=1kn+(11/k)1N{1+o(1)}. Furthermore, by Lemma 3, we have 1kk=1k(R1k+R2k)=Op[(logk/n){1/(nk)+1/N}]=op(1/(nk)+1/N)=op(Ūk). The penultimate equality holds, as logk=o(n). As a result, we have proved βkβ0κ1(1/(nk)+1/N){1+op(1)}.

Step 2.2. Finally, by (B4), we know

by the results of Step 2.1 and Lemma 3, we have K1k=1K(R1k+R2k)=Op[(logK/n){1/(nK)+1/N}]1/2. In addition, by Lemma 2, we have EŪK=0, and var(ŪK)={1/(nK)+N1(11/K)}{1+o(1)}. This accomplishes the proof of Step 2.2. Combining the results of Steps 1 and 2, we finish the first part.

Part 2. In the second part, we verify the asymptotic normality of the SOS estimator. As from Part 1, we have verified that K1k=1K(R1k+R2k)=op(ŪK), it suffices to study ŪK. To this end, we decompose ŪK into ŪK=ŪK(1)+ŪK(2) with ŪK(1)=K1K=1K{n1˙𝒮k(β0)} and ŪK(2)=K1K=1K[{n1¨𝒮k(β0)}1]{n1˙𝒮k(β0)}. We then calculate the two terms separately.

We first compute ŪK(2), by the same analysis of Lemma 2, it can be proved that E(ŪK(2))=0, and var(ŪK(2))={(nK)1+N1(11/K)}o(1)=o(1/N) when nK/N and K. This suggests that ŪK(2)=op(1/N). Next, we prove the normality of NŪK(1). To this end, it suffices to show that its characteristic function f(t)=E[exp(itK1K=1K{n1˙𝒮k(β0)}]exp{t1t/2}. Denote L=M1m=1M{n1˙𝒯m(β0)}. Then, we have

where Z1=(nK)1/2k=1Ki𝒮k{˙i(β0)L} and Z2=NL. Subsequently, we consider the following three cases to prove the convergence of f(t).

Case 1. We first consider the case of N/(nK)0. Then, we have τ2nK=(1+nK/N). Note that by a similar analysis technique to that in Lemma 2, we have E(Z1)=0 and var(Z1)=O(1). Consequently, we have Z1=Op(1). This leads to exp{it(τnK)1Z1}p1. Accordingly, f(t) shares the same asymptotic limit with E[exp{it(τN)1Z2}]. Note that E[exp{it(τN)1Z2}]exp(t2/2) due to the following two reasons: (1) τ2N=(N/nK+1)1; and (2) Z2=N(Mn)1i=1Mn˙i(β0){1+op(1)}=N1/2i=1N˙i(β0){1+op(1)}dN(0,1) by the central limit theorem, as n/N0. This finishes the proof of Case 1.

Case 2. We next consider the case of N/(nK). Because τ2N=(N/nK+1)0, and Z2dN(0,), we should have (τN)1Z2p0 and it leads to E[exp{it(τN)1Z2}]1. Then, by the dominated convergence theorem, f(t) has the same asymptotic limit as E[exp{it(τnK)1Z1}]. This limit term could be verified to be exp(t2/2), due to the following two reasons: (1) τ2nK=(1+nK/N)1; (2) Z1dN(0,1) by the Central Limit Theorem. This finishes the proof of Case 2.

Case 3. We finally consider the case that N/(nK)κ for some constant κ>0. We then decompose f(t) into f1(t)+f(t)f1(t) with f1(t)=E[Δ˜2exp{it(τN)1Z2}] and f(t)f1(t)=E[(Δ˜1Δ˜2)exp{it(τN)1Z2}], where Δ˜1=E[exp{it(τnK )1Z1|𝒯}] and Δ˜2=expt1t/(2nKτ2). Since Z1dN(0,1), we then have Δ˜1Δ˜2p0 conditional on 𝒯. Thus by the dominated convergence theorem, we have f(t)f1(t)0. Consequently, f(t) shares the same asymptotical limit with f1(t). This implies that it suffices to verify that f1(t)exp(t1t/2). Note that τ2nK=1+1/κ. Then, we have Δ˜2exp[t1t/{2(1+1/C)}]. Meanwhile, as τ2N1+κ and Z2N(0,1), we then should have E[exp{it(τN)1Z2}]exp{t1t/2(1+κ)}. It can be verified that f1(t)exp(t1t/{2(1+κ)}t1t/[2{1+1/κ}])=exp(t1t/2). This completes the proof of Case 3 and Part 2. Combining the results of Part 1 and Part 2, we accomplish the whole theorem proof.

B.1 Proof of Proposition 1

To prove Proposition 1, we first verify βKOSβ0=ΔosŪK with Δos=Op(1/n). Subsequently, by Theorem 1, it immediately leads to the asymptotic normality of βKOS.

Recall that βKOS=K1k=1Kβ^k,mle. By Taylor's expansion, we know

Here Δos(k)={n1¨𝒮k(β0)}1{(β^k,mleβ0)𝒮k(θ0)(β^k,mleβ0)}, and {(β^k,mleβ0)𝒮k(θ0)(β^k,mleβ0)} is defined similarly to that in Equation (B1). Note that by Lemma 1, we have β^k,mleβ0=Op(1/n), and by Condition (C4), we know that Δos(k)=Op(1/n) is the bias term. Then, it holds that

where Δos=K1k=1KΔos(k). Furthermore, if one requires n2/N, then we have Δos=op(1/N)=op{1/(nK)+1/N}. Because we have verified that {1/(nK)+1/N}1/2ŪKdN(0,) in Appendix B.1, this accomplishes the whole proof.

B.3 Proof of Theorem 2

First, we consider the expectation of SE^2(βK). We have

where A1={k=1K(U(k)Ū)(U(k)Ū)} and A2=K{(ŪŪK)(ŪŪK)}, and c0=n{(nK)1+N1}. We next consider E(A1) and E(A2) separately. Given , all U(k)s can be seen as independent. Then, we derive the following:

(B5)

and we also have

(B6)

Combining the above, we have

(B7)

Second, we consider the bias of SE^2(βK). Together with (A7) and (B7), we have

By a similar proof technique to that for Lemma 3, we can conclude that R1+R2 is sufficiently small compared with ŪK. Thus, the desired result can be obtained. This completes the proof.

B.4 Proof of Theorem 3

The purpose of this proof is to verify (3). Recall that SE^2(βK)=n(K1)1(1/(nK)+1/N)k=1K(Û(k)Ū^k)(Û(k)Ū^k). Then we have

where B1=K1k=1K(Û(k)Ū)(Û(k)Ū), and B2=(Ū^kŪ)(Ū^kŪ). To verify Equation (3), it suffices to prove that nB1p, and nB2p0 since K/(K1)1. Then we consider analysing B1 and B2 separately.

It then could be verified that

where B11=K1k=1K(Û(k)U(k))(Û(k)U(k)), B12=K1k=1K(U(k)Ū)(U(k)Ū), and 𝒪=2K1k=1K(Û(k)U(k))(U(k)Ū) is the cross term. Next, we are going to investigate the three terms in the following two steps. First, we verify that B12 is the leading term with nB12p. In addition, we prove that B11 and 𝒪 are ignorable terms, more precisely, they are both of the order op(1/n).

Step 1. We first show that nB12p. Then the consistency can be verified in (1) E(nB12) and (2) var(nB12)0. We next calculate the expectation and variance separately.

(1) Proof ofE(nB12): Note that, B12=A1/K, where A1 is defined in Appendix B.3. Then by (B5), we have E(B12)=n1+O(1/N)=n1{1+o(1)}.

(2) Proof ofvar(nB12)0: Because var(B12)=var{E(B12)}+E{var(B12)}, we then investigate the two terms respectively. We compute var{E(B12)} first. It could be verified that

(B8)

Take variance on both sides of (B8), by the same technique used in Step (1) of Lemma 2, we have

(B9)

Here, the first inequality holds because there are 2m=1n1(Mm) pairs of m1 and m2 when the covariance is not equal to zero. Furthermore, it could be verified that E{(UmŪ)(UmŪ)}28E{UmUm}2=O(1/n2). By the similar analysis with (A6), (B9) could be rewritten as var{E(B12)}=O(n/M)O(1/n2)=o(1/n2).

We next calculate E{var(B12)}. Because var(B12)=K1var{(U(k)Ū)(U(k)Ū)}, it could be proved that

Combining the above results, we have verified that var(nB12)0.

Step 2. We next show that nB11p0.

By definition of Û(k), we have Û(k)=[{n1¨𝒮k(βk)}1{n1¨𝒮k(β0)}1]{n1˙𝒮k(βk)}+{n1¨𝒮k(β0)}1{n1˙𝒮k(βk)}. Then it could be shown that

where U(k1)=[{n1¨𝒮k(βk)}1{n1¨𝒮k(β0)}1]{n1˙𝒮k(βk)}, U(k2)={n1¨𝒮k(β0)}1{n1˙𝒮k(βk)n1˙𝒮k(β0)}. Consequently, to verify nB11p0, it suffices to prove that nK1k=1KU(k1)U(k1)p0 and nK1k=1KU(k2)U(k2)p0. Then we analysis the two terms separately.

(1) Proof ofnK1k=1KU(k1)U(k1)p0: Note that, U(k1) could be further re-written as U(k1)=[{n1¨𝒮k(βk)}1{n1¨𝒮k(β0)}1]{n1˙𝒮k(βk)n1˙𝒮k(β0)}+[{n1¨𝒮k(βk)}1{n1¨𝒮k(β0)}1]n1˙𝒮k(β0). By condition (C4), we have U(k1)Cmaxkλmax(^k)βkβ02+Cn1˙𝒮k(β0)βkβ0. Furthermore, by the same analytical skills used in Lemma 3, it could be proved that K1k=1KU(k1)U(k1)=Op(n1{(nK)1+N1}logK)=op(1/n). This infers nK1k=1KU(k1)U(k1)p0.

(2) Proof ofnK1k=1KU(k2)U(k2)p0: Similarly, because Uk2{maxkλmax(^k)}2βkβ0, we could verify that k=1KU(k2)U(k2)=Op(n1{(nK)1+N1}logK)=op(1/n).

Combining the two above proofs, we then finish Step 2. Subsequently, by the Cauchy–Schwarz inequality, we have n𝒪p0. This completes the proof of (K1)1(nK)B1p.

Finally, we calculate B2. By a similar proof technique used in analysing B1, we can conclude that B2=(ŪKŪ)(ŪKŪ)+op(1/n). Then by (B6), we have E(ŪKŪ)(ŪKŪ)=(nK)1{1+o(1)}. As a consequence, it could be shown that E{n(ŪKŪ)(ŪKŪ)}0, which leads to nA2p0. The desired results can be obtained. This completes the whole theorem proof.

Author notes

Danyang Huang and Shuyuan Wu have contributed equally to this study.

Funding information Chinese National Statistical Science Research Project, Grant/Award Number: 2022LD06; Foundation from Ministry of Education of China, Grant/Award Number: 20JZD023; National Natural Science Foundation of China, Grant/Award Numbers: 11701560; 11831008; 11971504; 12071477; 72001205; 72171229; Renmin University of China

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)