Abstract

In this paper, we analyze the length-biased and partly interval-censored data, whose challenges primarily come from biased sampling and interfere induced by interval censoring. Unlike existing methods that focus on low-dimensional data and assume the covariates to be precisely measured, sometimes researchers may encounter high-dimensional data subject to measurement error, which are ubiquitous in applications and make estimation unreliable. To address those challenges, we explore a valid inference method for handling high-dimensional length-biased and interval-censored survival data with measurement error in covariates under the accelerated failure time model. We primarily employ the SIMEX method to correct for measurement error effects and propose the boosting procedure to do variable selection and estimation. The proposed method is able to handle the case that the dimension of covariates is larger than the sample size and enjoys appealing features that the distributions of the covariates are left unspecified.

1 Introduction

Interval-censored data, known as that the failure time lies in some time interval, arise commonly in lifetime data analysis (e.g., Lawless, 2003, Section 2.3.1). That is, instead of collecting the failure time, one usually has censored intervals in datasets. Unlike interval-censored data that simply censored intervals can be observed, another general and attractive data structure is called partly interval-censored (PIC) data, which indicates that some of the failure times are exactly observed, while other failure times are subject to interval-censoring (e.g., Gao et al., 2017). In the past literature, many methods have been developed to handle those two types of data structures. To name a few, for interval-censored data, Cai & Betensky (2003) proposed the penalized spline method to construct semi-parametric regression models. Yavuz & Lambert (2011) developed the Bayesian penalized B-splines method to estimate survival functions. Wang et al. (2016) proposed a monotone spline representation to fit the Cox model. Fu & Simonoff (2017) adopted the survival tree method to analyze interval-censored data. Yao et al. (2019) employed an ensemble learning method for interval-censored and time-to-event data. A generous overview of interval-censored data analysis can be found in Du & Sun (2021a). Unlike estimation methods for interval-censored data that simply adopt interval-censoring times to construct estimating functions, the analysis of the PIC data requires to adjust interval-censoring effects and take observed failure times into account. For example, Huang (1999) proposed a nonparametric approach and developed asymptotic properties; Zhao et al. (2008) proposed a generalized log-rank test to nonparametrically estimate survivor functions. With the covariates accommodated, Kim (2003) proposed nonparametric maximum likelihood estimation for the Cox model. Gao et al. (2017) considered the accelerated failure time (AFT) model and employed the Buckley–James formulation to construct an estimating function. Pan et al. (2020) adopted a Bayesian approach to model the Cox model under the PIC data.

With a complex sampling design, we usually have complicated data structure. Specifically, length-biased sampling has been an important issue in lifetime data analysis, and usually occurs due to prevalent cohort sampling approaches (e.g., Lawless, 2003, Section 2.4). In the presence of length-biased sampling, individuals with shorter survival times are less likely to be recruited for the study, thus resulting in a biased sample. In the past literature, a large body of methods have been available to deal with such a data structure. For example, Gao & Chan (2019) considered the Cox model and proposed the nonparametric maximum likelihood estimation. Wang et al. (2021) proposed a pairwise pseudo-likelihood approach for the Cox model. Pan & Chappell (2002) explored left-truncated and interval-censored data under the Cox model. Shen et al. (2022) considered length-biased and interval-censored data with a nonsusceptible fraction. However, those existing methods focus on interval-censored data instead of PIC data.

In addition to complex sampling design, the other challenges come from the covariates. The first feature is the involvement of irrelevant covariates. To address variable selection, several regularization strategies have been developed. In recent years, the boosting method is also adopted to deal with variable selection, including Wolfson (2011) and Brown et al. (2017). For the interval-censored survival data without length-biased sampling, existing regularization methods have been applied for the Cox model (e.g., Du et al., 2021; Zhao et al., 2020), the transformation model (e.g., Scolas et al., 2016), and the conditional cumulative hazard function (e.g., Sun et al. 2021). More comprehensive discussions can be found in Du & Sun (2021b). However, those methods cannot handle the PIC data. Moreover, to the best of our knowledge, there is no valid method to deal with variable selection for length-biased and interval-censored data.

The second feature of covariates is measurement error, which is ubiquitous when collecting data. It occurs due to wrong records of investigators or imprecise machines. In early developments, some methods have been explored for interval-censored data under the proportional odds model (e.g., Wen & Chen, 2014), the Cox model (e.g., Song & Ma, 2008), and the linear transformation model (e.g., Mandal et al., 2019). However, those methods focus on continuous covariates and do not explore length-biased sampling or PIC data. Regarding the framework of biased sampling induced by the prevalent cohort mechanism, several methods have been proposed for different models, such as the simulation and extrapolation (SIMEX) method (e.g., Chen, 2019, 2020) and the insertion method (e.g., Chen & Yi, 2021a). Moreover, relevant methods have also been extended to deal with variable selection, including the regularization method for the additive hazards model (e.g., Chen, 2021) and the focus information criterion for the Cox model (e.g., Chen & Yi, 2020). However, those strategies are based on the right-censored data, and the PIC data structure has not been carefully explored.

In this paper, we aim to fill out the research gap about dealing with measurement error and variable selection simultaneously for length-biased and PIC data. Specifically, we primarily consider the AFT model and adopt the Buckley–James formulation to develop an estimating function for length-biased and PIC data. After that, to correct for measurement error effects and address variable selection, we propose the SIMEXBoost method, which incorporates the SIMEX approach to the boosting algorithm. Our strategy has several advantages and differences from existing approaches. First, different from most developments that primarily focus on right-censored survival data, we develop a new approach to handle variable selection and measurement error for PIC data, which is rarely explored in the past literature. Second, unlike conventional regularization methods that aim to optimize penalized estimating functions, the SIMEXBoost method does not require nondifferentiable penalty functions; on the contrary, it enables us to deal with a general estimating function and obtain informative covariates as well as the corresponding estimator via finite iterations. For measurement error correction, the proposed method can correct for measurement error effects for error-prone binary and continuous covariates.

The remainder is organized as follows. In Section 2, we introduce the data structure and regression models. In Section 3, we propose the SIMEXBoost method, which aims to adopt the SIMEX method to correct for measurement error effect and apply the boosting method to do variable selection. In Section 4, we apply the proposed method to analyze a real dataset. Finally, a general discussion is presented in Section 5. Simulation studies and the corresponding numerical results are placed in the supporting information due to the limited space in the main text.

2 Notation and Models

2.1 Length-biased and Partly Interval-censored Data

Let formula denote the p-dimensional vector of covariates. Let u be the initial event and let v denote the terminal event that is a primary interest. Let formula be the failure time that is defined as the length of the initial event u to the terminal event v. In the process of collecting data, it is possible that individuals have experienced the terminal event before being recruited in the study. Specifically, let ξ represent the calendar time of the recruitment and define formula as the truncation time. Individuals cannot be recruited if formula. On the contrary, the observable failure time and truncation time, denoted T and A, respectively, can be collected if formula. Consequently, we define formula, and it shows that this sampling scheme causes sampling bias.

Moreover, to describe formula, we follow the scenario in Chen & Yi (2021a) that the incidence of disease onset follows a stationary Poisson distribution, then the truncation time follows a uniform distribution with an interval [0, τ], where τ is the maximum support of formula. Under this situation, such a sampling scheme is called length-biased sampling.

In addition to the biased sampling, the failure time T can be incomplete due to the occurrence of censoring. In this paper, we primarily focus on the partly interval-censoring, which indicates that some of the failure times are exactly observed, while others are only known to lie within certain intervals (e.g., Gao et al., 2017). Specifically, let δ denote the binary indicator with formula indicating that T is observed. When formula, there exists a sequence of examination times formula such that the censoring interval is given by formula, where formula and formula. Consistent with Gao et al. (2017) and Gao & Chan (2019), formula is assumed to be independent of T given formula.

Finally, for the sample with size n, let formula denote the independent copy of formula. In particular, if formula for all formula, then formula is reduced to the length-biased and interval-censored data.

2.2 Accelerated Failure Time Models

In survival analysis, the primary interest is to characterize the relationship between the failure time and the covariates. In this paper, we mainly consider the well-known AFT model:

(1)

where β is a p-dimensional vector of primarily interested parameters, ε is the noise term with mean zero, and its density function formula can be known or unknown.

If formula is complete and fully observed, then the estimator of β can be naturally obtained by using the least-squares method. However, in the presence of length-biased sampling and interval-censoring, some necessary adjustments are required. To the end, we first adjust the effects induced by length-biased sampling, and then deal with interval-censoring.

Following the similar discussion in Chen & Yi (2021a), the joint density function of T and A given by formula, denoted as formula, is formulated as

(2)

where formula is the indicator function and formula with formula being the conditional density function of formula given formula. By Equation (2), the observed failure time T has a length-biased conditional density function

(3)

for formula. Moreover, under Equation (1), the conditional density function formula is specified as

(4)

for formula. Therefore, for formula, combining Equations (3) and (4) yields that

(5)

We further define formula. Then given formula, the expectation of formula based on Equation (5) yields that

(6)

where the second step is obtained by the change of variable formula, and the last equality holds due to the zero expectation of ε. Consequently, under a sample with size n, Equation (6) suggests an unbiased estimating function with length-biased sampling involved only: formula.

In addition, in the presence of interval-censoring, formula is only observed when formula; when formula, the failure time formula is unobservable. To address the interval-censoring effect for the subjects with formula, we employ the Buckley–James method (e.g., Buckley & James, 1979), whose key strategy is to compute the conditional expectation of formula, given formula, and use it to adjust the censoring effects. Specifically, the idea of the Buckley–James method is to construct the pseudo response with the conditional expectation accommodated

(7)

where formulaexp(formula) and formulaexp(formula). In the case of formula, we further express the following conditional expectation in Equation (7):

(8)

where formula is the cumulative distribution function of formula. Therefore, together with Equations (7) and (8), a new estimating function for length-biased and PIC data is given by

(9)

with formulaexp(formula) and formulaexp(formula).

On the other hand, we observe from Equation (9) that formula is a nuisance function and is implicitly affected by formula. If formula is known, then formula is simply the integral of formula. If formula and formula are unknown, then we may use observed data to estimate it. Let formula denote the cumulative distribution function of formula. With β fixed at formula, we define formula as the estimator of formula that satisfies the following self-consistency equation (e.g., Huang, 1999; Gao et al., 2017):

(10)

where formula. In addition, according to the discussion in Ning et al. (2011), formula can be expressed as

(11)

With the estimate formula, the estimator of formula, denoted by formula, can be obtained by Equation (11) with formula replaced by formula. Therefore, replacing formula in Equation (9) by formula yields a workable estimating function:

(12)

With fixed formula, one can solve an estimating equation formula with respect to β, where formula represents the p-dimensional zero vector. Finally, iteratively updating Equations (10) and (12) until convergence essentially gives the solution of formula, and the resulting estimator is given by formula. The detailed implementation as well as the procedure of solving Equation (12) is deferred to Section 3.

2.3 Measurement Error Models

In applications, covariates are often subject to measurement error. Let formula denote the unobserved covariates and it can be decomposed as formula, where formula is the formula-dimensional vector of continuous covariates, and formula represents the formula-dimensional vector of discrete covariates. Let formula denote the observed version of formula, and it can be expressed as formula with formula and formula being the observed versions of formula and formula, respectively.

To describe the relationship between formula and formula, we use the factorization

(13)

where formula represents the conditional distribution for the random variables indicated by the arguments. Making the independence assumptions in formula and formula allows Equation (13) to be expressed as formula, suggesting that one can characterize formula and formula separately.

To characterize formula, we consider the classical additive measurement error model (Carroll et al., 2006)

(14)

where formula follows a normal distribution with mean zero and covariance formula. We assume that formula is independent of formula and ε in Equation (1). On the other hand, for discrete covariates, we first define possible values of formula as formula, and let formula denote the (mis)classification probability for formula (Chen & Yi, 2021b). We define formula as a formula (mis)classification matrix. In addition, by the law of total probability, we have that for formula, formula. Thus, we can model formula as (e.g., Chen & Yi, 2021b)

(15)

Therefore, the surrogate vector formula can be characterized by the true covariate vector formula through Equation (15). To ease notation, we let formula denote the misclassification operator indicated by Equation (15) and notationally write Equation (15) as formula. Such a misclassification operator was also used by Carroll et al. (2006, p. 125) for a misclassified binary variable. In addition, as discussed in Carroll et al. (2006, p. 125), we assume that Π has the decomposition formula, where formula is the diagonal matrix with diagonal elements being the eigenvalues of Π, and Ω is the corresponding matrix of eigenvectors.

In this paper, to highlight the key idea of measurement error effects as well as measurement error correction, we assume that formula and Π are known for now. Note that, in applications, formula and/or Π are unknown, one may require additional data information, such as repeated measurements or validation sample, to estimate unknown parameters (e.g., Carroll et al., 2006, Section 2.3). In the implementation without availability of auxiliary information, one may conduct sensitivity analyses, whose purpose is to reasonably specify different values to formula and Π, and examine the impact of different magnitudes of measurement error effects. The demonstration of sensitivity analyses is placed in Section 4; detailed discussions of the estimation of unknown formula and/or Π can be found in Section 5 and additional numerical results are placed in the supporting information.

3 Methodology

3.1 SIMEXboost

In this section, we propose the SIMEXBoost method, the combination of the SIMEX method and the boosting procedure, to correct for measurement error effects and do variable selection simultaneously. Detailed descriptions are given below, and a pseudo code of the algorithm is summarized in Appendix A of the supporting information.

  • Stage 1:

    Setup

    Let B be a given positive integer and let formula be a sequence of pre-specified values with formula, where M is a positive integer, and formula is a prespecified positive number such as formula or 2. While B and formula are not uniquely specified, commonly, B is set as a value between 100 and 500, formula is taken as a collection of M points that equally cut the interval [0,1] or [0,2] with M set as 5 or 10 (e.g., Carroll et al., 2006, p. 106).

    For a given subject i with formula as well as formula and formula, we generate formula from formula. Then for a vector formula, we define formula. In addition, for the discrete error-prone covariates formula, we generate formula by
    (16)

    where formula, and formula is derived from formula by replacing its diagonal elements, say formula, with formula.

    We call
    (17)

    the working data for formula, formula and formula.

  • Stage 2:

    Estimation

    In this stage, we run the boosting algorithm with measurement error correction accommodated.

    For given b and ζ, we first define formula as the initial value and define formula by computing Equation (10) with formula replaced by formula. Let I denote the total number of iterations. For the formulath iteration with formula as well as fixed formula and formula, let formula denote (12) with formula replaced by the working data (17).

    Define formula at the formulath iteration for formula. After that, we take formula as signals and define an active set
    (18)
    where formula is the jth component of formula and formula is a thresholding constant. We then update values by the steepest descent
    (19)

    for all formula and formula, where κ is a positive increment. Moreover, we further derive an updated function formula that is obtained by solving Equation (10) with given formula.

    Finally, continue this procedure based on the boosting algorithm until achieving the last iteration I, we have formula for formula and formula. For any fixed formula, taking an average on formula with respect to b yields that
    (20)
  • Stage 3:

    Extrapolation

    For a sequence formula obtained from (20), we fit a regression model to a sequence formula, where formula is the extrapolation function that is approximated by user-specific regression functions, Γ is the associated parameter, and η is the noise term. The parameter Γ can be estimated by the least-squares method; and we let formula denote the resulting estimate of Γ.

    Finally, we calculate the predicted value formula and take formula as the final estimator.

The key idea of the proposed three-stage procedure is to artificially simulate surrogate measurements with varying magnitudes of mismeasurement to delineate the patterns of different degrees of mismeasurement on inference results. The first and third stages generalize the SIMEX method (Carroll et al., 2006, Chapter 5) which is applicable to error-contaminated continuous and binary covariates. In addition, similar to the discussion in Brown et al. (2017), taking the indices corresponding to the threshold values in the second stage undertakes the selection of important covariates with different magnitudes of mismeasurement. Variable selection in this stage is similar to the development in Chen & Yi (2021b) who proposed the adaptive least absolute shrinkage and selection operator (LASSO) method, and thus, it ensures that important covariates can be detected if the working data are implemented to adjust for measurement error effects.

To see the validity of the measurement error corrections, Stage 1 generates a sequence of surrogate covariates, say Equation (17), whose value of ζ reflects the degree of mismeasurement in the artificially generated surrogates formula. With a positive and increasing ζ, formula incurs an increasing amount of mismeasurement whose effects on inferential procedures are recorded in Stage 2. When formula, formula reduces to formula, the ideal situation without measurement error. Using the patterns obtained from Stage 2 for different degrees of mismeasurement, in Stage 3, we employ a regression model to obtain estimators corresponding to the error-free scenario (i.e., formula).

We further comment on some parameters in Stage 2. First of all, ϱ in Equation (18) controls the number of selected covariates. If formula, all covariates will be included and variable selection is ignored. If formula, then the whole algorithm reduces to the generic procedure by selecting one covariate only in each iteration (e.g., Wolfson, 2011), and it has been shown by Wolfson (2011) that the updated path obtained in Step 2 is (approximately) equivalent to the LASSO method. However, updating one component in each iteration may make computation cumbersome. Hence, since our goal is to do variable selection, to take a balance between reducing iterated times and selecting informative predictors simultaneously at each iteration, one can choose a value ϱ that is close to 1, and our numerical examinations find that formula gives satisfactory results.

Next, regarding the update in Equation (19), our approach follows the steepest descent for L1-norm that takes the sign of the estimating function as an increment. In addition, similar to the discussion in Brown et al. (2017), the LASSO is approximately equivalent to Stage 2 with formula as formula and formula, it suggests that the learning rate κ should be a small value. Finally, while a large value of I ensures a precise estimate, it may make over-fitting at the same time. As a result, early stopping of iteration can be implemented. One of commonly used criteria is to stop the iteration at step formula if formula is satisfied for some tolerated constant formula and given formula and formula, where formula is the infinity norm for a vector formula.

In summary, the proposed SIMEXBoost method has several advantages. First, unlike Brown et al. (2017) whose algorithm requires an assumption that covariates should be precisely measured, the SIMEXBoost method provides a flexible strategy to handle measurement error and variable selection simultaneously. Moreover, our setting explores biased and incomplete data induced by truncation and censoring, which is not studied in Brown et al. (2017). Second, since β in the AFT model (1) is based on the estimating function and its estimator is difficult to solve directly due to possible discontinuity of the estimating function (e.g., Gao et al., 2017), Stage 2 of the SIMEXBoost method is valid to handle variable selection and solve a constructed estimating function. In addition, unlike regularization methods (e.g., LASSO) that are required to handle nondifferentiable penalty functions, Stage 2 simply adopts iterations to derive estimators with informative covariates detected.

3.2 SIMEXboost with Collinearity in Covariates

When constructing regression models, a crucial concern in covariates is collinearity, which shows high correlations among covariates. It is known that collinearity may produce misleading results, such as falsely concluding significant covariates to be insignificant.

To deal with variable selection and collinearity based on the SIMEXBoost method, we follow the idea of the elastic net method (e.g., Zou & Hastie, 2005) that incorporates L1- and L2-norm penalty functions. Specifically, for formula and formula, we suggest replacing formula in Equations (18) and (19) by

(21)

where λ is a tuning parameter. The optimal tuning parameter can be chosen by the cross-validation approach. To see why the proposed approach works, we observe that the elastic net based penalized likelihood function can be re-expressed as the LASSO method for the ridge regression function, and the derivative of the ridge regression function is equivalent to the sum of the estimating function and formula. To connect with our approach, one can regard Equation (21) as the differentiation of the likelihood function with the L2-norm penalty function, and the whole algorithm is taken as the LASSO approach to do variable selection.

4 Application to The Signal Tandmobiel Study

In this section, we apply the SIMEXBoost method to analyze the tooth data, which is a longitudinal prospective oral health study collected from the Signal formula® study in the Flanders region of Belgium from 1996 to 2001. In this dataset, the cohort of 4468 randomly sampled children who attended the first year of the basic school at the beginning of the study was annually dental examined by trained dentists. Among them, we primarily consider the sample with size formula because the remaining 38 sampled children did not come to any of the designed dental examinations.

In this paper, we are interested in the emergence times of 28 teeth that are divided by seven types of teeth, and each type contains four teeth, including permanent incisors (label: 11, 21, 31, 41), permanent central canines (label: 12, 22, 32, 42), permanent lateral canines (label: 13, 23, 33, 43), permanent first premolars (label: 14, 24, 34, 44), permanent second premolars (label: 15, 25, 35, 45), permanent first molars (label: 16, 26, 36, 46), and permanent second molars (label: 17, 27, 37, 47). We sort these labels as 1, 2, …, 28 in the following analysis. Each of the tooth emergence times can be taken as a response variable. However, as recorded in the dataset, the emergence time encounters interval-censoring. Moreover, for those 38 children, their tooth emergence times cannot be observed due to unavailability of designed dental examinations, yielding the biased sample. In this dataset, there are 41 covariates, including gender (gender, 0 for boy and 1 for girl), province (province, factor with code 0 for Antwerpen, 1 for Vlaams Brabant, 2 for Limburg, 3 for Oost Vlaanderen, 4 for West Vlaanderen), evidence of fluoride intake (fluor, binary with 0 for no and 1 for yes), type of educational system (educ, factor with code 0 for Free, 1 for Community school, 2 for Province/council school), starting age of brushing teeth (startbr, factor with code 1 for [0, 1] years, 2 for (1, 2] years, 3 for (2, 3] years, 4 for (3, 4] years, 5 for (4, 5] years, 6 for later than at the age of 5), deciduous tooth with label xx that were decayed or missing due to caries or filled (Txx.DMF, binary with 0 for no and 1 for yes), deciduous tooth with label xx that were removed because of orthodontic reasons (BAD.xx, binary with 0 for no and 1 for yes), deciduous tooth with label xx that were removed due to the orthodontical reasons or decayed on at most the last examination before the first examination when the emergency of the permanent successor was recorded (Txx.CAR, binary with 0 for no and 1 for yes) with xx being 53, 63, 73, 83 (deciduous lateral canines), 54, 64, 74, 84 (deciduous first molars), 55, 65, 75, 85 (deciduous second molars). Except for covariates gender, province, educ, startbr, however, other binary covariates may be contaminated with measurement error (e.g., Küchenhoff et al., 2007). While this dataset was discussed by several authors (e.g., Fu & Simonoff, 2017; Yao et al., 2019), their approaches ignored the possible impact of biased sampling and assumed the covariates to be precisely measured. To address those concerns, we implement the SIMEXBoost method to analyze this dataset.

Since the dataset has no additional information, such as repeated measurements or validation data, to quantify the degree of measurement error, we conduct sensitivity analyses to investigate how the analysis results are affected by different magnitudes of measurement error. Since the error-prone covariates are binary random variables, we adopt Equation (16) with formula being specified as a 2 × 2 matrix formula with π specified as 0.005 or 0.05 to characterize each error-prone covariate.

To assess the performance of estimation and prediction for each tooth, we adopt the H-fold cross-validation. Specifically, for a given tooth j with formula, we randomly split the original data into H roughly equal-sized subsets. For formula, take the hth subset as the validation data, and let the remaining formula pooled subsets as the training data; let formula and formula represent the classes of the subject indexes for the jth tooth and the hth validation and training datasets, respectively. In our study, we take formula.

For formula and formula, we adopt the estimating function (21) to fit the training data formula due to potential collinearity in records of different deciduous teeth (e.g., Küchenhoff et al., 2007) and possibly strong effects between deciduous teeth (e.g., Fu & Simonoff, 2017; Aktan et al., 2012). To implement the SIMEXBoost method, we specify formula and formula in Stage 1; formula, formula, and formula in Stage 2. The corresponding estimator of β under the training data formula is denoted as formula. With the estimates formula and the covariates in formula, we obtain the predicted failure time formula for formula. Repeating the procedure H times gives the predicted failure time formula for the whole dataset.

For formula, let formula represent the number of interval-censored data for the jth tooth. Let formula denote the interval-censoring time recorded in the dataset for formula and formula. To conduct the performance of prediction, we follow an idea in Yao et al. (2019) to consider the predicted emergency times that are not within the observed emergence interval, and we calculate the proportion that the predicted emergence times lie outside censoring intervals formula and the average absolute prediction distance formula for formula, where formula with formula being a set that contains subjects whose predicted failure times lie outside censoring intervals. As commented by Yao et al. (2019), smaller values of formula indicate better prediction of emergence times. In addition, smaller values of formula reflect that more predicted failure times fall in censoring intervals.

In addition to the implementation of the proposed method, for the comparison, we also examine other approaches, including the boosting estimation in Section 3.1 without measurement error correction (naive), SIMEX estimation without variable selection (SIMEX(π) with formula or 0.05), estimation from Equation (9) without variable selection and measurement error correction (LBPIC), and estimation proposed by Gao et al. (2017) without consideration of length-biased sampling (PIC(Gao et al.)). Moreover, we also examine two interval-censored based methods, denoted IC_Bayesian and IC_Par, that can be implemented by the R package icenReg. Throughout those comparisons, we wish to explore the impacts of length-biased sampling, variable selection, and measurement error correction in the analysis. Numerical results for 28 teeth are summarized in Table 1. Due to the smallest values of formula and formula derived by the SIMEXBoost method, we find that the prediction obtained by the proposed method is generally more precise than other methods that do not take noisy features into account. Among all teeth, all numerical results look similar, and values of formula and formula are comparable regardless of magnitude of measurement error effects, suggesting that the proposed method provides stable results. On the other hand, among other approaches with variable selection or measurement error ignored, we find that SIMEX(π) sometimes produces larger values of formula and formula than those from naive, suggesting that ignoring variable selection may provide worse prediction than ignoring measurement error correction. Finally, compared LBPIC with IC_Bayesian, IC_Par, and PIC(Gao et al.), we find that values of formula and formula derived by IC_Bayesian, IC_Par, and PIC(Gao et al.) are obviously larger than those from LBPIC. It indicates that, provided that variable selection and measurement error are not accommodated, length-biased effects indeed affect the estimation results.

Table 1

Evaluation on the signal formula® study datasets.

formulaformulaformulaformulaformulaLBPICPIC(Gao et al.)IC_BayesianIC_Par
Toothformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformula
110.9500.4550.9390.4510.9420.4520.9720.5110.9710.5130.9690.5130.9830.5280.9970.8090.9980.807
210.9480.4620.9470.4520.9430.4510.9670.5030.9710.5060.9730.5130.9830.5210.9990.8310.9990.830
310.9510.4350.9370.4230.9350.4320.9600.4980.9540.5010.9580.5090.9890.5420.99738.3401.00032.808
410.9410.4420.9370.4450.9460.4350.9670.5060.9690.5140.9550.5060.9730.5790.990418.4800.990418.484
120.9610.4790.9590.4670.9600.4670.9690.5210.9670.5250.9690.5200.9740.5261.0000.8091.0000.809
220.9620.4700.9600.4620.9630.4660.9710.5230.9700.5240.9710.5280.9750.5520.9990.7850.9990.785
320.9510.4620.9520.4590.9430.4640.9660.5110.9680.5120.9760.5130.9840.5491.0000.8731.0000.873
420.9530.4700.9470.4640.9440.4630.9690.5190.9680.5210.9720.5170.9750.5560.9990.8250.9990.825
130.9720.5210.9710.5070.9670.5090.9710.5090.9690.5110.9780.5190.9820.5351.0000.8771.0000.877
230.9720.4970.9690.4940.9700.4960.9720.5070.9710.5160.9630.5150.9750.5151.0000.8731.0000.873
330.9640.5190.9620.5060.9630.5090.9660.5130.9730.5160.9700.5090.9790.5491.0000.8201.0000.820
430.9650.4990.9630.4990.9630.4940.9640.5050.9670.5050.9690.5100.9720.7720.9900.7620.9900.762
140.9690.5090.9680.4900.9630.4940.9690.5020.9660.5040.9780.5050.9810.6910.9990.8290.9990.829
240.9680.5030.9540.4940.9540.4950.9620.5030.9630.5190.9800.5410.9840.6441.0000.8111.0000.810
340.9650.5140.9610.5110.9620.5080.9660.5060.9660.5140.9610.5510.9930.7561.0000.8681.0000.868
440.9660.5120.9610.4970.9600.5010.9690.5070.9680.5090.9640.5060.9700.7961.0000.8711.0000.871
150.9710.5150.9600.4970.9600.5020.9650.5060.9670.4890.9700.5530.9760.7311.0000.8601.0000.860
250.9650.5200.9620.5000.9600.4940.9650.5220.9640.5210.9650.5120.9730.5661.0000.8141.0000.816
350.9740.5160.9620.4950.9620.5010.9740.5120.9760.5130.9620.5030.9850.7561.0000.8771.0000.877
450.9730.5130.9630.4920.9610.4880.9700.5150.9710.5170.9670.5170.9750.7401.0000.8801.0000.879
160.9580.4520.9470.4480.9430.4390.9600.5030.9650.5030.9620.5040.9650.5370.9870.8840.9870.880
260.9410.4650.9380.4580.9400.4620.9640.5020.9660.5040.9690.5020.9740.5540.9941.6500.9921.649
360.9530.4530.9430.4570.9450.4600.9660.5030.9610.5100.9650.5080.9660.5080.9940.8940.9940.894
460.9500.4590.9500.4590.9480.4570.9680.5130.9630.5210.9720.5230.9730.6180.9960.8190.9980.811
170.9800.5170.9600.5060.9570.4950.9630.5050.9630.5040.9630.5090.9800.5381.0000.9101.0000.910
270.9750.5180.9520.5060.9500.4970.9650.5110.9630.5130.9790.5220.9830.5281.0000.8961.0000.896
370.9770.5190.9730.5020.9750.5110.9750.5030.9790.5040.9790.5090.9830.5401.0000.8351.0000.839
470.9800.5160.9670.5050.9660.4900.9710.5050.9760.5030.9800.5150.9850.5321.0000.8791.0000.879
formulaformulaformulaformulaformulaLBPICPIC(Gao et al.)IC_BayesianIC_Par
Toothformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformula
110.9500.4550.9390.4510.9420.4520.9720.5110.9710.5130.9690.5130.9830.5280.9970.8090.9980.807
210.9480.4620.9470.4520.9430.4510.9670.5030.9710.5060.9730.5130.9830.5210.9990.8310.9990.830
310.9510.4350.9370.4230.9350.4320.9600.4980.9540.5010.9580.5090.9890.5420.99738.3401.00032.808
410.9410.4420.9370.4450.9460.4350.9670.5060.9690.5140.9550.5060.9730.5790.990418.4800.990418.484
120.9610.4790.9590.4670.9600.4670.9690.5210.9670.5250.9690.5200.9740.5261.0000.8091.0000.809
220.9620.4700.9600.4620.9630.4660.9710.5230.9700.5240.9710.5280.9750.5520.9990.7850.9990.785
320.9510.4620.9520.4590.9430.4640.9660.5110.9680.5120.9760.5130.9840.5491.0000.8731.0000.873
420.9530.4700.9470.4640.9440.4630.9690.5190.9680.5210.9720.5170.9750.5560.9990.8250.9990.825
130.9720.5210.9710.5070.9670.5090.9710.5090.9690.5110.9780.5190.9820.5351.0000.8771.0000.877
230.9720.4970.9690.4940.9700.4960.9720.5070.9710.5160.9630.5150.9750.5151.0000.8731.0000.873
330.9640.5190.9620.5060.9630.5090.9660.5130.9730.5160.9700.5090.9790.5491.0000.8201.0000.820
430.9650.4990.9630.4990.9630.4940.9640.5050.9670.5050.9690.5100.9720.7720.9900.7620.9900.762
140.9690.5090.9680.4900.9630.4940.9690.5020.9660.5040.9780.5050.9810.6910.9990.8290.9990.829
240.9680.5030.9540.4940.9540.4950.9620.5030.9630.5190.9800.5410.9840.6441.0000.8111.0000.810
340.9650.5140.9610.5110.9620.5080.9660.5060.9660.5140.9610.5510.9930.7561.0000.8681.0000.868
440.9660.5120.9610.4970.9600.5010.9690.5070.9680.5090.9640.5060.9700.7961.0000.8711.0000.871
150.9710.5150.9600.4970.9600.5020.9650.5060.9670.4890.9700.5530.9760.7311.0000.8601.0000.860
250.9650.5200.9620.5000.9600.4940.9650.5220.9640.5210.9650.5120.9730.5661.0000.8141.0000.816
350.9740.5160.9620.4950.9620.5010.9740.5120.9760.5130.9620.5030.9850.7561.0000.8771.0000.877
450.9730.5130.9630.4920.9610.4880.9700.5150.9710.5170.9670.5170.9750.7401.0000.8801.0000.879
160.9580.4520.9470.4480.9430.4390.9600.5030.9650.5030.9620.5040.9650.5370.9870.8840.9870.880
260.9410.4650.9380.4580.9400.4620.9640.5020.9660.5040.9690.5020.9740.5540.9941.6500.9921.649
360.9530.4530.9430.4570.9450.4600.9660.5030.9610.5100.9650.5080.9660.5080.9940.8940.9940.894
460.9500.4590.9500.4590.9480.4570.9680.5130.9630.5210.9720.5230.9730.6180.9960.8190.9980.811
170.9800.5170.9600.5060.9570.4950.9630.5050.9630.5040.9630.5090.9800.5381.0000.9101.0000.910
270.9750.5180.9520.5060.9500.4970.9650.5110.9630.5130.9790.5220.9830.5281.0000.8961.0000.896
370.9770.5190.9730.5020.9750.5110.9750.5030.9790.5040.9790.5090.9830.5401.0000.8351.0000.839
470.9800.5160.9670.5050.9660.4900.9710.5050.9760.5030.9800.5150.9850.5321.0000.8791.0000.879

Note: “Naive” is the naive method by adopting an algorithm in Section 3.1 without measurement error correction. “SIMEXBoost(π)” is the proposed SIMEXBoost method with formula or 0.005. “SIMEX(π)” reflects the implementation of the SIMEX method for formula or 0.005 without using the boosting method for variable selection. “LBPIC” indicates the implementation of Equation (12) without measurement error correction and variable selection. “PIC (Gao et al.)” gives the estimation without considerations of length-biased sampling, measurement error correction, and variable selection. “IC_Bayesian” and “IC_Par” are the IC-based methods derived by the R package icenReg.

Table 1

Evaluation on the signal formula® study datasets.

formulaformulaformulaformulaformulaLBPICPIC(Gao et al.)IC_BayesianIC_Par
Toothformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformula
110.9500.4550.9390.4510.9420.4520.9720.5110.9710.5130.9690.5130.9830.5280.9970.8090.9980.807
210.9480.4620.9470.4520.9430.4510.9670.5030.9710.5060.9730.5130.9830.5210.9990.8310.9990.830
310.9510.4350.9370.4230.9350.4320.9600.4980.9540.5010.9580.5090.9890.5420.99738.3401.00032.808
410.9410.4420.9370.4450.9460.4350.9670.5060.9690.5140.9550.5060.9730.5790.990418.4800.990418.484
120.9610.4790.9590.4670.9600.4670.9690.5210.9670.5250.9690.5200.9740.5261.0000.8091.0000.809
220.9620.4700.9600.4620.9630.4660.9710.5230.9700.5240.9710.5280.9750.5520.9990.7850.9990.785
320.9510.4620.9520.4590.9430.4640.9660.5110.9680.5120.9760.5130.9840.5491.0000.8731.0000.873
420.9530.4700.9470.4640.9440.4630.9690.5190.9680.5210.9720.5170.9750.5560.9990.8250.9990.825
130.9720.5210.9710.5070.9670.5090.9710.5090.9690.5110.9780.5190.9820.5351.0000.8771.0000.877
230.9720.4970.9690.4940.9700.4960.9720.5070.9710.5160.9630.5150.9750.5151.0000.8731.0000.873
330.9640.5190.9620.5060.9630.5090.9660.5130.9730.5160.9700.5090.9790.5491.0000.8201.0000.820
430.9650.4990.9630.4990.9630.4940.9640.5050.9670.5050.9690.5100.9720.7720.9900.7620.9900.762
140.9690.5090.9680.4900.9630.4940.9690.5020.9660.5040.9780.5050.9810.6910.9990.8290.9990.829
240.9680.5030.9540.4940.9540.4950.9620.5030.9630.5190.9800.5410.9840.6441.0000.8111.0000.810
340.9650.5140.9610.5110.9620.5080.9660.5060.9660.5140.9610.5510.9930.7561.0000.8681.0000.868
440.9660.5120.9610.4970.9600.5010.9690.5070.9680.5090.9640.5060.9700.7961.0000.8711.0000.871
150.9710.5150.9600.4970.9600.5020.9650.5060.9670.4890.9700.5530.9760.7311.0000.8601.0000.860
250.9650.5200.9620.5000.9600.4940.9650.5220.9640.5210.9650.5120.9730.5661.0000.8141.0000.816
350.9740.5160.9620.4950.9620.5010.9740.5120.9760.5130.9620.5030.9850.7561.0000.8771.0000.877
450.9730.5130.9630.4920.9610.4880.9700.5150.9710.5170.9670.5170.9750.7401.0000.8801.0000.879
160.9580.4520.9470.4480.9430.4390.9600.5030.9650.5030.9620.5040.9650.5370.9870.8840.9870.880
260.9410.4650.9380.4580.9400.4620.9640.5020.9660.5040.9690.5020.9740.5540.9941.6500.9921.649
360.9530.4530.9430.4570.9450.4600.9660.5030.9610.5100.9650.5080.9660.5080.9940.8940.9940.894
460.9500.4590.9500.4590.9480.4570.9680.5130.9630.5210.9720.5230.9730.6180.9960.8190.9980.811
170.9800.5170.9600.5060.9570.4950.9630.5050.9630.5040.9630.5090.9800.5381.0000.9101.0000.910
270.9750.5180.9520.5060.9500.4970.9650.5110.9630.5130.9790.5220.9830.5281.0000.8961.0000.896
370.9770.5190.9730.5020.9750.5110.9750.5030.9790.5040.9790.5090.9830.5401.0000.8351.0000.839
470.9800.5160.9670.5050.9660.4900.9710.5050.9760.5030.9800.5150.9850.5321.0000.8791.0000.879
formulaformulaformulaformulaformulaLBPICPIC(Gao et al.)IC_BayesianIC_Par
Toothformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformulaformula
110.9500.4550.9390.4510.9420.4520.9720.5110.9710.5130.9690.5130.9830.5280.9970.8090.9980.807
210.9480.4620.9470.4520.9430.4510.9670.5030.9710.5060.9730.5130.9830.5210.9990.8310.9990.830
310.9510.4350.9370.4230.9350.4320.9600.4980.9540.5010.9580.5090.9890.5420.99738.3401.00032.808
410.9410.4420.9370.4450.9460.4350.9670.5060.9690.5140.9550.5060.9730.5790.990418.4800.990418.484
120.9610.4790.9590.4670.9600.4670.9690.5210.9670.5250.9690.5200.9740.5261.0000.8091.0000.809
220.9620.4700.9600.4620.9630.4660.9710.5230.9700.5240.9710.5280.9750.5520.9990.7850.9990.785
320.9510.4620.9520.4590.9430.4640.9660.5110.9680.5120.9760.5130.9840.5491.0000.8731.0000.873
420.9530.4700.9470.4640.9440.4630.9690.5190.9680.5210.9720.5170.9750.5560.9990.8250.9990.825
130.9720.5210.9710.5070.9670.5090.9710.5090.9690.5110.9780.5190.9820.5351.0000.8771.0000.877
230.9720.4970.9690.4940.9700.4960.9720.5070.9710.5160.9630.5150.9750.5151.0000.8731.0000.873
330.9640.5190.9620.5060.9630.5090.9660.5130.9730.5160.9700.5090.9790.5491.0000.8201.0000.820
430.9650.4990.9630.4990.9630.4940.9640.5050.9670.5050.9690.5100.9720.7720.9900.7620.9900.762
140.9690.5090.9680.4900.9630.4940.9690.5020.9660.5040.9780.5050.9810.6910.9990.8290.9990.829
240.9680.5030.9540.4940.9540.4950.9620.5030.9630.5190.9800.5410.9840.6441.0000.8111.0000.810
340.9650.5140.9610.5110.9620.5080.9660.5060.9660.5140.9610.5510.9930.7561.0000.8681.0000.868
440.9660.5120.9610.4970.9600.5010.9690.5070.9680.5090.9640.5060.9700.7961.0000.8711.0000.871
150.9710.5150.9600.4970.9600.5020.9650.5060.9670.4890.9700.5530.9760.7311.0000.8601.0000.860
250.9650.5200.9620.5000.9600.4940.9650.5220.9640.5210.9650.5120.9730.5661.0000.8141.0000.816
350.9740.5160.9620.4950.9620.5010.9740.5120.9760.5130.9620.5030.9850.7561.0000.8771.0000.877
450.9730.5130.9630.4920.9610.4880.9700.5150.9710.5170.9670.5170.9750.7401.0000.8801.0000.879
160.9580.4520.9470.4480.9430.4390.9600.5030.9650.5030.9620.5040.9650.5370.9870.8840.9870.880
260.9410.4650.9380.4580.9400.4620.9640.5020.9660.5040.9690.5020.9740.5540.9941.6500.9921.649
360.9530.4530.9430.4570.9450.4600.9660.5030.9610.5100.9650.5080.9660.5080.9940.8940.9940.894
460.9500.4590.9500.4590.9480.4570.9680.5130.9630.5210.9720.5230.9730.6180.9960.8190.9980.811
170.9800.5170.9600.5060.9570.4950.9630.5050.9630.5040.9630.5090.9800.5381.0000.9101.0000.910
270.9750.5180.9520.5060.9500.4970.9650.5110.9630.5130.9790.5220.9830.5281.0000.8961.0000.896
370.9770.5190.9730.5020.9750.5110.9750.5030.9790.5040.9790.5090.9830.5401.0000.8351.0000.839
470.9800.5160.9670.5050.9660.4900.9710.5050.9760.5030.9800.5150.9850.5321.0000.8791.0000.879

Note: “Naive” is the naive method by adopting an algorithm in Section 3.1 without measurement error correction. “SIMEXBoost(π)” is the proposed SIMEXBoost method with formula or 0.005. “SIMEX(π)” reflects the implementation of the SIMEX method for formula or 0.005 without using the boosting method for variable selection. “LBPIC” indicates the implementation of Equation (12) without measurement error correction and variable selection. “PIC (Gao et al.)” gives the estimation without considerations of length-biased sampling, measurement error correction, and variable selection. “IC_Bayesian” and “IC_Par” are the IC-based methods derived by the R package icenReg.

Based on predicted failure times formula, we estimate the survivor curves formula for a tooth formula. Due to the limited space, we demonstrate the estimated survivor curves for label 11 based on the proposed method and its competitors in Figure 1; other figures can be found in the supporting information. In general, with measurement error correction, we find that the proposed method produces close curves regardless of different magnitudes of measurement error effects, which seems to show that the proposed method provides robust results. We observe that estimated survivor function based on the proposed method shows slightly higher possibility to emergence permanent teeth than that based on the naive method without measurement error correction. In addition, the estimated curves obtained by the proposed method are above of curves with all covariates accommodated, which might be caused by the involvement of non-informative covariates. It is also interesting to see that the curve determined by PIC(Gao et al.) is obviously higher than others, while two IC-based methods (IC_Bayesian and IC_Par) give the lowest curves among others, which might be the impacts induced by the ignorance of biased sampling and different methodologies.

Survivor curves for permanent incisors with label 11. This figure appears in color in the electronic version of this paper, and any mention of color refers to that version.
Figure 1

Survivor curves for permanent incisors with label 11. This figure appears in color in the electronic version of this paper, and any mention of color refers to that version.

Finally, to see the complexity of estimation methods, we examine the computation time by using the R function proc.time() to record the CPU time (in seconds). The runtimes are based on using Intel(R) Core(TM) i5-8250U CPU @ 1.60 GHz. Take a tooth label 11 as an example, with the SIMEX procedure accommodated, SIMEXBoost(π) and SIMEX(π) require approximate 5382.533 and 5400.742 s under formula or 5356.53 and 5485.201 s under formula to derive the estimators, respectively. Unsurprising, due to the involvement of B and formula in Stage 2, computation time based on measurement error correction by using the SIMEX method is longer than other methods without measurement error correction, such as naive (59.519 s), LBPIC (60.094 s), PIC (Gao et al.) (439.732 s), IC_Bayesian (178.810 s), and IC_Par (76.900 s). Consequently, we comment that the SIMEXBoost method provides the precise estimator and outperforms other methods, but longer computation time is the price that one should pay.

5 Summary and Further Remarks

In this paper, we encounter survival data with irrelevant and error-prone covariates accommodated, where responses are biased and incomplete due to length-biased sampling and partly interval-censoring. To tackle those challenges, we first adopt the Buckley–James formulation to develop the estimating function based on the AFT model and address length-biased sampling as well as PIC. To deal with variable selection, we develop the SIMEXBoost algorithm, which employs the SIMEX strategy and the boosting procedure to correct for measurement error and do variable selection simultaneously. Throughout simulation studies in the supporting information, we find that the proposed method successfully eliminates measurement error effects and correctly retains informative covariates. To see the robustness of the proposed method and the validity of Equation (7) for adjusting the censoring effects, we also examine different percentages of censoring. Numerical results in the supporting information show that biases, variances, and mean-squared errors are stable and do not have significant difference with the change of censoring rates. For the comparisons with the proposed method, we find that, without measurement error correction, simply employing the boosting method may falsely retain irrelevant covariates. Compared with Gao et al. (2017) who considered AFT models under the PIC data or the IC_Bayesian and IC_Par methods that treat PIC data as the IC data, we observe that ignoring length-biased sampling effect may induce biases in the estimators. To assess the performance of the estimator with the availability of auxiliary information, we conduct a series of simulation settings and summarize numerical results in Appendix B.5 of the supporting information due to the limited space of the main text. In general, under the estimation from repeated measurements or validation data, the proposed method still produces satisfactory estimators. Finally, to show the flexibility of the proposed method, we examine the scenario formula, which refers to the conventional interval-censored data. According to the finding in Appendix B.6 of the supporting information, the proposed method is still valid to handle measurement error correction as well as variable selection. Although the proposed SIMEXBoost method has the best performance among other competitive methods listed in this paper, as summarized in Section 4 and simulation studies in the supporting information, SIMEXBoost requires longer computation time, which is due to the number of iterations I for the boosting procedure as well as a value B and a sequence formula for the SIMEX method. To reduce the computation time, one may require stronger computational tools or enhance programming algorithms to make computation faster. While we have not derived theoretical results in the current development, we examine the normality test for the estimators. Based on the simulation studies in Appendix B.2 of the supporting information, we adopt the Shapiro–Wilk normality test as well as the QQ plot to the proposed estimator under repeated simulations, and find that the proposed estimator formula follows a normal distribution. Detailed description can be found in Appendix B.2 of the supporting information.

The current development in this paper has some potential extensions. First, we primarily focus on continuous and binary covariates since they can be modeled by measurement error models (14) and (15) in the current development, which are also frequently considered in measurement error analysis. The proposed method can be extended to handle error-prone ordinal or counted data and the mismeasurement can be adjusted by the same strategy as the generation of working data (17), provided that the corresponding measurement error models are well established. This issue should be carefully explored in our future work. Second, we primarily consider length-biased sampling in this paper, which specifies the truncation time to follow a uniform distribution. As commented by a referee, testing the length-biased assumption and checking relevant conditions are crucial issues. For example, one can check if the incidence of disease onset follows a stationary Poisson distribution and check if the truncation time follows the other distributions. In addition, one can further test if truncation time is independent of the failure time (e.g., Tsai, 1990). While those discussions are interesting and important, they require careful exploration based on our setting because of the involvement of measurement error and interval-censored mechanism. Those extensions can be our future research topic.

Data Availability Statement

The dataset of the Signal Tandmobiel Study, named tandmob2.RData, is available in the corresponding author's Github, whose link is placed in the supporting information.

Acknowledgments

The authors would like to thank the editorial team for useful comments to improve the initial manuscript. The authors specially thank Dr. Zhong-Lin Tsai, a dentist at Department of Dentistry, Wan Fang Hospital, Taipei Medical University, for sharing the dentistry knowledge to enhance the background of real-data analysis. Chen's research was supported by National Science and Technology Council with grant ID 110-2118-M-004-006-MY2.

References

Aktan
,
A.M.
,
Kara
,
I.
,
Sener
,
I.
,
Bereket
,
C.
,
Celik
,
S.
,
Kirtay
,
M.
,
Ciftci
,
M.E.
&
Arici
,
N.
(
2012
)
An evaluation of factors associated with persistent primary teeth
.
European Journal of Orthodontics
,
34
,
208
212
.

Brown
,
B.
,
Miller
,
C.J.
&
Wolfson
,
J.
(
2017
)
ThrEEBoost: thresholded boosting for variable selection and prediction via estimating equations
.
Journal of Computational and Graphical Statistics
,
26
,
579
588
.

Buckley
,
J.
&
James
,
I.
(
1979
)
Linear regression with censored data
.
Biometrika
,
66
,
429
436
.

Cai
,
T.
&
Betensky
,
R.A.
(
2003
)
Hazard regression for interval-censored data with penalized spline
.
Biometrics
,
59
,
570
579
.

Carroll
,
R.J.
,
Ruppert
,
D.
,
Stefanski
,
L.A.
&
Crainiceanu
,
C.M.
(
2006
)
Measurement error in nonlinear model
.
Boca Raton, FL
:
Chapman and Hall
.

Chen
,
L.-P.
(
2019
)
Semiparametric estimation for cure survival model with left-truncated and right-censored data and covariate measurement error
.
Statistics and Probability Letters
,
154
, 108547.

Chen
,
L.-P.
(
2020
)
Semiparametric estimation for the transformation model with length-biased data and covariate measurement error
.
Journal of Statistical Computation and Simulation
,
90
,
420
442
.

Chen
,
L.-P.
(
2021
)
Variable selection and estimation for the additive hazards model subject to left-truncation, right-censoring and measurement error in covariates
.
Journal of Statistical Computation and Simulation
,
90
,
3261
3300
.

Chen
,
L.-P.
&
Yi
,
G.Y.
(
2020
)
Model selection and model averaging for analysis of truncated and censored data with measurement error
.
Electronic Journal of Statistics
,
14
,
4054
4109
.

Chen
,
L.-P.
&
Yi
,
G.Y.
(
2021a
)
Semiparametric methods for left-truncated and right-censored survival data with covariate measurement error
.
Annals of the Institute of Statistical Mathematics
,
73
,
481
517
.

Chen
,
L.-P.
&
Yi
,
G.Y.
(
2021b
)
Analysis of noisy survival data with graphical proportional hazards measurement error models
.
Biometrics
,
77
,
956
969
.

Du
,
M.
&
Sun
,
J.
(
2021a
)
Statistical analysis of interval-censored failure time data
.
Chinese Journal of Applied Probability and Statistics
,
37
,
627
654
.

Du
,
M.
&
Sun
,
J.
(
2021b
)
Variable selection for interval-censored failure time data
.
International Statistical Review
,
1
23
.

Du
,
M.
,
Zhao
,
H.
, &
Sun
,
J.
(
2021
)
A unified approach to variable selection for Cox's proportional hazards model with interval-censored failure time data
.
Statistical Methods in Medical Research
,
30
,
1833
1849
.

Fu
,
W.
&
Simonoff
,
J.S.
(
2017
)
Survival trees for interval-censored survival data
.
Statistics in Medicine
,
36
,
4831
4842
.

Gao
,
F.
,
Zeng
,
D.
&
Lin
,
D.Y.
(
2017
)
Semiparametric estimation of the accelerated failure time model with partly interval-censored data
.
Biometrics
,
73
,
1161
1168
.

Gao
,
F.
&
Chan
,
K.C. G.
(
2019
)
Semiparametric regression analysis of length-biased interval-censored data
.
Biometrics
,
75
,
121
132
.

Huang
,
J.
(
1999
)
Asymptotic properties of nonparametric estimation based on partly interval-censored data
.
Statistica Sinica
,
9
,
501
519
.

Kim
,
J.S.
(
2003
)
Maximum likelihood estimation for the proportional hazards model with partly interval-censored data
.
Journal of the Royal Statistical Society, Series B
,
65
,
489
502
.

Küchenhoff
,
H.
,
Lederer
,
W.
&
Lesaffre
,
E.
(
2007
)
Asymptotic variance estimation for the misclassification SIMEX
.
Computational Statistics & Data Analysis
,
51
,
6197
6211
.

Lawless
,
J.F.
(
2003
)
Statistical models and methods for lifetime data
.
New York
:
Wiley
.

Mandal
,
S.
,
Wang
,
S.
&
Sinha
,
S.
(
2019
)
Analysis of linear transformation models with covariate measurement error and interval censoring
.
Statistics in Medicine
,
38
,
4642
4655
.

Ning
,
J.
,
Qin
,
J.
&
Shen
,
Y.
(
2011
)
Buckley–James-type estimator with right-censored and length-biased data
.
Biometrics
,
67
,
1369
1378
.

Pan
,
C.
,
Cai
,
B.
&
Wang
,
L.
(
2020
)
A Bayesian approach for analyzing partly interval-censored data under the proportional hazards model
.
Statistical Methods in Medical Research
,
29
,
3192
3204
.

Pan
,
W.
&
Chappell
,
R.
(
2002
)
Estimation in the Cox proportional hazard model with left-truncated and interval censored data
.
Biometrics
,
58
,
64
70
.

Scolas
,
S.
,
Ghouch
,
A.E.
,
Legrand
,
C.
&
Oulhaj
,
A.
(
2016
)
Variable selection in a flexible parametric mixture cure model with interval-censored data
.
Statistics in Medicine
,
35
,
1210
1225
.

Shen
,
P.-S.
,
Peng
,
Y.
,
Chen
,
H.-J.
&
Chen
,
C.-M.
(
2022
)
Maximumlikelihoodestimation for length-biased and interval-censored data with a nonsusceptible fraction
.
Lifetime Data Analysis
,
28
,
68
88
.

Song
,
X.
&
Ma
,
S.
(
2008
)
Multiple augmentation for interval-censored data with measurement error
.
Statistics in Medicine
,
27
,
3178
3190
.

Sun
,
L.
,
Li
,
S.
,
Wang
,
L.
&
Song
,
X.
(
2021
)
Simultaneous variable selection in regression analysis of multivariate interval-censored data
.
Biometrics
,
1
12
.

Tsai
,
W.-Y.
(
1990
)
Testing the assumption of independence of truncation time and failure time
.
Biometrika
,
77
,
169
177
.

Wang
,
L.
,
McMahan
,
C.S.
,
Hudgens
,
M.G.
&
Qureshi
,
Z.P.
(
2016
)
A flexible, computationally efficient method for fitting the proportional hazards model to interval-censored data
.
Biometrics
,
72
,
222
231
.

Wang
,
P.
,
Li
,
D.
&
Sun
,
J.
(
2021
)
A pairwise pseudo-likelihood approach for left-truncated and interval-censored data under the Cox model
.
Biometrics
,
77
,
1303
1314
.

Wen
,
C.-C.
&
Chen
,
Y.-H.
(
2014
)
Functional inference for interval-censored data in proportional odds model with covariate measurement error
.
Statistica Sinica
,
24
,
1301
1317
.

Wolfson
,
J.
(
2011
)
EEBOOST: a general method for prediction and variable selection based on estimating equation
.
Journal of the American Statistical Association
,
106
,
296
305
.

Yao
,
W.
,
Frydman
,
H.
&
Simonoff
,
J.S.
(
2019
)
An ensemble method for interval-censored time-to-event data
.
Biostatistics
,
22
,
198
213
.

Yavuz
,
A. Ç.
&
Lambert
,
P.
(
2011
)
Smooth estimation of survival functions and hazard ratios from interval-censored data using Bayesian penalized B-splines
.
Statistics in Medicine
,
30
,
75
90
.

Zhao
,
H.
,
Wu
,
Q.
,
Li
,
G.
&
Sun
,
J.
(
2020
)
Simultaneous estimation and variable selection for interval-censored data with broken adaptive ridge regression
.
Journal of the American Statistical Association
,
115
,
204
216
.

Zhao
,
X.
,
Zhao
,
Q.
,
Sun
,
J.
&
Kim
,
J.S.
(
2008
)
Generalized log-rank tests for partly interval-censored failure time data
.
Biometrical Journal
,
50
,
375
385
.

Zou
,
H.
&
Hastie
,
T.
(
2005
)
Regularization and variable selection via the elastic net
.
Journal of the Royal Statistical Society: Series B
,
67
,
301
320
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data