Abstract

Statistical analysis of longitudinal data often involves modeling treatment effects on clinically relevant longitudinal biomarkers since an initial event (the time origin). In some studies including preventive HIV vaccine efficacy trials, some participants have biomarkers measured starting at the time origin, whereas others have biomarkers measured starting later with the time origin unknown. The semiparametric additive time-varying coefficient model is investigated where the effects of some covariates vary nonparametrically with time while the effects of others remain constant. Weighted profile least squares estimators coupled with kernel smoothing are developed. The method uses the expectation maximization approach to deal with the censored time origin. The Kaplan–Meier estimator and other failure time regression models such as the Cox model can be utilized to estimate the distribution and the conditional distribution of left censored event time related to the censored time origin. Asymptotic properties of the parametric and nonparametric estimators and consistent asymptotic variance estimators are derived. A two-stage estimation procedure for choosing weight is proposed to improve estimation efficiency. Numerical simulations are conducted to examine finite sample properties of the proposed estimators. The simulation results show that the theory and methods work well. The efficiency gain of the two-stage estimation procedure depends on the distribution of the longitudinal error processes. The method is applied to analyze data from the Merck 023/HVTN 502 Step HIV vaccine study.

1 Introduction

In preventive HIV vaccine efficacy trials, thousands of HIV uninfected volunteers are randomized to receive vaccine or placebo, and are monitored for HIV infection. Participants diagnosed with HIV infection have various endpoints measured longitudinally starting at the date of diagnosis; these endpoints include viral loads and CD4 cell counts as markers of HIV disease progression and secondary transmission. An objective of such trials is to assess the vaccine effect on the biomarkers, and all previous analyses assessed the biomarkers based on the time from HIV diagnosis (Fitzgerald et al., 2011; Rerks-Ngarm et al., 2013; Janes et al., 2015). However, it is more biologically meaningful to assess whether vaccination modifies the biomarkers over time since actual HIV acquisition. This assessment is challenging because exact times of HIV acquisition are generally unobtainable; rather data are available only on bounds formula and formula between which the true time origin must lie formula, where, as shown in Figure 1, for example, formula is the date of the last antibody (Ab)-based HIV negative diagnostic test result and formula is the date of the first antibody (Ab)-based HIV positive test result. Given details of the HIV testing algorithm described in the application, some participants have formula, at least approximately, such that formula is considered to be directly observed, whereas other participants have formula and formula is unknown. This set-up occurs in other multi-stage longitudinal studies, as depicted in Figure 1.

Censored time origin  in the study of a longitudinal response. Based on the HIV testing algorithm, each infected participant was classified into one of two groups, defined by whether the earliest HIV positive sample was (Ab+, PCR+) or (Ab-, PCR+). (a) For participants with the earliest HIV positive sample (Ab+, PCR+),  is left censored by ; (b) For participants with the earliest HIV positive sample (Ab-, PCR+),  and  is observed
Figure 1

Censored time origin formula in the study of a longitudinal response. Based on the HIV testing algorithm, each infected participant was classified into one of two groups, defined by whether the earliest HIV positive sample was (Ab+, PCR+) or (Ab-, PCR+). (a) For participants with the earliest HIV positive sample (Ab+, PCR+), formula is left censored by formula; (b) For participants with the earliest HIV positive sample (Ab-, PCR+), formula and formula is observed

In Figure 1, for each participant formula, formula is the gap time between the true time origin formula and the time formula when longitudinal markers begin to be measured. The time lapse from formula and formula is formula. For participant i, the longitudinal markers are measured at times formula, where formula is the time between formula and the jth marker measurement, for formula. In the HIV vaccine study, most participants have first measurement time formula at formula; some participants have formula and others have formula. Formally, we write formula and formula such that formula is left censored by formula with censoring indicator formula; formula is observed if formula and formula if formula. The time from formula to the jth sampling time is formula. The time origin is considered censored because formula is left censored.

Semiparametric regression models for longitudinal data have been intensively studied; see Lin and Ying (2001), Hu et al. (2004), Qu and Li (2006), Fan et al. (2007) and Sun et al. (2013), among others. However, to the best of our knowledge, none of these methods address the problem in which the time origin may be censored. In this paper, we study the semiparametric additive model with time-varying effects for longitudinal data with censored time origin. Weighted profile least squares estimators are developed for the unknown parameters as well as for the nonparametric coefficient functions. The expectation maximization approach is utilized to deal with the censored time origin. The proposed method does not assume any specific model for the sampling times, and thus avoids misspecification of the sampling models. The method is applied to investigate the effect of an HIV vaccine on viral load over time since actual HIV acquisition in the Merck 023/HVTN 502 Step study. Asymptotic properties of the parametric and nonparametric estimators and consistent asymptotic variance estimators are derived. Our numerical study shows that the proposed methods work well with satisfying finite sample properties.

The rest of the paper is organized as follows. Section 2 introduces the semiparametric additive model and develops the estimation method. Preliminaries on the data structure and model assumptions are given in Section 2.1. The weighted profile least squares estimators coupled with the kernel smoothing and EM algorithm are developed in Section 2.2. Computational issues about estimation at the boundaries and the weight selection are discussed in Section 2.3. In Section 3, we establish asymptotic properties of the nonparametric and parametric estimators, and in Section 4 study their finite sample performances in simulations. The proposed method is applied to Step trial in Section 5. Concluding remarks are given in Section 6. The proofs of the asymptotic results, additional simulations and data analysis, and the discussions on bandwidth choice are placed in Web Appendices available in the Supporting Information of this article.

2 Profile Weighted Least Squares Estimation through EM Algorithm

2.1 Preliminaries

Suppose that there is a random sample of n participants. For participant i, let formula be the response process and let formula and formula be possibly time-dependent covariates of dimensions formula and formula, respectively, where formula is the time since the actual time origin, and τ is the study duration. We consider the following semiparametric additive time-varying coefficients model

(1)

where formula is an unspecified formula vector of smooth regression functions, γ is a formula dimensional vector of parameters, and formula is a mean-zero process. The notation formula represents transpose of a vector or matrix x. Specify the first component of formula as 1 gives a model with a nonparametric baseline response process. The effect of formula is time-varying modeled nonparametrically while the effect of formula is time-independent modeled parametrically.

The observations of formula are taken at time points formula, where formula is the total number of observations from the ith subject. The formula can be written as the sum of two parts formula as shown in Figure 1, where formula is the time from the actual time origin formula participant to left censoring by formula, and formula is the time from the right edge formula of the interval for formula to the jth visit for the ith subject, where visit 1 is at formula. Let formula be the end of follow-up time or censoring time for the ith participant since formula. The censoring time formula is allowed to depend on the covariates formula and formula. The responses for the ith participant can only be observed at the time points before formula since the actual time origin.

The sampling times formula may vary among participants. The number of observations taken from the ith participant by time t is formula, where formula is the indicator function. Let formula be the conditional mean rate of the sampling times for participant i at time t defined by formula, formula. The mean rate function is assumed to only depend on formula, which is the part of the covariates formula that affects the potential sampling times. We denote formula, where formula is an unspecified nonnegative smooth function.

Let formula, formula, formula, formula, formula, formula, formula, where formula is a collection of possible auxiliary variables that are not of interest in the modeling of formula but may be useful in predicting the distribution of formula. For the censored time origin, left-censored data formula and formula are observed. The observed data for participant i can be expressed as formula. The observation is formula if formula and formula if formula, where formula. Although exact times formula may be unobtainable, the values formula, formula and formula at formula are known. Assume that formula are independent identically distributed (iid). The observed data are denoted by formula.

We assume that formula and formula are independent conditional on formula, and that the censoring time formula is noninformative in that formula  formula, formula and formula, formula. Let formula. Assume formula, formula  formula, formula, formula, formula, formula. Assume also that formula and formula are independent conditional on formula, formula, and formula. This assumption implies that, conditional on covariate processes, sampling times are noninformative for the longitudinal response.

2.2 Estimation Procedures

When all formula's are observed, estimation procedure such as in Sun and Wu (2005) and Sun et al. (2013) can be used to analyze model (1). If the unobserved or censored formula's are treated as missing, then formula is not missing at random. The inverse probability weighting of complete-cases method and the augmented inverse probability weighted complete-case method of Robins et al. (1994), which have been successfully adapted in Sun and Gilbert (2012), Sun et al. (2017), Yang et al. (2017) and by many other authors, will not work in this situation. We propose an estimation procedure based on the missing-data principle using the EM-algorithm.

The conditional distribution formula of formula equals formula  formula for formula and 1 for formula. Let formula be the estimated conditional distribution of formula. Let formula. The estimation of model (1) can be based on minimizing the following objective function:

(2)

where formula is a nonnegative weight function, and formula is the estimate of the conditional expectation formula, which can be obtained through estimation of formula as we show below.

For ease of presentation, we adopt the notation formula  formula  formula for formula  formula, formula, formula, where formula is a smooth function of formula. The above objective function can be written as

(3)

Note that formula, formula and formula are observed. The conditional expectation formula for participant i with formula equals

(4)

where the last equality holds because formula for formula. Since formula, estimating formula by formula for formula, we have

(5)

The basic idea of estimating the conditional distribution formula is to transform the data set from left censored to right censored. Assume that formula is bounded by a predetermined constant L. This is reasonable since for the application concerned here formula is less than the time interval between two consecutive testing times that is usually between 3 and 6 months. The distribution of formula based on the left censored data can be estimated by the methods developed for right censored data through the transformation formula. The Kaplan–Meier estimator can be used to estimate distribution of formula when formula is independent of formula. Otherwise, a failure time regression model such as the Cox model can be used to estimate the conditional distribution formula.

Next, we present an estimation procedure that estimates the nonparametric component formula with the kernel smoothing and the parametric component γ with the profile weighted least squares method. Let formula be a symmetric kernel function with compact support. For fixed γ and at time t, we estimate formula by minimizing the following objective function with respect to β:

(6)

where formula and h is the bandwidth depending on n.

Taking the derivative of formula with respect to β for a fixed γ yields

(7)

which leads to the following estimating function

(8)

Let formula  formula  formula  formula, and define formula and formula similarly by replacing formula with formula, and formula with formula, respectively. Solving formula for fixed γ and t yields formula, where formula and formula.

Replacing formula by formula in (3) and taking derivative with respect to γ, we obtain the profile estimating function for γ:

(9)

where formula is taken as a subinterval of [0, τ] to avoid boundary problems in the theoretical justifications. In practice, formula can be taken to be very close to [0, τ]. Equation formula can be solved explicitly yielding the weighted profile estimator formula, where

(10)

The local profile estimator of formula is given by formula.

2.3 Computational Issues and the Weight Selection

Our estimation procedure uses local smoothing or the local constant method for estimating formula. It is known that the local linear estimation technique, cf. Fan and Gijbels (1996), can improve the performance of estimation at boundaries. For the estimation at the inner points, the local linear and local constant estimators are equivalent, with the same asymptotic distributions. As shown in Fan and Gijbels (1996), the boundary effects from the local constant estimator can be reduced by applying the equivalent kernel of the local linear approach.

Following Fan and Gijbels (1996, sections 2.3.1 and 3.2.2), to reduce the estimation bias for formula at boundary points, for example, formula, we replace formula by the equivalent kernel to the local linear fit modified for the time-varying coefficient model for longitudinal data, defined by formula, where formula. The equivalent kernel formula is a kernel up to a normalizing constant satisfying the finite sample condition formula. This nice feature works to reduce bias for boundary points similar to the symmetrical kernel for interior points. Because it is simple and faster to use the kernel smoothing, we suggest to estimate formula using the equivalent kernel for the boundary time points, while using the standard kernel for the interior time points. Our simulations show that this adjustment works well.

We proposed a weighted profile least squares estimation method coupled with the EM approach for the semiparametric additive time-varying coefficient model for longitudinal data starting from a possibly censored relevant event. The proposed estimators are consistent and asymptotically normal as long as the weight process formula converges in probability to a deterministic function formula. The weight can be selected to improve estimation efficiency, although it is often conveniently taken to be 1. Lin and Carroll (2000) showed that the most efficient estimation of the nonparametric component formula can be achieved by ignoring the within-subject correlation. However, more efficient estimation for the parametric component γ is obtained by using the inverse of true covariance matrix of the longitudinal responses (Lin and Carroll, 2001; Wang et al., 2005). In a simplified situation where the error processes formula are uncorrelated at different times, the optimal weight is inversely proportional to the conditional variance formula of the error process formula (Bickel et al., 1993; Sun et al., 2013; Qi et al., 2017).

We investigated a two-stage estimation procedure for choosing the weight within the framework of the marginal approach that ignores the within-subject correlation. In the first stage, the unit weight function is used to obtain formula and formula. Suppose that formula does not depend on formula, then formula can be consistently estimated by formulaformula  formula  formula  formula  formula  formula, where formula is the residual of the first stage estimation. In the second stage, the updated estimators formula and formula are obtained by choosing the weight formula. When formula depends on formula, the optimal weight formula can be estimated using a multivariate kernel estimator (Qi et al., 2017, Web Appendix C). A simulation study is conducted in Section 4.2 to investigate the efficiency of the two-stage estimation procedure.

3 Asymptotic Properties

In this section, we present the asymptotic properties of the proposed estimators. Define formula  formula, formula, and formula  formula, where formula. Let formula and formula. Let γ0 and formula be the true values of γ and formula under model (1), respectively. In addition to the conditional independence assumptions and noninformative censoring assumptions stated in Section 2.1, more regularity conditions are given in Condition A in Web Appendix A.

Ying (1989) showed that the consistency and weak convergence of the Kaplan Meier estimation for the distribution of formula can be extended to the whole line under Condition (A.7). By Lemma 2 presented in Web Appendix B, formula  formula uniformly in formula. Similar asymptotic results hold for formula and formula. It follows that formula and formula converge to formula and formula uniformly in formula, respectively. These results are the basic building blocks for proving the asymptotic results for formula and formula.

Note that formula is the minimizer of formula. In Part (a) of the proof of Theorem 1, we show that formula converges uniformly to a deterministic function of γ that minimizes at formula. The consistency of formula follows by Theorem 5.7 of van der Vaart (1998). By the first-order Taylor expansion of formula at γ0, we have

(11)

where formula is on the line segment between formula and γ0. To prove the asymptotic normality of formula, it is sufficient to prove that formula converges in probability to a nonsingular matrix, and that formula converges in distribution. The convergence of the information matrix can be obtained by applying Lemma 1 in Web Appendix B. We show in Part (b) of the proof of Theorem 1 that

(12)

where henceforth we adopt the notation formula.

The asymptotic properties of formula and formula are summarized as Theorems 1 and 2. Theorems 1 and 2 are proved in Web Appendix B assuming that formula and formula are independent conditional on formula and both are independent of formula. Web Appendix C provides an outline of the proofs when the conditional hazard function of formula depends on covariates through the Cox model. This model assumption is for the convenience of theoretical development using the existing large sample results for the Cox model with right censored data (Andersen and Gill, 1982). Web Appendix C also discussed possibility of using other failure time regression models to estimate the conditional distribution of left censored event time.

Theorem 1. Under Condition A, we have

  • (a)

    formula as formula;

  • (b)

    formula  formula as formula, where formula  formula  formula  formula, formula.

The asymptotic variance of formula can be consistently estimated by formula, where

(13)

and formula.

Next we present the asymptotic results for formula. First, we introduce a few quantities to be used in the expression of the asymptotic variance of formula. We define formula to be the expected number of sampling points by time t under possibly censored time origin, denoted by formula. Let formula  formula be the natural filtration. Then the intensity formula of formula is given by formula. Hence formula is a formula martingale, with predictable variation process formula.

Theorem 2. Under Condition A, we have

  • (a)

    formula converges to formula in probability uniformly in formula as formula;

  • (b)

    For each formula, formula as formula, where formula, formula  formula. Here formula, formula.

The covariance matrix of formula can be estimated by formula, where formula. However, consider the approximation

(14)

A more refined variance estimation for formula with higher order accuracy can be based on formula, where

(15)

This variance estimator is used in the simulations and in the real data application.

4 Simulation Study

We conduct a numerical study to examine the finite sample performance of the proposed methods. Data are simulated using the following semiparametric additive model formula:

(16)

where formula, formula, formula, formula is uniformly distributed on [0,1], and formula is a Bernoulli random variable with formula. The error process formula has a normal distribution with mean formula and variance 1 for participant i where formula follows a standard normal distribution.

For participant i, formula is generated from a uniform distribution on (0,0.8). The left censoring time formula is generated from a uniform distribution on formula with a and b adjusted to yield a desired percentage formula of left censoring for formula. The first sampling point is set as formula, and the rest of the formula's are generated from a Poisson process formula with intensity rate formula, where formula, formula, and formula. Let formula be the responses formula at time points formula following model (16). The right censoring time formula is exponentially distributed with the rate parameter adjusted to give a prespecified percentage formula of right censoring (drop-out or administrative censoring at τ) in the time interval [0, τ], which is the probability of formula. We set formula for the time interval formula. The average number of observations in the interval formula per participant is about 4.7. The Epanechnikov kernel formula, and formula for the estimating function (9). The local constant kernel formula is used for time points in formula while the equivalent kernel to local linear smoothing is applied for the boundary points in formula. For added protection against boundary bias effects, we consider formula as boundary points in the calculation instead of formula.

The performance of the estimator for γ is measured through the bias (Bias), the sample standard error of the estimates (SSE), the estimated standard error of formula (ESE), and the coverage probability (CP) of a 95% confidence interval for γ. The performance of the estimator for the jth component formula on the interval formula is evaluated by the square root of average integrated squared error (RASE) defined by

(17)

where formula is the kth estimate of formula for formula and N is the number of simulations.

4.1 Simulation Study Using Unit Weight

First, we consider four simulation settings to demonstrate the validity and advantage of the proposed method in handling the censored time origin using unit weight formula. The first three settings show the performances of the proposed estimators with formula, and 50% of left censoring for formula. The fourth setting compares the performances of the proposed estimators with the naive version of the approach that ignores the censored time origin (or formula) by mistreating formula as the measurement times since the actual time origin and formula as the response at formula.

For sample sizes formula, 300, and 500, and bandwidths formula, 0.4, and 0.5, Table 1 shows results based on 500 simulations. The biases of formula are small, and the sample standard errors of formula are close to the estimated standard errors. Both standard errors decrease as the sample size increases, and they decrease with the left censoring percentage formula. The coverage probabilities of formula are close to their 0.95 nominal level. The formula, formula, decrease as sample size increases. The RASEs for formula, formula, also increase as formula increases. The values of SSE and ESE for formula are similar for the three different bandwidth choices. But formula and formula become smaller as bandwidth increases.

TABLE 1

Summary statistics for the estimators formula and formula under model (16) with formula left censoring and 30% right censoring. Each entry is based on 500 simulations

formulanhBiasSSEESECPRASEformulaRASEformula
0%2000.30.00750.19630.18860.9380.37980.6312
0.40.00780.19580.18930.9400.34440.5742
0.50.00800.19520.19000.9380.31940.5370
3000.3−0.00300.16400.15520.9400.30760.5157
0.4−0.00300.16370.15560.9380.27980.4688
0.5−0.00280.16360.15590.9400.25880.4359
5000.30.00020.11840.12120.9500.23700.4048
0.40.00000.11810.12140.9500.21600.3738
0.50.00000.11810.12160.9520.20050.3571
20%2000.30.00030.19900.18910.9300.39610.6778
0.40.00050.19790.18990.9320.35590.6091
0.50.00050.19730.19030.9380.32950.5670
3000.3−0.00490.16500.15540.9260.32420.5436
0.4−0.00480.16480.15580.9280.29280.4918
0.5−0.00460.16480.15620.9300.27130.4587
5000.30.00100.11640.12120.9580.24800.4293
0.40.00130.11640.12150.9600.22340.3922
0.50.00130.11640.12170.9600.20690.3713
50%2000.30.00130.20050.18990.9340.44740.7841
0.40.00100.19960.19110.9360.39280.6921
0.50.00110.19920.19170.9360.35840.6315
3000.3−0.00500.16630.15630.9220.37070.6415
0.4−0.00450.16630.15710.9260.32650.5677
0.5−0.00420.16660.15750.9260.29750.5157
5000.30.00210.11700.12220.9620.28200.5166
0.40.00230.11680.12260.9640.24860.4630
0.50.00250.11670.12280.9600.22730.4237
formulanhBiasSSEESECPRASEformulaRASEformula
0%2000.30.00750.19630.18860.9380.37980.6312
0.40.00780.19580.18930.9400.34440.5742
0.50.00800.19520.19000.9380.31940.5370
3000.3−0.00300.16400.15520.9400.30760.5157
0.4−0.00300.16370.15560.9380.27980.4688
0.5−0.00280.16360.15590.9400.25880.4359
5000.30.00020.11840.12120.9500.23700.4048
0.40.00000.11810.12140.9500.21600.3738
0.50.00000.11810.12160.9520.20050.3571
20%2000.30.00030.19900.18910.9300.39610.6778
0.40.00050.19790.18990.9320.35590.6091
0.50.00050.19730.19030.9380.32950.5670
3000.3−0.00490.16500.15540.9260.32420.5436
0.4−0.00480.16480.15580.9280.29280.4918
0.5−0.00460.16480.15620.9300.27130.4587
5000.30.00100.11640.12120.9580.24800.4293
0.40.00130.11640.12150.9600.22340.3922
0.50.00130.11640.12170.9600.20690.3713
50%2000.30.00130.20050.18990.9340.44740.7841
0.40.00100.19960.19110.9360.39280.6921
0.50.00110.19920.19170.9360.35840.6315
3000.3−0.00500.16630.15630.9220.37070.6415
0.4−0.00450.16630.15710.9260.32650.5677
0.5−0.00420.16660.15750.9260.29750.5157
5000.30.00210.11700.12220.9620.28200.5166
0.40.00230.11680.12260.9640.24860.4630
0.50.00250.11670.12280.9600.22730.4237
TABLE 1

Summary statistics for the estimators formula and formula under model (16) with formula left censoring and 30% right censoring. Each entry is based on 500 simulations

formulanhBiasSSEESECPRASEformulaRASEformula
0%2000.30.00750.19630.18860.9380.37980.6312
0.40.00780.19580.18930.9400.34440.5742
0.50.00800.19520.19000.9380.31940.5370
3000.3−0.00300.16400.15520.9400.30760.5157
0.4−0.00300.16370.15560.9380.27980.4688
0.5−0.00280.16360.15590.9400.25880.4359
5000.30.00020.11840.12120.9500.23700.4048
0.40.00000.11810.12140.9500.21600.3738
0.50.00000.11810.12160.9520.20050.3571
20%2000.30.00030.19900.18910.9300.39610.6778
0.40.00050.19790.18990.9320.35590.6091
0.50.00050.19730.19030.9380.32950.5670
3000.3−0.00490.16500.15540.9260.32420.5436
0.4−0.00480.16480.15580.9280.29280.4918
0.5−0.00460.16480.15620.9300.27130.4587
5000.30.00100.11640.12120.9580.24800.4293
0.40.00130.11640.12150.9600.22340.3922
0.50.00130.11640.12170.9600.20690.3713
50%2000.30.00130.20050.18990.9340.44740.7841
0.40.00100.19960.19110.9360.39280.6921
0.50.00110.19920.19170.9360.35840.6315
3000.3−0.00500.16630.15630.9220.37070.6415
0.4−0.00450.16630.15710.9260.32650.5677
0.5−0.00420.16660.15750.9260.29750.5157
5000.30.00210.11700.12220.9620.28200.5166
0.40.00230.11680.12260.9640.24860.4630
0.50.00250.11670.12280.9600.22730.4237
formulanhBiasSSEESECPRASEformulaRASEformula
0%2000.30.00750.19630.18860.9380.37980.6312
0.40.00780.19580.18930.9400.34440.5742
0.50.00800.19520.19000.9380.31940.5370
3000.3−0.00300.16400.15520.9400.30760.5157
0.4−0.00300.16370.15560.9380.27980.4688
0.5−0.00280.16360.15590.9400.25880.4359
5000.30.00020.11840.12120.9500.23700.4048
0.40.00000.11810.12140.9500.21600.3738
0.50.00000.11810.12160.9520.20050.3571
20%2000.30.00030.19900.18910.9300.39610.6778
0.40.00050.19790.18990.9320.35590.6091
0.50.00050.19730.19030.9380.32950.5670
3000.3−0.00490.16500.15540.9260.32420.5436
0.4−0.00480.16480.15580.9280.29280.4918
0.5−0.00460.16480.15620.9300.27130.4587
5000.30.00100.11640.12120.9580.24800.4293
0.40.00130.11640.12150.9600.22340.3922
0.50.00130.11640.12170.9600.20690.3713
50%2000.30.00130.20050.18990.9340.44740.7841
0.40.00100.19960.19110.9360.39280.6921
0.50.00110.19920.19170.9360.35840.6315
3000.3−0.00500.16630.15630.9220.37070.6415
0.4−0.00450.16630.15710.9260.32650.5677
0.5−0.00420.16660.15750.9260.29750.5157
5000.30.00210.11700.12220.9620.28200.5166
0.40.00230.11680.12260.9640.24860.4630
0.50.00250.11670.12280.9600.22730.4237

Table 2 presents results for estimation of γ and formula with the approach that ignores the censored time origin. It shows that formula, formula, increase dramatically compared to the corresponding results of the proposed estimators in Table 1 that account for the censored time origin. Table 2 also shows that there is little bias in the estimation of the constant effect γ when the censored time origin issue is ignored. Table 3 gives a side-by-side comparison of the proposed estimator for formula versus the approach that misplaces the time origin under 50% left censoring in the presence (formula) and absence (formula) of right censoring.

TABLE 2

Summary statistics of estimation of γ and formula under model (16) with misplaced time origin under 50% left censoring and 30% right censoring. Each entry is based on 500 simulations

formulanhBiasSSEESECPRASEformulaRASEformula
50%2000.3−0.00010.23410.22310.9320.57421.5619
0.4−0.00070.23390.22400.9340.54771.5296
0.5−0.00090.23360.22460.9360.52651.4946
3000.3−0.00550.20150.18470.9200.53551.5214
0.4−0.00560.20140.18530.9260.51711.4991
0.5−0.00580.20180.18560.9240.50071.4689
5000.3−0.00130.14690.14440.9360.48091.4537
0.4−0.00130.14670.14460.9340.46951.4396
0.5−0.00140.14650.14480.9400.45791.4168
formulanhBiasSSEESECPRASEformulaRASEformula
50%2000.3−0.00010.23410.22310.9320.57421.5619
0.4−0.00070.23390.22400.9340.54771.5296
0.5−0.00090.23360.22460.9360.52651.4946
3000.3−0.00550.20150.18470.9200.53551.5214
0.4−0.00560.20140.18530.9260.51711.4991
0.5−0.00580.20180.18560.9240.50071.4689
5000.3−0.00130.14690.14440.9360.48091.4537
0.4−0.00130.14670.14460.9340.46951.4396
0.5−0.00140.14650.14480.9400.45791.4168
TABLE 2

Summary statistics of estimation of γ and formula under model (16) with misplaced time origin under 50% left censoring and 30% right censoring. Each entry is based on 500 simulations

formulanhBiasSSEESECPRASEformulaRASEformula
50%2000.3−0.00010.23410.22310.9320.57421.5619
0.4−0.00070.23390.22400.9340.54771.5296
0.5−0.00090.23360.22460.9360.52651.4946
3000.3−0.00550.20150.18470.9200.53551.5214
0.4−0.00560.20140.18530.9260.51711.4991
0.5−0.00580.20180.18560.9240.50071.4689
5000.3−0.00130.14690.14440.9360.48091.4537
0.4−0.00130.14670.14460.9340.46951.4396
0.5−0.00140.14650.14480.9400.45791.4168
formulanhBiasSSEESECPRASEformulaRASEformula
50%2000.3−0.00010.23410.22310.9320.57421.5619
0.4−0.00070.23390.22400.9340.54771.5296
0.5−0.00090.23360.22460.9360.52651.4946
3000.3−0.00550.20150.18470.9200.53551.5214
0.4−0.00560.20140.18530.9260.51711.4991
0.5−0.00580.20180.18560.9240.50071.4689
5000.3−0.00130.14690.14440.9360.48091.4537
0.4−0.00130.14670.14460.9340.46951.4396
0.5−0.00140.14650.14480.9400.45791.4168
TABLE 3

Side-by-side comparison of the proposed estimator for formula under model (16) with the approach that misplaces the time origin under 50% left censoring and formula right censoring. Each entry is based on 500 simulations

RASE(formula)RASE(formula)
ProposedMisplacedProposedMisplaced
formulanhmethodoriginmethodorigin
0%2000.30.40780.54100.71111.5325
0.40.35570.52070.62461.5081
0.50.32310.50330.56831.4757
3000.30.33020.50840.57621.4941
0.40.29030.49510.51001.4791
0.50.26540.48180.46501.4528
5000.30.25280.46860.46421.4379
0.40.22300.46050.41701.4294
0.50.20420.45050.38291.4094
30%2000.30.44740.57420.78411.5619
0.40.39280.54770.69211.5296
0.50.35840.52650.63151.4946
3000.30.37070.53550.64151.5214
0.40.32650.51710.56771.4991
0.50.29750.50070.51571.4689
5000.30.28200.48090.51661.4537
0.40.24860.46950.46301.4396
0.50.22730.45790.42371.4168
RASE(formula)RASE(formula)
ProposedMisplacedProposedMisplaced
formulanhmethodoriginmethodorigin
0%2000.30.40780.54100.71111.5325
0.40.35570.52070.62461.5081
0.50.32310.50330.56831.4757
3000.30.33020.50840.57621.4941
0.40.29030.49510.51001.4791
0.50.26540.48180.46501.4528
5000.30.25280.46860.46421.4379
0.40.22300.46050.41701.4294
0.50.20420.45050.38291.4094
30%2000.30.44740.57420.78411.5619
0.40.39280.54770.69211.5296
0.50.35840.52650.63151.4946
3000.30.37070.53550.64151.5214
0.40.32650.51710.56771.4991
0.50.29750.50070.51571.4689
5000.30.28200.48090.51661.4537
0.40.24860.46950.46301.4396
0.50.22730.45790.42371.4168
TABLE 3

Side-by-side comparison of the proposed estimator for formula under model (16) with the approach that misplaces the time origin under 50% left censoring and formula right censoring. Each entry is based on 500 simulations

RASE(formula)RASE(formula)
ProposedMisplacedProposedMisplaced
formulanhmethodoriginmethodorigin
0%2000.30.40780.54100.71111.5325
0.40.35570.52070.62461.5081
0.50.32310.50330.56831.4757
3000.30.33020.50840.57621.4941
0.40.29030.49510.51001.4791
0.50.26540.48180.46501.4528
5000.30.25280.46860.46421.4379
0.40.22300.46050.41701.4294
0.50.20420.45050.38291.4094
30%2000.30.44740.57420.78411.5619
0.40.39280.54770.69211.5296
0.50.35840.52650.63151.4946
3000.30.37070.53550.64151.5214
0.40.32650.51710.56771.4991
0.50.29750.50070.51571.4689
5000.30.28200.48090.51661.4537
0.40.24860.46950.46301.4396
0.50.22730.45790.42371.4168
RASE(formula)RASE(formula)
ProposedMisplacedProposedMisplaced
formulanhmethodoriginmethodorigin
0%2000.30.40780.54100.71111.5325
0.40.35570.52070.62461.5081
0.50.32310.50330.56831.4757
3000.30.33020.50840.57621.4941
0.40.29030.49510.51001.4791
0.50.26540.48180.46501.4528
5000.30.25280.46860.46421.4379
0.40.22300.46050.41701.4294
0.50.20420.45050.38291.4094
30%2000.30.44740.57420.78411.5619
0.40.39280.54770.69211.5296
0.50.35840.52650.63151.4946
3000.30.37070.53550.64151.5214
0.40.32650.51710.56771.4991
0.50.29750.50070.51571.4689
5000.30.28200.48090.51661.4537
0.40.24860.46950.46301.4396
0.50.22730.45790.42371.4168

Figure 2 shows the average estimates of formula based on 500 simulations under the four simulation settings described above. Figure 2(a)–(c) plots average estimates based on the proposed method corresponding to 0%, 20%, and 50% left censoring for formula, and Figure 2(d) corresponds to the fourth case. Figure 2(a)–(c) shows that the biases are small, thus the estimated curves fit the true curve quite well. In contrast, there are large biases and an obvious time shift in the estimated covariate effect for formula in Figure 2(d).

Averages of the estimates for  and  for ,  and 30% right censoring based on 500 simulations. The solid black lines are for  and the dashed black lines are for . (a)–(c) The biases in the cases of 0%, 20%, and 50% left censoring rate of , respectively. (d) The results in the case of misplaced time origin by ignoring
Figure 2

Averages of the estimates for formula and formula for formula, formula and 30% right censoring based on 500 simulations. The solid black lines are for formula and the dashed black lines are for formula. (a)–(c) The biases in the cases of 0%, 20%, and 50% left censoring rate of formula, respectively. (d) The results in the case of misplaced time origin by ignoring formula

Figure 3 shows standard errors of formula and formula based on 500 simulations under the four simulation settings. Figure 3(a)–(d) plots results under the four simulation settings. In all four plots, the sample standard error curves are quite close to the estimated standard error curve. In the first three cases, large variations for time near zero are typical for the local linear approach near the boundaries; see page 73 of Fan and Gijbels (1996). The fourth case in Figure 3(d) does not have large variation near zero as the new time zero is shifted from a time point that is of duration formula after the actual time origin for ith subject, formula.

Sample and estimated standard errors of the estimates for  and  for ,  and 30% right censoring based on 500 simulations. The solid lines are for  and the dashed lines are for . The gray lines are the estimated standard error and the black ones are the sample standard error. (a)–(c) The results in the cases of 0%, 20%, and 50% left censoring rate of , respectively. (d) The results in the case of misplaced time origin by ignoring
Figure 3

Sample and estimated standard errors of the estimates for formula and formula for formula, formula and 30% right censoring based on 500 simulations. The solid lines are for formula and the dashed lines are for formula. The gray lines are the estimated standard error and the black ones are the sample standard error. (a)–(c) The results in the cases of 0%, 20%, and 50% left censoring rate of formula, respectively. (d) The results in the case of misplaced time origin by ignoring formula

Figure 4 shows the coverage probabilities of 95% pointwise confidence intervals for formula and formula for formula based on 500 simulations under the four simulation settings. Figure 4(a)–(c) shows that the proposed estimators have accurate coverage probabilities close to the 0.95 nominal level except for time near zero, while Figure 4(d) shows very poor coverage probabilities for both formula and formula for the approach that ignores the censored time origin.

Coverage probabilities of 95% pointwise confidence intervals for  and  for ,  and 30% right censoring based on 500 simulations. The solid lines are the coverage probabilities for  and the dashed lines are for . (a)–(c) The coverage probabilities of 95% pointwise confidence intervals in the cases of 0%, 20% and 50% left censoring rate of , respectively. (d) The results in the case of misplaced time origin by ignoring
Figure 4

Coverage probabilities of 95% pointwise confidence intervals for formula and formula for formula, formula and 30% right censoring based on 500 simulations. The solid lines are the coverage probabilities for formula and the dashed lines are for formula. (a)–(c) The coverage probabilities of 95% pointwise confidence intervals in the cases of 0%, 20% and 50% left censoring rate of formula, respectively. (d) The results in the case of misplaced time origin by ignoring formula

An additional simulation study is conducted in Web Appendix D when the conditional hazard function of formula depends on the baseline covariates through the Cox model. The simulation results presented in Web Tables 13 and Web Figures 13 show that the theory and methods work well for the Cox model.

4.2 Simulation Study Using the Estimated Weight

A simulation study is conducted to evaluate the efficiency gain of the two-stage estimation procedure. We consider four different error models for formula. In Error Model I, formula is a normal distribution with mean formula and variance 1 conditional on the ith subject, and formula is N(0, 1). In Error Model II, formula has a normal distribution with mean formula and variance of formula conditional on the ith subject, and formula is N(0, 0.52). Error Model III is same as for Error Model II except for formula from N(0, 0.32). In Error Model IV, formula is a Gaussian process with mean 0 and variance formula, and formula and formula are independent for formula. Errors formula and formula for formula are dependent under both Error Models I, II, and III, but independent under Error Models IV. Variance formula of error formula is time-varying under Error Models II–IV, but constant under Error Models I. The second-stage estimator is obtained by using the estimated weight formula where formula is given in Section 2.3.

Define the empirical relative efficiency (Eff) of the weighted estimator formula to formula as formula. The efficiency gain depends on the distribution of error process formula. Table 4 shows the simulation results on performance of the second-stage estimator and its empirical relative efficiency. Overall empirical relative efficiency varies in the range of 1 and 1.35. The amount of the efficiency gain is not observed when variance of formula is constant. The second-stage estimator is most efficient when errors formula at different time points are not correlated. The efficiency gain is less obvious when errors formula are correlated. The simulation study also shows that there is no clear efficiency gain of formula over formula for all error models.

TABLE 4

The empirical relative efficiency Eff(γ) of the two-stage estimator using the estimated weight to the estimator using unit weight under model (16) with 20% left censoring and 30% right censoring, n = 200, 300, 500 and formula. Each entry is based on 500 simulations

Unit WeightEstimated Weight
nBiasSSEESECPBiasSSEESECPEff(γ)
Error Model I
2000.00050.19790.18990.9320.00060.19840.18750.9320.9972
300−0.00480.16480.15580.928−0.00500.16500.15450.9280.9988
5000.00130.11640.12150.9600.00120.11620.12080.9601.0017
Error Model II
2000.00010.12820.12630.9520.00060.11970.11430.9401.0707
300−0.00410.10700.10350.940−0.00210.09920.09390.9281.0789
5000.00000.07740.08030.9540.00090.07060.07330.9641.0959
Error Model III
200−0.00070.10660.10710.9580.00000.09180.08960.9561.1612
300−0.00360.09010.08760.940−0.00130.07730.07320.9241.1665
500−0.00030.06580.06790.9560.00070.05500.05700.9641.1970
Error Model IV
200−0.00180.09290.09540.954−0.00090.07090.07150.9501.3100
300−0.00280.07940.07790.948−0.00060.06020.05800.9321.3187
500−0.00090.05970.06030.9500.00030.04430.04510.9581.3481
Unit WeightEstimated Weight
nBiasSSEESECPBiasSSEESECPEff(γ)
Error Model I
2000.00050.19790.18990.9320.00060.19840.18750.9320.9972
300−0.00480.16480.15580.928−0.00500.16500.15450.9280.9988
5000.00130.11640.12150.9600.00120.11620.12080.9601.0017
Error Model II
2000.00010.12820.12630.9520.00060.11970.11430.9401.0707
300−0.00410.10700.10350.940−0.00210.09920.09390.9281.0789
5000.00000.07740.08030.9540.00090.07060.07330.9641.0959
Error Model III
200−0.00070.10660.10710.9580.00000.09180.08960.9561.1612
300−0.00360.09010.08760.940−0.00130.07730.07320.9241.1665
500−0.00030.06580.06790.9560.00070.05500.05700.9641.1970
Error Model IV
200−0.00180.09290.09540.954−0.00090.07090.07150.9501.3100
300−0.00280.07940.07790.948−0.00060.06020.05800.9321.3187
500−0.00090.05970.06030.9500.00030.04430.04510.9581.3481
TABLE 4

The empirical relative efficiency Eff(γ) of the two-stage estimator using the estimated weight to the estimator using unit weight under model (16) with 20% left censoring and 30% right censoring, n = 200, 300, 500 and formula. Each entry is based on 500 simulations

Unit WeightEstimated Weight
nBiasSSEESECPBiasSSEESECPEff(γ)
Error Model I
2000.00050.19790.18990.9320.00060.19840.18750.9320.9972
300−0.00480.16480.15580.928−0.00500.16500.15450.9280.9988
5000.00130.11640.12150.9600.00120.11620.12080.9601.0017
Error Model II
2000.00010.12820.12630.9520.00060.11970.11430.9401.0707
300−0.00410.10700.10350.940−0.00210.09920.09390.9281.0789
5000.00000.07740.08030.9540.00090.07060.07330.9641.0959
Error Model III
200−0.00070.10660.10710.9580.00000.09180.08960.9561.1612
300−0.00360.09010.08760.940−0.00130.07730.07320.9241.1665
500−0.00030.06580.06790.9560.00070.05500.05700.9641.1970
Error Model IV
200−0.00180.09290.09540.954−0.00090.07090.07150.9501.3100
300−0.00280.07940.07790.948−0.00060.06020.05800.9321.3187
500−0.00090.05970.06030.9500.00030.04430.04510.9581.3481
Unit WeightEstimated Weight
nBiasSSEESECPBiasSSEESECPEff(γ)
Error Model I
2000.00050.19790.18990.9320.00060.19840.18750.9320.9972
300−0.00480.16480.15580.928−0.00500.16500.15450.9280.9988
5000.00130.11640.12150.9600.00120.11620.12080.9601.0017
Error Model II
2000.00010.12820.12630.9520.00060.11970.11430.9401.0707
300−0.00410.10700.10350.940−0.00210.09920.09390.9281.0789
5000.00000.07740.08030.9540.00090.07060.07330.9641.0959
Error Model III
200−0.00070.10660.10710.9580.00000.09180.08960.9561.1612
300−0.00360.09010.08760.940−0.00130.07730.07320.9241.1665
500−0.00030.06580.06790.9560.00070.05500.05700.9641.1970
Error Model IV
200−0.00180.09290.09540.954−0.00090.07090.07150.9501.3100
300−0.00280.07940.07790.948−0.00060.06020.05800.9321.3187
500−0.00090.05970.06030.9500.00030.04430.04510.9581.3481

5 Analysis of Step Study

Step study was a multicenter, double-blind, randomized, placebo-controlled preventive HIV vaccine efficacy trial conducted in North America, the Caribbean, South America, and Australia from 2004 to 2009 (Buchbinder et al., 2008; Fitzgerald et al., 2011; Duerr et al., 2012). A co-primary objective of the study was to determine whether the MRKAd5 HIV gag/pol/nef vaccine, which elicits T cell immune responses to HIV proteins through delivery of the gag, pol, and nef HIV genes to the immune system by the adenovirus type 5 (Ad5) common cold vector, is capable of controlling HIV replication among participants who acquired HIV infection after vaccination. Three thousand HIV-1 negative participants at high risk of HIV infection and with ages between 18 and 45 were enrolled and randomly assigned to receive vaccine or placebo in 1:1 allocation, stratified by sex, study site, and anti-Ad5 neutralizing antibody titer at baseline.

Of the 3000 participants, 174 acquired HIV infection during the trial, 159 of which were male and 15 female. As females comprised only <10% of the sample, we analyze the males only. Study participants received antibody (Ab)-based HIV diagnostic ELISA tests at periodic study visits at Weeks 12, 52, and every six months thereafter through 5 years. Participants with a positive ELISA test had HIV infection confirmed by an antigen-based HIV-specific RNA PCR test. Moreover, for all confirmed HIV infected participants, a “look-back” procedure was applied wherein all earlier available blood samples going back in time were tested for HIV infection using the more sensitive RNA PCR test. The antibody tests used in Step have near-perfect sensitivity to detect HIV infections starting 4 weeks after HIV acquisition, but miss HIV infections before that, whereas the RNA PCR tests have near-perfect sensitivity starting about a week after HIV acquisition. Let formula be the time of the latest negative Ab- test result, formula the first positive Ab+ test result, and formula the actual HIV acquisition time. Based on the HIV testing algorithm, each infected participant was classified into one of two groups, defined by whether the earliest HIV positive sample was (Ab+, PCR+) or (Ab-, PCR+). The (Ab+, PCR+) group has formula such that formula and formula is left censored (Figure 1a), and the (Ab-, PCR+) group has formula and formula such that formula is observed (Figure 1b). The left censoring rate for formula is 70.4%.

At the time of a participant's first antibody-based positive HIV infection diagnosis (formula), 18 post-infection visits were scheduled at weeks 0, 1, 2, 8, 12, 26 and every 26 weeks thereafter through week 338. However, the actual dates of visits vary. We define formula as the time from formula to the jth visit for the ith infected participant. At each study visit, HIV viral load formula was measured. A participant is considered censored once he starts antiretroviral therapy (ART), which interferes with the assessment of vaccine effects on viral load and other biomarkers of interest. The censoring time formula is the time from formula to ART initiation, study drop-out, or the end of follow-up, whichever comes first. The right censoring rate of formula was 69.8%.

The data for analysis includes 159 participants (97 vaccine group, 62 placebo group) with 785 visits from HIV infection diagnosis and prior to ART initiation. One hundred and twenty-two participants were in North America or Australia and the rest were in the Caribbean or South America. Information was available on whether the participants were fully adherent to vaccinations and on baseline anti-Ad5 neutralizing antibody titer (Ad5 titer), each of which affect the T cell response to the vaccine and hence could associate with the viral load response. A spaghetti plot of the raw viral load data with one line for each participant by vaccination status and the study region is given in Web Figure 4 in Web Appendix F. We investigate the effect of MRKAd5 HIV-1 gag/pol/nef vaccine versus placebo on longitudinal HIV viral loads among the 159 HIV infected men. With formula, the logarithm (log10) of HIV viral load at time t in years, we analyzed the data with the following model:

(18)

where formula is the treatment indicator (formula if participant i was assigned vaccine and 0 if placebo), formula is the study site indicator (formula if North America or Australia and 0 otherwise), formula is the natural logarithm of baseline Ad5 titer, and formula is the per-protocol indicator (formula if participant i was fully adherent to vaccinations and 0 otherwise).

We choose formula years since there are very few observations after formula and formula. The Kaplan–Meier estimator is used to estimate distribution of formula since the covariates in model (18) are not significant with large p-values in fitting the Cox model. Figure 5 shows the plot of the Kaplan–Meier estimator formula. We let formula for formula as the smallest uncensored formula is 0.1397 and the smallest value of formula is 0.0493.

Kaplan–Meier estimator of the distribution function of the time from actual HIV acquisition to the first positive Elisa confirmed by Western Blot or RNA for male HIV infected cases in Step trial
Figure 5

Kaplan–Meier estimator of the distribution function of the time from actual HIV acquisition to the first positive Elisa confirmed by Western Blot or RNA for male HIV infected cases in Step trial

We select bandwidth using the formula formula, where formula. Here formula is the sample variance of formula, for formula and each participant i, and formula is the sample variance of formula. Some discussions on bandwidth choice are given in Web Appendix E. With formula, the bandwidth choice is formula. The estimates of γ1, γ2, and γ3 using unit weight are 0.0241, −0.0127, and −0.0124, with standard error 0.0406, 0.1935, and 0.1573, respectively. The results indicate no significant associations of baseline Ad5 titer, study site, or per-protocol status with the HIV viral load level. The estimates of time-varying effects with 95% pointwise confidence intervals are shown in Figure 6. Figure 6(b) suggests that there is a nonsignificant vaccine effect to reduce viral load after infection. However, the vaccine group tends to have lower viral load than the placebo group and the benefit increases over time.

Analysis of HIV viral load level for the male HIV infected cases in Step trial with model (18). (a) The estimate of the intercept  and its 95% pointwise confidence interval. (b) The estimated effect  of the vaccine and its 95% pointwise confidence interval. The solid curves are the point estimates and the dashed curves are the confidence intervals
Figure 6

Analysis of HIV viral load level for the male HIV infected cases in Step trial with model (18). (a) The estimate of the intercept formula and its 95% pointwise confidence interval. (b) The estimated effect formula of the vaccine and its 95% pointwise confidence interval. The solid curves are the point estimates and the dashed curves are the confidence intervals

The two-stage estimation procedure with the estimated weight is applied to the analysis of Step study. The results of analysis using the estimated weight are similar and are presented in Web Appendix F.

6 Concluding Remarks

This paper investigated the semiparametric additive time-varying coefficient model for longitudinal data starting from a possibly censored relevant event. The developed method can be applied to assess the HIV vaccine effect on the post HIV infection longitudinal biomarkers such as HIV viral load level since the time of infection. However, the exact time of infection can be censored. The censorship status can be determined by an HIV antibody test followed by a more sensitive antigen-based HIV-specific PCR assay, or by an expanded set of HIV diagnostic assays with varying sensitivity properties (Grebe et al., 2019) and/or sequence diversification molecular clock models (Giorgi et al., 2010; Rossenkhan et al., 2019). The time from HIV acquisition to first HIV antibody positive test is subject to left censoring and its distribution can be estimated based on the Kaplan–Meier estimator or a failure time regression model such as the Cox model. We used the expectation maximization approach to deal with the censored time origin and developed the weighted profile least squares estimation procedure. The nonparametric kernel smoothing method was employed to estimate time-varying covariate effects. We investigated the asymptotic properties of both the parametric and nonparametric estimators and proposed consistent variance estimators.

Our simulation study showed that the proposed estimators work well with small biases and satisfying coverage probabilities for different levels of left censoring and moderate sample sizes while substantial bias was incurred for estimation of time-varying effects when misplacing the time origin by ignoring left censoring. The method was applied to analyze the male HIV infected cases from Step study, suggesting a borderline significant vaccine effect to lower viral load after infection by about a half to three-quarters of a log (base 10) during the period beyond 1.5 years post HIV acquisition. This result should be interpreted with caution given the potential for post-randomization selection bias, which could occur if there are baseline factors not adjusted for in the analysis that predict both viral load and the vaccine effect on HIV acquisition (Shepherd et al., 2006).

As with most statistical modeling, the application of the proposed method should be accompanied by model assessment. Web Figure 5 shows a scatter plot of residuals of the participants with formula for the Step study application. The residual plot is approximately centered around 0 between −2 and 2, which indicates the model fits the data reasonably well. However, it is more desirable to develop formal goodness-of-fit tests to examine the model fitness. The test statistics can be constructed based on certain test processes that are functionals of residual processes. The critical values of the test statistics can be estimated similarly to the technique of Sun et al. (2019). Further work is warranted to understand the empirical and theoretical properties of the tests. Although the development for the censored time origin is based on the semiparametric additive time-varying coefficients model, extensions to the generalized semiparametric regression models of Sun et al. (2013) are possible. The extension can be carried out by applying the EM approach to the expositions of Sun et al. (2013). It would also be interesting to investigate the random mixed-effects model with censored time origin that can be used to estimate subject-specific effects of covariates.

Data Availability Statement

The data that support the findings in this paper are available in the Supporting Information of this article.

Supporting Information

Web Appendices AF, Tables 13, and Figures 15 referenced in Sections 25 together with the data and computer code are posted online with this paper at the Biometrics website on Wiley Online Library.

Acknowledgments

The authors thank an associate editor and two referees for their valuable comments and suggestions that have greatly improved the paper. This research was partially supported by the National Institutes of Health NIAID [grant number R37 AI054165]. The research of Yanqing Sun was also partially supported by the National Science Foundation [grant numbers DMS-1208978 and DMS-1915829] and the Reassignment of Duties fund provided by the University of North Carolina at Charlotte. We thank the HIV Vaccine Trials Network (HVTN) and Merck for providing the data analyzed in this article. The HVTN is supported through a cooperative agreement with the National Institutes of Health Division of AIDS, grant 5 U01 AI068635. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

References

Andersen
,
P.K.
&
Gill
,
R.D.
(
1982
)
Cox's regression model for counting processes: A large sample study
.
Annals of Statistics
,
10
,
1100
1120
.

Bickel
,
P.J.
,
Klaassen
,
C.A.
,
Bickel
,
P.J.
,
Ritov
,
Y.
,
Klaassen
,
J.
,
Wellner
,
J.A.
et al. (
1993
)
Efficient and adaptive estimation for semiparametric models
.
Baltimore, MD
:
Johns Hopkins University Press
.

Buchbinder
,
S.
,
Mehrotra
,
D.
,
Duerr
,
A.
,
Fitzgerald
,
D.
,
Mogg
,
R.
,
Li
,
D.
et al. (
2008
)
Efficacy assessment of a cell-mediated immunity HIV-1 vaccine (the Step Study): a double-blind, randomised, placebo-controlled, test-of-concept trial
.
Lancet
,
372
,
1881
1893
.

Duerr
,
A.
,
Huang
,
Y.
,
Buchbinder
,
S.
,
Coombs
,
R.W.
,
Sanchez
,
J.
,
del Rio
,
C.
et al. (
2012
)
Extended follow-up confirms early vaccine-enhanced risk of HIV acquisition and demonstrates waning effect over time among participants in a randomized trial of recombinant adenovirus HIV vaccine (Step study)
.
Journal of Infectious Diseases
,
206
,
258
266
.

Fan
,
J.
&
Gijbels
,
I.
(
1996
)
Local polynomial modelling and its applications: Monographs on statistics and applied probability
.
Boca Raton, FL
:
CRC Press
.

Fan
,
J.
,
Huang
,
T.
&
Li
,
R.
(
2007
)
Analysis of longitudinal data with semiparametric estimation of covariance function
.
Journal of the American Statistical Association
,
102
,
632
641
.

Fitzgerald
,
D.
,
Janes
,
H.
,
Robertson
,
M.
,
Coombs
,
R.
,
Frank
,
I.
,
Gilbert
,
P.
et al. (
2011
)
An ad5-vectored HIV-1 vaccine elicitscell-mediated immunity but does not affect disease progression in HIV-1- infected male subjects: results from a randomized placebo-controlled trial (the step study)
.
Journal of Infectious Diseases
,
203
,
765
772
.

Giorgi
,
E.
,
Funkhouser
,
B.
,
Athreya
,
G.
,
Perelson
,
A.
,
Korber
,
B.
&
Bhattacharya
,
T.
(
2010
)
Estimating time since infection in early homogeneous HIV-1 samples using a Poisson model
.
BMC Bioinformatics
,
11
,
532
.

Grebe
,
E.
,
Facente
,
S.N.
,
Bingham
,
J.
,
Pilcher
,
C.D.
,
Powrie
,
A.
,
Gerber
,
J.
et al. (
2019
)
Interpreting diagnostic histories into HIV infection time estimates: analytical framework and online tool
.
BMC Infectious Diseases
,
19
,
894
.

Hu
,
Z.
,
Wang
,
N.
&
Carroll
,
R.J.
(
2004
)
Profile-kernel versus backfitting in the partially linear models for longitudinal/clustered data
.
Biometrika
,
91
,
251
262
.

Janes
,
H.
,
Herbeck
,
J.T.
,
Tovanabutra
,
S.
,
Thomas
,
R.
,
Frahm
,
N.
,
Duerr
,
A.
et al. (
2015
)
HIV-1 infections with multiple founders are associated with higher viral loads than infections with single founders
.
Nature Medicine
,
21
,
1139
.

Lin
,
X.
&
Carroll
,
R.J.
(
2000
)
Nonparametric function estimation for clustered data when the predictor is measured without/with error
.
Journal of the American statistical Association
,
95
,
520
534
.

Lin
,
X.
&
Carroll
,
R.J.
(
2001
)
Semiparametric regression for clustered data using generalized estimating equations
.
Journal of the American Statistical Association
,
96
,
1045
1056
.

Lin
,
D.Y.
&
Ying
,
Z.
(
2001
)
Semiparametric and nonparametric regression analysis of longitudinal data (with discussion)
.
Journal of the American Statistical Association
,
96
,
103
113
.

Qi
,
L.
,
Sun
,
Y.
&
Gilbert
,
P.
(
2017
)
Generalized semiparametric varying-coefficient model for longitudinal data with applications to adaptive treatment randomizations
.
Biometrics
,
73
,
441
451
.

Qu
,
A.
&
Li
,
R.
(
2006
)
Quadratic inference functions for varying-coefficient models with longitudinal data
.
Biometrics
,
62
,
379
391
.

Rerks-Ngarm
,
S.
,
Paris
,
R.
,
Chunsutthiwat
,
S.
,
Premsri
,
N.
,
Namwat
,
C.
,
Bowonwatanuwong
,
C.
et al. (
2013
)
Extended evaluation of the virologic, immunologic, and clinical course of volunteers who acquired HIV-1 infection in a phase III vaccine trial of ALVAC-HIV and AIDSVAX B/E
.
Journal of Infectious Diseases
,
207
(
8
),
1195
1205
.

Robins
,
J.
,
Rotnitzky
,
A.
&
Zhao
,
L.
(
1994
)
Estimation of regression coefficients when some regressors are not always observed
.
Journal of the American Statistical Association
,
89
,
846
866
.

Rossenkhan
,
R.
,
Rolland
,
M.
,
Labuschagne
,
J.
,
Ferreira
,
R.
,
Magaret
,
C.
,
Carpp
,
L.
et al. (
2019
)
Combining viral genetics and statistical modeling to improve HIV-1 time-of-infection estimation towards enhanced vaccine efficacy assessment
.
Viruses
,
11
,
607
.

Shepherd
,
B.
,
Gilbert
,
P.B.
,
Jemiai
,
Y.
&
Rotnitzky
,
A.
(
2006
)
Sensitivity analyses comparing outcomes only existing in a subset selected post-randomization, conditional on covariates, with application to HIV vaccine trials
.
Biometrics
,
62
,
332
342
.

Sun
,
Y.
&
Gilbert
,
P.
(
2012
)
Estimation of stratified mark-specific proportional hazards models with missing marks
.
Scandinavian Journal of Statistics
,
39
,
34
52
. PMCID: PMC3601495.

Sun
,
Y.
,
Qi
,
L.
,
Heng
,
F.
&
Gilbert
,
P.B.
(
2019
)
Analysis of generalized semiparametric mixed varying-coefficients models for longitudinal data
.
Canadian Journal of Statistics
,
47
,
352
373
.

Sun
,
Y.
,
Qian
,
X.
,
Shou
,
Q.
&
Gilbert
,
P.
(
2017
)
Analysis of two-phase sampling data with semiparametric additive hazards models
.
Lifetime Data Analysis
,
23
,
377
399
.

Sun
,
Y.
,
Sun
,
L.
&
Zhou
,
J.
(
2013
)
Profile local linear estimation of generalized semiparametric regression model for longitudinal data
.
Lifetime Data Analysis
,
19
,
317
349
.

Sun
,
Y.
&
Wu
,
H.
(
2005
)
Semiparametric time-varying coefficients regression model for longitudinal data
.
Scandinavian Journal of Statistics
,
32
,
21
48
.

van der Vaart
,
A.W.
(
1998
)
Asymptotic statistics
.
Cambridge
:
Cambridge University Press
.

Wang
,
N.
,
Carroll
,
R.J.
&
Lin
,
X.
(
2005
)
Efficient semiparametric marginal estimation for longitudinal/clustered data
.
Journal of the American Statistical Association
,
100
,
147
157
.

Yang
,
G.
,
Sun
,
Y.
,
Qi
,
L.
&
Gilbert
,
P.
(
2017
)
Estimation of stratified mark-specific proportional hazards models under two-phase sampling with application to HIV vaccine efficacy trials
.
Statistics in Biosciences
,
9
,
259
283
.

Ying
,
Z.
(
1989
)
A note on the asymptotic properties of the product-limit estimator on the whole line
.
Statistics & Probability Letters
,
7
,
311
314
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data