Abstract

Leveraging information in aggregate data from external sources to improve estimation efficiency and prediction accuracy with smaller scale studies has drawn a great deal of attention in recent years. Yet, conventional methods often either ignore uncertainty in the external information or fail to account for the heterogeneity between internal and external studies. This article proposes an empirical likelihood-based framework to improve the estimation of the semiparametric transformation models by incorporating information about the t-year subgroup survival probability from external sources. The proposed estimation procedure incorporates an additional likelihood component to account for uncertainty in the external information and employs a density ratio model to characterize population heterogeneity. We establish the consistency and asymptotic normality of the proposed estimator and show that it is more efficient than the conventional pseudopartial likelihood estimator without combining information. Simulation studies show that the proposed estimator yields little bias and outperforms the conventional approach even in the presence of information uncertainty and heterogeneity. The proposed methodologies are illustrated with an analysis of a pancreatic cancer study.

1 Introduction

The advent of evidence-based medicine has generated considerable interest in developing methods that can better synthesize information from different sources to infer treatment effects and identify prognostic/predictive factors (Guyatt et al., 1992). The meta-analysis, a quantitative procedure for combining results from multiple relevant clinical studies, is a powerful tool to produce empirical evidence to guide clinical practice (Sutton et al., 2000; Whitehead, 2002). It conventionally refers to methods combining study-level results but has evolved to encompass ones with individual participant data (IPD). The IPD meta-analysis enjoys clear advantages over the conventional aggregate data (AD) meta-analysis because it allows standardization of the endpoint definition, covariates, and analytical methods; it also allows examination of treatment-by-covariate interactions or subgroup analyses. Despite its known advantages, however, IPD meta-analysis is less common in practice because it is more costly and time-consuming; moreover, access to IPD may be a challenge due to privacy concerns and/or administrative problems.

This research is motivated by the growing interest in developing efficient and flexible meta-analysis procedure to integrate IPD and AD (Chatterjee et al., 2016; Chen et al., 2021; Gao & Chan, 2022; Huang et al., 2016; Han & Lawless, 2019; Liu et al., 2014; Zhang et al., 2020; Zheng et al., 2022). When combining information from different sources, challenges arise as AD may be given in different forms and of different degrees of uncertainty. As an example, in a multivariate regression analysis of data from 209 consecutive patients who underwent pancreatectomy at the Johns Hopkins Hospital between 1998 and 2007 to identify prognostic factors for pancreatic cancer survival, the effect of lymph node status, an important prognostic factor, did not reach statistical significance. To improve estimation efficiency, we seek to incorporate the information in the 3-year survival probabilities estimated using 116 patients with different node statuses (Ahmad et al., 2001). It is easy to see that the uncertainty in the external information should not be ignored in the inference procedure because the sample size of the external study is not large. Moreover, a careful examination of the covariate summary statistics revealed that the proportions of margin-positive and node-positive patients in the external study were much lower than that in the internal study, suggesting the presence of heterogeneity in the covariate distribution between the internal and external studies. Our goal is to develop a unified framework that can account for uncertainty in the external study and heterogeneity across different studies simultaneously.

In this paper, we propose an empirical likelihood-based framework for integrating IPD and AD under the semiparametric transformation model. The empirical likelihood method, originally developed for constructing confidence regions (Owen, 1988; Thomas & Grunkemeier, 1975), was later to combine auxiliary information given in the form of moment estimating equations (Qin & Lawless, 1994; Qin, 2000). We aim to exploit t-year survival probabilities, a common form of summary statistics in the context of survival analysis, to improve estimation efficiency and prediction accuracy. Specifically, following Huang et al. (2016), we derive moment constraints by reexpressing the t-year survival probabilities in the form of estimating equations under the semiparametric transformation model. Next, to account for uncertainty in the reported t-year survival probabilities, we exploit the asymptotic normality of summary statistics by treating the reported values as the realization of a normal random vector. This way, the contribution of auxiliary information to the likelihood can be captured by adding a normal density term (Imbens & Lancaster, 1994; Zhang et al., 2020). This augmented empirical likelihood is then maximized subject to the moment constraints derived from the reported t-year survival probabilities to estimate the regression parameters in the semiparametric transformation model. It is worthwhile to point out that, instead of adding a normal density term, a direct extension of the adjusted variance method was proposed by Sheng et al. (2021); however, the latter cannot be easy to handle multiple external studies.

It is known that ignoring important differences between studies can invalidate meta-analysis. In this article, we assume that the covariate effects follow the same semiparametric transformation model but allow distributions of covariates to vary across different studies because they may be conducted in different patient populations with different study designs. To account for the differences in the covariate distribution, which is analogous to the concept of “covariate shift” in transfer learning (Shimodaira, 2000), we employ a density ratio model to characterize population heterogeneity between internal and external studies. To perform empirical likelihood estimation, we reevaluate the moment constraints derived from the summary statistics under the density ratio models of the marginal covariate distributions, in addition to the semiparametric transformation model of covariate effects. Hence, maximizing the augmented empirical likelihood subject to the reevaluated moment constraints can simultaneously account for uncertainty in the reported t-year survival probabilities and population heterogeneity across studies. Of note, the efficiency loss resulting from estimating an additional set of parameters in the density ratio model can be compensated by including additional constraints based on the marginal covariate distribution.

This article is organized as follows. In Section 2, we propose an empirical likelihood method for integrating IPD and the reported t-year survival probabilities under the semiparametric transformation model, where an empirical likelihood is constructed based on a compromise between the pseudopartial and nonparametric likelihoods. In Section 3, we extend the proposed empirical likelihood method to account for the uncertainty in the reported t-year survival probabilities and exploit the semiparametric density ratio model to allow for the population heterogeneity between the internal and external studies simultaneously. The results of simulation studies are provided in Section 4, and the proposed approaches are illustrated by a pancreatic cancer data in Section 5. Finally, some concluding remarks and potential future works are discussed in Section 6.

2 Empirical Likelihood Estimation

2.1 A Brief Review of the Semiparametric Transformation Models

Let T denote the time to a failure event of interest in the internal study. We assume that, conditional on a p-dimensional vector of covariates X, the survival time T follows a semiparametric transformation model

(1)

where formula is an unspecified monotone function with formula, β is a p-dimensional vector of regression parameter, and ε is a random error with a known cumulative hazard function formula independent of X. Hence, the cumulative hazard function of T given X is formula, with formula and formula. The semiparametric transformation models encompass the Cox model and the proportional odds model as special cases, where the corresponding random error ε follows the extreme-value distribution and the standard logistic distribution, respectively.

In this article, we impose the usual independent censoring assumption that the time to censoring, denoted by C, is conditionally independent of T given X. Define formula and formula, so that Y gives the observed failure time and Δ is the failure event indicator. The observed data formula are assumed to be independent and identically distributed realizations of formula. Denote by formula the jump of formula at time y. Under model (1), the log conditional likelihood is

(2)

with formula. As pointed out by Zeng and Lin (2006), direct maximization of the conditional likelihood formula is challenging because it involves the nonparametric component formula in a complicated way. Alternatively, an estimator of formula can be constructed by solving the martingale estimating equation

(3)

where formula is the counting process of the observed failure events and formula is the at-risk process. Specifically, given β, the solution of the martingale estimating equation, denoted by formula, satisfies

(4)

Replacing formula with formula in formula and ignoring a constant term yields the log pseudopartial likelihood (Zucker, 2005)

(5)

where formula.

Define formula, formula, with formula, formula, and formula for any vector a. Taking derivative of formula with respect to β yields the pseudopartial likelihood score function

(6)

where formula. As a result, the maximum pseudopartial likelihood estimator formula can be obtained by solving formula for zero. Denote by β0 and formula the true values of β and formula, respectively. Zucker (2005) showed that formula is asymptotically normally distributed with a zero mean and a variance–covariance matrix formula, where Γ is the negative expectation of the second derivative of the pseudopartial log-likelihood with respect to β and Q is a positive definite matrix resulting from the variation of formula. Moreover, as formula, formula converges in distribution to a zero-mean normal distribution with covariance matrix formula. Note that in the special case of the Cox model, formula only involves β and thus Q = 0. As a result, the asymptotic variance of formula reduces to formula. The explicit forms of Γ and Q are given in the Supporting Information.

2.2 An Empirical Likelihood Estimator for Synthesizing External Information

Our goal is to obtain an improved estimation of the semiparametric transformation model by incorporating external information on t-year survival probabilities in different subgroups. We begin by assuming that the uncertainty in the external information is negligible and that subjects in the internal and external studies were random samples from the same population. The two assumptions will be relaxed later in Sections 3.1 and 3.2.

Let formula denote the kth subgroup whose survival probability at the time point formula is available from an external study. Let formula denote random variables in the external study. So, the external information can be expressed as formula, formula, where formula is the survival probability at time formula in the kth subgroup. By double expectation, we have

(7)
(8)

Note that the second equality holds in the absence of heterogeneity between internal and external studies; that is, the conditional distribution of T given X and the marginal distribution of X are equivalent to their counterparts in the external study. Following Huang et al. (2016), we reexpress the subgroup survival information as a population estimating equation

(9)

where formula and formula.

In this paper, we apply the empirical likelihood method to integrate information from IPD and the t-year survival probabilities under the semiparametric transformation model. Denote by formula the marginal distribution function of X, and by formula the jump size of formula at the observed data point Xi. We construct the empirical likelihood by multiplying the pseudopartial likelihood and the nonparametric marginal likelihood of X, and then maximize the resulting log-likelihood

(10)

subject to the constraints

(11)

Note that the constraints were derived from the external information on subgroup survival. Write formula and formula. Applying the classic empirical likelihood argument (Qin & Lawless, 1994; Qin, 2000), we have formula and the constrained log likelihood, up to a constant,

(12)

where formula are the Lagrange multipliers determined by formula. Hence, we estimate formula by solving the following empirical score functions for zero:

(13)
(14)
(15)
(16)

We denote the solution by formula. The asymptotic properties of the proposed estimator formula are summarized in Theorem 1, with the proof given in the Supporting Information.

 
Theorem 1.

Under regularity conditions for pseudopartial likelihood estimators (Zucker, 2005, p. 1273) and Conditions (S1)∼(S4) stated in the Supporting Information, as formula, (i) formula converges in probability to β0, and (ii) formula converges in distribution to a zero mean multivariate normal distribution with the variance–covariance matrix formula, where formula, formula, formula, and formula.

Note that formula is semipositive definite because formula is idempotent. As a result, the proposed estimator formula, which combines information from the external study, is asymptotically as or more efficient than the conventional pseudopartial likelihood estimator formula obtained using only the internal study data.

3 Proposed Methods

In practice, the degree of uncertainty in the auxiliary information may not be negligible because the sample size in the external study is not large enough. Moreover, the population and research design usually differ across studies, leading to heterogeneity between internal and external data. This section extends the empirical likelihood method to deal with uncertainty and heterogeneity in the reported t-year survival probabilities.

3.1 Synthesizing External Information with Uncertainty

Suppose that the reported t-year survival probabilities formula are estimates of the population parameters formula and were obtained from an external study of N participants. Assume that the asymptotic normality assumption holds for formula, that is, formula approximately follows a multivariate normal distribution with mean zero and variance–covariance matrix V0. To account for uncertainty in the reported t-year survival probabilities, we adopt the augmentation approach proposed in Zhang et al. (2020) by adding an additional normal density term in the empirical likelihood to characterize the contribution of formula and formulating the constraints using the population parameter formula directly. The augmented log empirical likelihood, up to a constant, is given by

(17)

where the last term formula reflects variability in the external information. To estimate β, we maximize (17) subject to the constraints

(18)

Unlike the empirical likelihood method described in Section 2.2, the constraints are formulated using the population parameter ϕ instead of the value of the AD formula.

By a standard empirical likelihood argument and argued as in Section 2.2, we can estimate formula by maximizing the objective function

(19)

where ξ is the Lagrange multiplier satisfying formula. Let formula be the derivative of formula with respect to formula. The maximizer, denoted by formula, can be obtained by solving formula using the Newton–Raphson algorithm. The asymptotic properties of formula are summarized in Theorem 2, with the proof given in the Supporting Information.

 
Theorem 2.

Under regularity conditions for pseudopartial likelihood estimators (Zucker, 2005, p. 1273) and Conditions (S1)∼(S4) stated in the Supporting Information, assume that there exists a constant formula so that formula. Then, as formula, (i) formula converges in probability to β0 and (ii) formula converges in distribution to a zero mean multivariate normal distribution with variance–covariance matrix formula, where formula, formula, and formula.

Arguing as in Section 2.2, one can show that formula is semipositive definite and hence formula is asymptotically as or more efficient than formula. When formula, that is, the uncertainty in the external information is negligible, it follows from formula that formula, and thus, the asymptotic variance of formula is close to that of formula. On the other hand, when formula, we have formula, and thus, there is almost no efficiency gain when compared with formula.

Intuitively, when V0 is not available from the external source, a consistent estimator of V0 can be obtained using data from the internal study. Since formula is asymptotically negligible, the asymptotic variance of the proposed estimator of β remains the same if V0 is replaced by its consistent estimator. However, it is worthwhile to point out that the variance–covariance matrix V0 involves the external censoring time distribution when the censoring time distribution differs between internal and external studies. Without assuming the same distribution on the censoring time, the proposed method explicitly requires that V0 is available from the external source.

3.2 Synthesizing External Information in the Presence of Population Heterogeneity

We now consider the situation where the distribution of covariates in the internal study differs from that in the external study. Denote by formula the density functions of X* in the external study and by formula the density functions of X in the internal study. To characterize the differences between formula and formula, we employ a semiparametric density ratio model

(20)

where formula is a prespecified q-dimensional function of X, γ is a q-dimensional vector of parameters, and fX(x) is left unspecified. Interestingly, the semiparametric model specified in (20) is equivalent to imposing a (parametric) logistic regression model for membership in the internal (vs. external) study given X. In practice, the selection of covariates involved in D(X) can be informed by comparing the summary statistics of covariates, such as means and variances, which are typically available in the medical reports. For example, if the mean of formula and the variance of X1 (but not X2) are found to be different between internal and external studies, one may specify formula. The parameter γ in (20) characterizes the degree of heterogeneity in the covariate distribution between studies, with formula implying no population heterogeneity.

By employing Model (20) to account for the heterogeneity, we can derive a new set of weighted estimating equations

(21)

where the weight formula reflects the magnitude of heterogeneity. In the absence of heterogeneity, that is, formula, Equation (21) reduces to equation (9). It is worthwhile to point out that imposing the semiparametric density ratio model (20) introduces extra parameters; thus, a direct application of the estimation procedure proposed in the previous sections may encounter identifiability problems. To circumvent this challenge, we seek to exploit information in the covariate summary statistics to construct an additional set of constraints to improve model identification. Based on the summary statistics formula, an additional set of estimating equations for γ can be derived as formula and formula, where the latter reflects formula. Collectively, we have formula, where formula with

(22)
(23)

In formula, formula can be different from formula as long as formula satisfies the regular conditions in the  Appendix. Moreover, the number of the additional estimating equations for γ can be greater than q.

We proposed to estimate β by maximizing the augmented log empirical likelihood function

(24)

subject to the constraints

(25)

where the third constraint is the empirical version of the population estimating equation formula.

Arguing as in Section 2.2, we can estimate formula by maximizing the objective function

(26)
(27)

where ξ is the Lagrange multiplier satisfying formula. Let formula be the derivative of formula with respect to formula. The maximizer, denoted by formula, can be obtained by solving

formula using the Newton–Raphson algorithm. The asymptotic properties of formula are summarized in Theorem 3, with proof given in the Supporting Information.

 
Theorem 3.

Under regularity conditions for pseudopartial likelihood estimators (Zucker, 2005, p. 1273) and Conditions (A1)∼(A4) stated in the  Appendix, assume that there exists a constant formula so that formula. Then, as formula, (i) formula converges in probability to β0 and (ii) formula converges in distribution to a zero mean multivariate normal distribution with covariance matrix formula, where formula, formula, formula, formula, formula, formula, formula, and formula.

It follows from the fact that formula is semipositive definite that the proposed estimator formula is asymptotically as or more efficient than the conventional pseudopartial likelihood estimator formula. When formula, we have formulaformula and thus formula, with formula. In this case, the asymptotic variance–covariance matrix of formula is formula and is free of V0 as the uncertainty in the external information is negligible. On the other hand, when formula, we have formula, and thus, there is no efficiency gain when compared with formula. Note that formula can be more efficient than formula in the absence of heterogeneity. This is because the efficiency loss resulting from estimating an extra set of parameters in the density ratio model can be compensated by including additional constraints based on the marginal covariate distribution.

4 Numerical Simulations

Simulation studies were conducted to evaluate the performance of the proposed methods under two special cases of the semiparametric transformation model, namely, the Cox model and the proportional odds model. For the internal study, we independently generated X1 from the standard normal distribution N(0, 1) and X2 from the Bernoulli distribution with formula. Given formula, the survival time T has a cumulative hazard function formula. We considered two sets of model specifications: (I) formula with formula and (II) formula with formula. The follow-up time C was generated from a uniform distribution so that the censoring rate was approximately 30%. In all simulations, 1000 internal study datasets were generated, each with n = 400.

On the other hand, the external study data were generated with or without the homogeneity assumption on the covariate distribution. Specifically, for the homogeneity case, the simulation setting was identical to that of the internal study. For the heterogeneity case, formula was generated from the normal distribution with mean 0.2 and variance 0.49, whereas formula was generated from the Bernoulli distribution with formula. Hence, the density ratio model formula, with formula, characterized the difference in the covariate distribution between internal and external studies. All other simulation settings were the same as the internal study. The external study sample size was set to be formula and 400 for different degrees of uncertainty in the external information.

We considered external information in the form of survival probability at formula for the two subgroups: formula and formula. For the homogeneity case, the true survival probabilities of subgroups Ω1 and Ω2 are 0.68 and 0.84 under the Cox model and 0.72 and 0.85 under the proportional odds model; for the heterogeneity case, the subgroup survival probabilities are 0.72 and 0.83 under the Cox model and are 0.75 and 0.84 under the proportional odds model. In each simulation, the subgroup survival probabilities formula, formula, were estimated using the Kaplan–Meier method with the external study data, and the variance–covariance matrix V0 was calculated using Greenwood's formula. Finally, when adjusting for heterogeneity via the density ratio model, we also incorporated summary statistics of the covariates in the external study, given in the form of formula with formula, to improve model estimation.

Tables 1 and 2 summarize the performance of proposed methods under the homogeneity assumption on the covariate distribution, whereas Tables 3 and 4 summarize their performance in the heterogeneity case. We examined the biases, asymptotic standard errors, and empirical standard deviations of the conventional pseudopartial likelihood estimator (formula), the maximum likelihood estimator (MLE, formula) implemented using the R package transmdl, the empirical likelihood estimator without accounting for uncertainty (formula), the augmented empirical likelihood estimator accounting for uncertainty but not population heterogeneity (formula), and the estimator accounting for both uncertainty and heterogeneity by employing a density ratio model (formula).

Table 1

Summary of simulation results for the Cox model assuming formula

β1β2β3
BiasESD(ASE)CPREBiasESD(ASE)CPREBiasESD(ASE)CPRE
Scenario (I): formula
formula−764(66)95.46124(123)94.7
formula−764(66)95.41.006124(123)94.71.00
formula−626(25)94.66.685124(123)94.21.01
N = 400
formula−862(63)94.81.076124(123)94.81.00
formula−862(63)95.31.076124(123)94.81.00
N = 10,000
formula−741(40)95.62.445124(123)94.61.00
formula−539(39)94.42.665124(123)94.71.00
Scenario (II): formula
formula−997(95)94.310129(129)95.6−2132(129)93.8
formula−997(95)94.31.0010129(129)95.61.00−2132(129)93.81.00
formula−627(27)94.612.469127(127)94.91.03−399(98)93.91.79
N = 400
formula−1090(87)94.01.1510129(128)95.41.000129(124)94.01.06
formula−1090(87)94.11.1510129(128)95.41.000128(124)93.81.06
N = 10,000
formula−746(45)95.04.339128(127)94.11.02−2105(103)94.71.60
formula−644(43)94.14.859127(127)95.41.03−3104(103)95.21.63
β1β2β3
BiasESD(ASE)CPREBiasESD(ASE)CPREBiasESD(ASE)CPRE
Scenario (I): formula
formula−764(66)95.46124(123)94.7
formula−764(66)95.41.006124(123)94.71.00
formula−626(25)94.66.685124(123)94.21.01
N = 400
formula−862(63)94.81.076124(123)94.81.00
formula−862(63)95.31.076124(123)94.81.00
N = 10,000
formula−741(40)95.62.445124(123)94.61.00
formula−539(39)94.42.665124(123)94.71.00
Scenario (II): formula
formula−997(95)94.310129(129)95.6−2132(129)93.8
formula−997(95)94.31.0010129(129)95.61.00−2132(129)93.81.00
formula−627(27)94.612.469127(127)94.91.03−399(98)93.91.79
N = 400
formula−1090(87)94.01.1510129(128)95.41.000129(124)94.01.06
formula−1090(87)94.11.1510129(128)95.41.000128(124)93.81.06
N = 10,000
formula−746(45)95.04.339128(127)94.11.02−2105(103)94.71.60
formula−644(43)94.14.859127(127)95.41.03−3104(103)95.21.63

Note: formula, the pseudopartial likelihood estimator; formula, the maximum likelihood estimator; formula, the empirical likelihood estimator; formula, the proposed estimator accounting for uncertainty in auxiliary information; formula, the proposed estimator accounting for population heterogeneity and uncertainty in auxiliary information. Bias, ESD, ASE, and CP are empirical bias (× 1000), empirical standard deviation (× 1000), the average of the estimated asymptotic standard error (× 1000) over 1000 simulated datasets, and the 95% coverage probability. RE, the empirical variance of formula divided by that of the proposed estimators.

Table 1

Summary of simulation results for the Cox model assuming formula

β1β2β3
BiasESD(ASE)CPREBiasESD(ASE)CPREBiasESD(ASE)CPRE
Scenario (I): formula
formula−764(66)95.46124(123)94.7
formula−764(66)95.41.006124(123)94.71.00
formula−626(25)94.66.685124(123)94.21.01
N = 400
formula−862(63)94.81.076124(123)94.81.00
formula−862(63)95.31.076124(123)94.81.00
N = 10,000
formula−741(40)95.62.445124(123)94.61.00
formula−539(39)94.42.665124(123)94.71.00
Scenario (II): formula
formula−997(95)94.310129(129)95.6−2132(129)93.8
formula−997(95)94.31.0010129(129)95.61.00−2132(129)93.81.00
formula−627(27)94.612.469127(127)94.91.03−399(98)93.91.79
N = 400
formula−1090(87)94.01.1510129(128)95.41.000129(124)94.01.06
formula−1090(87)94.11.1510129(128)95.41.000128(124)93.81.06
N = 10,000
formula−746(45)95.04.339128(127)94.11.02−2105(103)94.71.60
formula−644(43)94.14.859127(127)95.41.03−3104(103)95.21.63
β1β2β3
BiasESD(ASE)CPREBiasESD(ASE)CPREBiasESD(ASE)CPRE
Scenario (I): formula
formula−764(66)95.46124(123)94.7
formula−764(66)95.41.006124(123)94.71.00
formula−626(25)94.66.685124(123)94.21.01
N = 400
formula−862(63)94.81.076124(123)94.81.00
formula−862(63)95.31.076124(123)94.81.00
N = 10,000
formula−741(40)95.62.445124(123)94.61.00
formula−539(39)94.42.665124(123)94.71.00
Scenario (II): formula
formula−997(95)94.310129(129)95.6−2132(129)93.8
formula−997(95)94.31.0010129(129)95.61.00−2132(129)93.81.00
formula−627(27)94.612.469127(127)94.91.03−399(98)93.91.79
N = 400
formula−1090(87)94.01.1510129(128)95.41.000129(124)94.01.06
formula−1090(87)94.11.1510129(128)95.41.000128(124)93.81.06
N = 10,000
formula−746(45)95.04.339128(127)94.11.02−2105(103)94.71.60
formula−644(43)94.14.859127(127)95.41.03−3104(103)95.21.63

Note: formula, the pseudopartial likelihood estimator; formula, the maximum likelihood estimator; formula, the empirical likelihood estimator; formula, the proposed estimator accounting for uncertainty in auxiliary information; formula, the proposed estimator accounting for population heterogeneity and uncertainty in auxiliary information. Bias, ESD, ASE, and CP are empirical bias (× 1000), empirical standard deviation (× 1000), the average of the estimated asymptotic standard error (× 1000) over 1000 simulated datasets, and the 95% coverage probability. RE, the empirical variance of formula divided by that of the proposed estimators.

Table 2

Summary of simulation results for the proportional odds model assuming formula

β1β2β3
BiasESD(ASE)CPREBiasESD(ASE)CPREBiasESD(ASE)CPRE
Scenario (I): formula
formula−595(97)95.38188(188)94.8
formula−694(96)95.51.01−3187(188)95.61.01
formula−628(27)95.113.199188(188)94.81.01
N = 400
formula−989(90)95.71.1410189(188)94.90.99
formula−989(90)95.81.1510189(188)94.90.99
N = 10,000
formula−549(49)94.83.809189(188)94.81.00
formula−347(47)94.84.219188(188)94.81.00
Scenario (II): formula
formula−1139(137)94.09193(192)94.3−9193(191)94.9
formula−4137(137)94.61.027193(192)94.71.015193(191)95.21.01
formula−628(27)94.724.079193(191)94.41.01−10145(142)95.01.79
N = 400
formula−7122(120)95.01.3113194(192)94.10.99−10184(180)94.71.11
formula−7121(120)95.11.3113194(192)94.20.99−10184(180)94.61.11
N = 10,000
formula−453(52)94.56.9111193(191)94.41.01−12151(148)94.61.64
formula−350(50)94.67.6611193(191)94.41.01−13151(147)94.41.64
β1β2β3
BiasESD(ASE)CPREBiasESD(ASE)CPREBiasESD(ASE)CPRE
Scenario (I): formula
formula−595(97)95.38188(188)94.8
formula−694(96)95.51.01−3187(188)95.61.01
formula−628(27)95.113.199188(188)94.81.01
N = 400
formula−989(90)95.71.1410189(188)94.90.99
formula−989(90)95.81.1510189(188)94.90.99
N = 10,000
formula−549(49)94.83.809189(188)94.81.00
formula−347(47)94.84.219188(188)94.81.00
Scenario (II): formula
formula−1139(137)94.09193(192)94.3−9193(191)94.9
formula−4137(137)94.61.027193(192)94.71.015193(191)95.21.01
formula−628(27)94.724.079193(191)94.41.01−10145(142)95.01.79
N = 400
formula−7122(120)95.01.3113194(192)94.10.99−10184(180)94.71.11
formula−7121(120)95.11.3113194(192)94.20.99−10184(180)94.61.11
N = 10,000
formula−453(52)94.56.9111193(191)94.41.01−12151(148)94.61.64
formula−350(50)94.67.6611193(191)94.41.01−13151(147)94.41.64

Note: formula, the pseudopartial likelihood estimator; formula, the maximum likelihood estimator; formula, the empirical likelihood estimator; formula, the proposed estimator accounting for uncertainty in auxiliary information; formula, the proposed estimator accounting for population heterogeneity and uncertainty in auxiliary information. Bias, ESD, ASE, and CP are empirical bias (× 1000), empirical standard deviation (× 1000), the average of the estimated asymptotic standard error (× 1000) over 1000 simulated datasets, and the 95% coverage probability. RE, the empirical variance of formula divided by that of the proposed estimators.

Table 2

Summary of simulation results for the proportional odds model assuming formula

β1β2β3
BiasESD(ASE)CPREBiasESD(ASE)CPREBiasESD(ASE)CPRE
Scenario (I): formula
formula−595(97)95.38188(188)94.8
formula−694(96)95.51.01−3187(188)95.61.01
formula−628(27)95.113.199188(188)94.81.01
N = 400
formula−989(90)95.71.1410189(188)94.90.99
formula−989(90)95.81.1510189(188)94.90.99
N = 10,000
formula−549(49)94.83.809189(188)94.81.00
formula−347(47)94.84.219188(188)94.81.00
Scenario (II): formula
formula−1139(137)94.09193(192)94.3−9193(191)94.9
formula−4137(137)94.61.027193(192)94.71.015193(191)95.21.01
formula−628(27)94.724.079193(191)94.41.01−10145(142)95.01.79
N = 400
formula−7122(120)95.01.3113194(192)94.10.99−10184(180)94.71.11
formula−7121(120)95.11.3113194(192)94.20.99−10184(180)94.61.11
N = 10,000
formula−453(52)94.56.9111193(191)94.41.01−12151(148)94.61.64
formula−350(50)94.67.6611193(191)94.41.01−13151(147)94.41.64
β1β2β3
BiasESD(ASE)CPREBiasESD(ASE)CPREBiasESD(ASE)CPRE
Scenario (I): formula
formula−595(97)95.38188(188)94.8
formula−694(96)95.51.01−3187(188)95.61.01
formula−628(27)95.113.199188(188)94.81.01
N = 400
formula−989(90)95.71.1410189(188)94.90.99
formula−989(90)95.81.1510189(188)94.90.99
N = 10,000
formula−549(49)94.83.809189(188)94.81.00
formula−347(47)94.84.219188(188)94.81.00
Scenario (II): formula
formula−1139(137)94.09193(192)94.3−9193(191)94.9
formula−4137(137)94.61.027193(192)94.71.015193(191)95.21.01
formula−628(27)94.724.079193(191)94.41.01−10145(142)95.01.79
N = 400
formula−7122(120)95.01.3113194(192)94.10.99−10184(180)94.71.11
formula−7121(120)95.11.3113194(192)94.20.99−10184(180)94.61.11
N = 10,000
formula−453(52)94.56.9111193(191)94.41.01−12151(148)94.61.64
formula−350(50)94.67.6611193(191)94.41.01−13151(147)94.41.64

Note: formula, the pseudopartial likelihood estimator; formula, the maximum likelihood estimator; formula, the empirical likelihood estimator; formula, the proposed estimator accounting for uncertainty in auxiliary information; formula, the proposed estimator accounting for population heterogeneity and uncertainty in auxiliary information. Bias, ESD, ASE, and CP are empirical bias (× 1000), empirical standard deviation (× 1000), the average of the estimated asymptotic standard error (× 1000) over 1000 simulated datasets, and the 95% coverage probability. RE, the empirical variance of formula divided by that of the proposed estimators.

Table 3

Summary of simulation results for the Cox model assuming formula

β1β2β3
BiasESD(ASE)CPREBiasESD(ASE)CPREBiasESD(ASE)CPRE
Scenario (I): formula
formula−764(66)95.46124(123)94.7
formula−764(66)95.41.006124(123)94.71.00
formula12820(19)010.01−15120(123)95.51.07
N = 400
formula862(62)94.71.084124(123)94.81.01
formula−662(64)95.31.076124(123)94.81.00
N = 10,000
formula9938(36)24.22.85−10120(123)95.61.07
formula−242(42)92.72.375124(123)94.61.01
formula
formula−997(95)94.310129(129)95.6−2132(129)93.8
formula−997(95)94.31.0010129(129)95.61.00−2132(129)93.81.00
formula13420(19)023.34−14125(127)95.31.06−13296(96)73.41.90
N = 400
formula2089(85)92.71.194128(128)95.31.02−27127(123)93.01.08
formula-993(89)93.81.0910129(128)95.21.01−2130(125)94.41.04
N = 10,000
formula11940(39)15.05.81−12125(127)95.71.06−118101(101)80.11.73
formula−449(48)94.63.859127(127)95.41.03−5105(105)94.51.60
β1β2β3
BiasESD(ASE)CPREBiasESD(ASE)CPREBiasESD(ASE)CPRE
Scenario (I): formula
formula−764(66)95.46124(123)94.7
formula−764(66)95.41.006124(123)94.71.00
formula12820(19)010.01−15120(123)95.51.07
N = 400
formula862(62)94.71.084124(123)94.81.01
formula−662(64)95.31.076124(123)94.81.00
N = 10,000
formula9938(36)24.22.85−10120(123)95.61.07
formula−242(42)92.72.375124(123)94.61.01
formula
formula−997(95)94.310129(129)95.6−2132(129)93.8
formula−997(95)94.31.0010129(129)95.61.00−2132(129)93.81.00
formula13420(19)023.34−14125(127)95.31.06−13296(96)73.41.90
N = 400
formula2089(85)92.71.194128(128)95.31.02−27127(123)93.01.08
formula-993(89)93.81.0910129(128)95.21.01−2130(125)94.41.04
N = 10,000
formula11940(39)15.05.81−12125(127)95.71.06−118101(101)80.11.73
formula−449(48)94.63.859127(127)95.41.03−5105(105)94.51.60

Note: formula, the pseudopartial likelihood estimator; formula, the maximum likelihood estimator; formula, the empirical likelihood estimator; formula, the proposed estimator accounting for uncertainty in auxiliary information; formula, the proposed estimator accounting for population heterogeneity and uncertainty in auxiliary information. Bias, ESD, ASE, and CP are empirical bias (× 1000), empirical standard deviation (× 1000), the average of the estimated asymptotic standard error (× 1000) over 1000 simulated datasets, and the 95% coverage probability. RE, the empirical variance of formula divided by that of the proposed estimators.

Table 3

Summary of simulation results for the Cox model assuming formula

β1β2β3
BiasESD(ASE)CPREBiasESD(ASE)CPREBiasESD(ASE)CPRE
Scenario (I): formula
formula−764(66)95.46124(123)94.7
formula−764(66)95.41.006124(123)94.71.00
formula12820(19)010.01−15120(123)95.51.07
N = 400
formula862(62)94.71.084124(123)94.81.01
formula−662(64)95.31.076124(123)94.81.00
N = 10,000
formula9938(36)24.22.85−10120(123)95.61.07
formula−242(42)92.72.375124(123)94.61.01
formula
formula−997(95)94.310129(129)95.6−2132(129)93.8
formula−997(95)94.31.0010129(129)95.61.00−2132(129)93.81.00
formula13420(19)023.34−14125(127)95.31.06−13296(96)73.41.90
N = 400
formula2089(85)92.71.194128(128)95.31.02−27127(123)93.01.08
formula-993(89)93.81.0910129(128)95.21.01−2130(125)94.41.04
N = 10,000
formula11940(39)15.05.81−12125(127)95.71.06−118101(101)80.11.73
formula−449(48)94.63.859127(127)95.41.03−5105(105)94.51.60
β1β2β3
BiasESD(ASE)CPREBiasESD(ASE)CPREBiasESD(ASE)CPRE
Scenario (I): formula
formula−764(66)95.46124(123)94.7
formula−764(66)95.41.006124(123)94.71.00
formula12820(19)010.01−15120(123)95.51.07
N = 400
formula862(62)94.71.084124(123)94.81.01
formula−662(64)95.31.076124(123)94.81.00
N = 10,000
formula9938(36)24.22.85−10120(123)95.61.07
formula−242(42)92.72.375124(123)94.61.01
formula
formula−997(95)94.310129(129)95.6−2132(129)93.8
formula−997(95)94.31.0010129(129)95.61.00−2132(129)93.81.00
formula13420(19)023.34−14125(127)95.31.06−13296(96)73.41.90
N = 400
formula2089(85)92.71.194128(128)95.31.02−27127(123)93.01.08
formula-993(89)93.81.0910129(128)95.21.01−2130(125)94.41.04
N = 10,000
formula11940(39)15.05.81−12125(127)95.71.06−118101(101)80.11.73
formula−449(48)94.63.859127(127)95.41.03−5105(105)94.51.60

Note: formula, the pseudopartial likelihood estimator; formula, the maximum likelihood estimator; formula, the empirical likelihood estimator; formula, the proposed estimator accounting for uncertainty in auxiliary information; formula, the proposed estimator accounting for population heterogeneity and uncertainty in auxiliary information. Bias, ESD, ASE, and CP are empirical bias (× 1000), empirical standard deviation (× 1000), the average of the estimated asymptotic standard error (× 1000) over 1000 simulated datasets, and the 95% coverage probability. RE, the empirical variance of formula divided by that of the proposed estimators.

Table 4

Summary of simulation results for the proportional odds model assuming formula

β1β2β3
BiasESD(ASE)CPREBiasESD(ASE)CPREBiasESD(ASE)CPRE
Scenario (I): formula
formula−595(97)95.38188(188)94.8
formula−694(96)95.51.01−3187(188)95.61.01
formula13820(19)022.311185(188)95.21.03
N = 400
formula1689(89)93.81.158188(188)94.71.00
formula−791(92)93.91.0910189(188)94.80.99
N = 10,000
formula11744(43)24.14.682186(188)95.01.03
formula−354(53)95.03.179188(188)94.91.00
Scenario (II): formula
formula−1139(137)94.09193(192)94.3−9193(191)94.9
formula−4137(137)94.61.027193(192)94.71.015193(191)95.21.01
formula14120(19)048.020191(191)94.01.02−152143(140)81.21.84
N = 400
formula36119(117)93.71.3610194(192)94.31.00−52181(178)94.01.14
formula−3127(125)95.11.1913194(192)94.20.99−14186(183)94.71.08
N = 10,000
formula13246(45)18.39.211191(191)94.21.02−143148(146)83.81.71
formula157(58)95.35.8811193(191)94.21.00−17153(150)94.31.60
β1β2β3
BiasESD(ASE)CPREBiasESD(ASE)CPREBiasESD(ASE)CPRE
Scenario (I): formula
formula−595(97)95.38188(188)94.8
formula−694(96)95.51.01−3187(188)95.61.01
formula13820(19)022.311185(188)95.21.03
N = 400
formula1689(89)93.81.158188(188)94.71.00
formula−791(92)93.91.0910189(188)94.80.99
N = 10,000
formula11744(43)24.14.682186(188)95.01.03
formula−354(53)95.03.179188(188)94.91.00
Scenario (II): formula
formula−1139(137)94.09193(192)94.3−9193(191)94.9
formula−4137(137)94.61.027193(192)94.71.015193(191)95.21.01
formula14120(19)048.020191(191)94.01.02−152143(140)81.21.84
N = 400
formula36119(117)93.71.3610194(192)94.31.00−52181(178)94.01.14
formula−3127(125)95.11.1913194(192)94.20.99−14186(183)94.71.08
N = 10,000
formula13246(45)18.39.211191(191)94.21.02−143148(146)83.81.71
formula157(58)95.35.8811193(191)94.21.00−17153(150)94.31.60

Note: formula, the pseudopartial likelihood estimator; formula, the maximum likelihood estimator; formula, the empirical likelihood estimator; formula, the proposed estimator accounting for uncertainty in auxiliary information; formula, the proposed estimator accounting for population heterogeneity and uncertainty in auxiliary information. Bias, ESD, ASE, and CP are empirical bias (× 1000), empirical standard deviation (× 1000), the average of the estimated asymptotic standard error (× 1000) over 1000 simulated datasets, and the 95% coverage probability. RE, the empirical variance of formula divided by that of the proposed estimators.

Table 4

Summary of simulation results for the proportional odds model assuming formula

β1β2β3
BiasESD(ASE)CPREBiasESD(ASE)CPREBiasESD(ASE)CPRE
Scenario (I): formula
formula−595(97)95.38188(188)94.8
formula−694(96)95.51.01−3187(188)95.61.01
formula13820(19)022.311185(188)95.21.03
N = 400
formula1689(89)93.81.158188(188)94.71.00
formula−791(92)93.91.0910189(188)94.80.99
N = 10,000
formula11744(43)24.14.682186(188)95.01.03
formula−354(53)95.03.179188(188)94.91.00
Scenario (II): formula
formula−1139(137)94.09193(192)94.3−9193(191)94.9
formula−4137(137)94.61.027193(192)94.71.015193(191)95.21.01
formula14120(19)048.020191(191)94.01.02−152143(140)81.21.84
N = 400
formula36119(117)93.71.3610194(192)94.31.00−52181(178)94.01.14
formula−3127(125)95.11.1913194(192)94.20.99−14186(183)94.71.08
N = 10,000
formula13246(45)18.39.211191(191)94.21.02−143148(146)83.81.71
formula157(58)95.35.8811193(191)94.21.00−17153(150)94.31.60
β1β2β3
BiasESD(ASE)CPREBiasESD(ASE)CPREBiasESD(ASE)CPRE
Scenario (I): formula
formula−595(97)95.38188(188)94.8
formula−694(96)95.51.01−3187(188)95.61.01
formula13820(19)022.311185(188)95.21.03
N = 400
formula1689(89)93.81.158188(188)94.71.00
formula−791(92)93.91.0910189(188)94.80.99
N = 10,000
formula11744(43)24.14.682186(188)95.01.03
formula−354(53)95.03.179188(188)94.91.00
Scenario (II): formula
formula−1139(137)94.09193(192)94.3−9193(191)94.9
formula−4137(137)94.61.027193(192)94.71.015193(191)95.21.01
formula14120(19)048.020191(191)94.01.02−152143(140)81.21.84
N = 400
formula36119(117)93.71.3610194(192)94.31.00−52181(178)94.01.14
formula−3127(125)95.11.1913194(192)94.20.99−14186(183)94.71.08
N = 10,000
formula13246(45)18.39.211191(191)94.21.02−143148(146)83.81.71
formula157(58)95.35.8811193(191)94.21.00−17153(150)94.31.60

Note: formula, the pseudopartial likelihood estimator; formula, the maximum likelihood estimator; formula, the empirical likelihood estimator; formula, the proposed estimator accounting for uncertainty in auxiliary information; formula, the proposed estimator accounting for population heterogeneity and uncertainty in auxiliary information. Bias, ESD, ASE, and CP are empirical bias (× 1000), empirical standard deviation (× 1000), the average of the estimated asymptotic standard error (× 1000) over 1000 simulated datasets, and the 95% coverage probability. RE, the empirical variance of formula divided by that of the proposed estimators.

In the absence of population heterogeneity, all methods yield a small bias in the parameter estimation and the coverage rates of the 95% confidence intervals based on the estimated asymptotic standard errors are very close to the nominal level (0.95). Compared with formula, the proposed methods enjoy efficiency gains in estimating β1 in Scenario (I) and both β1 and β3 in Scenario (II) with the upper bound given by that of formula, but not β2 in either case. Note that the external information consists of two exclusive subgroups differed by their values in X1 but not X2. Thus, the efficiency gain is mainly observed for effects involving X1. On the other hand, formula is slightly more efficient than formula, with the relative efficiency ranging from 1.01 to 1.02. Yet, it is computationally more costly than its competitors. The computation burden of the MLE is 52 times higher than that of formula (26,382 s vs. 509 s for analyzing 10 datasets with formula and formula).

When heterogeneity between internal and external studies is present, Table 3 and 4 show that formula and formula, the augmented empirical likelihood estimators without accounting for heterogeneity, can yield large biases. When a density ratio model is employed to characterize population heterogeneity, formula yields small biases while enjoying efficiency gains under all scenarios. When information from a large external study with formula was exploited, the relative efficiency in estimating β1 under Scenario (I) is 2.19 and 3.28 for the Cox model and the proportional odds model, respectively. On the other hand, the relative efficiency in Scenario (II) can be as high as 3.86 and 1.59 in estimating β1 and β3 under the Cox model, and 5.87 and 1.60 under the proportional odds model.

To investigate the robustness of the proposed methods against model misspecification, we carried out additional simulation studies with incorrect choices of formula in the semiparametric density ratio model. The results are presented in Tables S1–S2 of the Supporting Information. In the case where formula fails to include X2, the bias in estimating β remains negligible, and the efficiency gain is similar to that under the correctly specified model. On the other hand, failing to include X1 or formula in formula leads to larger biases in parameter estimation. The results can be explained by the fact that the external information consists of two exclusive subgroups differed by their values in X1 but not in X2. Thus model misspecification has a minor impact on the estimation of β when X2 is not included in formula.

Following the suggestions of the reviewers, we expanded simulation studies by including the smaller internal sample sizes formula and 200, varying the external sample size N from 200 to 10,000, and considered different censoring rates. The results show that the proposed methods perform well in all situations. Details of additional simulation studies are reported in Section S2 of the Supporting Information. Moreover, since formula may not be available in practice, we studied the asymptotic properties and investigated numerical performance of the proposed method when its estimate formula is employed instead. As expected, replacing formula with its estimate yields a larger asymptotic variance. Interestingly, simulation studies show that two estimators have a similar numerical performance in estimating β but not γ. Details can be found in Section 3 of the Supplementary Information.

5 Pancreatic Cancer Data Analysis

Pancreatic cancer is a highly aggressive disease. According to Global Cancer Statistics 2020 (Sung et al., 2021), pancreatic cancer is the seventh leading cause of cancer death in the world, and its incidence rate is on the rise. Late diagnosis, early metastasis, and lack of effective therapy have contributed to the dismal overall prognosis, with only 6% of the patients surviving more than 5 years after diagnosis. Despite recent advances in cancer diagnosis and treatment, surgical resection remains the only possible curative option for pancreatic cancer. However, less than 20% of the patients are eligible for resection when diagnosed, as they often present at advanced stages. Moreover, most patients with resectable pancreatic cancer have an unfavorable outcome due to recurrent disease within a few years. The 5-year survival probability after resection is reported to be around 34.5% (Yamamoto et al., 2015). Hence, it is crucial to identify prognostic factors for pancreatic cancer patients to improve disease management.

In the Johns Hopkins Hospital pancreatic cancer study described in Section 1, patients' demographic information, treatments, and clinical and pathological exam were collected via a retrospective chart review. All-cause and cancer-specific deaths were determined by a combined review of clinical information, Social Security Death Index, and the National Cancer Database. Prognostic factors of interest in our analysis included resection margin status, lymph node involvement, invasion of the surrounding nerves, age, and gender. After excluding patients with missing data, the mean age at the time of surgery among the 204 remaining patients was 64.2 years. About half of the patients were male (51.9%) and had positive resection margins (48.0%). The majority of the patients had perineural invasion (PNI) (94.1%) and lymph node involvement (86.2%).

To evaluate the effects of prognostic factors on survival after pancreatectomy, we fit two special cases of the semiparametric transformation model: the Cox model and the proportional odds model. As reported in Table 5, both sets of analysis concluded that female sex, older age, PNI, node positivity, and margin positivity were associated with poorer prognosis. However, the effect of node status, a known prognostic factor, did not reach statistical significance in both models, most likely due to the small sample size of patients without lymph node involvement.

Table 5

Estimated regression coefficients of the Cox model and the proportional odds model for the pancreatic cancer study

Node positiveMargin positivePNI>65 yearsMaleγ0γ1γ2
EstSEEstSEEstSEEstSEEstSEEstSEEstSEEstSE
Cox model
formula0.3630.2260.4070.1531.1240.3730.2820.153−0.2950.154
formula0.5090.1580.3930.1531.1280.3730.2750.153−0.2910.1541.1410.226−0.8670.269−1.0980.290
Proportional odds model
formula0.6540.3730.9160.2562.0960.7440.3850.246−0.3410.248
formula0.8770.2820.8990.2552.0890.7430.3770.245−0.3400.2491.1430.226−0.8670.269−1.1000.290
Node positiveMargin positivePNI>65 yearsMaleγ0γ1γ2
EstSEEstSEEstSEEstSEEstSEEstSEEstSEEstSE
Cox model
formula0.3630.2260.4070.1531.1240.3730.2820.153−0.2950.154
formula0.5090.1580.3930.1531.1280.3730.2750.153−0.2910.1541.1410.226−0.8670.269−1.0980.290
Proportional odds model
formula0.6540.3730.9160.2562.0960.7440.3850.246−0.3410.248
formula0.8770.2820.8990.2552.0890.7430.3770.245−0.3400.2491.1430.226−0.8670.269−1.1000.290

Note: formula, the conventional pseudopartial likelihood estimator without synthesizing auxiliary information; formula, the proposed estimator accounting for uncertainty and heterogeneity in the external information. Est denotes the estimate, SE denotes the standard error, which is calculated by the square root of the asymptotic variance. formula, and γ2, the regression parameters of the density ratio model corresponding to intercept, margin positivity, and node positivity, respectively.

Table 5

Estimated regression coefficients of the Cox model and the proportional odds model for the pancreatic cancer study

Node positiveMargin positivePNI>65 yearsMaleγ0γ1γ2
EstSEEstSEEstSEEstSEEstSEEstSEEstSEEstSE
Cox model
formula0.3630.2260.4070.1531.1240.3730.2820.153−0.2950.154
formula0.5090.1580.3930.1531.1280.3730.2750.153−0.2910.1541.1410.226−0.8670.269−1.0980.290
Proportional odds model
formula0.6540.3730.9160.2562.0960.7440.3850.246−0.3410.248
formula0.8770.2820.8990.2552.0890.7430.3770.245−0.3400.2491.1430.226−0.8670.269−1.1000.290
Node positiveMargin positivePNI>65 yearsMaleγ0γ1γ2
EstSEEstSEEstSEEstSEEstSEEstSEEstSEEstSE
Cox model
formula0.3630.2260.4070.1531.1240.3730.2820.153−0.2950.154
formula0.5090.1580.3930.1531.1280.3730.2750.153−0.2910.1541.1410.226−0.8670.269−1.0980.290
Proportional odds model
formula0.6540.3730.9160.2562.0960.7440.3850.246−0.3410.248
formula0.8770.2820.8990.2552.0890.7430.3770.245−0.3400.2491.1430.226−0.8670.269−1.1000.290

Note: formula, the conventional pseudopartial likelihood estimator without synthesizing auxiliary information; formula, the proposed estimator accounting for uncertainty and heterogeneity in the external information. Est denotes the estimate, SE denotes the standard error, which is calculated by the square root of the asymptotic variance. formula, and γ2, the regression parameters of the density ratio model corresponding to intercept, margin positivity, and node positivity, respectively.

To improve estimation efficiency, we seek to incorporate information from Ahmad et al. (2001), which reported 3-year survival probabilities with respect to lymph node status based on data from 116 patients. From this study, we exploited two sets of 3-year subgroup survival probabilities: 14% for patients with positive node status and 38% for patients with negative node status. An examination of the available covariate summary statistics revealed that the proportions of margin-positive and node-positive patients in the external study were only 24% and 62%, respectively, which were much lower than that in the internal study (48% and 86%, respectively), indicating strong heterogeneity between the covariate distributions between the internal and external study. As discussed in Section 3.2, we constructed a density ratio model with margin status and node status to account for the heterogeneity between the covariate distributions. Note that we applied the estimator described in Section 3 of the Supporting Information to account for variability in the covariate summary statistics.

Table 5 summarizes the fitted Cox model and the fitted proportional odds model using the proposed methods. We extracted the standard errors of the reported 3-year survival probabilities from Ahmad et al. (2001) to evaluate the degree of uncertainty in the external information. In particular, a significant efficiency gain is observed in estimating the effects of node positivity, which determined the two subgroups in the external information. The estimated coefficients of margin positivity and node positivity in the density ratio model are −0.867 (95% CI, [formula0.340]) and −1.098 (95% CI, [formula0.529]), respectively, under the Cox model, and −0.867 (95% CI, [formula0.339]) and −1.100 (95% CI, [formula0.531]), respectively, under the proportional odds model, indicating significant heterogeneity between internal and external studies. Interestingly, the efficiency loss due to estimating an additional set of parameters in the density ratio model is minimal. As a result, compared with formula, the relative efficiency in estimating node positivity is 2.046 under the Cox model and 1.750 under the proportional odds model. Notably, the effect of lymph node involvement reaches statistical significance after incorporating the external information. Finally, following Zeng and Lin (2007), the final model was selected based on the Akaike information criterion (AIC). In this data example, the proportional odds model fits slightly better than the Cox model (AIC 1623 vs. 1625). Hence, we conclude that node-positive patients had 2.404 (formula) times higher odds of dying before any given time t compared to node-negative patients.

6 Discussion

In this article, we propose empirical likelihood-based methods to improve the estimation efficiency of the semiparametric transformation model by incorporating three types of external information: t-year subgroup survival probabilities, variance–covariance matrix of the estimated survival probabilities, and covariate summary statistics. With externally reported t-year subgroup survival probabilities and a consistently estimated variance–covariance matrix, information synthesis can be performed under a homogeneity assumption between internal and external studies. When the homogeneity condition fails to hold, a density ratio model is used to account for population heterogeneity. Therefore, additional information on the distribution of the external covariates is needed to estimate the extra set of parameters in the density ratio model.

When IPD formula from external sources is available, a pooled IPD analysis can be performed to combine information from internal and external study. However, when formula, the pooled analysis may not yield better efficiency than properly combining information from the subgroup survival probabilities evaluated in entire external cohort. On the other hand, when t-year subgroup survival probabilities and a random sample of the covariates formula are available from external sources, the population estimating equation (2) can be approximated using the available external data. Specifically, let formula denote the jump size of the marginal distribution of X* at the data point formula. One can maximize the empirical likelihood obtained by multiplying the pseudopartial likelihood and the marginal likelihood of X* with respect to the constraints:

(28)

It is worth pointing out that this approach gives valid inference even when the covariate distribution differs between internal and external studies. Moreover, as argued as in Chatterjee et al. (2016) and Han and Lawless (2019), when formula, it can be shown that this approach is asymptotically as or more efficient than formula under a homogeneity assumption between internal and external studies.

In practice, adding more covariates in the density ratio model usually leads to increased computation burden and instability when implementing the empirical likelihood estimation. Alternatively, one may consider a two-step procedure as follows. In the first step, estimate γ by solving formula when the number of estimating equations constructed based on covariate summary statistics equals q, that is, the dimension of γ. When the number of estimating equations is greater than q, the generalized method of moments (GMMs) can be applied to estimate γ. In the second step, the parameter β of interest can be obtained by using the estimating procedure proposed in (17) with the constraint formula replaced by formula, where formula is the solution obtained in Step 1. In our simulation experience, this approach shares the advantage of the computational speed, and the results are similar to the empirical likelihood estimation.

It is worthwhile to point out that the use of the density ratio model to account for population heterogeneity is different than what has been proposed in the literature. For example, Huang et al. (2016) required the covariate distributions between internal and external studies to be the same but assumed that the baseline hazard function in the external study is proportional (but may not be identical) to that in the internal study up to a constant factor. Recently, Zheng et al. (2022) proposed a calibration weighting method to reduce the bias for individual risk prediction in the presence of population heterogeneity. In contrast to the density ratio model, the calibration weights formula, with ξ being the Lagrange multiplier, also reflect the difference of the covariate distributions between internal and external studies. Our conjecture is that the calibration weight adjustment can be less efficient than the density ratio weight adjustment when important covariates are included in the density ratio model. Further research is warranted and will be investigated in our future work.

Data Availability Statement

The application study data in this paper are not publicly available due to patient privacy and confidentiality issues.

Open Research Badges

This article has earned an Open Materials badge for making publicly available the components of the research methodology needed to reproduce the reported procedure and analysis. All materials are available at the Biometrics website on Wiley Online Library https://github.com/kgolmakani/CumulativeRisk.git.

Acknowledgments

This work was partially supported by Taiwan Ministry of Science and Technology MOST 110-2628-M-007-003-MY2 (Cheng and Liu), MOST 110-2811-M-007-560 (Tsai), and National Institutes of Health grant R01CA193888 (Huang). The authors thank Dr. Lei Zheng for kindly sharing the Johns Hopkins Pancreatic Cancer Study data. They also thank Dr. Ying Sheng for discussion and computing assistance.

References

Ahmad
,
N.A.
,
Lewis
,
J.D.
,
Ginsberg
,
G.G.
,
Haller
,
D.G.
,
Morris
,
J.B.
,
Williams
,
N.N.
,
Rosato
,
E.F.
&
Kochman
,
M.L.
(
2001
)
Long term survival after pancreatic resection for pancreatic adenocarcinoma
.
The American Journal of Gastroenterology
,
96
(
9
),
2609
2615
.

Chatterjee
,
N.
,
Chen
,
Y.H.
,
Maas
,
P.
&
Carroll
,
R.J.
(
2016
)
Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources
.
Journal of the American Statistical Association
,
111
(
513
),
107
117
.

Chen
,
Z.
,
Ning
,
J.
,
Shen
,
Y.
&
Qin
,
J.
(
2021
)
Combining primary cohort data with external aggregate information without assuming comparability
.
Biometrics
,
77
(
3
),
1024
1036
.

Gao
,
F.
&
Chan
,
K.
(
2022
)
Noniterative adjustment to regression estimators with population-based auxiliary information for semiparametric models
.
Biometrics
,
to appear
.

Guyatt
,
G.
,
Cairns
,
J.
,
Churchill
,
D.
,
Cook
,
D.
,
Haynes
,
B.
,
Hirsh
,
J.
,
Irvine
,
J.
,
Levine
,
M.
,
Levine
,
M.
,
Nishikawa
,
J.
et al. (
1992
)
Evidence-based medicine: a new approach to teaching the practice of medicine
.
Journal of the American Medical Association
,
268
(
17
),
2420
2425
.

Han
,
P.
&
Lawless
,
J.F.
(
2019
)
Empirical likelihood estimation using auxiliary summary information with different covariate distributions
.
Statistica Sinica
,
29
(
3
),
1321
1342
.

Huang
,
C.Y.
,
Qin
,
J.
&
Tsai
,
H.T.
(
2016
)
Efficient estimation of the Cox model with auxiliary subgroup survival information
.
Journal of the American Statistical Association
,
111
(
514
),
787
799
.

Imbens
,
G.W.
&
Lancaster
,
T.
(
1994
)
Combining micro and macro data in microeconometric models
.
The Review of Economic Studies
,
61
(
4
),
655
680
.

Liu
,
D.
,
Zheng
,
Y.
,
Prentice
,
R.L.
&
Hsu
,
L.
(
2014
)
Estimating risk with time-to-event data: an application to the women's health initiative
.
Journal of the American Statistical Association
,
109
(
506
),
514
524
.

Owen
,
A.B.
(
1988
)
Empirical likelihood ratio confidence intervals for a single functional
.
Biometrika
,
75
(
2
),
237
249
.

Qin
,
J.
(
2000
)
Combining parametric and empirical likelihoods
.
Biometrika
,
87
(
2
),
484
490
.

Qin
,
J.
&
Lawless
,
J.
(
1994
)
Empirical likelihood and general estimating equations
.
The Annals of Statistics
,
72
(
1
),
300
325
.

Sheng
,
Y.
,
Sun
,
Y.
,
Huang
,
C.Y.
&
Kim
,
M.O.
(
2021
)
Synthesizing external aggregated information in the penalized Cox regression under population heterogeneity
.
Statistics in Medicine
,
40
(
23
),
4915
4930
.

Shimodaira
,
H.
(
2000
)
Improving predictive inference under covariate shift by weighting the log-likelihood function
.
Journal of Statistical Planning and Inference
,
90
(
2
),
227
244
.

Sung
,
H.
,
Ferlay
,
J.
,
Siegel
,
R.L.
,
Laversanne
,
M.
,
Soerjomataram
,
I.
,
Jemal
,
A.
&
Bray
,
F.
(
2021
)
Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries
.
CA: A Cancer Journal for Clinicians
,
71
(
3
),
209
249
.

Sutton
,
A.J.
,
Abrams
,
K.R.
,
Jones
,
D.R.
,
Jones
,
D.R.
,
Sheldon
,
T.A.
&
Song
,
F.
(
2000
)
Methods for meta-analysis in medical research
.
Chichester
:
Wiley
.

Thomas
,
D.R.
&
Grunkemeier
,
G.L.
(
1975
)
Confidence interval estimation of survival probabilities for censored data
.
Journal of the American Statistical Association
,
70
(
352
),
865
871
.

Whitehead
,
A.
(
2002
)
Meta-analysis of controlled clinical trials
.
Chichester
:
John Wiley & Sons
.

Yamamoto
,
T.
,
Yagi
,
S.
,
Kinoshita
,
H.
,
Sakamoto
,
Y.
,
Okada
,
K.
,
Uryuhara
,
K.
,
Morimoto
,
T.
,
Kaihara
,
S.
&
Hosotani
,
R.
(
2015
)
Long-term survival after resection of pancreatic cancer: a single-center retrospective analysis
.
World Journal of Gastroenterology
,
21
(
1
),
262
268
.

Zeng
,
D.
&
Lin
,
D.Y.
(
2006
)
Efficient estimation of semiparametric transformation models for counting processes
.
Biometrika
,
93
(
3
),
627
640
.

Zeng
,
D.
&
Lin
,
D.Y.
(
2007
)
Maximum likelihood estimation in semiparametric regression models with censored data
.
Journal of the Royal Statistical Society: Series B
,
69
(
4
),
507
564
.

Zhang
,
H.
,
Deng
,
L.
,
Schiffman
,
M.
,
Qin
,
J.
&
Yu
,
K.
(
2020
)
Generalized integration model for improved statistical inference by leveraging external summary data
.
Biometrika
,
107
(
3
),
689
703
.

Zheng
,
J.
,
Zheng
,
Y.
&
Hsu
,
L.
(
2022
)
Risk projection for time-to-event outcome leveraging summary statistics with source individual-level data
.
Journal of the American Statistical Association
,
to appear
.

Zucker
,
D.M.
(
2005
)
A pseudo–partial likelihood method for semiparametric survival regression with covariate errors
.
Journal of the American Statistical Association
,
100
(
472
),
1264
1277
.

Appendix A

Below are the regularity conditions for the method of empirical likelihood in Theorem 3. Let formula and formula.

  • (A1)

    formula has an unique solution at μ0.

  • (A2)

    The covariate vector X is bounded with probability 1 and the true regression parameter β0 lies in a compact subset of formula.

  • (A3)

    formula is positive definite; functions formula and formula are continuous in a neighborhood of the true value μ0.

  • (A4)

    formula and formula are bounded by some integrable function in the neighborhood of the true value μ0 same as in (A3).

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data