Abstract

We consider identification and inference about mean functionals of observed covariates and an outcome variable subject to non-ignorable missingness. By leveraging a shadow variable, we establish a necessary and sufficient condition for identification of the mean functional even if the full data distribution is not identified. We further characterize a necessary condition for n-estimability of the mean functional. This condition naturally strengthens the identifying condition, and it requires the existence of a function as a solution to a representer equation that connects the shadow variable to the mean functional. Solutions to the representer equation may not be unique, which presents substantial challenges for non-parametric estimation, and standard theories for non-parametric sieve estimators are not applicable here. We construct a consistent estimator of the solution set and then adapt the theory of extremum estimators to find from the estimated set a consistent estimator of an appropriately chosen solution. The estimator is asymptotically normal, locally efficient and attains the semi-parametric efficiency bound under certain regularity conditions. We illustrate the proposed approach via simulations and a real data application on home pricing.

1 Introduction

Non-response is frequently encountered in social science and biomedical studies, due to reluctance to answer sensitive survey questions. Certain characteristics about the missing data mechanism have been previously used to define a taxonomy to describe the missingness process (Little & Rubin, 2002; Rubin, 1976). Data are said to be missing at random (MAR) if the propensity of missingness conditional on all study variables is unrelated to the missing values. Otherwise, it is called missing not at random (MNAR) or non-ignorable. MAR has been commonly used for statistical analysis in the presence of missing data; however, in many fields of study, suspicion that the missing data mechanism may be non-ignorable is often warranted (Ibrahim et al., 1999; Robins et al., 2000; Rotnitzky & Robins, 1997; Rotnitzky et al., 1998; Scharfstein et al., 1999). For example, non-response rates in surveys about income tend to be higher for low socio-economic groups (Kim & Yu, 2011). In another example, efforts to estimate HIV prevalence in developing countries via household HIV survey and testing such as the well-known Demographic and Health Survey, are likewise subject to non-ignorable missing data on participants’ HIV status due to highly selective non-participation in the HIV testing component of the survey study (Tchetgen Tchetgen & Wirth, 2017). There currently exist a variety of methods for the analysis of MAR data, such as likelihood-based inference, multiple imputation, inverse probability weighting, and doubly robust methods. However, these methods can result in severe bias and invalid inference in the presence of non-ignorable missing data.

In this paper, we focus on identification and estimation of mean functionals of observed covariates and an outcome variable subject to non-ignorable missingness. Estimating mean functionals is a goal common to many scientific areas, e.g., sampling survey and causal inference, and thus is of significant practical importance. However, there are several difficulties for analysis of non-ignorable missing data. The first challenge is identification, which means that the parameter of interest is uniquely determined from observed data distribution. Identification is straightforward under MAR as the conditional outcome distribution in complete cases equals that in incomplete cases given fully observed covariates, whereas it becomes difficult under MNAR because the selection bias due to missing values is no longer negligible. Even if stringent fully parametric models are imposed on both the propensity score and the outcome regression, identification may not be achieved; for counterexamples, see Wang et al. (2014) and Miao et al. (2016). To resolve the identification difficulty, previous researchers (Kim & Yu, 2011; Robins et al., 2000) have assumed that the selection bias is known or can be estimated from external studies, but this approach should be used rather as a sensitivity analysis, until the validity of the selection bias assumption is assessed. Without knowing the selection bias, identification can be achieved by leveraging fully observed auxiliary variables that are available in many empirical studies. For instance, instrumental variables, which are related to the non-response propensity but not related to the outcome given covariates, have been used in missing data analysis since Heckman (1979). The corresponding semi-parametric theory and inference are recently established by Liu et al. (2020), Sun et al. (2018), Tchetgen Tchetgen and Wirth (2017), Das et al. (2003). Recently, an alternative approach called the shadow variable approach has grown in popularity in sampling survey and missing data analysis. In contrast to the instrumental variable, this approach entails a shadow variable that is associated with the outcome but independent of the missingness process given covariates and the outcome itself. Shadow variable is available in many applications (Kott, 2014). For example, Zahner et al. (1992) and Ibrahim et al. (2001) considered a study of the children’s mental health evaluated through their teachers’ assessments in Connecticut. However, the data for the teachers’ assessments are subject to non-ignorable missingness. As a proxy of the teacher’s assessment, a separate parent report is available for all children in this study. The parent report is likely to be correlated with the teacher’s assessment, but is unlikely to be related to the teacher’s response rate given the teacher’s assessment and fully observed covariates. Hence, the parental assessment is regarded as a shadow variable in this study. The shadow variable design is quite general. In health and social sciences, an accurately measured outcome is routinely available only for a subset of patients, but one or more surrogates may be fully observed. For instance, Robins et al. (1994) considered a cardiovascular disease setting where, due to high cost of laboratory analyses, and to the small amount of stored serum per subject (about 2% of study subjects had stored serum thawed and assayed for antioxidants serum vitamin A and vitamin E), error-prone surrogate measurements of the biomarkers derived from self-reported dietary questionnaire were obtained for all subjects. Other examples include the semi-supervised setting in comparative effectiveness research where the true outcome is measured only for a small fraction of the data, e.g., diagnosis requiring a costly panel of physicians while surrogates are obtained from databases including ICD-9 codes for certain comorbidities. Instead of assuming MAR conditional on the surrogates, the shadow variable assumption may be more appropriate in presence of informative selection bias. By leveraging a shadow variable, D’Haultfœuille (2010) and Miao et al. (2015) established identification results for non-parametric models under completeness conditions. In related work, Wang et al. (2014), Shao and Wang (2016), Tang et al. (2003), Zhao and Shao (2015), and Morikawa and Kim (2021) proposed identification conditions for a suite of parametric and semi-parametric models that require either the propensity score or the outcome regression, or both to be parametric. All existing shadow variable approaches need to further impose sufficiently strong conditions to identify the full data distribution, although in practice one may only be interested in a parameter or a mean functional which may be identifiable even if the full data distribution is not.

The second challenge for MNAR data is model misspecification in estimation, after identification is established. Likelihood-based inference (Greenlees et al., 1982; Tang et al., 2003, 2014), imputation-based methods (Kim & Yu, 2011), inverse probability weighting (Wang et al., 2014) have been developed for analysis of MNAR data. These estimation methods require correct model specification of either the propensity score or the outcome regression, or both. However, bias may arise due to specification error of parametric models as they have limited flexibility, and moreover, model misspecification is more likely to appear in the presence of missing values. Vansteelandt et al. (2007), Miao and Tchetgen Tchetgen (2016), Miao et al. (2015), and Liu et al. (2020) proposed doubly robust methods for estimation with MNAR data, which affords double protection against model misspecification; however, these previous proposals require an odds ratio model characterizing the degree of non-ignorable missingness to be either completely known or correctly specified.

In this paper, we develop a novel strategy to non-parametrically identify and estimate a generic mean functional. In contrast to previous approaches in the literature, this work has the following distinctive features and makes several contributions to non-ignorable missing data literature. First, given a shadow variable, we directly work on the identification of the mean functional without identifying the full data distribution, whereas previous proposals typically have to first ensure identification of the full data distribution. In particular, we establish a necessary and sufficient condition for identification of the mean functional, which is weaker than the commonly used completeness condition for non-parametric identification. For estimation, we propose a representer assumption which is necessary for n-estimability of the mean functional. The representer assumption naturally strengthens the identifying condition, and it involves the existence of a function as a solution to a representer equation that relates the mean functional with the shadow variable. Second, under the representer assumption, we propose non-parametric estimation that no longer involves parametrically modelling the propensity score or the outcome regression. Non-parametric estimation has been largely studied in the literature and has received broad acceptance with machine learning tools in many applications (Chen & Pouzo, 2012; Kennedy et al., 2017; Newey & Powell, 2003). Because the solution to the representer equation may not be unique, we first construct a consistent estimator of the solution set. We use the method of sieves to approximate unknown smooth functions as possible solutions and estimate corresponding coefficients by applying a minimum distance procedure, which has been routinely used in semi-parametric and non-parametric econometrics literature (Ai & Chen, 2003; Chen & Pouzo, 2012; Newey & Powell, 2003; Santos, 2011). We then adapt the theory of extremum estimators to find from the estimated set a consistent estimator of an appropriately chosen solution. Based on such an estimator, we propose a representer-based estimator of the mean functional. Under certain regularity conditions, we establish consistency and asymptotic normality for the proposed estimator. Due to the non-uniqueness of solutions to the representer equation, the asymptotic results do not simply follow from the standard non-parametric theory and additional techniques are required. Besides, the proposed estimator is locally efficient for the mean functional under our shadow variable model, if the completeness condition is further imposed.

The rest of this paper proceeds as follows. In Section 2, we provide a necessary and sufficient identifying condition for a generic mean functional within the shadow variable framework, and also introduce a representer assumption that is necessary for n-estimability of the mean functional. In Section 3, we develop a model-free estimator of the mean functional, establish the asymptotic theory, discuss its semi-parametric efficiency, and provide computational details. In Section 4, we study the finite-sample performance of the proposed approach via both simulation studies and a real data example about home pricing. We conclude with a discussion in Section 5 and relegate proofs to the supplementary material.

2 Identification

In this section, we discuss identification of a generic mean functional in the shadow variable framework regardless of the identifiability of the full data distribution. In particular, we provide a necessary and sufficient condition for identifying the mean functional, which is weaker than the commonly used completeness conditions. We also introduce a representer assumption by naturally strengthening the identifying condition, because this assumption is necessary for n-estimability of the mean functional.

Let X denote a vector of fully observed covariates, Y the outcome variable that is subject to missingness, and R the missingness indicator with R=1 if Y is observed and R=0 otherwise. The missingness process may depend on the missing values. We let f() denote the probability density or mass function of a random variable (vector). The observed data contain n independent and identically distributed realizations of (R,X,Y,Z) with the values of Y missing for R=0. We are interested in the identification and statistical inference about the mean functional μ=E{τ(X,Y)} for a generic function τ(). In particular, when τ(X,Y)=Y, the parameter μ=E(Y) corresponds to the population outcome mean that is of particular interest in sampling survey and causal inference. This setup also covers the ordinary least squares regression problem if we choose μ=E(XjY) for any component Xj of X; and further generalizes to arbitrary functionals of the full data that can be defined implicitly as solution to a moment equation. Suppose we observe an additional shadow variable Z that meets the following requirements.

Shadow variable

 
Assumption 1

(i) Z⨿;R(X,Y); (ii) Z⨿̸;YX,R=1.

Assumption 1 reveals that the shadow variable does not affect the missingness process given the covariates and outcome, and it is associated with the outcome given the covariates. There are various causal diagrams satisfying Assumption 1, and we list three such examples in Figure 1. In particular, Figure 1b considers the scenario where the missingness does not depend directly on the outcome but correlates with it via their unmeasured common causes. Besides, Figure 1b allows X and Y to precede Z. Figure 1c is more complicated by allowing for both the direct dependence of R on Y and the existence of unmeasured common causes of X and R. In practice, domain-specific knowledge is required to justify whether such graphical models and Assumption 1 are reasonable, and they need to be investigated on a case-by-case basis. Assumption 1 has been used for adjustment of selection bias in sampling surveys (Kott, 2014) and in missing data literature (D’Haultfœuille, 2010; Miao & Tchetgen Tchetgen, 2016; Wang et al., 2014). Examples and extensive discussions about the assumption can be found in Zahner et al. (1992), Ibrahim et al. (2001), Miao et al. (2015), and Miao and Tchetgen Tchetgen (2016). Assumption 1 is important for identification of μ in non-ignorable missing data analysis. If Z affects the missingness indicator after conditioning on (X,Y), then μ may not be identifiable even in fully parametric models (Miao et al., 2016; Wang et al., 2014). Under Assumption 1, Miao et al. (2015) showed that

(1)

where OR~(X,Y)=OR(X,Y)/E{OR(X,Y)R=1,X}, and OR(X,Y)={f(R=0X,Y)f(R=1X,Y=0)}/{f(R=0X,Y=0)f(R=1X,Y)} is the odds ratio function relating R to Y conditional on X. Under an additional completeness condition that for any square-integrable function ξ, E{ξ(X,Y)R=1,X,Z}=0 almost surely if and only if ξ(X,Y)=0 almost surely, Miao et al. (2015) showed identifiability of OR(X,Y) and further established identifiability of the full data distribution.

Causal graphical examples (a)-(c) illustrating Assumption 1 with the shadow variable Z, missingness indicator R, covariates X and outcome Y; variables in circles are unmeasured.
Figure 1.

Causal graphical examples (a)-(c) illustrating Assumption 1 with the shadow variable Z, missingness indicator R, covariates X and outcome Y; variables in circles are unmeasured.

We note that the equation in (1) can be simplified as follows:

(2)

where

To illustrate how the shadow variable Z is leveraged for identification, we consider the special case where both Y and Z are binary. By setting Z=0,1, we obtain two equations for the unknown functions γ(X,1) and γ(X,0) from (2). Then these two unknown functions can be uniquely solved from the system of the two equations as long as Z⨿̸;YR=1,X. This implies that γ(X,Y), and therefore, the full data distribution are identifiable. In general, without further assumptions such as completeness conditions, γ(X,Y) is not identifiable from (2). Nevertheless, by noting that

the expectation functional μ can be identified under conditions possibly weaker than those required for identification of γ(X,Y) itself. The intuition is as follows. According to the above equation, the identification of μ depends on identification of the linear functional E{Rγ(X,Y)τ(X,Y)} or E{γ(X,Y)τ(X,Y)R=1}. If we can decompose γ(X,Y) into some identifiable part γ1(X,Y) and non-identifiable part γ2(X,Y), i.e., γ(X,Y)=γ1(X,Y)+γ2(X,Y) with E{γ2(X,Y)τ(X,Y)R=1}=0, then we have E{γ(X,Y)τ(X,Y)R=1}=E{γ1(X,Y)τ(X,Y)R=1}, and hence the linear functional E{Rγ(X,Y)τ(X,Y)} is identifiable. This in fact can be achieved with γ1(X,Y) being the orthogonal projection of γ(X,Y) onto an appropriate chosen space so that γ1(X,Y) is guaranteed to be identifiable; accordingly, γ2(X,Y) is the orthogonal projection of γ(X,Y) onto the orthogonal complement of the chosen space.

The formal discussions on how to choose the projection space are characterized in terms of operators on Hilbert spaces. Let L2{f(X,Y,Z)} denote the set of real valued functions of (X,Y,Z) that are square integrable with respect to the conditional distribution of (X,Y,Z) given R=1. Let T:L2{f(X,Y)}L2{f(X,Z)} be the linear operator given by T(ξ)=E{ξ(X,Y)R=1,X,Z}. Its adjoint operator T:L2{f(X,Z)}L2{f(X,Y)} is the linear map T(η)=E{η(X,Z)R=1,X,Y}. The range and null space of T is denoted by R(T) and N(T), respectively. The orthogonal complement of a set A is denoted by A and its closure in the norm topology is cl(A). Let PA denote the orthogonal projection onto the set A. If the projection space discussed above is N(T), then γ1(X,Y)=PN(T)γ(X,Y) is identifiable and the reasons are given as follows. Noting that the equation in (2) can be rewritten as Tγ(X,Y)=β(X,Z) using the operator T, then from the decomposition γ(X,Y)=PN(T)γ(X,Y)+PN(T)γ(X,Y), it follows that (T|N(T))PN(T)γ(X,Y)=β(X,Z), where T|N(T) is the restriction of T to N(T). Since N(T|N(T))={0}, the quantity PN(T)γ(X,Y) is identifiable.

 
Theorem 1

Under Assumption 1, μ is identifiable if and only if τ(X,Y)N(T).

Since the definition of the operator T involves only observed data, the identifying condition could be justified in principle without extra model assumptions on the missing data distribution. In contrast to previous approaches that have identified the full data distribution under varying completeness conditions, our identification strategy allows for a larger class of models where the parameter of interest is uniquely identified even though the full data law may not be. For example, Miao et al. (2015) established identification of the full data distribution under the completeness condition within the shadow variable framework. Since we may only be interested in a mean functional of the data distribution in practice, identification of such a parameter requires possibly weaker assumptions. In fact, Theorem 1 has provided a necessary and sufficient condition for identification of the mean functional μ given a valid shadow variable. When the completeness condition holds, then N(T)={0}, and the functional μ is identifiable because τ(X,Y)N(T)=L2{f(X,Y)}. Even if the completeness condition does not hold, we can still identify μ as long as τ(X,Y)N(T). An example is given at the end of this section. In other words, the completeness condition is sufficient but not necessary for identifying the mean functional μ. Although the proposed condition τ(X,Y)N(T) guarantees identifiability of μ, it does not suffice for its n-estimability, particularly for functionals that lie in the boundary of N(T). Following Lemma 4.1 in Severini and Tripathi (2012), one can show that the following assumption is necessary for n-estimability of μ under some regularity conditions.

Representer

 
Assumption 2
τ(X,Y)R(T), i.e., there exists a function δ0(X,Z) such that
(3)

Since N(T)=cl{R(T)}, the representer assumption naturally strengthens the identifying condition τ(X,Y)N(T). Assumption 2 requires the existence of a solution to the representer equation in (3), which essentially implies Assumption 1(ii). Thus, if Assumption 2 holds, then Assumption 1(ii) is satisfied. Equation (3) relates the function τ(x,y) with the shadow variable via the representer function δ0(x,z). This equation has some connections to confounding bridge functions in negative control literature for identification of average causal effects (Cui et al., 2020; Miao et al., 2018), as both of them have defined their separate inverse problems formally known as the Fredholm integral equation of the first kind. It is also important to note that the representer equation does not follow directly from the confounding bridge function. As discussed above, the representer assumption is motivated by the necessary and sufficient identifying condition given in Theorem 1. The requirement for existence of solutions to (3) in Assumption 2 is mild, and this assumption is nearly necessary for the completeness condition. Specifically, if the completeness condition holds, then the solution to (3) exists under some regularity conditions on the singular value decomposition of a conditional mean operator given in Conditions S1–S2 of online supplementary materials. Similar conditions for existence of a solution to the Fredholm integral equation of the first kind are discussed by Carrasco et al. (2007), Miao et al. (2018), and Cui et al. (2020).

We give some toy examples where Assumption 2 holds. If there exists some transformation of Z such that E{λ(Z)R=1,X,Y}=α(X)+β(X)τ(X,Y) and β(x)0, then Assumption 2 is met with δ0(X,Z)={λ(Z)α(X)}/β(X). As a special case, λ(Z)=Z when E(ZR=1,X,Y) is linear in τ(X,Y). Note that Assumption 2 only requires the existence of solutions to equation (3), but not uniqueness. For instance, if both Z and Y are binary, then δ0(X,Z) is unique and

where fy(X)=f(Z=1R=1,X,Y=y) for y=0,1. However, if Z has more levels than Y, then δ0(X,Z) may not be unique. For simplicity, we may drop the arguments in δ0(X,Z) and directly use δ0 in what follows, and notation for other functions are treated in a similar way.

 
Corollary 1
Under Assumptions 1 and 2, μ is identifiable, and

From Corollary 1, even if δ0 is not uniquely determined, all solutions to Assumption 2 must result in an identical value of μ. Moreover, this identification result does not require identification of the full data distribution f(R,X,Y,Z). In fact, identification of f(R,X,Y,Z) is not ensured under Assumptions 1 and 2 only; see Example 1. To our knowledge, the identifying Assumptions 12 are so far the weakest for the shadow variable approach. Besides that, Assumption 2 is also necessary for n-estimability of μ within the shadow variable framework. We further illustrate Assumption 2 with the following example.

 
Example 1

Consider the following two data generating processes (DGPs):

DGP 1: YUniform(0,1), ZyBernoulli(y), and f(R=1y,z)=4y2(1y).

DGP 2: YBeta(2,2), ZyBernoulli(y), and f(R=1y,z)=2y/3.

Suppose we are interested in estimating the outcome mean μ=E(Y). It is easy to verify that the above two DGPs satisfy Assumption 2 by choosing δ0(X,Z)=Z. These two DGPs imply the same outcome mean E(Y)=1/2 and the same observed data distribution, because f(R=1,y,z)=4y2(1y){zy+(1z)(1y)} and f(z)=1/2 in these two DGPs. However, the full data distributions of these two DGPs are different.

3 Estimation, inference, and computation

In this section, we provide an estimation method without modelling the propensity score or outcome regression. Previous approaches often require fully or partially parametric models for at least one of them. For example, Qin et al. (2002) and Wang et al. (2014) assumed a fully parametric model for the propensity score; Kim and Yu (2011) and Shao and Wang (2016) relaxed their assumption and considered a semi-parametric exponential tilt model for the propensity; Miao and Tchetgen Tchetgen (2016) proposed doubly robust estimation methods by either requiring a parametric propensity score or an outcome regression to be correctly specified. Our approach aims to avoiding (i) point identification of the full data law under more stringent conditions and (ii) parametric assumptions either for identification or for estimation.

As implied by Corollary 1, any solution to (3) provides a valid δ0 for identifying the parameter μ. Suppose that all such solutions belong to a set Δ of smooth functions, with specific requirements for smooth functions given in Definition 1. Then the set of solutions to (3) is

(4)

For estimation and inference about μ, we need to construct a consistent estimator of some fixed element δ0Δ0. If Δ0 were known, then one would simply select one element δ0 from the set and use this element to estimate μ. Without knowing Δ0, the lack of identification of δ0 presents important technical challenges. In particular, common estimators of the non-parametric nuisance function δ0 may be unstable and may not necessarily converge to any fixed limit. The resulting estimator of the mean functional μ using these nuisance estimates would then have intractable asymptotic behaviour. However, the solution set Δ0 is identified and can be estimated. We thus aim to obtain an estimator δ^0 in the following two steps: first, construct a consistent estimator Δ^0 of the set Δ0; second, carefully select δ^0Δ^0 such that it is a consistent estimator of a fixed element δ0Δ0.

3.1 Estimation of the solution set Δ0

Define the criterion function

Then the solution set Δ0 in (4) is equal to the set of zeros of Q(δ), i.e.,

and hence, estimation of Δ0 is equivalent to estimation of zeros of Q(δ). This can be accomplished with the approximate minimizers of a sample analogue of Q(δ) (Chernozhukov et al., 2007).

We adopt the sieve approach to construct a sample analogue function Qn(δ) for Q(δ) and a corresponding approximation Δn for Δ. Let {ψq(x,z)}q=1 denote a sequence of known approximating functions of x and z, and

(5)

for some known qn and unknown parameters {βq}q=1qn. The construction of Qn involves non-parametric estimation of conditional expectations. Let {ϕk(x,y)}k=1 be a sequence of known approximating functions of x and y. Denote the vector of the first kn terms of the basis functions by

and let

For a generic random variable B=B(X,Y,Z) with realizations {Bi=B(Xi,Yi,Zi)}i=1n, the non-parametric sieve estimator of E(BR=1,x,y) is obtained by the linear regression of B on the vector ϕ(X,Y) in complete cases, i.e.,

(6)

Then the sample analogue Qn of Q is

(7)

with

(8)

where the explicit expression of e^(Xi,Yi,δ) is obtained from (6). Finally, the proposed estimator of Δ0 is

(9)

where Δn and Qn(δ) are given in (5) and (7), respectively, and {cn}n=1 is a sequence of small positive numbers converging to zero at an appropriate rate. The requirement on the rate of cn will be discussed later for theoretical analysis.

3.2 Set consistency of Δ^0

We establish the set consistency of Δ^0 for Δ0 in terms of Hausdorff distances. For a given norm , the Hausdorff distance between two sets Δ1,Δ2Δ is

where d(Δ1,Δ2)=supδ1Δ1infδ2Δ2δ1δ2 and d(Δ2,Δ1) is defined analogously. Thus, Δ^0 is consistent under the Hausdorff distance if both the maximal approximation error of Δ^0 by Δ0 and of Δ0 by Δ^0 converge to zero in probability.

We consider two different norms for the Hausdorff distance: the pseudo-norm w defined by

and the supremum norm defined by

From the representer equation (3), we have that for any δ^0Δ^0 and δ0,δΔ0, δ^0δ0w=δ^0δw. Hence,

(10)

This result implies that we can obtain the convergence rate of δ^0δ0w by deriving that of d(Δ^0,Δ0,w). However, the identified set Δ0 is an equivalence class under the pseudo-norm, and the convergence under dH(Δ^0,Δ0,w) does not suffice to consistently estimating a given element δ0Δ0. Whereas the supremum norm is able to differentiate between elements in Δ0, and dH(Δ^0,Δ0,)=op(1) under certain regularity condition as we will show later.

We make the following assumptions to achieve consistency of Δ^0 under the metric dH(Δ^0,Δ0,) and to obtain the rate of convergence for Δ^0 under the weaker metric dH(Δ^0,Δ0,w).

 
Assumption 3

The vector of covariates XRd has support [0,1]d, and the outcome YR and the shadow variable ZR have compact supports.

Assumption 3 requires (X,Y,Z) to have compact supports, and without loss of generality, we assume that X has been transformed such that the support is [0,1]d. These conditions are fairly standard in semi-parametric literature. Although Y and Z are also required to have compact support, the proposed approach may still be applied if the supports are infinite with sufficiently thin tails. For instance, in our simulation studies where the variables Y and Z are drawn from a normal distribution in Section 4, the proposed approach continues to perform quite well.

We next impose restrictions on the smoothness of functions in the set Δ, characterized by the following Sobolev norm.

 
Definition 1
For a generic function ρ(w) defined on wRd, we define
where λ is a d-dimensional vector of non-negative integers, |λ|=i=1dλi, α_ denotes the largest integer smaller than α, Dλρ(w)=|λ|ρ(w)/w1λ1wdλd, and D0ρ(w)=ρ(w).

A function ρ with ρ,α< has uniformly bounded partial derivatives up to order α_; besides, the α_th partial derivative of this function is Lipschitz of order αα_.

 
Assumption 4

The following conditions hold:

  • supδΔδ,α< for some α>(d+1)/2; in addition, Δ0, ΔnΔ, and both Δn and Δ are closed;

  • There is an operator Πn such that for every δΔ, ΠnδΔn and supδΔδΠnδ=O(ηn) for some ηn=o(1).

Assumption 4(i) requires that each function δΔ is sufficiently smooth and bounded. The closedness condition in this assumption and Assumption 3 together imply that Δ is compact under . It is well known that solving integral equations as in (3) is ill-posed. The ill-posedness due to non-continuity of the solution and difficulty of computation can have a severe impact on the consistency and convergence rates of estimators. The compactness condition ensures that the consistency of the proposed estimator under is not affected by the ill-posedness. Such a compactness condition is commonly made in the non-parametric and semi-parametric literature; see, e.g., Newey and Powell (2003), Ai and Chen (2003), and Chen and Pouzo (2012). Alternatively, it is possible to address the ill-posed problem by employing a regularization approach as in Horowitz (2009) and Darolles et al. (2011).

Assumption 4(ii) quantifies the approximation error of functions in Δ by the sieve space Δn. This condition is satisfied by many commonly used function spaces (e.g., Hölder space), whose elements are sufficiently smooth, and by popular sieves (e.g., power series, splines). For example, consider the function set Δ with supδΔδ,α<. If the sieve functions {ψq(x,z)}q=1 are polynomials or tensor product univariate splines, then uniformly on δΔ, the approximation error of δ by functions of the form q=1qnβqψq(x,z)Δn under is of the order O{qnα/(d+1)}. Thus, Assumption 4(ii) is met with ηn=qnα/(d+1); see Chen (2007) for further discussion.

 
Assumption 5

The following conditions hold:

  • the smallest and largest eigenvalues of E{Rϕ(X,Y)ϕ(X,Y)T} are bounded above and away from zero for all kn;

  • for every δΔ, there is a πn(δ)Rkn such that
  • ξn2kn=o(n), where ξn=supx,yϕ(x,y)2.

Assumption 5 bounds the second moment matrix of the approximating functions away from singularity, presents a uniform approximation error of the series estimator to the conditional mean function, and restricts the magnitude of the series terms. These conditions are standard for series estimation of conditional mean functions; see, e.g., Newey (1997), Ai and Chen (2003), and Huang (2003). Primitive conditions are discussed below so that the rate requirements in this assumption hold. Consider any δΔ satisfying Assumption 4, i.e., supδΔδ,α<. If the partial derivatives of f(zr=1,x,y) with respect to (x,y) are continuously differentiable up to order α_+1, then under Assumption 3, we have supδE{δ(X,Z)R=1,x,y},α<. In addition, if the sieve functions {ϕk(x,y)}k=1kn are polynomials or tensor product univariate splines, then by similar arguments after Assumption 4, we conclude that the approximation error under is of the order O{knα/(d+1)} uniformly on δΔ. Verifying Assumption 5(iii) depends on the relationship between ξn and kn. For example, if {ϕk(x,y)}k=1kn are tensor product univariate splines, then ξn=O{kn(d+1)/2}.

Write cn in (9) by bn/an with appropriate sequences an and bn, and define λn=kn/n+kn2α/(d+1)+ηn2.

 
Theorem 2
Suppose that Assumptions 35 hold. If an=O(λn1), bn and bn=o(an), Then

Theorem 2 shows the consistency of Δ^0 under the supremum-norm metric dH(Δ^0,Δ0,) and establishes the rate of convergence of Δ^0 under the weaker pseudo-norm metric dH(Δ^0,Δ0,w). In particular, if we let kn3=o(n), kn3α/(d+1)=o(n1), and ηn=o(n1/3) as imposed in Assumption 7 in the next subsection, then λn=o(n2/3) or λn1n2/3. We take an=λn1/2n1/3 and bn=an1/2/n1/3. Thus, an=λn1(λnn2/3)1/2=o(λn1), bn=(λn1n2/3)1/4, and bn=anan1/2n1/3=o(an). In fact, under such rate requirements, we have n2/3bn=o(an) and dH(Δ^0,Δ0,w)=op(n1/4), which are sufficient to establish the asymptotic normality of the proposed estimator given in Subsection 3.4.

3.3 A representer-based estimator

In this subsection, we aim to provide a consistent estimator of the mean functional μ identified in Corollary 1 and discuss its asymptotic properties. Since we have obtained a consistent estimator Δ^0 of the solution set Δ0 according to Theorem 2, it remains to select an estimator from Δ^0 such that it converges to a unique element belonging to Δ0. Although any element of the set Δ^0 is valid for constructing a consistent estimator of μ, such an estimator may not be asymptotically normal if the element δ^0Δ^0 is not carefully chosen. We adapt the theory of extremum estimators to choose the element δ^0 (Kosorok, 2008). Let M:ΔR be a population criterion functional that attains a unique minimum δ0 on Δ0 and Mn(δ) be its sample analogue. We then choose the minimizer of Mn(δ) over the estimated solution set Δ^0, denoted by

(11)

which is expected to converge to the unique minimum δ0 of M(δ) on Δ0.

 
Assumption 6

The function set Δ is convex; the functional M:ΔR is strictly convex and attains a unique minimum at δ0 on Δ0; its sample analogue Mn:ΔR is continuous and supδΔ|Mn(δ)M(δ)|=op(1).

One simple example of the criterion functional M satisfying Assumption 6 is

(12)

This is a convex functional with respect to δ. Its sample analogue is

(13)

Under Assumptions 34, we have shown in supplementary materials that the function class {(1R)δ2:δΔ} is a Glivenko-Cantelli class, and thus supδΔ|Mn(δ)M(δ)|=op(1).

 
Theorem 3
Suppose that Assumptions 36 hold. Then
where δ^0 is defined through (11) and δ0 is defined in Assumption 6. In addition, if an=O(λn1), bn and bn=o(an), we then have

Theorem 3 implies that by choosing an appropriate functional M(δ) satisfying Assumption 6, it is possible to construct a consistent estimator δ^0 for some unique element δ0Δ0 in terms of supremum norm and further obtain its rate of convergence under the weaker pseudo-norm w.

Based on the estimator δ^0 given in (11), we obtain the following representer-based estimator μ^rep of μ:

(14)

The estimator μ^rep depends on the criterion functional M through δ^0. However, since Theorem 3 has shown that δ^0δ0=op(1) for any M satisfying Assumption 6, it then follows from the law of large numbers and Corollary 1 that μ^rep is always consistent for μ irrespective of the choice of M.

Below we discuss the asymptotic expansion of the estimator μ^rep. Let H¯ be the closure of the linear span of Δ under w, which is a Hilbert space with inner product:

for any δ1,δ2H¯.

 
Assumption 7

The following conditions hold:

  • there exists a function h0Δ such that
  • ηn=o(n1/3), kn3α/(d+1)=o(n1), kn3=o(n), ξn2kn2=o(n), and ξn2kn2α/(d+1)=o(1).

Note that the linear functional δE{(1R)δ(X,Z)} is continuous under w. Hence, by the Riesz representation theorem, there exists a unique h0H¯ (up to an equivalence class in w) such that h0,δw=E{(1R)δ(X,Z)} for all δH¯. However, Assumption 7(i) further requires that this equivalence class must contain at least one element that falls in Δ. This can be guaranteed by choosing Δ to be a closed subspace of L2{f(X,Z)}, because H¯=Δ in this case. Assumption 4 has required Δ to be closed, bounded, and sufficiently smooth. An example of Δ satisfying all these requirements is a closed subspace of the Hölder space, e.g., a space of polynomials with degree no larger than some fixed integer. For a more general function set Δ, a primitive condition for Assumption 7(i) is that the inverse probability weight also has a smooth representer: if

(15)

then h0 satisfies Assumption 7(i). Assumption 7(ii) imposes some rate requirements on the approximation error ηn of functions in Δ by the sieve space Δn and the dimension kn of the approximation sieve functions {ϕk(x,y)}k=1kn. For bounded approximation sieves ϕk(x,y), we have ξn=O(kn1/2) according to the definition. Then Assumption 7(ii) requires that knnβ, where (d+1)/(3α)<β<1/3 and 2α>d+1. These requirements can be satisfied as long as the function classes being approximated in Assumptions 4 and 5 are sufficiently smooth.

 
Theorem 4
Suppose that Assumptions 37 hold. We have that
with
(16)
where E^() and e^() are defined in (6) and (8), respectively.

Theorem 4 reveals an asymptotic expansion of μ^rep. The estimator μ^rep is not necessarily asymptotically normal as the bias term rn(δ^0) may not be asymptotically negligible. Nevertheless, if the bias term rn(δ^0) can be estimated with sufficiently fast convergence rates, then by subtracting the estimated bias term from the representer-based estimator μ^rep, we can obtain a debiased estimator that is asymptotically normal.

3.4 A debiased semi-parametric locally efficient estimator

In this subsection, we propose a debiased representer-based estimator which is regular and asymptotically normal. Since the estimator is based on the chosen element δ^0Δ^0 through some criterion functional M in Assumption 6, the asymptotic variance may depend on the choice of M. If further a key completeness condition holds, the solution set Δ0 contains a single element at the true data generating mechanism and any choice of M leads to an estimator with the same asymptotic variance. Under such circumstance, we further establish that the estimator is semi-parametric locally efficient. To derive the debiased estimator, we need to estimate the bias term rn(δ^0). Note that only Πnh0 is unknown in the expression of rn(δ^0) given in (16). We propose to construct an estimator of Πnh0 and define the following criterion function:

and its sample analogue,

Since E{(1R)δ(X,Z)}=h0,δw by Assumption 7, it follows that C(δ)=δh0w2h0w2. Thus, h0 is the unique minimizer of δC(δ) up to the equivalence class in w. In addition, since h0 and Πnh0 are close under the metric by Assumption 4(ii), we then define the estimator of Πnh0 by:

(17)

Given the estimator h^, an approximation to the term rn(δ^0) is given by

(18)
 
Lemma 1
Suppose that Assumptions 35 and 7 hold. Then it follows that

This lemma establishes the rate of convergence of r^n(δ^0) to rn(δ^0) uniformly on Δ^0. If cn converges to zero sufficiently fast, then supδ^0Δ^0n|r^n(δ^0)rn(δ^0)|=op(1). The rate conditions imposed in Assumption 7(ii) guarantee that such a choice of cn is feasible. As a result, Theorem 7 and Lemma 1 together imply that a debiased estimator that is n-consistent and asymptotically normal can be constructed by subtracting the estimated bias r^n(δ^0) from μ^rep:

(19)
 
Theorem 5
Suppose that Assumptions 37 hold. If an=O(λn1), bn and n2/3bn=o(an), then n(μ^repdbμ) converges in distribution to N(0,σ2), where σ2 is the variance of
(20)

Theorem 5 shows that although the debiased estimator μrepdb depends on the criterion functional M, such an estimator is always n-consistent and asymptotically normal as long as the choice of M satisfies Assumption 6. However, because δ0 appearing in (20) is the minimizer of M on Δ0, the asymptotic variance of μ^repdb may depend on the choice of M. As such, sensitivity analysis may be useful for choosing M. In particular, given a variety of candidate functionals M satisfying Assumption 6, one may obtain a series of corresponding debiased estimators, from which, an estimator with minimal variance can be selected. However, since there may exist infinite choices, it is in general unable to select an estimator that achieves the semi-parametric efficiency bound under the shadow variable model. Based on (20), one can easily construct an estimator of the asymptotic variance as follows:

 
Proposition 1

Under Assumptions 37, the estimator σ^2 converges in probability to σ2 as n goes to infinity.

Proposition 1 implies that given α(0,1), an asymptotic 100(1α)% confidence interval is [μ^repdbzα/2σ^/n,μ^repdb+zα/2σ^/n], where zα/2=Φ1(1α/2). The formula in (20) presents the influence function for μ^repdb. Under the following conditions, the influence function is locally efficient in the sense that it attains the semi-parametric efficiency bound for the mean functional in the semi-parametric model Mnp defined by (2).

 
Assumption 8

The following conditions hold:

  • Completeness: (a) for any square-integrable function ξ(x,y), E{ξ(X,Y)R=1,X,Z}=0 almost surely if and only if ξ(X,Y)=0 almost surely; (b) for any square-integrable function η(x,z), E{η(X,Z)R=1,X,Y}=0 almost surely if and only if η(X,Z)=0 almost surely.

  • Denote Ω(x,z)=E[{γ(X,Y)β(X,Z)}2R=1,X=x,Z=z]. Suppose that 0<infx,zΩ(x,z)supx,zΩ(x,z)<.

Completeness in Assumption 8(i) is equivalent to the injectivity of the conditional expectation operator (D’Haultfoeuille, 2011). Under the completeness condition in Assumption 8(i), γ(X,Y) is identifiable, δ0(X,Z) and h0(X,Z) that respectively solve equations (3) and (15) are also uniquely determined. Completeness can be checked in specific models; for instance, one can check whether the covariance matrix is of full rank in the joint normal model. However, without restrictions on the distribution, it is generally impossible to provide empirical evidence in favor of the completeness condition (Canay et al., 2013). Assumption 8(ii) bounds the conditional variance Ω(x,z) of γ(X,Y) away from zero and infinity.

 
Theorem 6

The influence function in (20) attains the semi-parametric efficiency bound of μ in Mnp at the submodel where h0(x,z) solves equation (15) and Assumptions 18 hold.

Given the completeness condition in Assumption 8(i), the solution set Δ0 contains a single element at the true data generating mechanism. Then the nuisance function δ0 appearing in (20), which is the minimizer of a convex functional M on Δ0, should be unique irrespective of the choice of M. Theorem 6 implies that any choice of M leads to an asymptotically equivalent estimator, which also attains the semi-parametric efficiency bound at a key submodel where the representer is uniquely identified.

3.5 Computation

In this subsection, we discuss computations of δ^0 in (11) and h^ in (17) that are both required for μ^repdb. For the computation of δ^0, we aim to solve the following constrained optimization problem:

(21)

The constrained function Qn in (7) is quadratic in δ and the objective function Mn(δ) is required to be convex. It remains to impose some constraints on Δn so that we could have a tractable optimization problem.

A computationally simple choice for Δn is linear sieves as defined in (5); that is, for any δΔn, we have δ=ψTθ, where θ=(θ1,,θqn)T and ψ(x,z)={ψ1(x,z),,ψqn(x,z)}T. For this choice of Δn, if we define Δ={δ:δ,αK} for some K>0 as required by Assumption 4, then the constraint ψTθ,αK can be highly non-linear in θ. We thus follow Newey and Powell (2003) by defining Δ as the closure of the set {δ:δ2,α0K} under 2,α0, where α0>α+(d+1)/2 and

In the above equation, Z denotes the support of Z, λ is a (d+1)-dimensional vector of non-negative integers, |λ|=i=1d+1λi, and the operator Dλ() is defined in Definition 1. Then the constraint ψTθΔn now turns to θTHnθK2, where

Based on the above discussions, the constrained optimization problem in (21) becomes:

(22)

and the estimate of δ0 is δ^0=ψTθ^0. The computation of h^ is similar to that of δ^0. Specifically, we first solve

and then let h^=ψTθ^h. In the above optimization problems, cn and K are tuning parameters. In principle, one can set these parameters to arbitrary values as long as the rate requirements of the asymptotic results are satisfied. The theory, however, offers little guidance for how to select their values. Several data-driven methods for selecting tuning parameters in series estimation have been discussed by Li and Racine (2007). Here we suggest using κ-fold cross-validation to select the tuning parameters. Specifically, we split the data into κ balanced groups. Of the κ subsamples, a single subsample is retained as the test data and the remaining κ1 subsamples are used as the training data. We obtain an estimate δ^0 by solving the constrained optimization problem in (22) based on the training data, and then evaluate it using the criterion function (7) on the test data. We repeat the above process κ times, with each of the κ subsamples used exactly once as the test data. The cross-validation error is defined as an average of the κ evaluation results; the tuning parameters cn and K are chosen by minimizing the cross validation error. In practice, we tend to choose large values for K and small positive values for cn. In the simulation studies and real application, candidate values for cn and K are chosen as follows: cn=c×Qn(ψTθ^), where θ^ denotes the minimizer of Qn without any constraints, c{1,2,5,10}, and K{103,105}.

4 Numerical studies

4.1 Simulation

In this subsection, we conduct simulation studies to evaluate the performance of the proposed estimators in finite samples. We consider two different cases. In case (I), the data are generated under models where the full data distribution is identified. In case (II), the full data distribution is not identified but Assumption 2 holds.

For case (I), we generate four covariates X=(X1,X2,X3,X4)T according to XjUniform(0,1) for j=1,,4. We consider four data generating settings, including combinations of two choices of outcome models and two choices of propensity score models.

The missingness indicator R does not depend on Z given (X,Y), and hence the shadow variable assumption holds. We generate Y and Z from a normal distribution here to evaluate the proposed approach. Although Y and Z are both required to have compact support in Assumption 3, the proposed approach may still be applied in this situation. The missing data proportion in each of these settings is about 50%. For each setting, we replicate 1,000 simulations at sample sizes 500 and 1,000. We apply the proposed debiased estimator μ^repdb to estimate the population outcome mean μ=E(Y). To implement the proposed estimator, we choose polynomials of order 3 as sieve functions in both simulation studies and the empirical example, and choose the quadratic criterion functionals M and Mn given in (12) and (13), respectively. We denote the resultant estimator by qd-REP-DB. Quartic functionals for M and Mn are also considered in the current simulation setting, and the corresponding estimator is denoted by qr-REP-DB. For comparison, we also use the following methods to estimate μ: (i) ln-mnarIPW, an inverse probability weighted estimator with a logistic linear propensity score model assuming MNAR; (ii) py-mnarIPW, an inverse probability weighted estimator with a logistic polynomial-sieves propensity score assuming MNAR; (iii) marREG, a regression-based estimator assuming MAR.

Simulation results are reported in Figure 2. The proposed estimators qd-REP-DB and qr-REP-DB have negligible biases and their performances are almost the same in all four settings. In contrast, the estimator ln-mnarIPW with a logistic linear propensity score model can have comparable bias with ours only when the propensity score model is correctly specified; see settings (a) and (c). If the propensity score model is incorrectly specified as in settings (b) and (d), the estimator ln-mnarIPW exhibits an obvious downward bias and does not vanish when the sample size increases. Somewhat surprisingly, the estimator py-mnarIPW with a more complex propensity score model performs even worse than ln-mnarIPW with a simple logistic linear model in almost all settings. The reasons may be as follows. Note that the inverse probability weighted estimator under MNAR is obtained by solving estimating equations (Wang et al., 2014). When we use a complex propensity score model for the missingness mechanism, we need to solve a large system of equations with many unknown parameters, which may yield unstable solutions with less accuracy. Finally, since the missingness mechanism is MNAR, the estimator marREG assuming MAR is expected to have non-negligible bias in all settings. We also conduct simulation studies when the shadow variable assumption does not hold, but MAR does, and details about simulation settings and results can be found in the supplementary materials.

Comparisons in case (I) between the proposed two estimators (qd-REP-DB with quadratic functional M and qr-REP-DB with quartic functional M) and existing estimators (ln-mnarIPW with logistic linear propensity score, py-mnarIPW with logistic polynomial-sieves propensity score, and marREG) under sample sizes n=500 and n=1,000. The abbreviation LL stands for linear-logistic propensity score model with linear outcome model, and the other three scenarios are analogously defined. The horizontal line marks the true value of the outcome mean. (a) LL. (b) NL. (c) LN, and (d) NN.
Figure 2.

Comparisons in case (I) between the proposed two estimators (qd-REP-DB with quadratic functional M and qr-REP-DB with quartic functional M) and existing estimators (ln-mnarIPW with logistic linear propensity score, py-mnarIPW with logistic polynomial-sieves propensity score, and marREG) under sample sizes n=500 and n=1,000. The abbreviation LL stands for linear-logistic propensity score model with linear outcome model, and the other three scenarios are analogously defined. The horizontal line marks the true value of the outcome mean. (a) LL. (b) NL. (c) LN, and (d) NN.

We calculate the 95% confidence interval based on the proposed estimator μ^repdb and the inverse probability weighted estimator assuming MNAR. Since the estimators qd-REP-DB and qr-REP-DB behave similarly and the estimator ln-mnarIPW performs better than py-mnarIPW, we thus focus only on qd-REP-DB and ln-mnarIPW. For simplicity, we denote them by REP-DB and mnarIPW, respectively. Coverage probabilities (CPs) of these two approaches are shown in Table 1. The estimator REP-DB based confidence intervals have CPs close to the nominal level of 0.95 in all scenarios even under small sample size n=500. In contrast, the estimator mnarIPW based confidence intervals have CPs well below the nominal value if the propensity score model is incorrectly specified.

Table 1.

Coverage probability of the 95% confidence interval for the estimators REP-DB and mnarIPW in case (I) under n=500 and n=1,000

nMethodsLLNLLNNN
500REP-DB0.9390.9380.9450.943
mnarIPW0.9300.6350.9280.491
1,000REP-DB0.9470.9390.9480.948
mnarIPW0.9430.3810.9510.177
nMethodsLLNLLNNN
500REP-DB0.9390.9380.9450.943
mnarIPW0.9300.6350.9280.491
1,000REP-DB0.9470.9390.9480.948
mnarIPW0.9430.3810.9510.177
Table 1.

Coverage probability of the 95% confidence interval for the estimators REP-DB and mnarIPW in case (I) under n=500 and n=1,000

nMethodsLLNLLNNN
500REP-DB0.9390.9380.9450.943
mnarIPW0.9300.6350.9280.491
1,000REP-DB0.9470.9390.9480.948
mnarIPW0.9430.3810.9510.177
nMethodsLLNLLNNN
500REP-DB0.9390.9380.9450.943
mnarIPW0.9300.6350.9280.491
1,000REP-DB0.9470.9390.9480.948
mnarIPW0.9430.3810.9510.177

For case (II), we generate data according to Model 1 in Example 1. As with case (I), we consider two different sample sizes n=500 and n=1,000. We calculate the bias (Bias), Monte Carlo standard deviation (SD) and 95% CPs based on 1,000 replications in each setting. For comparison, we also apply the estimator mnarIPW with a correct propensity score model to estimate μ. As mentioned earlier, the estimator mnarIPW is obtained through solving estimating equations, and its performance may depend on initial values during the optimization process, especially in this case where the full data distribution is not identified. We consider two different settings of initial values for optimization parameters: true values and random values from Uniform(0,1). The results are summarized in Table 2.

Table 2.

Comparisons in case (II) between the estimators REP-DB and mnarIPW under n=500 and n=1,000

REP-DBmnarIPW-truemnarIPW-uniform
nBiasSDCPBiasSDCPBiasSDCP
5000.0040.0310.9050.0030.0500.9230.1130.2060.709
1,0000.0010.0230.9320.0040.0350.9330.1340.2160.667
REP-DBmnarIPW-truemnarIPW-uniform
nBiasSDCPBiasSDCPBiasSDCP
5000.0040.0310.9050.0030.0500.9230.1130.2060.709
1,0000.0010.0230.9320.0040.0350.9330.1340.2160.667
Table 2.

Comparisons in case (II) between the estimators REP-DB and mnarIPW under n=500 and n=1,000

REP-DBmnarIPW-truemnarIPW-uniform
nBiasSDCPBiasSDCPBiasSDCP
5000.0040.0310.9050.0030.0500.9230.1130.2060.709
1,0000.0010.0230.9320.0040.0350.9330.1340.2160.667
REP-DBmnarIPW-truemnarIPW-uniform
nBiasSDCPBiasSDCPBiasSDCP
5000.0040.0310.9050.0030.0500.9230.1130.2060.709
1,0000.0010.0230.9320.0040.0350.9330.1340.2160.667

We observe from Table 2 that the proposed estimator REP-DB has negligible bias, small standard deviation and satisfactory CP even under sample size n=500. As sample size increases to n=1,000, the 95% CP is close to the nominal level. For the estimator mnarIPW, only when the initial values for optimization parameters are set to be true values, it has comparable performance with REP-DB. However, if the initial values are randomly drawn from Uniform(0,1), the estimator mnarIPW has non-negligible bias, large standard deviation and low CP. As sample size increases, the situation becomes worse. We also calculate the estimator mnarIPW when initial values are drawn from other distributions, e.g., standard normal distribution. The performance is even worse and we do not report the results here. The simulations in this case demonstrate the superiority of the proposed estimator over existing estimators which require identifiability of the full data distribution.

4.2 Empirical example

We apply the proposed approach to the China Family Panel Studies, which was previously analysed in Miao et al. (2015). The data set includes 3,126 households in China. The outcome Y is the log of current home price (in 104 RMB yuan), and it has missing values due to the non-response of house owner and the non-availability from the real estate market. The missingness process of home price is likely to be not at random, because subjects having expensive houses may be less likely to disclose their home prices. The missing data rate of current home price is 21.8%. The completely observed covariates X includes 5 continuous variables: travel time to the nearest business center, house building area, family size, house story height, log of family income, and 3 discrete variables: province, urban (1 for urban househould, 0 rural), refurbish status. The shadow variable Z is chosen as the construction price of a house, which is also completely observed. The construction price is related to the current price of a house, and the shadow variable assumption that the non-response is independent of the construction price conditional on the current price and fully observed covariates is reasonable as the construction price can be viewed as an error-prone proxy for the current home value. For the representer assumption, we have discussed in Section 2 that it can be justified in principle without extra model assumptions on the missing data distribution, as the assumption involves only observed data. In particular, there are some simple examples where this assumption is guaranteed to hold. For instance, if E(ZR=1,X,Y) is linear in Y, then this assumption holds immediately according to the discussions below the representer assumption. Motivated by this fact, we fit a linear regression of the shadow variable Z on the covariates X and outcome Y using observed data. We test for significance by performing a t-test for the coefficient of Y, and the corresponding p-value is approximately zero, which indicates that there is a statistically significant relationship between Y and Z. Residual plots in Figure S2 of online supplementary materials show that the residuals are approximately randomly scattered, which one may interpret as evidence supporting the linearity assumption in this example. To further detect the linear structure in this multidimensional regression example, we also make a component-plus-residual plot given in Figure S3 of online supplementary materials. The scatter and the slope of the line in these plots appear approximately linear, as no obvious non-linear patterns emerge. All these empirical results provide no compelling evidence of violation of a linear model for E(ZR=1,X,Y), and hence the representer assumption seems reasonable in the real data example. Further discussions on the required assumptions in this example are given in supplementary materials.

We apply the proposed estimator REP-DB to estimate the outcome mean and the 95% confidence interval. We also use the competing estimator mnarIPW and two estimators assuming MAR (marREG and marIPW) for comparison. The results are shown in Table 3. We observe that the results from the proposed estimator are similar to those from the estimator mnarIPW, both yielding slightly lower estimates of home price on the log scale than those obtained from the standard MAR estimators. However, when the data are transformed back to the original scale, the deviations are notable and amount to approximately 1.13×104 RMB yuan. These analysis results are generally consistent with those in Miao et al. (2015).

Table 3.

Point estimates and 95% confidence intervals of the outcome mean for the home pricing example

MethodsEstimate95% confidence interval
REP-DB2.591(2.520, 2.661)
mnarIPW2.611(2.544, 2.678)
marREG2.714(2.661, 2.766)
marIPW2.715(2.659, 2.772)
MethodsEstimate95% confidence interval
REP-DB2.591(2.520, 2.661)
mnarIPW2.611(2.544, 2.678)
marREG2.714(2.661, 2.766)
marIPW2.715(2.659, 2.772)
Table 3.

Point estimates and 95% confidence intervals of the outcome mean for the home pricing example

MethodsEstimate95% confidence interval
REP-DB2.591(2.520, 2.661)
mnarIPW2.611(2.544, 2.678)
marREG2.714(2.661, 2.766)
marIPW2.715(2.659, 2.772)
MethodsEstimate95% confidence interval
REP-DB2.591(2.520, 2.661)
mnarIPW2.611(2.544, 2.678)
marREG2.714(2.661, 2.766)
marIPW2.715(2.659, 2.772)

5 Discussion

With the aid of a shadow variable, we have established a necessary and sufficient condition for non-parametric identification of mean functionals of non-ignorable missing data even if the joint distribution is not identified. Then we naturally strengthen this condition by imposing a representer assumption that is shown to be necessary for n-estimability of the mean functional. The assumption involves the existence of solutions to a representer equation, which is a Fredholm integral equation of the first kind and can be satisfied under mild requirements. Based on the representer equation, we propose a sieve-based estimator of the mean functional, which bypasses the difficulties of correctly specifying and estimating the unknown missingness mechanism and the outcome regression. Although the joint distribution is not identifiable, the proposed estimator is shown to be consistent for the mean functional. In addition, we establish conditions under which the proposed estimator is asymptotically normal. Since the solution to the representer equation is not uniquely determined, one cannot simply apply standard theories for non-parametric sieve estimators to derive the above asymptotic results. In fact, we need to firstly construct a consistent estimator of the solution set, and then find from the estimated set a consistent estimator of an appropriately chosen solution. We finally show that the proposed estimator attains the semi-parametric efficiency bound for the shadow variable model at a key submodel where the representer is uniquely identified.

The availability of a valid shadow variable is crucial for the proposed approach to adjust for non-ignorable missing data. An interesting feature of this assumption is that it has some observable implications and it can in principle be falsified with observed data, contrary to the usual ignorability assumption (D’Haultfœuille, 2010; Molenberghs et al., 2008). In particular, under the shadow variable assumption, the equation in π(): E{R/π(X,Y)X,Z}=1 has a solution that is a positive probability, i.e., π()(0,1], and hence, this assumption can be rejected if such a positive solution does not exist. Although refutable, the validity of the shadow variable assumption generally requires domain-specific knowledge of experts and needs to be investigated on a case-by-case basis. The existence of a shadow variable is practically reasonable in the empirical example presented in this paper and similar situations where one or more proxies or surrogates of a variable prone to missing data may be available. In fact, it is not uncommon in survey studies and/or cohort studies in the health and social sciences, that certain outcomes may be sensitive and/or expensive to measure accurately, so that a gold standard measurement is obtained only for a select subset of the sample, while one or more proxies or surrogate measures may be available for the remaining sample. Instead of a standard measurement error model often used in such settings which requires stringent identifying conditions, the more flexible shadow variable approach proposed in this paper provides a more robust alternative to incorporate surrogate measurement in a non-parametric framework, under minimal identification conditions. Besides, as advocated by Robins et al. (2000), in principle, one can also conduct sensitivity analysis to assess how results would change if the shadow variable assumption were violated by some pre-specified amount. For example, suppose that the parameter of interest μ is the population outcome mean. If the shadow variable assumption does not hold, then E(ZR,X,Y) should depend on R, e.g., E(ZR,X,Y)=α0+ρR+α1X+α2Y. Under this model, the representer assumption holds with δ0(X,Z)=(Zα0ρα1X)/α2. Following the proof of Corollary 1, one can verify that μ=E{RY+(1R)δ0(X,Z)}+ρ/α2*E(1R). To conduct sensitivity analysis, we first use the proposed approach to estimate δ0(X,Z) and α2 based on observed data. Then for any given value of the sensitivity parameter ρ (ρ=0 when the shadow variable assumption holds), one can estimate μ using sample analogues of the above identification formula.

This paper may be improved or extended in several directions. Firstly, the proposed identification and estimation framework may be extended to handle non-ignorable missing outcome regression or missing covariate problems. Secondly, one can use modern machine learning techniques to solve the representer equation so that an improved estimator may be achieved that adapts to sparsity structures in the data. Thirdly, it is of great interest to extend our results to handling other problems of coarsened data, for instance, unmeasured confounding problems in causal inference. Proximal causal inference has been recently developed to adjust for unmeasured confounders by using two suitable negative control variables (Cui et al., 2020; Miao et al., 2018), which is different from the current non-ignorable missing setting that requires only a valid shadow variable. Since the identification of causal effects in negative control literature relies on certain confounding bridge functions that are similar to the representer equation, the proposed approach is a promising tool for relaxing the requirement of unique solutions in proximal causal inference. We plan to pursue these and other related issues in future research.

Acknowledgments

We would like to thank the editor, associate editor, and two anonymous reviewers for their very insightful and helpful comments, which led to a significant improvement of our paper.

Funding

W.L.'s and W.M.'s research is supported by the National Natural Science Foundation of China (12101607,12071015), the National Key R&D Program of China (2022YFA1008100), Beijing Natural Science Foundation (1232008), and the National Statistical Science Research Project (2022LZ13).

Data availability

The R code for the implementation of this paper and the real data set are available at https://gitlab.com/weylpku/nonignorablemissingcode.

Supplementary material

Supplementary material is available online at Journal of the Royal Statistical Society: Series B.

References

Ai
C.
, &
Chen
X.
(
2003
).
Efficient estimation of models with conditional moment restrictions containing unknown functions
.
Econometrica
,
71
(
6
),
1795
1843
. https://doi.org/10.1111/1468-0262.00470

Canay
I. A.
,
Santos
A.
, &
Shaikh
A. M.
(
2013
).
On the testability of identification in some nonparametric models with endogeneity
.
Econometrica
,
81
(
6
),
2535
2559
. https://doi.org/10.3982/ECTA10851

Carrasco
M.
,
Florens
J.-P.
, &
Renault
E.
(
2007
).
Linear inverse problems in structural econometrics estimation based on spectral decomposition and regularization
.
Handbook of Econometrics
,
6(Part B)
,
5633
5751
. https://doi.org/10.1016/S1573-4412(07)06077-1

Chen
X.
(
2007
).
Large sample sieve estimation of semi-nonparametric models
.
Handbook of Econometrics
,
6(Part B)
,
5549
5632
. https://doi.org/10.1016/S1573-4412(07)06076-X

Chen
X.
, &
Pouzo
D.
(
2012
).
Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals
.
Econometrica
,
80
(
1
),
277
321
. https://doi.org/10.3982/ECTA7888

Chernozhukov
V.
,
Hong
H.
, &
Tamer
E.
(
2007
).
Estimation and confidence regions for parameter sets in econometric models
.
Econometrica
,
75
(
5
),
1243
1284
. https://doi.org/10.1111/j.1468-0262.2007.00794.x

Cui
Y.
,
Pu
H.
,
Shi
X.
,
Miao
W.
, &
Tchetgen Tchetgen
E. J.
(
2020
).
‘Semiparametric proximal causal inference’, arXiv, arXiv:2011.08411, preprint: not peer reviewed
.

Darolles
S.
,
Fan
Y.
,
Florens
J.-P.
, &
Renault
E.
(
2011
).
Nonparametric instrumental regression
.
Econometrica
,
79
(
5
),
1541
1565
. https://doi.org/10.3982/ECTA6539

Das
M.
,
Newey
W. K.
, &
Vella
F.
(
2003
).
Nonparametric estimation of sample selection models
.
The Review of Economic Studies
,
70
(
1
),
33
58
. https://doi.org/10.1111/1467-937X.00236

D’Haultfoeuille
X.
(
2011
).
On the completeness condition in nonparametric instrumental problems
.
Econometric Theory
,
27
(
3
),
460
471
. https://doi.org/10.1017/S0266466610000368

D’Haultfœuille
X.
(
2010
).
A new instrumental method for dealing with endogenous selection
.
Journal of Econometrics
,
154
(
1
),
1
15
. https://doi.org/10.1016/j.jeconom.2009.06.005

Greenlees
J. S.
,
Reece
W. S.
, &
Zieschang
K. D.
(
1982
).
Imputation of missing values when the probability of response depends on the variable being imputed
.
Journal of the American Statistical Association
,
77
(
378
),
251
261
. https://doi.org/10.1080/01621459.1982.10477793

Heckman
J. J.
(
1979
).
Sample selection bias as a specification error
.
Econometrica
,
47
(
1
),
153
161
. https://doi.org/10.2307/1912352

Horowitz
J. L.
(
2009
).
Semiparametric and nonparametric methods in econometrics
(Vol.
12
).
Springer
.

Huang
J. Z.
(
2003
).
Local asymptotics for polynomial spline regression
.
Annals of Statistics
,
31
(
5
),
1600
1635
. https://doi.org/10.1214/aos/1065705120

Ibrahim
J. G.
,
Lipsitz
S. R.
, &
Chen
M. -H.
(
1999
).
Missing covariates in generalized linear models when the missing data mechanism is non-ignorable
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
61
(
1
),
173
190
. https://doi.org/10.1111/1467-9868.00170

Ibrahim
J. G.
,
Lipsitz
S. R.
, &
Horton
N.
(
2001
).
Using auxiliary data for parameter estimation with non-ignorably missing outcomes
.
Journal of the Royal Statistical Society: Series C (Applied Statistics)
,
50
(
3
),
361
373
. https://doi.org/10.1111/1467-9876.00240

Kennedy
E. H.
,
Ma
Z.
,
McHugh
M. D.
, &
Small
D. S.
(
2017
).
Non-parametric methods for doubly robust estimation of continuous treatment effects
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
79
(
4
),
1229
1245
. https://doi.org/10.1111/rssb.12212

Kim
J. K.
, &
Yu
C. L.
(
2011
).
A semiparametric estimation of mean functionals with nonignorable missing data
.
Journal of the American Statistical Association
,
106
(
493
),
157
165
. https://doi.org/10.1198/jasa.2011.tm10104

Kosorok
M. R.
(
2008
).
Introduction to empirical processes and semiparametric inference
.
Springer
.

Kott
P. S.
(
2014
).
Calibration weighting when model and calibration variables can differ. In F. Mecatti, L. P. Conti, & G. M. Ranalli (Eds.), Contributions to sampling statistics (pp. 1–18). Springer
.

Li
Q.
, &
Racine
J. S.
(
2007
).
Nonparametric econometrics: Theory and practice
.
Princeton University Press
.

Little
R. J.
, &
Rubin
D. B.
(
2002
).
Statistical analysis with missing data
.
Wiley-Interscience
.

Liu
L.
,
Miao
W.
,
Sun
B.
,
Robins
J.
, &
Tchetgen Tchetgen
E. J.
(
2020
).
Identification and inference for marginal average treatment effect on the treated with an instrumental variable
.
Statistica Sinica
,
30(3)
,
1517
1541
. https://doi.org/10.5705/ss.202017.0196

Miao
W.
,
Ding
P.
, &
Geng
Z.
(
2016
).
Identifiability of normal and normal mixture models with nonignorable missing data
.
Journal of the American Statistical Association
,
111
(
516
),
1673
1683
. https://doi.org/10.1080/01621459.2015.1105808

Miao
W.
,
Geng
Z.
, &
Tchetgen Tchetgen
E. J.
(
2018
).
Identifying causal effects with proxy variables of an unmeasured confounder
.
Biometrika
,
105
(
4
),
987
993
. https://doi.org/10.1093/biomet/asy038

Miao
W.
,
Liu
L.
,
Tchetgen Tchetgen
E. J.
, &
Geng
Z.
(
2015
).
‘Identification, doubly robust estimation, and semiparametric efficiency theory of nonignorable missing data with a shadow variable’, arXiv, arXiv:1509.02556, preprint: not peer reviewed
.

Miao
W.
, &
Tchetgen Tchetgen
E. J.
(
2016
).
On varieties of doubly robust estimators under missingness not at random with a shadow variable
.
Biometrika
,
103
(
2
),
475
482
. https://doi.org/10.1093/biomet/asw016

Molenberghs
G.
,
Beunckens
C.
,
Sotto
C.
, &
Kenward
M. G.
(
2008
).
Every missingness not at random model has a missingness at random counterpart with equal fit
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
70
(
2
),
371
388
. https://doi.org/10.1111/j.1467-9868.2007.00640.x

Morikawa
K.
, &
Kim
J. K.
(
2021
).
Semiparametric optimal estimation with nonignorable nonresponse data
.
The Annals of Statistics
,
49
(
5
),
2991
3014
. https://doi.org/10.1214/21-AOS2070

Newey
W. K.
(
1997
).
Convergence rates and asymptotic normality for series estimators
.
Journal of Econometrics
,
79
(
1
),
147
168
. https://doi.org/10.1016/S0304-4076(97)00011-0

Newey
W. K.
, &
Powell
J. L.
(
2003
).
Instrumental variable estimation of nonparametric models
.
Econometrica
,
71
(
5
),
1565
1578
. https://doi.org/10.1111/1468-0262.00459

Qin
J.
,
Leung
D.
, &
Shao
J.
(
2002
).
Estimation with survey data under nonignorable nonresponse or informative sampling
.
Journal of the American Statistical Association
,
97
(
457
),
193
200
. https://doi.org/10.1198/016214502753479338

Robins
J. M.
,
Rotnitzky
A.
, &
Scharfstein
D. O.
(
2000
).
Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models. In M. E. Halloran & D. Berry (Eds.), Statistical models in epidemiology, the environment, and clinical trials (pp. 1–94). Springer
.

Robins
J. M.
,
Rotnitzky
A.
, &
Zhao
L. P.
(
1994
).
Estimation of regression coefficients when some regressors are not always observed
.
Journal of the American Statistical Association
,
89
(
427
),
846
866
. https://doi.org/10.1080/01621459.1994.10476818

Rotnitzky
A.
, &
Robins
J.
(
1997
).
Analysis of semi-parametric regression models with non-ignorable non-response
.
Statistics in Medicine
,
16
(1),
81
102
. https://doi.org/10.1002/(SICI)1097-0258(19970115)16:1<81::AID-SIM473>3.0.CO;2-0

Rotnitzky
A.
,
Robins
J. M.
, &
Scharfstein
D. O.
(
1998
).
Semiparametric regression for repeated outcomes with nonignorable nonresponse
.
Journal of the American Statistical Association
,
93
(
444
),
1321
1339
. https://doi.org/10.1080/01621459.1998.10473795

Rubin
D. B.
(
1976
).
Inference and missing data (with discussion)
.
Biometrika
,
63
(
3
),
581
592
. https://doi.org/10.1093/biomet/63.3.581

Santos
A.
(
2011
).
Instrumental variable methods for recovering continuous linear functionals
.
Journal of Econometrics
,
161
(
2
),
129
146
. https://doi.org/10.1016/j.jeconom.2010.11.014

Scharfstein
D. O.
,
Rotnitzky
A.
, &
Robins
J. M.
(
1999
).
Adjusting for nonignorable drop-out using semiparametric nonresponse models
.
Journal of the American Statistical Association
,
94
(
448
),
1096
1120
. https://doi.org/10.1080/01621459.1999.10473862

Severini
T. A.
, &
Tripathi
G.
(
2012
).
Efficiency bounds for estimating linear functionals of nonparametric regression models with endogenous regressors
.
Journal of Econometrics
,
170
(
2
),
491
498
. https://doi.org/10.1016/j.jeconom.2012.05.018

Shao
J.
, &
Wang
L.
(
2016
).
Semiparametric inverse propensity weighting for nonignorable missing data
.
Biometrika
,
103
(
1
),
175
187
. https://doi.org/10.1093/biomet/asv071

Sun
B.
,
Liu
L.
,
Miao
W.
,
Wirth
K.
,
Robins
J.
, &
Tchetgen Tchetgen
E. J.
(
2018
).
Semiparametric estimation with data missing not at random using an instrumental variable
.
Statistica Sinica
,
28(4)
,
1965
1983
. https://doi.org/10.5705/ss.202016.0324

Tang
G.
,
Little
R. J.
, &
Raghunathan
T. E.
(
2003
).
Analysis of multivariate missing data with nonignorable nonresponse
.
Biometrika
,
90
(
4
),
747
764
. https://doi.org/10.1093/biomet/90.4.747

Tang
N.
,
Zhao
P.
, &
Zhu
H.
(
2014
).
Empirical likelihood for estimating equations with nonignorably missing data
.
Statistica Sinica
,
24(2)
,
723
-747. https://doi.org/10.5705/ss.2012.254

Tchetgen Tchetgen
E. J.
, &
Wirth
K. E.
(
2017
).
A general instrumental variable framework for regression analysis with outcome missing not at random
.
Biometrics
,
73
(
4
),
1123
1131
. https://doi.org/10.1111/biom.12670

Vansteelandt
S.
,
Rotnitzky
A.
, &
Robins
J.
(
2007
).
Estimation of regression models for the mean of repeated outcomes under nonignorable nonmonotone nonresponse
.
Biometrika
,
94
(
4
),
841
860
. https://doi.org/10.1093/biomet/asm070

Wang
S.
,
Shao
J.
, &
Kim
J. K.
(
2014
).
An instrumental variable approach for identification and estimation with nonignorable nonresponse
.
Statistica Sinica
,
24
(3),
1097
1116
. https://doi.org/10.5705/ss.2012.074

Zahner
G. E.
,
Pawelkiewicz
W.
,
DeFrancesco
J. J.
, &
Adnopoz
J.
(
1992
).
Children’s mental health service needs and utilization patterns in an urban community: An epidemiological assessment
.
Journal of the American Academy of Child & Adolescent Psychiatry
,
31
(
5
),
951
960
. https://doi.org/10.1097/00004583-199209000-00025

Zhao
J.
, &
Shao
J.
(
2015
).
Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data
.
Journal of the American Statistical Association
,
110
(
512
),
1577
1590
. https://doi.org/10.1080/01621459.2014.983234

Author notes

Conflicts of interest: None declared.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/pages/standard-publication-reuse-rights)

Supplementary data