Causal Isotonic Regression

Westling, Ted; Gilbert, Peter; Carone, Marco

doi:10.1111/rssb.12372

Summary

In observational studies, potential confounders may distort the causal relationship between an exposure and an outcome. However, under some conditions, a causal dose–response curve can be recovered by using the G-computation formula. Most classical methods for estimating such curves when the exposure is continuous rely on restrictive parametric assumptions, which carry significant risk of model misspecification. Non-parametric estimation in this context is challenging because in a non-parametric model these curves cannot be estimated at regular rates. Many available non-parametric estimators are sensitive to the selection of certain tuning parameters, and performing valid inference with such estimators can be difficult. We propose a non-parametric estimator of a causal dose–response curve known to be monotone. We show that our proposed estimation procedure generalizes the classical least squares isotonic regression estimator of a monotone regression function. Specifically, it does not involve tuning parameters and is invariant to strictly monotone transformations of the exposure variable. We describe theoretical properties of our proposed estimator, including its irregular limit distribution and the potential for doubly robust inference. Furthermore, we illustrate its performance via numerical studies and use it to assess the relationship between body mass index and immune response in human immunodeficiency virus vaccine trials.

Backdoor formula, Cross-fitting, Dose–response, Grenander-type estimator, Monotonicity, Shape-constrained estimation

1 Introduction

1.1 Motivation and literature review

Questions regarding the causal effect of an exposure on an outcome are ubiquitous in science. If investigators can carry out an experimental study in which they randomly assign a level of exposure to each participant and then measure the outcome of interest, estimating a causal effect is generally straightforward. However, such studies are often not feasible, and data from observational studies must be relied on instead. Assessing causality is then more difficult, in large part because of potential confounding of the relationship between exposure and outcome. Many non-parametric methods have been proposed for drawing inference about a causal effect by using observational data when the exposure of interest is either binary or categorical—these include, among others, inverse-probability-weighted estimators (Horvitz and Thompson, 1952), augmented inverse-probability-weighted (AIPW) estimators (Scharfstein et al., 1999; Bang and Robins, 2005) and targeted minimum-loss-based estimators (van der Laan and Rose, 2011).

In practice, many exposures are continuous, in the sense that they may take any value in an interval. A common approach to dealing with such exposures is simply to discretize the interval into two or more regions, thus returning to the categorical exposure setting. However, it is frequently of scientific interest to learn the causal dose–response curve, which describes the causal relationship between the exposure and outcome across a continuum of the exposure. Much less attention has been paid to continuous exposures. Robins (2000) and Zhang et al. (2016) studied this problem by using parametric models, and Neugebauer and van der Laan (2007) considered inference on parameters obtained by projecting a causal dose–response curve onto a parametric working model. Other researchers have taken a non-parametric approach instead. Rubin and van der Laan (2006) and Díaz and van der Laan (2011) discussed non-parametric estimation using flexible data-adaptive algorithms. Kennedy et al. (2017) proposed an estimator based on local linear smoothing. Finally, van der Laan et al. (2018) have recently presented a general framework for inference on parameters that fail to be sufficiently smooth as a function of the data-generating distribution and for which regular root-n-estimation theory is therefore not available. This is indeed the case for the causal dose–response curve, and van der Laan et al. (2018) discussed inference on such a parameter as a particular example.

Despite a growing body of literature on non-parametric estimation of causal dose–response curves, to the best of our knowledge, existing methods do not permit valid large sample inference and may be sensitive to the selection of certain tuning parameters. For instance, smoothing-based methods are often sensitive to the choice of a kernel function and bandwidth, and these estimators typically have non-negligible asymptotic bias, which complicates the task of performing valid inference.

In many settings, it may be known that the causal dose–response curve is monotone in the exposure. For instance, exposures such as daily exercise performed, cigarettes smoked per week and air pollutant levels are all known to have monotone relationships with various health outcomes. In such cases, an extensive literature suggests that monotonicity may be leveraged to derive estimators with desirable properties—Groeneboom and Jongbloed (2014) provides a comprehensive overview. For example, in the absence of confounding, isotonic regression may be employed to estimate the causal dose–response curve (Barlow et al., 1972). The isotonic regression estimator does not require selection of a kernel function or bandwidth, is invariant to strictly increasing transformations of the exposure and on centring and scaling by n^−1/3, converges in law pointwise to a symmetric limit distribution with mean 0 (Brunk, 1970). The convergence property is useful since it facilitates asymptotically valid pointwise inference.

Non-parametric inference on a monotone dose–response curve when the exposure–outcome relationship is confounded is more difficult to tackle and is the focus of this paper. To the best of our knowledge, this problem has not been comprehensively studied before.

1.2 Parameter of interest and its causal interpretation

The prototypical data unit that we consider is O = (Y, A, W), where Y is a response, A a continuous exposure and W a vector of covariates. The support of the true data-generating distribution $P_{0}$ is denoted by $O = Y \times A \times W$ ⁠, where $Y \subseteq R$ ⁠, $A \subseteq R$ is an interval and $W \subseteq R^{p}$ ⁠. Throughout, the use of subscript 0 refers to evaluation at or under $P_{0}$ ⁠. For example, we write $θ_{0}$ and F₀ to denote $θ_{P_{0}}$ and $F_{P_{0}}$ respectively, and E₀ to denote expectation under $P_{0}$ ⁠.

Our parameter of interest is the so-called G-computed regression function from

A

to

R

⁠, defined as

a \mapsto θ_{0} (a) : = E_{0} {E_{0} (Y | A = a, W)},

where the outer expectation is with respect to the marginal distribution Q₀ of W. In some scientific contexts,

θ_{0} (a)

may have a causal interpretation. Adopting the Neyman–Rubin potential outcomes framework, for each

a \in A

⁠, we denote by Y(a) a unit’s potential outcome under exposure level A = a. The causal parameter m₀(a): = E₀{Y(a)} corresponds to the average outcome under assignment of the entire population to exposure level A = a. The resulting curve

m_{0} : A \to R

is what we formally define as the causal dose–response curve. Under varying sets of causal conditions, m₀(a) may be identified with functionals of the observed data distribution, such as the unadjusted regression function

r_{0} (a) : = E_{0} (Y | A = a)

or the G-computed regression function

θ_{0} (a)

⁠.

Suppose that

(a)
each unit’s potential outcomes are independent of all other units’ exposures, and
(b)
the observed outcome Y equals the potential outcome Y(A) corresponding to the exposure level A that is actually received.

Identification of m₀(a) further depends on the relationship between A and Y(a). If assumptions (a) and (b) hold and, in addition,

(a)
A and Y(a) are independent, and
(b)
the marginal density of A is positive at a,

then m₀(a) = r₀(a). Condition (c) typically holds in experimental studies only (e.g. randomized trials). In observational studies there are often common causes of A and Y(a)—so-called confounders of the exposure–outcome relationship—that induce dependence. In such cases, m₀(a) and r₀(a) do not generally coincide. However, if W contains a sufficiently rich collection of confounders, it may still be possible to identify m₀(a) from the observed data. If assumptions (a) and (b) hold and, in addition,

(a)
A and Y(a) are conditionally independent given W, and
(b)
the conditional density of A given W is almost surely positive at A = a,

then $m_{0} (a) = θ_{0} (a)$ ⁠. This is a fundamental result in causal inference (Robins, 1986; Gill and Robins, 2001). Whenever $m_{0} (a) = θ_{0} (a)$ ⁠, our methods can be interpreted as drawing inference on the causal dose–response parameter m₀(a).

We note that the definition of the counterfactual outcome Y(a) presupposes that the intervention setting A = a is uniquely defined. In many situations, this stipulation requires careful thought. For example, in Section 6 we consider an application in which body mass index BMI is the exposure of interest. There is an on-going scientific debate about whether such an exposure leads to a meaningful causal interpretation, since it is not clear what it means to intervene on BMI.

Even if the identifiability conditions that were stipulated above do not strictly hold or the scientific question is not causal in nature, when W is associated with both A and Y, $θ_{0} (a)$ often has a more appealing interpretation than the unadjusted regression function r₀(a). Specifically, $θ_{0} (a)$ may be interpreted as the average value of Y in a population with exposure fixed at A = a but otherwise characteristic of the study population with respect to W. Because $θ_{0} (a)$ involves both adjustment for W and marginalization with respect to a single reference population that does not depend on the value a, the comparison of $θ_{0} (a)$ over different values of a is generally more meaningful than for r₀(a).

When $P_{0} (A = a) = 0$ ⁠, the parameter P↦θ_P(a) is not pathwise differentiable at $P_{0}$ with respect to the non-parametric model (Díaz and van der Laan, 2011). Heuristically, because of the continuous nature of A, θ_P(a) corresponds to a local feature of P. As a result, regular root-n-rate estimators cannot be expected, and standard methods for constructing efficient estimators of pathwise differentiable parameters in non-parametric and semiparametric models (e.g. estimating equations, one-step estimation or targeted minimum loss-based estimation) cannot be used directly to target and obtain inference on $θ_{0} (a)$ ⁠.

1.3 Contribution and organization of the paper

We denote by $F_{P} : A \to R$ the distribution function of A under P, by $F_{θ}$ the class of non-decreasing real-valued functions on $A$ and by $F_{F}$ the class of strictly increasing and continuous distribution functions supported on $A$ ⁠. The statistical model that we shall work in is $M : = {P : θ_{P} \in F_{θ}, F_{P} \in F_{F}}$ ⁠, which consists of the collection of distributions for which θ_P is non-decreasing over $A$ and the marginal distribution of A is continuous with positive Lebesgue density over $A$ ⁠.

In this paper, we study non-parametric estimation and inference on the G-computed regression function $a \mapsto θ_{0} (a) = E_{0} {E_{0} (Y | A = a, W)}$ for use when A is a continuous exposure and $θ_{0}$ is known to be monotone. Specifically, our goal is to make inference about $θ_{0} (a)$ for $a \in A$ by using independent observations O₁, O₂, …, O_n drawn from $P_{0} \in M$ ⁠. This problem is an extension of classical isotonic regression to the setting in which the exposure–outcome relationship is confounded by recorded covariates—this is why we refer to the method proposed as causal isotonic regression. As mentioned above, to the best of our knowledge, non-parametric estimation and inference on a monotone G-computed regression function have not been comprehensively studied before. In what follows, we

(a)
show that our proposed estimator generalizes the unadjusted isotonic regression estimator to the more realistic scenario in which there is confounding by recorded covariates,
(b)
investigate finite sample and asymptotic properties of the estimator proposed, including invariance to strictly increasing transformations of the exposure, doubly robust consistency and doubly robust convergence in distribution to a non-degenerate limit,
(c)
derive practical methods for constructing pointwise confidence intervals, including intervals that have valid doubly robust calibration and
(d)
illustrate numerically the practical performance of the estimator.

We note that, in Westling and Carone (2019), we studied estimation of $θ_{0}$ as one of several examples of a general approach to monotonicity-constrained inference. Here, we provide a comprehensive examination of estimation of a monotone dose–response curve. In particular, we establish novel theory and methods that have important practical implications. First, we provide conditions under which the estimator converges in distribution even when one of the nuisance estimators that are involved in the problem is inconsistent. This contrasts with the results in Westling and Carone (2019), which required that both nuisance parameters be estimated consistently. We also propose two estimators of the scale parameter arising in the limit distribution, one of which requires both nuisance estimators to be consistent, and the other of which does not. Second, we demonstrate that our estimator is invariant to strictly monotone transformations of the exposure. Third, we study the joint convergence of our proposed estimator at two points and use this result to construct confidence intervals for causal effects. Fourth, we study the behaviour of our estimator in the context of discrete exposures. Fifth, we propose an alternative estimator based on cross-fitting the nuisance estimators and demonstrate that this strategy removes the need for empirical process conditions that were required in Westling and Carone (2019). Finally, we investigate the behaviour of our estimator in comprehensive numerical studies and compare its behaviour with that of the local linear estimator of Kennedy et al. (2017).

The remainder of the paper is organized as follows. In Section 2, we concretely define the estimator proposed. In Section 3, we study theoretical properties of the estimator. In Section 4, we propose methods for pointwise inference. In Section 5, we perform numerical studies to assess the performance of the estimator and, in Section 6, we use this procedure to investigate the relationship between BMI and immune response to human immunodeficiency virus (HIV) vaccines by using data from several randomized trials. Finally, we provide concluding remarks in Section 7. Proofs of all theorems are provided in on-line supplementary material.

Some example data and the programs that were used to analyse them can be obtained from

https://dbpia.nl.go.kr/jrsssb/issue/.

2 Proposed approach

2.1 Review of isotonic regression

Since the proposed estimator of

θ_{0} (a)

builds on isotonic regression, we briefly review the classical least squares isotonic regression estimator of r₀(a). The isotonic regression r_n of Y₁, Y₂, …, Y_n on A₁, A₂, …, A_n is the minimizer in r of

Σ_{i = 1}^{n} {Y_{i} - r (A_{i})}^{2}

over all monotone non-decreasing functions. This minimizer can be obtained via the pool adjacent violators algorithm (Ayer et al., 1955; Barlow et al., 1972) and can also be represented in terms of greatest convex minorants (GCMs). The GCM of a bounded function f on an interval [a, b] is defined as the supremum over all convex functions g such that g ⩽ f. Letting F_n be the empirical distribution function of A₁, A₂, …, A_n, r_n(a) can be shown to equal the left derivative, evaluated at F_n(a), of the GCM over the interval [0, 1] of the linear interpolation of the so-called cumulative sum diagram

\{\frac{1}{n} (i, \sum_{j = 0}^{i} Y_{(i)}^{*}) : i = 0, 1, \dots, n\},

where

Y_{(0)}^{*} : = 0

and

Y_{(i)}^{*}

is the value of Y corresponding to the observation with ith smallest value of A.

The isotonic regression estimator r_n has many attractive properties. First, unlike smoothing-based estimators, isotonic regression does not require the choice of a kernel function, bandwidth or any other tuning parameter. Second, it is invariant to strictly increasing transformations of A. Specifically, if $H : A \to R$ is a strictly increasing function, and $r_{n}^{*}$ is the isotonic regression of Y₁, Y₂, …, Y_n on H(A₁), H(A₂), …, H(A_n), it follows that $r_{n}^{*} = r_{n} \circ H^{- 1}$ ⁠. Third, r_n is uniformly consistent on any strict subinterval of $A$ ⁠. Fourth, n^1/3{r_n(a) − r₀(a)} converges in distribution to ${4 r_{0}^{'} (a) σ_{0}^{2} (a) / f_{0} (a)}^{1 / 3} W$ for any interior point a of $A$ at which $r_{0}^{'} (a)$ ⁠, $f_{0} (a) : = F_{0}^{'} (a)$ and $σ_{0}^{2} (a) : = E_{0} [{Y - r_{0} (a)}^{2} | A = a]$ exist and are positive and continuous in a neighbourhood of a. Here, $W : = arg {max}_{u \in R} {Z_{0} (u) - u^{2}}$ ⁠, where Z₀ denotes a two-sided Brownian motion originating from zero, and is said to follow Chernoff’s distribution. Chernoff’s distribution has been extensively studied: among other properties, it is a log-concave and symmetric law centred at zero, has moments of all orders and can be approximated by an N(0, 0.52) distribution (Chernoff, 1964; Groeneboom and Wellner, 2001). It appears often in the limit distribution of monotonicity-constrained estimators.

2.2 Definition of proposed estimator

For any given

P \in M

⁠, we define the outcome regression pointwise as

μ_{P} (a, w) : = E_{P} (Y | A = a, W = w)

⁠, and the normalized exposure density as

g_{P} (a, w) : = π_{P} (a | w) / f_{P} (a)

⁠, where

π_{P} (a | w)

is the evaluation at a of the conditional density function of A given W = w and f_P is the marginal density function of A under P. Additionally, we define the pseudo-outcome ξ_{μ, g, Q}(y, a, w) as

ξ_{μ, g, Q} (y, a, w) : = \frac{y - μ (a, w)}{g (a, w)} + \int μ (a, z) Q (d z) .

As noted by Kennedy et al. (2017),

E_{0} {ξ_{μ, g, Q_{0}} (Y, A, W) | A = a} = θ_{0} (a)

if eitherμ = μ₀org = g₀. They used this fact to motivate an estimator θ_{n, h}(a) of

θ_{0} (a)

⁠, defined as the local linear regression with bandwidth h > 0 of the pseudo-outcomes

ξ_{μ_{n}, g_{n}, Q_{n}} (Y_{1}, A_{1}, W_{1}), ξ_{μ_{n}, g_{n}, Q_{n}} (Y_{2}, A_{2}, W_{2}), \dots, ξ_{μ_{n}, g_{n}, Q_{n}} (Y_{n}, A_{n}, W_{n})

on A₁, A₂, …, A_n, where

μ_{n}

is an estimator of μ₀, g_n is an estimator of g₀ and Q_n is the empirical distribution function based on W₁, W₂, …, W_n. The study of this non-parametric regression problem is not standard because these pseudo-outcomes are dependent when the nuisance function estimators

μ_{n}

and g_n are estimated from the data. Nevertheless, Kennedy et al. (2017) showed that their estimator is consistent if either

μ_{n}

or g_n is consistent. Additionally, under regularity conditions, they showed that, if both nuisance estimators converge sufficiently fast and the bandwidth

h_{n}^{*}

tends to 0 at rate n^−1/5, then

n^{2 / 5} {θ_{n, h_{n}^{*}} (a) - θ_{0} (a)} \overset{d}{\to} N {b_{0} (a), v_{0} (a)},

where b₀(a) is an asymptotic bias depending on the second derivative of

θ_{0}

⁠, and v₀(a) is an asymptotic variance.

In our setting, $θ_{0}$ is known to be monotone. Therefore, instead of using a local linear regression to estimate the conditional mean of the pseudo-outcomes, it is natural to consider as an estimator the isotonic regression of the pseudo-outcomes on A₁, A₂, …, A_n. Using the GCM representation of isotonic regression that was stated in the previous section, we can summarize our estimation procedure as follows.

Step 1: construct estimators $μ_{n}$ and g_n of μ₀ and g₀ respectively.
Step 2: for each a in the unique values of A₁, A₂, …, A_n, compute and set
$Γ_{n} (a) : = \frac{1}{n} \sum_{i = 1}^{n} I_{(- \infty, a]} (A_{i}) \frac{Y_{i} - μ_{n} (A_{i}, W_{i})}{g_{n} (A_{i}, W_{i})} + \frac{1}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} I_{(- \infty, a]} (A_{i}) μ_{n} (A_{i}, W_{j}) .$
(1)
Step 3: compute the GCM ${\bar{Ψ}}_{n}$ of the set of points {(0, 0)}∪{(F_n(A_i), Γ_n(A_i)):i = 1, 2, …, n} over [0, 1].
Step 4: define θ_n(a) as the left derivative of ${\bar{Ψ}}_{n}$ evaluated at F_n(a).

As in the work of Kennedy et al. (2017), although the proposed estimator θ_n can be defined as an isotonic regression, the asymptotic properties of our estimator do not appear to follow simply from classical results for isotonic regression because the pseudo-outcomes depend on the estimators

μ_{n}

⁠, g_n and Q_n, which themselves depend on all the observations. However, θ_n is of generalized Grenander type, and thus the asymptotic results of Westling and Carone (2019) can be used to study its asymptotic properties. To see that θ_n is a generalized Grenander-type estimator, we define

ψ_{P} : = θ_{P} \circ F_{P}^{- 1}

and note that, since θ_P and

F_{P}^{- 1}

are increasing, so is ψ_P. Therefore, the primitive function

Ψ_{P} (t) : = \int_{0}^{t} ψ_{P} (u) d u = \int_{- \infty}^{F_{P}^{- 1} (t)} θ_{P} (v) F_{P} (d v)

is convex. Next, we define Γ_P:=Ψ_P∘F_P, so that

Γ_{P} (a) = \int_{- \infty}^{a} θ_{P} (u) F_{P} (d u) = \int \int_{- \infty}^{a} μ_{P} (u, w) F_{P} (d u) Q_{P} (d w) .

The parameter

Γ_{P} (a_{0})

is pathwise differentiable at P in

M

for each

a_{0}

⁠, and its non-parametric efficient influence function

ϕ_{μ_{P}, g_{P}, F_{P}, Q_{P}, a_{0}}^{*}

can be computed to be

(y, a, w) \mapsto I_{(- \infty, a_{0}]} (a) \frac{y - μ_{P} (a, w)}{g_{P} (a, w)} + \int_{- \infty}^{a_{0}} μ_{P} (u, w) F_{P} (d u) + I_{(- \infty, a_{0}]} (a) θ_{P} (a) - 2 Γ_{P} (a_{0}) .

Denoting by P_n any estimator of

P_{0}

that is compatible with estimators

μ_{n}

⁠, g_n, F_n and Q_n of μ₀, g₀, F₀ and Q₀ respectively, the one-step estimator of

Γ_{0} (a)

is given by

Γ_{n} (a) : = Γ_{μ_{n}, F_{n}, Q_{n}} (a) + \frac{1}{n} \sum_{i = 1}^{n} ϕ_{μ_{n}, g_{n}, F_{n}, Q_{n}, a}^{*} (O_{i}),

where we define

Γ_{μ_{n}, F_{n}, Q_{n}} (a) : = \int \int_{- \infty}^{a} μ_{n} (u, w) F_{n} (d u) Q_{n} (d w) .

This one-step estimator is equivalent to that defined in expression (1). We then define

Ψ_{n} : = Γ_{n} \circ F_{n}^{-}

for

F_{n}^{-}

the empirical quantile function of A as our estimator of Ψ₀, and ψ_n as the left derivative of the GCM of Ψ_n. Thus, we find that θ_n = ψ_n∘F_n is the estimator that is defined in steps 1–4. This form of the estimator was described in Westling and Carone (2019), where it was briefly discussed as one of several examples of a general strategy for non-parametric monotone inference.

If $θ_{0} (a)$ were only known to be monotone on a fixed subinterval $A_{0} \subset A$ ⁠, we would define $F_{P} (a) : = P (A ⩽ a | A \in A_{0})$ as the marginal distribution function restricted to $A_{0}$ ⁠, and F_n as its empirical counterpart. Similarly, $I_{(- \infty, a]} (A_{i})$ in expression (1) would be replaced with $I_{(- \infty, a] \cap A_{0}} (A_{i})$ ⁠. In all other respects, our estimation procedure would remain the same.

Finally, as alluded to earlier, we observe that the proposed estimator generalizes classical isotonic regression in a way that we now make precise. If it is known that A is independent of W (condition 1), so that g₀(a, w) = 1 for all supported (a, w), we may take g_n = 1. If, furthermore, it is known that Y is independent of W given A (condition 2), then we may construct

μ_{n}

such that

μ_{n} (a, w) = μ_{n} (a)

for all supported (a, w). Inserting g_n = 1 and any such

μ_{n}

into expression (1), we obtain that

Γ_{n} (a) = \frac{1}{n} \sum_{i = 1}^{n} I_{(- \infty, a]} (A_{i}) Y_{i}

and thus that θ_n(a) = r_n(a) for each a. Hence, in this case, our estimator reduces to least squares isotonic regression.

3 Theoretical properties

3.1 Invariance to strictly increasing exposure transformations

An important feature of the estimator proposed is that, as with the isotonic regression estimator, it is invariant to any strictly increasing transformation of A. This is a desirable property because the scale of a continuous exposure is often arbitrary from a statistical perspective. For instance, if A is temperature, whether A is measured in degrees Fahrenheit or Celsius or in kelvins does not change the information that is available. In particular, if the parameters $θ_{0}$ and $θ_{0}^{*}$ correspond to using as exposure A and H(A) respectively for H some strictly increasing transformation, then $θ_{0}$ and $θ_{0}^{*}$ encode exactly the same information about the effect of A on Y after adjusting for W. It is therefore natural to expect any sensible estimator to be invariant to the scale on which the exposure is measured.

Setting X: = H(A) for a strictly increasing function

H : A \to R

⁠, we first note that the function

θ_{0}^{*} : x \mapsto E_{0} {E_{0} (Y | X = x, W)} = θ_{0} \circ H^{- 1} (x)

is non-decreasing. Next, we define

μ_{0}^{*} (x, w) : = E_{0} (Y | X = x, W = w)

and

g_{0}^{*} (x, w) = π_{0}^{*} (x | w) / f_{0}^{*} (x)

⁠, where

π_{0}^{*} (x | w)

is the evaluation at x of the conditional density function of X given W = w and

f_{0}^{*}

is the marginal density function of X under

P_{0}

⁠, and we denote by

μ_{n}^{*}

and

g_{n}^{*}

estimators of

μ_{0}^{*}

and

g_{0}^{*}

respectively. The estimation procedure that was defined in the previous section but using exposure X instead of A then leads to estimator

θ_{n}^{*} (x) : = ψ_{n}^{*} \circ F_{n}^{*} (x)

⁠, where

F_{n}^{*} : = F_{n} \circ H^{- 1}

is the empirical distribution function based on X₁, X₂, …, X_n, and

ψ_{n}^{*}

is the left derivative of the GCM of

Ψ_{n}^{*} : = Γ_{n}^{*} \circ F_{n}^{* -}

for

\begin{matrix} Γ_{n}^{*} (x) & : = \frac{1}{n} \sum_{i = 1}^{n} \{I_{(- \infty, x]} (X_{i}) \frac{Y_{i} - μ_{n}^{*} (X_{i}, W_{i})}{g_{n}^{*} (X_{i}, W_{i})} + \int_{- \infty}^{x} μ_{n}^{*} (x, W_{i}) F_{n}^{*} (d x)\} \\ = \frac{1}{n} \sum_{i = 1}^{n} [I_{(- \infty, H^{- 1} (x)]} (A_{i}) \frac{Y_{i} - μ_{n}^{*} {H (A_{i}), W_{i}}}{g_{n}^{*} {H (A_{i}), W_{i}}} + \int_{- \infty}^{H^{- 1} (x)} μ_{n}^{*} {H (a), W_{i}} F_{n} (d a)] . \end{matrix}

If

μ_{n}^{*} {H (a), w} = μ_{n} (a, w)

and

g_{n}^{*} {H (a), w} = g_{n} (a, w)

⁠, implying that nuisance estimators

μ_{n}

and g_n are themselves invariant to a strictly increasing transformation of A, then we have that

Γ_{n}^{*} = Γ_{n} \circ H^{- 1}

⁠, and so

Ψ_{n}^{*} = Γ_{n} \circ H^{- 1} \circ H \circ F_{n} = Ψ_{n}

⁠. It follows then that

θ_{n}^{*} = θ_{n} \circ H^{- 1}

⁠. In other words, the proposed estimator θ_n of

θ_{0}

is invariant to any strictly increasing transformation of the exposure variable.

We note that it is easy to ensure that $μ_{n}^{*} {H (a), w} = μ_{n} (a, w)$ and $g_{n}^{*} {H (a), w} = g_{n} (a, w)$ ⁠. Set U: = F_n(A), which is also equal to $F_{n}^{*} (X)$ ⁠, and let ${\bar{μ}}_{n} (u, w)$ be an estimator of the conditional mean of Y given (U, W) = (u, w). Then, taking $μ_{n} (a, w) : = {\bar{μ}}_{n} {F_{n} (a), w}$ ⁠, we have that $μ_{n}^{*} (x, w) : = {\bar{μ}}_{n} {F_{n}^{*} (x), w}$ satisfies the desired property. Similarly, letting ${\bar{g}}_{n} (u, w)$ be an estimator of the conditional density of U = u given W = w and, setting $g_{n} (a, w) : = {\bar{g}}_{n} {F_{n} (a), w}$ ⁠, we may take $g_{n}^{*} (x, w) : = {\bar{g}}_{n} {F_{n}^{*} (x), w}$ ⁠.

3.2 Consistency

We now provide sufficient conditions under which consistency of θ_n is guaranteed. Our conditions require controlling the uniform entropy of certain classes of functions. For a uniformly bounded class of functions $F$ ⁠, a finite discrete probability measure Q, and any ɛ > 0, the ɛ-covering-number $N {ε, F, L_{2} (Q)}$ of $F$ relative to the L₂(Q) metric is the smallest number of L₂(Q) balls of radius less than or equal to ɛ needed to cover $F$ ⁠. The uniform ɛ-entropy of $F$ is then defined as $\log [{sup}_{Q} N {ε, F, L_{2} (Q)}]$ ⁠, where the supremum is taken over all finite discrete probability measures. For a thorough treatment of covering numbers and their role in empirical process theory, we refer readers to van der Vaart and Wellner (1996).

Below, we state three sufficient conditions that we shall refer to in the following theorem.

Condition 1
There exist constants $C, δ, K_{0}, K_{1}, K_{2} \in (0, \infty)$ and V ∈ [0, 2) such that, almost surely as n → ∞, $μ_{n}$ and g_n are contained in classes of functions $F_{0}$ and $F_{1}$ respectively, satisfying
(a)
$| μ | ⩽ K_{0}$ for all $μ \in F_{0}$ ⁠, and K₁ ⩽ g ⩽ K₂ for all $g \in F_{1}$ ⁠, and
(b)
$\log [{sup}_{Q} N {ε, F_{0}, L_{2} (Q)}] ⩽ C ε^{- V / 2}$ and $\log [{sup}_{Q} N {ε, F_{1}, L_{2} (Q)}] ⩽ C ε^{- V}$ for all ɛ ⩽ δ.

Condition 2
There exist $μ_{\infty} \in F_{0}$ and $g_{\infty} \in F_{1}$ such that $P_{0} {(μ_{n} - μ_{\infty})}^{2} \to^{p} 0$ and $P_{0} {(g_{n} - g_{\infty})}^{2} \to^{p} 0$ ⁠.

Condition 3
There exist subsets $S_{1}, S_{2}$ and $S_{3}$ of $A \times W$ such that $P_{0} (S_{1} \cup S_{2} \cup S_{3}) = 1$ and
(a)
$μ_{\infty} (a, w) = μ_{0} (a, w)$ for all $(a, w) \in S_{1}$ ⁠,
(b)
$g_{\infty} (a, w) = g_{0} (a, w)$ for all $(a, w) \in S_{2}$ and
(c)
$μ_{\infty} (a, w) = μ_{0} (a, w)$ and $g_{\infty} (a, w) = g_{0} (a, w)$ for all $(a, w) \in S_{3}$ ⁠.

Under these three conditions, we have the following result.

Theorem 1
(consistency)
If conditions 1–3 hold, then $θ_{n} (a) \to^{p} θ_{0} (a)$ for any value $a \in A$ such that $F_{0} (a) \in (0, 1)$ ⁠, $θ_{0}$ is continuous at a and F₀ is strictly increasing in a neighbourhood of a. If $θ_{0}$ is uniformly continuous and F₀ is strictly increasing on $A$ ⁠, then ${sup}_{a \in A_{0}} | θ_{n} (a) - θ_{0} (a) | \to^{p} 0$ for any bounded strict subinterval $A_{0} ⊊ A$ ⁠.

We note that, in the pointwise statement of theorem 1, F₀(a) is required to be in the interior of [0, 1] and, similarly, the uniform statement of theorem 1 covers only strict subintervals of $A$ ⁠. This is due to the well-known boundary issues with Grenander-type estimators. Various remedies have been proposed in particular settings, and it would be interesting to consider these in future work (see, for example, Woodroofe and Sun (1993), Balabdaoui et al. (2011) and Kulikov and Lopuhaä (2006)).

Condition 1 requires that $μ_{n}$ and g_n eventually be contained in uniformly bounded function classes that are sufficiently small for certain empirical process terms to be controlled. This condition is easily satisfied if, for instance, $F_{0}$ and $F_{1}$ are parametric classes. It is also satisfied for many infinite dimensional function classes. Uniform entropy bounds for many such classes may be found in chapter 2.6 of van der Vaart and Wellner (1996). We note that there is an asymmetry between the entropy requirements for $F_{0}$ and $F_{1}$ in part (b) of condition 1. This is due to the term $\int \int_{- \infty}^{a} μ_{n} (u, w) F_{n} (d u) Q_{n} (d w)$ appearing in Γ_n(a). To control this term, we use an upper bound of the form $\int_{0}^{1} \log [{sup}_{Q} N {ε, F_{0}, L_{2} (Q)}] d ε$ from the theory of empirical U-processes (Nolan and Pollard, 1987)—this contrasts with the uniform entropy integral $\int_{0}^{1} {(\log [{sup}_{Q} N {ε, F, L_{2} (Q)}])}^{1 / 2} d ε$ that bounds ordinary empirical processes indexed by a uniformly bounded class $F$ ⁠. In Section 3.7, we consider the use of cross-fitting to avoid the entropy conditions in condition 1.

Condition 2 requires that $μ_{n}$ and g_n tend to limit functions $μ_{\infty}$ and $g_{\infty}$ ⁠, and condition 3 requires that either $μ_{\infty} (a, w) = μ_{0} (a, w)$ or $g_{\infty} (a, w) = g_{0} (a, w)$ for (F₀ × Q₀) almost every (a, w). If either

(a)
$S_{1}$ and $S_{3}$ are null sets or
(b)
$S_{2}$ and $S_{3}$ are null sets,

then condition 3 is known simply as double robustness of the estimator θ_n relative to the nuisance functions μ₀ and g₀: θ_n is consistent as long as $μ_{\infty} = μ_{0}$ or $g_{\infty} = g_{0}$ ⁠. Doubly robust estimators are at this point a mainstay of causal inference and have been studied for over two decades (see, for example, Robins et al. (1994), Rotnitzky et al. (1998), Scharfstein et al. (1999), van der Laan and Robins (2003), Neugebauer and van der Laan (2005) and Bang and Robins (2005)). However, condition 3 is more general than classical double robustness, as it allows neither $μ_{n}$ nor g_n to tend to their true counterparts over the whole domain, as long as at least one of $μ_{n}$ or g_n tends to the truth for almost every point in the domain.

3.3 Convergence in distribution

We now study the convergence in distribution of

n^{1 / 3} {θ_{n} (a) - θ_{0} (a)}

for fixed a. We first define for any square integrable functions

h_{1}, h_{2} : A \times W \to R

⁠, ɛ > 0 and

S \subseteq A \times W

the pseudodistance

d (h_{1}, h_{2}; a, ε, S) : = {(sup_{| u - a | ⩽ ε} E_{0} [I_{S} (u, W) {h_{1} (u, W) - h_{2} (u, W)}^{2}])}^{1 / 2} .

(2)

We also denote by

σ_{0}^{2} (a, w)

the conditional variance

E_{0} [{Y - μ_{0} (A, W)}^{2} | A = a, W = w]

of Y given A = a and W = w under

P_{0}

⁠. Below, we shall refer to these two additional conditions.

Condition 4
There exists ɛ₀ > 0 such that
(a)
$max {d (μ_{n}, μ_{\infty}; a, ε_{0}, S_{1}), d (g_{n}, g_{\infty}; a, ε_{0}, S_{2})} = o_{P} (n^{- 1 / 3})$ ⁠,
(b)
$max {d (μ_{n}, μ_{\infty}; a, ε_{0}, S_{2}), d (g_{n}, g_{\infty}; a, ε_{0}, S_{1})} = o_{P} (1)$ and
(c)
$d (μ_{n}, μ_{\infty}; a, ε_{0}, S_{3}) d (g_{n}, g_{\infty}; a, ε_{0}, S_{3}) = o_{P} (n^{- 1 / 3})$ ⁠.

Condition 5
$F_{0}, μ_{0}, μ_{\infty}, g_{0}, g_{\infty}$ and $σ_{0}^{2}$ are continuously differentiable in a neighbourhood of a uniformly over $w \in W$ ⁠.

Under conditions that have been introduced so far, we have the following distributional result.

Theorem 2
(convergence in distribution)
If conditions 1–5 hold, then
$n^{1 / 3} {θ_{n} (a) - θ_{0} (a)} \overset{d}{\to} {\{\frac{4 θ_{0}^{'} (a) κ_{0} (a)}{f_{0} (a)}\}}^{1 / 3} W .$
for any $a \in A$ such that $F_{0} (a) \in (0, 1)$ ⁠, where $W$ follows the standard Chernoff distribution and
$κ_{0} (a) : = E_{0} (E_{0} [{\{\frac{Y - μ_{\infty} (a, W)}{g_{\infty} (a, W)} + θ_{\infty} (a) - θ_{0} (a)\}}^{2} | A = a, W] g_{0} (a, W))$
with $θ_{\infty} (a)$ denoting $\int μ_{\infty} (a, w) Q_{0} (d w)$ ⁠.

We note that the limit distribution in theorem 2 is the same as that of the standard isotonic regression estimator up to a scale factor. As noted above, when either

(a)
Y and W are independent given A or
(b)
A is independent of W,

the functions $θ_{0}$ and r₀ coincide. As such, we can directly compare the respective limit distributions of $n^{1 / 3} {θ_{n} (a) - θ_{0} (a)}$ and n^1/3{r_n(a) − r₀(a)} under these conditions. When both $μ_{\infty} = μ_{0}$ and $g_{\infty} = g_{0}$ ⁠, r_n(a) is asymptotically more concentrated than θ_n(a) in scenario (a), and less concentrated in scenario (b). This is analogous to findings in linear regression, where including a covariate that is uncorrelated with the outcome inflates the standard error of the estimator of the coefficient corresponding to the exposure, whereas including a covariate that is correlated with the outcome but uncorrelated with the exposure deflates its standard error.

Condition 4 requires that, on the set $S_{1}$ where $μ_{n}$ is consistent but g_n is not, $μ_{n}$ converges faster than n^−1/3 uniformly in a neighbourhood of a, and similarly for g_n on the set $S_{2}$ ⁠. On the set $S_{3}$ where both $μ_{n}$ and g_n are consistent, only the product of their rates of convergence must be faster than n^−1/3. Hence, a non-degenerate limit theory is available as long as at least one of the nuisance estimators is consistent at a rate faster than n^−1/3, even if the other nuisance estimator is inconsistent. This suggests the possibility of performing doubly robust inference for $θ_{0} (a)$ ⁠, i.e. of constructing confidence intervals and tests based on θ_n(a) with valid calibration even when one of μ₀ and g₀ is inconsistently estimated. This is explored in Section 4. Finally, as in theorem 1, we allow that neither $μ_{n}$ nor g_n be consistent everywhere, as long as for (F₀ × Q₀) almost every (a, w) at least one of $μ_{n}$ or g_n is consistent.

We remark that, if it is known that $μ_{n} (a, \cdot)$ is consistent for μ₀(a, ·) in an L₂(Q₀) sense at rate faster than n^−1/3, the isotonic regression of the plug-in estimator $θ_{μ_{n}} (a) : = \int μ_{n} (a, w) Q_{n} (d w)$ —which can be equivalently obtained by setting $g_{n} (a, \cdot) = \infty$ in the construction of θ_n(a)—achieves a faster rate of convergence to $θ_{0} (a)$ than does θ_n(a). This might motivate an analyst to use $θ_{μ_{n}} (a)$ rather than θ_n(a) in such a scenario. However, the consistency of $θ_{μ_{n}} (a)$ hinges entirely on the fact that $μ_{\infty} = μ_{0}$ and, in particular, $θ_{μ_{n}} (a)$ will be inconsistent if $μ_{\infty} \neq μ_{0}$ ⁠, even if $g_{\infty} = g_{0}$ ⁠. Additionally, the estimator $θ_{μ_{n}} (a)$ may not generally admit a tractable limit theory on which to base the construction of valid confidence intervals, particularly when machine learning methods are used to build $μ_{n}$ ⁠.

3.4 Grenander-type estimation without domain transformation

As indicated earlier, the isotonic regression estimator based on estimated pseudo-outcomes coincides with a generalized Grenander-type estimator for which the marginal exposure empirical distribution function is used as domain transformation. An alternative estimator could be constructed via Grenander-type estimation without the use of any domain transformation. Specifically, we let

a_{-}, a_{+} \in R

be fixed, and we define

Θ_{0} (a) = \int_{a_{-}}^{a} θ_{0} (u) d u

⁠. Under regularity conditions, for a ⩽ a₊, the one-step estimator of

Θ_{0} (a)

given by

Θ_{n} (a) : = \frac{1}{n} \sum_{i = 1}^{n} \{I_{(a_{-}, a]} (A_{i}) \frac{Y_{i} - μ_{n} (A_{i}, W_{i})}{π_{n} (A_{i}, W_{i})} + \int_{a_{-}}^{a} μ_{n} (u, W_{i}) d u\}

is asymptotically efficient, where π_n is an estimator of π₀, the conditional density of A given W under

P_{0}

⁠. The left derivative of the GCM of Θ_n over

[a_{-}, a_{+}]

defines an alternative estimator

{\bar{θ}}_{n} (a)

⁠.

It is natural to ask how

{\bar{θ}}_{n}

compares with the estimator θ_n that we have studied thus far. First, we note that, unlike θ_n,

{\bar{θ}}_{n}

neither generalizes the classical isotonic regression estimator nor is invariant to strictly increasing transformations of A. Additionally, utilizing the transformation F₀ fixes [0, 1] as the interval over which the GCM should be performed. If

A

is known to be a bounded set,

[a_{-}, a_{+}]

can be taken as the end points of

A

⁠, but otherwise the domain

[a_{-}, a_{+}]

must be chosen in defining

{\bar{θ}}_{n}

⁠. Turning to an asymptotic analysis, using the results of Westling and Carone (2019), it is possible to establish conditions akin to conditions 1–5 under which

n^{1 / 3} {{\bar{θ}}_{n} (a) - θ_{0} (a)} \overset{d}{\to} {4 θ_{0}^{'} (a) {\bar{κ}}_{0} (a)}^{1 / 3} W

with scale parameter

{\bar{κ}}_{0} (a) : = E_{0} (E_{0} [{\{\frac{Y - μ_{\infty} (A, W)}{π_{\infty} (A | W)}\}}^{2} | A = a, W] π_{0} (a | W)),

where

π_{\infty}

is the limit of π_n in probability. We denote by {4τ₀(a)}^1/3 and

{4 {\bar{τ}}_{0} (a)}^{1 / 3}

the limit scaling factors of

n^{1 / 3} {θ_{n} (a) - θ_{0} (a)}

and

n^{1 / 3} {{\bar{θ}}_{n} (a) - θ_{0} (a)}

respectively. If

g_{\infty} = π_{\infty} / f_{0}

and

μ_{\infty} = μ_{0}

⁠, then

τ_{0} (a) = {\bar{τ}}_{0} (a)

⁠, and

n^{1 / 3} {θ_{n} (a) - θ_{0} (a)}

and

n^{1 / 3} {{\bar{θ}}_{n} (a) - θ_{0} (a)}

have the same limit distribution. If instead

g_{\infty} = π_{\infty} / f_{0} = g_{0}

but

μ_{\infty} \neq μ_{0}

⁠, this is no longer so. In fact, we can show that

\begin{matrix} τ_{0} (a) & = θ_{0}^{'} (a) E_{0} (\frac{E_{0} [{Y - μ_{\infty} (a, W)}^{2} | A = a, W]}{π_{0} (a | W)}) - θ_{0}^{'} (a) \frac{{θ_{\infty} (a) - θ_{0} (a)}^{2}}{f_{0} (a)} \\ ⩽ θ_{0}^{'} (a) E_{0} (\frac{E_{0} [{Y - μ_{\infty} (a, W)}^{2} | A = a, W]}{π_{0} (a | W)}) = \bar{τ} (a) . \end{matrix}

Hence, when the outcome regression estimator

μ_{n}

is inconsistent, gains in efficiency are achieved by utilizing the transformation, and the relative gain in efficiency is directly related to the amount of asymptotic bias in the estimation of μ₀.

3.5 Discrete domains

In some circumstances, the exposure A is discrete rather than continuous. Our estimator works equally well in these cases, since, as we highlight below, it turns out then to be asymptotically equivalent to the well-studied AIPW estimator. As a result, the large sample properties of our estimator can be derived from the large sample properties of the AIPW estimator, and asymptotically valid inference can be obtained by using standard influence-function-based techniques.

Suppose that

A = {a_{1} < a_{2} < \dots < a_{m}}

and

f_{0, j} : = P_{0} (A = a_{j}) > 0

for all j ∈ {1, 2, …, m} and

Σ_{j = 1}^{m} f_{0, j} = 1

⁠. Our estimation procedure remains the same with one exception: in defining g₀: = π₀/f₀, we now take π₀ to be the conditional probability

π_{0} (a_{j} | w) : = P_{0} (A = a_{j} | W = w)

rather than the corresponding conditional density, and we take f₀ as the marginal probability

f_{0} (a_{j}) : = P_{0} (A = a_{j}) = f_{0, j}

rather than the corresponding marginal density. We then set g_n: = π_n/f_n as the estimator of g₀, where π_n is any estimator of π₀ and f_n(a_j): = n_j/n for

n_{j} : = Σ_{i = 1}^{n} I (A_{i} = a_{j})

⁠. In all other respects, our estimation procedure is identical to that defined previously. With these definitions, we denote by ξ_{n, i} the estimated pseudo-outcome for observation i. Our estimator is then the isotonic regression of ξ_{n, 1}, ξ_{n, 2}, …, ξ_{n, n} on A₁, A₂, …, A_n. However, since for each i there is a unique j such that A_i = a_j, this is equivalent to performing isotonic regression of

θ_{n}^{†} (a_{1}), θ_{n}^{†} (a_{2}), \dots, θ_{n}^{†} (a_{m})

on a₁, a₂, …, a_m, where

θ_{n}^{†} (a_{j}) : = n_{j}^{- 1} Σ_{i = 1}^{n} I_{{a_{j}}} (A_{i}) ξ_{n, i}

⁠. It is straightforward to see that

θ_{n}^{†} (a_{j}) = \frac{1}{n} \sum_{i = 1}^{n} \{I_{{a_{j}}} (A_{i}) \frac{Y_{i} - μ_{n} (a_{j}, W_{i})}{π_{n} (a_{j} | W_{i})} + μ_{n} (a_{j}, W_{i})\},

which is exactly the AIPW estimator of

θ_{0} (a_{j})

⁠. Therefore, in this case, our estimator reduces to the isotonic regression of the classical AIPW estimator constructed separately for each element of the exposure domain.

The large sample properties of $θ_{n}^{†}$ ⁠, including doubly robust consistency and convergence in distribution at the regular parametric rate n^−1/2, are well established (Robins et al., 1994). Therefore, many properties of θ_n in this case can be determined by using the results of Westling et al. (2018), who studied the behaviour of the isotonic correction of an initial estimator. In particular, ${max}_{a \in A} | θ_{n} (a) - θ_{0} (a) | ⩽ {max}_{a \in A} | θ_{n}^{†} (a) - θ_{0} (a) |$ as long as $θ_{0}$ is non-decreasing on $A$ ⁠. Uniform consistency of $θ_{n}^{†}$ over $A$ thus implies uniform consistency of θ_n. Furthermore, if $θ_{0}$ is strictly increasing on $A$ and ${n^{1 / 2} [θ_{n}^{†} (a) - θ_{0} (a)] : a \in A}$ converges in distribution, then ${max}_{a \in A} | θ_{n} (a) - θ_{n}^{†} (a) | = o_{P} (n^{- 1 / 2})$ ⁠, so large sample standard errors for $θ_{n}^{†}$ are also valid for θ_n. If $θ_{0}$ is not strictly increasing on $A$ but instead has flat regions, then θ_n is more efficient than $θ_{0}$ on these regions, and confidence intervals centred near θ_n but based on the limit theory for $θ_{n}^{†}$ will be conservative.

3.6 Large sample results for causal effects

In many applications, in addition to the causal dose–response curve a↦m₀(a) itself, causal effects of the form (a₁, a₂)↦m₀(a₁) − m₀(a₂) are of scientific interest as well. Under the identification conditions that were discussed in Section 1.2 applied to each of a₁ and a₂, such causal effects are identified with the observed data parameter $θ_{0} (a_{1}) - θ_{0} (a_{2})$ ⁠. A natural estimator for such a causal effect in our setting is θ_n(a₁) − θ_n(a₂). If the conditions of theorem 1 hold for both a₁ and a₂, then the continuous mapping theorem implies that $θ_{n} (a_{1}) - θ_{n} (a_{2}) \to^{p} θ_{0} (a_{1}) - θ_{0} (a_{2})$ ⁠. However, since theorem 2 provides only marginal distributional results, and thus does not describe the joint convergence of $Z_{n} (a_{1}, a_{2}) : = (n^{1 / 3} {θ_{n} (a_{1}) - θ_{0} (a_{1})}, n^{1 / 3} {θ_{n} (a_{2}) - θ_{0} (a_{2})})$ ⁠, it cannot be used to determine the large sample behaviour of $n^{1 / 3} [{θ_{n} (a_{1}) - θ_{n} (a_{2})} - {θ_{0} (a_{1}) - θ_{0} (a_{2})}]$ ⁠. The following result demonstrates that such joint convergence can be expected under the aforementioned conditions, and that the bivariate limit distribution of Z_n(a₁, a₂) has independent components.

Theorem 3
(joint convergence in distribution)
If conditions 1–5 hold for $a \in {a_{1}, a_{2}} \subset A$ and $F_{0} (a_{1}), F_{0} (a_{2}) \in (0, 1)$ ⁠, then Z_n(a₁, a₂) converges in distribution to $({4 τ_{0} (a_{1})}^{1 / 3} W_{1}, {4 τ_{0} (a_{2})}^{1 / 3} W_{2})$ ⁠, where $W_{1}$ and $W_{2}$ are independent standard Chernoff distributions and the scale parameter τ₀ is as defined in theorem 2.

Theorem 3 implies that, under the conditions stated, $n^{1 / 3} [{θ_{n} (a_{1}) - θ_{n} (a_{2})} - {θ_{0} (a_{1}) - θ_{0} (a_{2})}]$ converges in distribution to ${4 τ_{0} (a_{1})}^{1 / 3} W_{1} - {4 τ_{0} (a_{2})}^{1 / 3} W_{2}$ ⁠.

3.7 Use of cross-fitting to avoid empirical process conditions

Theorems 1 and 2 reveal that the statistical properties of θ_n depend on the nuisance estimators $μ_{n}$ and g_n in two important ways. First, we require in condition 1 that $μ_{n}$ or g_n fall in sufficiently small classes of functions, as measured by metric entropy, to control certain empirical process remainder terms. Second, we require in conditions 2 and 3 that at least one of $μ_{n}$ or g_n be consistent almost everywhere (for consistency), and in condition 4 that the product of their rates of convergence be faster than n^−1/3 (for convergence in distribution). In observational studies, researchers can rarely specify a priori correct parametric models for μ₀ and g₀. This motivates use of data-adaptive estimators of these nuisance functions to meet the second requirement. However, such estimators often lead to violations of the first requirement, or it may be onerous to determine that they do not. Thus, because it may be difficult to find nuisance estimators that are both sufficiently data adaptive to meet required rates of convergence and fall in sufficiently small function classes to make empirical process terms negligible, simultaneously satisfying these two requirements can be challenging in practice.

In the context of asymptotically linear estimators, it has been noted that cross-fitting nuisance estimators can resolve this challenge by eliminating empirical process conditions (Zheng and van der Laan, 2011; Belloni et al., 2018; Kennedy, 2019). We therefore propose to employ cross-fitting of

μ_{n}

and g_n in the estimation of

Γ_{0}

to avoid entropy conditions in theorems 1 and 2. Specifically, we fix V ∈ {2, 3, …, n/2} and suppose that the indices {1, 2, …, n} are randomly partitioned into V sets

V_{n, 1}, V_{n, 2}, \dots, V_{n, V}

⁠. We assume for convenience that N: = n/V is an integer and that

| V_{n, v} | = N

for each v, but all our results hold as long as

{max}_{v} n / | V_{n, v} | = O_{P} (1)

⁠. For each v ∈ {1, 2, …, V}, we define

T_{n, v} : = {O_{i} : i \notin V_{n, v}}

as the training set for fold v and denote by μ_{n, v} and g_{n, v} the nuisance estimators that are constructed by using only the observations from

T_{n, v}

⁠. We then define pointwise the cross-fitted estimator

Γ_{n}^{\circ}

of

Γ_{0}

as

Γ_{n}^{\circ} (a) : = \frac{1}{V} \sum_{v = 1}^{V} \{\frac{1}{N} \sum_{i \in V_{n, v}} I_{(- \infty, a]} (A_{i}) \frac{Y_{i} - μ_{n, v} (A_{i}, W_{i})}{g_{n, v} (A_{i}, W_{i})} + \frac{1}{N^{2}} \sum_{i, j \in V_{n, v}} I_{(- \infty, a]} (A_{i}) μ_{n, v} (A_{i}, W_{j})\} .

(3)

Finally, the cross-fitted estimator

θ_{n}^{\circ}

of

θ_{0}

is constructed by using steps 1–4 outlined in Section 2.2, with Γ_n replaced by

Γ_{n}^{\circ}

⁠.

As we now demonstrate, utilizing the cross-fitted estimator $θ_{n}^{\circ}$ enables us to avoid the empirical process condition 1. We first introduce the following two conditions, which are analogues of conditions 1 and 2.

Condition 6
There exist constants $C^{'}, δ^{'}, K_{0}^{'}, K_{1}^{'}, K_{2}^{'}, K_{3}^{'} \in (0, \infty)$ such that, almost surely as n → ∞ and for all v, μ_{n, v} and g_{n, v} are contained in classes of functions $F_{0}^{'}$ and $F_{1}^{'}$ respectively, satisfying
(a)
$| μ | ⩽ K_{0}^{'}$ for all $μ \in F_{0}^{'}$ ⁠, and $K_{1}^{'} ⩽ g ⩽ K_{2}^{'}$ for all $g \in F_{1}^{'}$ ⁠, and
(b)
$σ_{0}^{2} (a, w) ⩽ K_{3}^{'}$ for almost all a and w.

Condition 7
There exist $μ_{\infty} \in F_{0}^{'}$ and $g_{\infty} \in F_{1}^{'}$ such that ${max}_{v} P_{0} {(μ_{n, v} - μ_{\infty})}^{2} \to^{p} 0$ and ${max}_{v} P_{0} {(g_{n, v} - g_{\infty})}^{2} \to^{p} 0$ ⁠.

We then have the following analogue of theorem 1 establishing consistency of the cross-fitted estimator $θ_{n}^{\circ}$ ⁠.

Theorem 4
(consistency of the cross-fitted estimator)
If conditions 6 and 7 and 3 hold, then $θ_{n}^{\circ} (a) \to^{p} θ_{0} (a)$ for any $a \in A$ such that $F_{0} (a) \in (0, 1)$ ⁠, $θ_{0}$ is continuous at a, and F₀ is strictly increasing in a neighbourhood of a. If $θ_{0}$ is uniformly continuous and F₀ is strictly increasing on $A$ ⁠, then ${sup}_{a \in A_{0}} | θ_{n}^{\circ} (a) - θ_{0} (a) | \to^{p} 0$ for any bounded strict subinterval $A_{0} ⊊ A$ ⁠.

For convergence in distribution, we introduce the following analogue of condition 4.

Condition 8
There exists ɛ₀ > 0 such that
(a)
${max}_{v} max {d (μ_{n, v}, μ_{\infty}; a, ε_{0}, S_{1}), d (g_{n, v}, g_{\infty}; a, ε_{0}, S_{2})} = o_{P} (n^{- 1 / 3})$ ⁠,
(b)
${max}_{v} max {d (μ_{n, v}, μ_{\infty}; a, ε_{0}, S_{2}), d (g_{n, v}, g_{\infty}; a, ε_{0}, S_{1})} = o_{P} (1)$ and
(c)
${max}_{v} d (μ_{n, v}, μ_{\infty}; a, ε_{0}, S_{3}) d (g_{n, v}, g_{\infty}; a, ε_{0}, S_{3}) = o_{P} (n^{- 1 / 3})$ ⁠.

We then have the following analogue of theorem 2 for the cross-fitted estimator $θ_{n}^{\circ}$ ⁠.

Theorem 5
(convergence in distribution for the cross-fitted estimator)
If conditions 6, 7, 3, 8 and 5 hold, then $n^{1 / 3} {θ_{n}^{\circ} (a) - θ_{0} (a)} \overset{d}{\to} {4 τ_{0} (a)}^{1 / 3} W$ for any $a \in A$ such that $F_{0} (a) \in (0, 1)$ ⁠.

The conditions of theorems 4 and 5 are analogous to those of theorems 1 and 2, with the important exception that the entropy condition 1(b) is no longer required. Therefore, the estimators μ_{n, v} and g_{n, v} may be as data adaptive as one desires without concern for empirical process terms, as long as they satisfy the boundedness conditions that are stated in condition 6.

4 Construction of confidence intervals

4.1 Wald-type confidence intervals

The distributional results of theorem 2 can be used to construct a confidence interval for

θ_{0} (a)

⁠. Since the limit distribution of

n^{1 / 3} {θ_{n} (a) - θ_{0} (a)}

is symmetric around zero, a Wald-type construction seems appropriate. Specifically, writing

τ_{0} (a) : = θ_{0}^{'} (a) κ_{0} (a) / f_{0} (a)

and denoting by τ_n(a) any consistent estimator of τ₀(a), a Wald-type (1 − α)-level asymptotic confidence interval for

θ_{0} (a)

is given by

[θ_{n} (a) - {\{\frac{4 τ_{n} (a)}{n}\}}^{1 / 3} q_{1 - α / 2}, θ_{n} (a) + {\{\frac{4 τ_{n} (a)}{n}\}}^{1 / 3} q_{1 - α / 2}],

where q_p denotes the pth quantile of

W

⁠. Quantiles of the standard Chernoff distribution have been numerically computed and tabulated on a fine grid (Groeneboom and Wellner, 2001) and are readily available in the statistical programming language R. Estimation of τ₀(a) involves, either directly or indirectly, estimation of

θ_{0}^{'} (a) / f_{0} (a)

and

κ_{0} (a)

⁠. We focus first on the former.

We note that $θ_{0}^{'} (a) / f_{0} (a) = ψ_{0}^{'} {F_{0} (a)}$ with $ψ_{0} : = θ_{0} \circ F_{0}^{- 1}$ ⁠. This suggests that we could either estimate $θ_{0}^{'}$ and f₀ separately and consider the ratio of these estimators, or that we could instead estimate $ψ_{0}^{'}$ directly and compose it with the estimator of F₀ that is already available. The latter approach has the desirable property that the resulting scale estimator is invariant to strictly monotone transformations of the exposure. As such, this is the strategy that we favour. To estimate $ψ_{0}^{'}$ ⁠, we recall that the estimator ψ_n from Section 2 is a step function and is therefore not differentiable. A natural solution consists of computing the derivative of a smoothed version of ψ_n. We have found local quadratic kernel smoothing of points {(u_j, ψ_n(u_j)):j = 1, 2, …, K}, for u_j the midpoints of the jump points of ψ_n, to work well in practice.

Theorem 3 can be used to construct Wald-type confidence intervals for causal effects of the form $θ_{0} (a_{1}) - θ_{0} (a_{2})$ ⁠. We first construct estimates τ_n(a₁) and τ_n(a₂) of the scale parameters τ₀(a₁) and τ₀(a₂) respectively, and then compute an approximation ${\bar{q}}_{n, 1 - α / 2}$ of the (1 − α/2)-quantile of ${4 τ_{n} (a_{1})}^{1 / 3} W_{1} - {4 τ_{n} (a_{2})}^{1 / 3} W_{2}$ ⁠, where $W_{1}$ and $W_{2}$ are independent Chernoff distributions, using Monte Carlo simulations, for example. An asymptotic (1 − α)-level Wald-type confidence interval for $θ_{0} (a_{1}) - θ_{0} (a_{2})$ is then $θ_{n} (a_{1}) - θ_{n} (a_{2}) \pm {\bar{q}}_{n, 1 - α / 2} n^{- 1 / 3}$ ⁠.

In the next two subsections, we discuss different strategies for estimating the scale factor $κ_{0} (a)$ ⁠.

4.2 Scale estimation relying on consistent nuisance estimation

We first consider settings in which both

μ_{n}

and g_n are consistent estimators, i.e.

g_{\infty} = g_{0}

and

μ_{\infty} = μ_{0}

⁠. In such cases, we have that

κ_{0} (a) = E_{0} {σ_{0}^{2} (a, W) / g_{0} (a, W)}

with

σ_{0}^{2} (a, w)

denoting the conditional variance

E_{0} [{Y - μ_{0} (a, W)}^{2} | A = a, W = w]

⁠. Any regression technique could be used to estimate the conditional expectation of

Z_{n} : = {Y - μ_{n} (A, W)}^{2}

given A and W, yielding an estimator

σ_{n}^{2} (a, w)

of

σ_{0}^{2} (a, w)

⁠. A plug-in estimator of

κ_{0} (a)

is then given by

κ_{n} (a) : = \frac{1}{n} \sum_{i = 1}^{n} \frac{σ_{n}^{2} (a, W_{i})}{g_{n} (a, W_{i})} .

Provided that

μ_{n}

⁠, g_n and

σ_{n}^{2}

are consistent estimators of μ₀, g₀ and

σ_{0}^{2}

respectively, κ_n(a) is a consistent estimator of

κ_{0} (a)

⁠. We note that, in the special case of a binary outcome, the fact that

σ_{0}^{2} (a, w) = μ_{0} (a, w) {1 - μ_{0} (a, w)}

motivates the use of

μ_{n} (a, w) {1 - μ_{n} (a, w)}

as estimator

σ_{n}^{2} (a, w)

and thus eliminates the need for further regression beyond the construction of

μ_{n}

and g_n. In practice, we typically recommend the use of an ensemble method—e.g. Super Learner (van der Laan et al., 2007)—to combine a variety of regression techniques, including machine learning techniques, to minimize the risk of inconsistency of

μ_{n}

⁠, g_n and

σ_{n}^{2}

⁠.

4.3 Doubly robust scale estimation

As noted above, theorem 2 provides the limit distribution of $n^{1 / 3} {θ_{n} (a) - θ_{0} (a)}$ even if one of the nuisance estimators is inconsistent, as long as the consistent nuisance estimator converges sufficiently fast. We now show how we may capitalize on this result to provide a doubly robust estimator of $κ_{0} (a)$ ⁠. Since ψ_n is itself a doubly robust estimator of ψ₀, so will be the proposed estimator $ψ_{n}^{'}$ of $ψ_{0}^{'}$ and hence also of the resulting estimator τ_n(a) of τ₀(a). This contrasts with the estimator of $κ_{0} (a)$ that was described in the previous section, which required the consistency of both $μ_{n}$ and g_n.

To construct an estimator of

κ_{0} (a)

that is consistent even if either

μ_{\infty} \neq μ_{0}

or

g_{\infty} \neq g_{0}

⁠, we begin by noting that

κ_{0} (a) = {lim}_{h ↓ 0} E_{0} [K_{h} {F_{0} (A) - F_{0} (a)} η_{\infty} (Y, A, W)]

⁠, where K_h:u↦h⁻¹K(uh⁻¹) for some bounded density function K with bounded support, and we have defined

η_{\infty} : (y, a, w) \mapsto {\{\frac{y - μ_{\infty} (a, w)}{g_{\infty} (a, w)} + θ_{\infty} (a) - θ_{0} (a)\}}^{2} .

Setting

θ_{μ_{n}} (a) : = \int μ_{n} (a, w) Q_{n} (d w)

with Q_n the empirical distribution based on W₁, W₂, …, W_n, we define

κ_{n, h}^{*} (a) : = \frac{1}{n} \sum_{i = 1}^{n} K_{h} {F_{n} (A_{i}) - F_{n} (a)} η_{n} (Y_{i}, A_{i}, W_{i})

with η_n obtained by substituting

μ_{\infty}

⁠,

g_{\infty}

⁠,

θ_{\infty}

and

θ_{0}

by

μ_{n}

⁠, g_n,

θ_{μ_{n}}

and θ_n respectively in the definition of

η_{\infty}

⁠. Under conditions 1–5, it can be shown that

κ_{n, h_{n}}^{*} (a) \to^{p} κ_{0} (a)

by standard kernel smoothing arguments for any sequence h_n→0. In particular,

κ_{n, h_{n}}^{*} (a)

is consistent under the general form of doubly robustness specified by condition 3.

To determine an appropriate value of the bandwidth h in practice, we propose the following empirical criterion. We first define the integrated scale $γ_{0} : = \int κ_{0} (a) F_{0} (d a)$ and construct the estimator $γ_{n} (h) : = \int κ_{n, h} (a) F_{n} (d a)$ for any candidate h > 0. Furthermore, we observe that $γ_{0} = E_{0} {η_{\infty} (Y, A, W)}$ ⁠, which suggests the use of the empirical estimator ${\bar{η}}_{n} : = (1 / n) Σ_{i = 1}^{n} η_{n} (Y_{i}, A_{i}, W_{i})$ ⁠. This motivates us to define $h_{n}^{*} : = arg {min}_{h} {γ_{n} (h) - {\bar{η}}_{n}}^{2}$ ⁠, i.e. the value of h that makes γ_n(h) and ${\bar{η}}_{n}$ closest. The proposed doubly robust estimator of $κ_{0} (a)$ is thus $κ_{n, DR} (a) : = κ_{n, h_{n}^{*}} (a)$ ⁠.

We make two final remarks regarding this doubly robust estimator of $κ_{0} (a)$ ⁠. First, we note that this estimator depends only on A and a through the ranks F_n(A) and F_n(a). Hence, as before, our estimator is invariant to strictly monotone transformations of the exposure A. Second, we note that, if $μ_{n} (a, w) = μ_{n} (a)$ does not depend on w and g_n = 1, κ_n,DR(a) tends to the conditional variance ${var}_{0} (Y | A = a)$ ⁠, which is precisely the scale parameter appearing in standard isotonic regression.

4.4 Confidence intervals via sample splitting

As an alternative, we note here that the sample splitting method that was recently proposed by Banerjee et al. (2019) could also be used to perform inference. Specifically, to implement their approach in our context, we randomly split the sample into m subsets of roughly equal size, perform our estimation procedure on each subset to form subset-specific estimates θ_{n, 1}, θ_{n, 2}, …, θ_{n, m}, and then define

{\bar{θ}}_{n, m} (a) : = (1 / m) Σ_{j = 1}^{m} θ_{n, j} (a)

⁠. Banerjee et al. (2019) demonstrated that, if m > 1 is fixed, then under mild conditions

{\bar{θ}}_{n, m} (a)

has strictly better asymptotic mean-squared error than θ_n(a), and that, for moderate m,

[{\bar{θ}}_{n, m} (a) - \frac{σ_{n, m} (a)}{\sqrt m n^{1 / 3}} t_{1 - α / 2, m - 1}, {\bar{θ}}_{n, m} (a) + \frac{σ_{n, m} (z)}{\sqrt m n^{1 / 3}} t_{1 - α / 2, m - 1}]

forms an asymptotic (1 − α)-level confidence interval for

θ_{0} (a)

⁠, where

σ_{n, m}^{2} (a) : = \frac{1}{m - 1} \sum_{j = 1}^{m} {θ_{n, j} (a) - {\bar{θ}}_{n, m} (a)}^{2}

and t_{1 − α/2, m − 1} is the (1 − α/2)-quantile of the t-distribution with m − 1 degrees of freedom.

5 Numerical studies

In this section, we perform numerical experiments to assess the performance of the proposed estimators of $θ_{0} (a)$ and of the three approaches for constructing confidence intervals, which we also compare with that of the local linear estimator and associated confidence intervals that were proposed in Kennedy et al. (2017).

In our experiments, we simulate data as follows. First, we generate $W \in R^{4}$ as a vector of four independent standard normal variates. A natural next step would be to generate A given W. However, since our estimation procedures requires estimating the conditional density of U: = F₀(A) given W, we instead generate U given W, and then transform U to obtain A. This strategy makes it easier to construct correctly specified parametric nuisance estimators in the context of these simulations. Given W = w, we generate U from the distribution with conditional density function ${\bar{g}}_{0} (u | w) = I_{[0, 1]} (u) [λ (w) + 2 u {1 - λ (w)}]$ for $λ (w) : = 0.1 + 1.8 expit (β^{T} w)$ ⁠. We note that ${\bar{g}}_{0} (u | w) ⩾ 0.1$ for all u ∈ [0, 1] and $w \in R^{4}$ ⁠, and, also, that $\int {\bar{g}}_{0} (u | w) Q_{0} (d w) = I_{[0, 1]} (u)$ ⁠, so that U is marginally uniform. We then take A to be the evaluation at U of the quantile function of an equal weight mixture of two normal distributions with means −2 and 2 and standard deviation 1, which implies that A is marginally distributed according to this bimodal normal mixture. Finally, conditionally on A = a and W = w, we simulate Y as a Bernoulli random variate with conditional mean function given by $μ_{0} (a, w) : = expit (γ_{1}^{T} \underset{̲}{w} + γ_{2}^{T} \underset{̲}{w} a + γ_{3} a^{2})$ ⁠, where $\underset{̲}{w}$ denotes (1, w). We set $β = {(- 1, - 1, 1, 1)}^{T}$ ⁠, γ₁ = (−1, −1, −1, 1, 1)^T, γ₂ = (3, −1, −1, 1, 1)^T and γ₃ = 3 in the experiments that we report on.

We estimate the true confounder-adjusted dose–response curve $θ_{0}$ by using the causal isotonic regression estimator θ_n, the local linear estimator of Kennedy et al. (2017) and the sample splitting version of θ_n that was proposed by Banerjee et al. (2019) with m = 5 splits. For the local linear estimator, we use the data-driven bandwidth selection procedure that was proposed in section 3.5 of Kennedy et al. (2017). We consider three settings in which either both $μ_{n}$ and g_n are consistent, only $μ_{n}$ is consistent or only g_n consistent. To construct a consistent estimator $μ_{n}$ ⁠, we use a correctly specified logistic regression model, whereas, to construct a consistent estimator g_n, we use a maximum likelihood estimator based on a correctly specified parametric model. To construct an inconsistent estimator $μ_{n}$ ⁠, we still use a logistic regression model but omit covariates W₃ and W₄ and all interactions. To construct an inconsistent estimator g_n, we posit the same parametric model as before but omit W₃ and W₄. We construct pointwise confidence intervals for $θ_{0}$ in each setting, using the Wald-type construction described in Section 4 with both the plug-in and the doubly robust estimators of $κ_{0} (a)$ ⁠. We expect intervals based on the doubly robust estimator of $κ_{0} (a)$ to provide asymptotically correct coverage rates for $θ_{0} (a)$ for each of the three settings, but we expect only asymptotically correct coverage rates in the first setting when the plug-in estimator of $κ_{0} (a)$ is used. We construct pointwise confidence intervals for the local linear estimator by using the procedure that was proposed in Kennedy et al. (2017), and for the sample splitting procedure by using the procedure that was discussed in Section 4.4. We consider the performance of these inferential procedures for values of a between −3 and 3.

Fig. 1(a) shows a single sample path of the causal isotonic regression estimator based on a sample of size n = 5000 and consistent estimators $μ_{n}$ and g_n. Also included in Fig. 1(a) are asymptotic 95% pointwise confidence intervals constructed by using the doubly robust estimator of $κ_{0} (a)$ ⁠. Fig. 1(b) shows the unadjusted isotonic regression estimate based on the same data and corresponding 95% asymptotic confidence intervals. The true causal and unadjusted regression curves are the smooth curves. We note that $θ_{0} (a) \neq r_{0} (a)$ for a ≠ 0, since the relationship between Y and A is confounded by W, and indeed the unadjusted regression curve does not have a causal interpretation. Therefore, the marginal isotonic regression estimator will not be consistent for the true causal parameter. In this data-generating setting, the causal effect of A on Y is larger than the marginal effect of A on Y in the sense that $θ_{0} (a)$ has greater variation over values of a than does r₀(a).

Fig. 1

(a) Causal isotonic regression estimate using consistent nuisance estimators $μ_{n}$ and g_n, and (b) regular isotonic regression estimate: , pointwise 95% confidence intervals constructed by using the doubly robust estimator; , true functions

Open in new tab Download slide

We perform 1000 simulations, each with n ∈ {500, 1000, 2500, 5000} observations. Fig. 2 displays the empirical standard error of the three estimators considered over these 1000 simulated data sets as a function of a and for each value of n. We first note that the standard error of the local linear estimator is smaller than that of θ_n, which is expected because of the faster rate of convergence of the local linear estimator. The sample splitting procedure also reduces the standard error of θ_n. Furthermore, the standard deviation of the local linear estimator appears to decrease faster than n^−1/3, whereas the standard deviation of the estimators based on θ_n do not, in line with the theoretical rates of convergence of these estimators. We also note that inconsistent estimation of the propensity has little effect on the standard errors of any of the estimators, but inconsistent estimation of the outcome regression results in slightly larger standard errors.

Fig. 2

Standard error of the three estimators scaled by n^1/3 as a function of n for various values of a and in contexts in which $μ_{n}$ and g_n are either consistent or inconsistent, computed empirically over 1000 simulated data sets of different sizes (, sample size 500; , sample size 1000; , sample size 2500; , sample size 5000): (a) correct outcome regression, correct propensity; (b) correct outcome regression, incorrect propensity; (c) incorrect outcome regression, correct propensity

Open in new tab Download slide

Fig. 3 displays the absolute bias of the three estimators. For most values of a, the estimator θ_n that is proposed here has smaller absolute bias than the local linear estimator, and its absolute bias decreases faster than n^−1/3. The absolute bias of the local linear estimator depends strongly on a, and in particular is largest where the second derivative of $θ_{0}$ is large in absolute value, agreeing with the large sample theory that was described in Kennedy et al. (2017). The sample splitting estimator has larger absolute bias than does θ_n because it inherits the bias of θ_n/m. The bias is especially large for values of a in the tails of the marginal distribution of A.

Fig. 3

Absolute bias of the three estimators scaled by n^1/3 as a function of n for various values of a and in contexts in which $μ_{n}$ and g_n are either consistent or inconsistent, computed empirically over 1000 simulated data sets of different sizes (, sample size 500; , sample size 1000; , sample size 2500; , sample size 5000): (a) correct outcome regression, correct propensity; (b) correct outcome regression, incorrect propensity; (c) incorrect outcome regression, correction propensity

Open in new tab Download slide

Fig. 4 shows the empirical coverage of nominal 95% pointwise confidence intervals for a range of values of a for the four methods that were considered. For both the plug-in and the doubly robust intervals centred near θ_n, the coverage improves as n grows, especially for values of a in the tails of the marginal distribution of A. Under correct specification of outcome and propensity regression models, the plug-in method attains close to nominal coverage rates for a between −3 and 3 by n = 1000. When the propensity estimator is inconsistent, the plug-in method still performs well in this example, although we do not expect this to be the case always. However, when $μ_{n}$ is inconsistent, the plug-in method is very conservative for positive values of a. The doubly robust method attains close to nominal coverage for large samples as long as one of g_n or $μ_{n}$ is consistent. Compared with the plug-in method, the doubly robust method requires larger sample sizes to achieve good coverage, especially for extreme values of a. This is because the doubly robust estimator of $κ_{0} (a)$ has a slower rate of convergence than does the plug-in estimator, as demonstrated by box plots of these estimators that are provided in the on-line supplementary material.

Fig. 4

Observed coverage of pointwise 95% confidence intervals by using θ_n and the plug-in method (top row), θ_n and the doubly robust method (second row), the local linear estimator and associated intervals (third row) and the sample splitting estimator (bottom row), considered for various values of a and computed empirically over 1000 simulated data sets of different sizes (, nominal coverage rate; , sample size 500; sample size 1000; , sample size 2500; , sample size 5000): (a) correct outcome regression, correct propensity; (b) correct outcome regression, incorrect propensity; (c) incorrect outcome regression, correct propensity

Open in new tab Download slide

The confidence intervals that are associated with the local linear estimator have poor coverage for values of a where the bias of the estimator is large, which, as mentioned above, occurs when the second derivative of $θ_{0}$ is large in absolute value. Overall, the sample splitting estimator has excellent coverage, except perhaps for values of a in the tails of the marginal distribution of A when n is small or moderate, in which case the coverage is near 90%.

We also conducted a small simulation study to illustrate the performance of the proposed procedures when machine learning techniques are used to construct $μ_{n}$ and g_n. To estimate μ₀ consistently, we used Super Learner (van der Laan et al., 2007) with a library consisting of generalized linear models, multivariate adaptive regression splines and generalized additive models. To estimate g₀ consistently, we used the method that was proposed by Díaz and van der Laan (2011) with covariate vector (W₁, W₂, W₃, W₄). To produce inconsistent estimators $μ_{n}$ or g_n, we used the same estimators but omitted covariates W₁ and W₂. We also considered the estimator $θ_{n}^{\circ}$ that was obtained via cross-fitting these nuisance parameters, as discussed in Section 3.7, as well as the local linear estimator. Because of computational limitations, we performed 1000 simulations at sample size n = 1000 only. Fig. 5 shows the coverage of nominal 95% confidence intervals. The plug-in intervals achieve very close to nominal coverage under consistent estimation of both nuisances, and they also achieve surprisingly good coverage rates when the propensity is inconsistently estimated. The plug-in intervals are somewhat conservative when the outcome regression is inconsistently estimated. The doubly robust method is anti-conservative under inconsistent estimation of both nuisances and also when the propensity is inconsistently estimated, with coverage rates mostly between 90% and 95%. Good coverage rates are also achieved when the outcome regression is inconsistently estimated. These results suggest that the doubly robust intervals may require larger sample sizes to achieve good coverage, particularly when machine learning estimators are used for $μ_{n}$ and g_n. The plug-in intervals appear to be relatively robust to moderate misspecification of models for the nuisance parameters in smaller samples. Histograms of the estimators of $κ_{0} (a)$ and $ψ_{0}^{'} (a)$ are provided in the on-line supplementary material. Confidence intervals based on the local linear estimator show a similar pattern to that in the previous simulation study, undercovering where the second derivative of the true function is large in absolute value. Cross-fitting had little effect on coverage.

Fig. 5

Observed coverage of pointwise 95% doubly robust and plug-in confidence intervals by using machine learning estimators based on simulated data including n = 1000 observations (, nominal coverage rate; , causal isotonic plus plug in; , causal isotonic plus doubly robust; , cross-fitted causal isotonic plus plug in; , cross-fitted causal isotonic plus doubly robust; , local linear): (a) correct outcome regression, correct propensity; (b) correct outcome regression, incorrect propensity; (c) incorrect outcome regression, correct propensity

Open in new tab Download slide

As noted above, we found in our numerical experiments that the plug-in estimator of the scale parameter was surprisingly robust to inconsistent estimation of the nuisance parameters, whereas its doubly robust estimator was anticonservative even when the nuisance parameters were estimated consistently. This can be explained in terms of the bias and variance of the two scale estimators proposed. On one hand, under inconsistent estimation of any nuisance function, the plug-in estimator of the scale parameter is biased, even in large samples. However, its variance decreases relatively quickly with sample size, since it is a simple empirical average of estimated functions. On the other hand, the doubly robust estimator is asymptotically unbiased, but its variance decreases much slower with sample size. These trends can be observed in the figures that are provided in the on-line supplementary material. In sufficiently large samples, the doubly robust estimator is expected to outperform the plug-in estimator in terms of mean-squared error when one of the nuisances is inconsistently estimated. However, the sample size that is required for this trade-off to affect confidence interval coverage significantly depends on the degree of inconsistency. Although we did not see this trade-off at the sample sizes that were used in our numerical experiments, we expect the benefits of the doubly robust confidence interval construction to become apparent in smaller samples in other settings.

6 Body mass index and T-cell response in human immunodeficiency virus vaccine studies

The scientific literature indicates that, for several vaccines, obesity or BMI is inversely associated with immune responses to vaccination (see, for example, Sheridan et al. (2012), Young et al. (2013), Jin et al. (2015), Painter et al. (2015) and Liu et al. (2017)). Some of this literature has investigated potential mechanisms of how obesity or higher BMI might lead to impaired immune responses. For example, Painter et al. (2015) concluded that obesity may alter cellular immune responses, especially in adipose tissue, which varies with BMI. Sheridan et al. (2012) found that obesity is associated with decreased CD8+ T-cell activation and decreased expression of functional proteins in the context of influenza vaccines. Liu et al. (2017) found that obesity reduced hepatitis B immune responses through ‘leptin-induced systemic and B cell intrinsic inflammation, impaired T cell responses and lymphocyte division and proliferation’. Given this evidence of a monotone effect of BMI on immune responses, we used the methods that are presented in this paper to assess the covariate-adjusted relationship between BMI and CD4+ T-cell responses using data from a collection of clinical trials of candidate HIV vaccines. We present the results of our analyses here.

Jin et al. (2015) compared the rate of CD4+ T cell response with HIV peptide pools among low (BMI <25) medium (25 ⩽ BMI <30) and high (BMI ⩾ 30) BMI participants, and they found that low BMI participants had a statistically significantly greater response rate than high BMI participants by using Fisher’s exact test. However, such a marginal assessment of the relationship between BMI and immune response can be misleading because there are known common causes, such as age and sex, of both BMI and immune response. For this reason, Jin et al. (2015) also performed a logistic regression of the binary CD4+ responses against sex, age, BMI (not discretized), vaccination dose and number of vaccinations. In this adjusted analysis, they found a significant association between BMI and CD4+ cell response rate after adjusting for all other covariates (odd ratio 0.92; 95% confidence interval 0.86, 0.98; p = 0.007). However, such an adjusted odds ratio has a formal causal interpretation only under strong parametric assumptions. As discussed in Section 1.2, the covariate-adjusted dose–response function $θ_{0}$ is identified with the causal dose–response curve without making parametric assumptions and is therefore of interest for understanding the continuous covariate-adjusted relationship between BMI and immune responses.

We note that there is some debate in the causal inference literature about whether exposures such as BMI have a meaningful interpretation in formal causal modelling. In particular, some researchers suggest that causal models should always be tied to hypothetical randomized experiments (see, for example Bind and Rubin (2019)), and it is difficult to imagine a hypothetical randomized experiment that would assign participants to levels of BMI. From this perspective, it may therefore not be sensible to interpret $θ_{0} (a)$ in a causal manner in the context of this example. Nevertheless, as discussed in Section 1, we contend that $θ_{0} (a)$ is still of interest. In particular, it provides a meaningful summary of the relationship between BMI and immune response accounting for measured potential confounders. In this case, we interpret $θ_{0} (a)$ as the probability of immune response in a population of participants with BMI-value a but sex, age, vaccination dose, number of vaccinations and study with a similar distribution to that of the entire study population.

We pooled data from the vaccine arms of 11 phase I–II clinical trials, all conducted through the HIV Vaccine Trials Network. 10 of these trials were previously studied in the analysis that was presented in Jin et al. (2015), and a detailed description of the trials are contained therein. The final trial in our pooled analysis is HVTN 100, in which 210 participants were randomized to receive four doses of the ALVAC-HIV vaccine (vCP1521). The ALVAC-HIV vaccine, in combination with an AIDSVAX boost, was found to have statistically significant vaccine efficacy against HIV type 1 in the RV-144 trial that was conducted in Thailand (Rerks-Ngarm et al., 2009). CD4+ and CD8+ T-cell responses to HIV peptide pools were measured in all 11 trials by using validated intracellular cytokine staining at HIV Vaccine Trials Network laboratories. These continuous responses were converted to binary indicators of whether there was a significant change from baseline by using the method described in Jin et al. (2015). We analysed these binary responses at the first visit following administration of the last vaccine dose—either 2 or 4 weeks after the final vaccination depending on the trial. After accounting for missing responses from a small number of participants, our analysis data sets consisted of a total of n = 439 participants for the analysis of CD4+ cell responses and n = 462 participants for CD8+ cell responses. Here, we focus on analysing CD4+ cell responses; we present the analysis of CD8+ cell responses in the on-line supplementary material.

We assessed the relationship between BMI and T-cell response by estimating the covariate-adjusted dose–response function $θ_{0}$ using our cross-fitted estimator $θ_{n}^{\circ}$ ⁠, the local linear estimator and the sample splitting version of our estimator with m = 5 splits. We adjusted for sex, age, vaccination dose, number of vaccinations and study. We estimated μ₀ and g₀ as in the machine-learning-based simulation study that was described in Section 5, and we constructed confidence intervals for our estimator by using both the plug-in and the doubly robust estimators that were described above.

Fig. 6 presents the estimated probability of a positive CD4+ T-cell response as a function of BMI for BMI-values between the 0.05- and 0.95-quantile of the marginal empirical distribution of BMI by using our estimator (Fig. 6(a)), the local linear estimator (Fig. 6(b)) and the sample-splitting estimator (Fig. 6(c)). Pointwise 95% confidence intervals are shown as broken or dotted curves. The three methods found qualitatively similar results. We found that the change in probability of CD4+ cell response appears to be largest for BMI <20 and BMI >30. We estimated the probability of having a positive CD4+ T-cell response, after adjusting for potential confounders, to be 0.52 (95% doubly robust confidence interval 0.44–0.59) for a BMI of 20, 0.47 (0.42–0.52) for a BMI of 25, 0.47 (0.32–0.62) for a BMI of 30 and 0.29 (0.12–0.47) for a BMI of 35. We estimated the difference between these probabilities for BMIs of 20 and 35 to be 0.22 (confidence interval 0.03–0.41).

Fig. 6

Estimated probabilities of CD4+ T-cell response and 95% pointwise confidence intervals as a function of BMI, adjusted for sex, age, number of vaccinations received, vaccine dose and study: (a) estimator proposed here; , confidence intervals based on the plug-in estimator of the scale parameter; , confidence interval based on the doubly robust estimator of the scale parameter; (b) local linear estimator of Kennedy et al. (2017); (c) sample splitting version of our estimator with m = 5 splits

Open in new tab Download slide

7 Concluding remarks

The work that we have presented in this paper lies at the interface of causal inference and shape-constrained non-parametric inference, and there are natural future directions building on developments in either of these areas. Inference on a monotone causal dose–response curve when outcome data are only observed subject to potential coarsening, such as censoring, truncation or missingness, is needed to increase the applicability of our proposed method. To tackle such cases, it appears most fruitful to follow the general primitive strategy that was described in Westling and Carone (2019) based on a revised causal identification formula allowing such coarsening.

It would be useful to develop tests of the monotonicity assumption, like Durot (2003) did for regression functions. Such a test could probably be developed by studying the large sample behaviour of $| {\bar{Ψ}}_{n} - Ψ_{n} |_{p}$ under the null hypothesis that $θ_{0}$ is monotone, where Ψ_n and ${\bar{Ψ}}_{n}$ are the primitive estimator and its greatest convex minorant as defined in Section 2.2. Such a result would probably permit testing with a given asymptotic size when $θ_{0}$ is strictly increasing, and asymptotically conservative inference otherwise. It would also be useful to develop methods for uniform inference. Uniform inference is difficult in this setting because ${n^{1 / 3} {θ_{n} (a) - θ_{0} (a)} : a \in A}$ does not convergence weakly as a process in the space $l^{\infty} (A)$ of bounded functions on $A$ to a tight limit process. Indeed, theorem 3 indicates that ${n^{1 / 3} {θ_{n} (a) - θ_{0} (a)} : a \in A}$ converges to an independent white noise process, which is not tight, so this convergence is not useful for constructing uniform confidence bands. Instead, it may be possible to extend the work of Durot et al. (2012) to our setting (and other generalized Grenander-type estimators) by demonstrating that $\log (n) [{n / \log (n)}^{1 / 3} {sup}_{a \in A_{n}} | θ_{n} (a) - θ_{0} (a) | / α_{0} - c_{n}]$ converges in distribution to a non-degenerate limit for some constant $α_{0}$ depending on $P_{0}$ ⁠, a deterministic sequence c_n and a suitable sequence of subsets $A_{n}$ increasing to $A$ ⁠. Developing procedures for uniform inference and tests of the monotonicity assumption are important areas for future research.

An alternative approach to estimating a causal dose–response curve is to use local linear regression, like Kennedy et al. (2017) did. As is true in the context of estimating classical univariate functions such as density, hazard and regression functions, there are certain trade-offs between local linear smoothing and monotonicity-based methods. On the one hand, local linear regression estimators exhibit a faster n^−2/5-rate of convergence whenever optimal tuning rates are used and the true function has two continuous derivatives. However, the limit distribution involves an asymptotic bias term depending on the second derivative of the true function, so confidence intervals based on optimally chosen tuning parameters provide asymptotically correct coverage only for a smoothed parameter rather than the true parameter of interest. In contrast, monotonicity-constrained estimators such as the estimator that is proposed here exhibit a rate of convergence of n^−1/3 whenever the true function is strictly monotone and has one continuous derivative, do not require choosing a tuning parameter, are invariant to strictly increasing transformations of the exposure and their limit theory does not include any asymptotic bias (as illustrated by theorem 2). We note that both estimators achieve the optimal rate of convergence for pointwise estimation of a univariate function under their respective smoothness constraints. In our view, the ability to perform asymptotically valid inference by using a monotonicity-constrained estimator is one of the most important benefits of leveraging the monotonicity assumption rather than using smoothing methods. This advantage was evident in our numerical studies when comparing the isotonic estimator that is proposed here and the local linear method of Kennedy et al. (2017). Undersmoothing can be used to construct calibrated confidence intervals by using kernel smoothing estimators, but performing adequate undersmoothing in practice is challenging.

The two methods for pointwise asymptotic inference that we presented require estimation of the derivative $θ_{0}^{'} (a)$ and the scale parameter $κ_{0} (a)$ ⁠. We found that the plug-in estimator of $κ_{0} (a)$ had low variance but possibly large bias depending on the levels of inconsistency of $μ_{n}$ and g_n, and that its doubly robust estimator instead had high variance but low bias as long as either $μ_{n}$ or g_n is consistent. In practice, we found that the low variance of the plug-in estimator often outweighed its bias, resulting in better coverage rates for intervals based on the plug-in estimator of $κ_{0} (a)$ ⁠, especially in samples of small and moderate sizes. Whether a doubly robust estimator of $κ_{0} (a)$ with smaller variance can be constructed is an important question to be addressed in future work. We found that sample splitting with as few as m = 5 splits provided doubly robust coverage, and the sample splitting estimator also had smaller variance than the original estimator, at the expense of some additional bias.

It would be even more desirable to have inferential methods that do not require estimation of additional nuisance parameters or sample splitting. Unfortunately, the standard non-parametric bootstrap is not generally consistent in Grenander-type estimation settings and, although alternative bootstrap methods have been proposed, to our knowledge, all such proposals require the selection of critical tuning parameters (Kosorok, 2008; Sen et al., 2010). Likelihood-ratio-based inference for Grenander-type estimators has proven fruitful in a variety of contexts (see, for example, Banerjee and Wellner (2001) and Groeneboom and Jongbloed (2015)), and extending such methods to our context is also an area of significant interest in future work.

Supporting information

Additional ‘supporting information’ may be found in the on-line version of this article:

‘Supplementary material for: Causal isotonic regression’.

Acknowledgements

Research reported in this publication was supported by the National Institute of Allergy and Infectious Diseases of the National Institutes of Health grant AI068635 (PBG) and by the National Lung, Heart, and Blood Institute of the National Institutes of Health grant R0IHL137808 (MC). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The authors are also grateful to the members of the University of Pennsylvania Center for Causal Inference working group for helpful feedback on an early version of this work, and to two reviewers and an Associate Editor for their detailed and insightful comments. The authors also thank the study participants and investigators of the HIV Vaccine Trials Network studies 044, 052, 060, 063, 064, 068, 069, 070, 080, 100 and 204.

References

Ayer

,

M.

,

Brunk

,

H. D.

,

Ewing

,

G. M.

,

Reid

,

W. T.

and

Silverman

,

E.

(

1955

)

An empirical distribution function for sampling with incomplete information

.

Ann. Math. Statist.

,

26

,

641

–

647

.

Google Scholar

Crossref

WorldCat

Balabdaoui

,

F.

,

Jankowski

,

H.

,

Pavlides

,

M.

,

Seregin

,

A.

and

Wellner

,

J.

(

2011

)

On the Grenander estimator at zero

.

Statist. Sin.

,

21

,

873

–

899

.

Google Scholar

Crossref

WorldCat

Banerjee

,

M.

,

Durot

,

C.

and

Sen

,

B.

(

2019

)

Divide and conquer in nonstandard problems and the super-efficiency phenomenon

.

Ann. Statist.

,

47

,

720

–

757

.

Google Scholar

Crossref

WorldCat

Banerjee

,

M.

and

Wellner

,

J. A.

(

2001

)

Likelihood ratio tests for monotone functions

.

Ann. Statist.

29

,

1699

–

1731

.

Google Scholar

Crossref

WorldCat

Bang

,

H.

and

Robins

,

J. M.

(

2005

)

Doubly robust estimation in missing data and causal inference models

.

Biometrics

,

61

,

962

–

973

.

Barlow

,

R. E.

,

Bartholomew

,

D. J.

,

Bremner

,

J. M.

and

Brunk

,

H. D.

(

1972

)

Statistical Inference under Order Restrictions: the Theory and Application of Isotonic Regression

.

New York

:

Wiley

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Belloni

,

A.

,

Chernozhukov

,

V.

,

Chetverikov

,

D.

and

Wei

,

Y.

(

2018

)

Uniformly valid post-regularization confidence regions for many functional parameters in Z-estimation framework

.

Ann. Statist.

,

46

, no. 6B,

3643

–

3675

.

Google Scholar

Crossref

WorldCat

Bind

,

M.-A. C.

and

Rubin

,

D. B.

(

2019

)

Bridging observational studies and randomized experiments by embedding the former in the latter

.

Statist. Meth. Med. Res.

,

28

,

1958

–

1978

.

Google Scholar

Crossref

WorldCat

Brunk

,

H. D.

(

1970

) Estimation of isotonic regression. In

Nonparametric Techniques in Statistical Inference

(ed.

M. L.

Puri

), pp.

177

–

197

.

London

:

Cambridge University Press

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Chernoff

,

H.

(

1964

)

Estimation of the mode

.

Ann. Inst. Statist. Math.

,

16

,

31

–

41

.

Google Scholar

Crossref

WorldCat

Díaz

,

I.

and

van der Laan

,

M. J.

(

2011

)

Super learner based conditional density estimation with application to marginal structural models

.

Int. J. Biostatist.

,

7

,

1

–

20

.

Google Scholar

Crossref

WorldCat

Durot

,

C.

(

2003

)

A Kolmogorov-type test for monotonicity of regression

.

Statist. Probab. Lett.

,

63

,

425

–

433

.

Google Scholar

Crossref

WorldCat

Durot

,

C.

,

Kulikov

,

V. N.

and

Lopuhaä

,

H. P.

(

2012

)

The limit distribution of the

L_{\infty}

-error of Grenander-type estimators

.

Ann. Statist.

,

40

,

1578

–

1608

.

Google Scholar

Crossref

WorldCat

Gill

,

R. D.

and

Robins

,

J. M.

(

2001

)

Causal inference for complex longitudinal data: the continuous case

.

Ann. Statist.

,

29

,

1785

–

1811

.

Google Scholar

Crossref

WorldCat

Groeneboom

,

P.

and

Jongbloed

,

G.

(

2014

)

Nonparametric Estimation under Shape Constraints

.

New York

:

Cambridge University Press

.

Groeneboom

,

P.

and

Jongbloed

,

G.

(

2015

)

Nonparametric confidence intervals for monotone functions

.

Ann. Statist.

,

43

,

2019

–

2054

.

Google Scholar

Crossref

WorldCat

Groeneboom

,

P.

and

Wellner

,

J. A.

(

2001

)

Computing Chernoff’s distribution

.

J. Computnl Graph. Statist.

,

10

,

388

–

400

.

Google Scholar

Crossref

WorldCat

Horvitz

,

D. G.

and

Thompson

,

D. J.

(

1952

)

A generalization of sampling without replacement from a finite universe

.

J. Am. Statist. Ass.

,

47

,

663

–

685

.

Google Scholar

Crossref

WorldCat

Jin

,

X.

,

Morgan

,

C.

,

Yu

,

X.

,

DeRosa

,

S.

,

Tomaras

,

G. D.

,

Montefiori

,

D. C.

,

Kublin

,

J.

,

Corey

,

L.

,

Keefer

,

M. C.

and

NIAID HIV Vaccine Trials Network

(

2015

)

Multiple factors affect immunogenicity of DNA plasmid HIV vaccines in human clinical trials

.

Vaccine

,

33

,

2347

–

2353

.

Kennedy

,

E. H.

(

2019

)

Nonparametric causal effects based on incremental propensity score interventions

.

J. Am. Statist. Ass.

,

114

,

645

–

656

.

Google Scholar

Crossref

WorldCat

Kennedy

,

E. H.

,

Ma

,

Z.

,

McHugh

,

M. D.

and

Small

,

D. S.

(

2017

)

Non-parametric methods for doubly robust estimation of continuous treatment effects

.

J. R. Statist. Soc.

B,

79

,

1229

–

1245

.

Google Scholar

Crossref

WorldCat

Kosorok

,

M. R.

(

2008

) Bootstrapping the Grenander estimator. In

Beyond Parametrics in Interdisciplinary Research: Festschrift in Honor of Professor Pranab K. Sen

(eds

N.

Balakrishnan

,

E. A.

Peñ

and

M. J.

Silvapulle

), vol. 1, pp.

282

–

292

.

Hayward

:

Institute of Mathematical Statistics

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Kulikov

,

V. N.

and

Lopuhaä

,

H. P.

(

2006

)

The behavior of the NPMLE of a decreasing density near the boundaries of the support

.

Ann. Statist.

,

34

,

742

–

768

.

Google Scholar

Crossref

WorldCat

van der Laan

,

M. J.

,

Bibaut

,

A.

and

Luedtke

,

A. R.

(

2018

) CV-TMLE for nonpathwise differentiable target parameters. In

Targeted Learning in Data Science: Causal Inference for Complex Longitudinal Studies

(eds

M. J.

van der Laan

and

S.

Rose

), pp.

455

–

481

.

New York

:

Springer

.

van der Laan

,

M. J.

,

Polley

,

E. C.

and

Hubbard

,

A. E.

(

2007

)

Super Learner.

Statist. Appl. Genet. Molec. Biol.

,

6

,

no. 1

.

Google Scholar

OpenURL Placeholder Text

WorldCat

van der Laan

,

M. J.

and

Robins

,

J. M.

(

2003

)

Unified Methods for Censored Longitudinal Data and Causality

.

New York

:

Springer Science and Business Media

.

van der Laan

,

M. J.

and

Rose

,

S.

(

2011

)

Targeted Learning: Causal Inference for Observational and Experimental Data

.

New York

:

Springer

.

Liu

,

F.

,

Guo

,

Z.

and

Dong

,

C.

(

2017

)

Influences of obesity on the immunogenicity of hepatitis b vaccine

.

Hum. Vacc. Immuntherp.

,

13

,

1014

–

1017

.

Google Scholar

Crossref

WorldCat

Neugebauer

,

R.

and

van der Laan

,

M.

(

2005

)

Why prefer double robust estimators in causal inference?

J. Statist. Planng Inf.

,

129

,

405

–

426

.

Google Scholar

Crossref

WorldCat

Neugebauer

,

R.

and

van der Laan

,

M. J.

(

2007

)

Nonparametric causal effects based on marginal structural models

.

J. Statist. Planng Inf.

,

137

,

419

–

434

.

Google Scholar

Crossref

WorldCat

Nolan

,

D.

and

Pollard

,

D.

(

1987

)

U-processes: rates of convergence

.

Ann. Statist.

,

15

,

780

–

799

.

Google Scholar

Crossref

WorldCat

Painter

,

S. D.

,

Ovsyannikova

,

I. G.

and

Poland

,

G. A.

(

2015

)

The weight of obesity on the human immune response to vaccination

.

Vaccine

,

33

,

4422

–

4429

.

Rerks-Ngarm

,

S.

,

Pitisuttithum

,

P.

,

Nitayaphan

,

S.

,

Kaewkungwal

,

J.

,

Chiu

,

J.

,

Paris

,

R.

,

Premsri

,

N.

,

Namwat

,

C.

,

de Souza

,

M.

,

Adams

,

E.

,

Benenson

,

M.

,

Gurunathan

,

S.

,

Tartaglia

,

J.

,

McNeil

,

J. G.

,

Francis

,

D. P.

,

Stablein

,

D.

,

Birx

,

D. L.

,

Chunsuttiwat

,

S.

,

Khambounruang

,

C.

,

Thongcharoen

,

P.

,

Robb

,

M. L.

,

Michael

,

N. I.

,

Kunasol

,

P.

and

Kim

,

J. H.

(

2009

)

Vaccination with ALVAC and AIDSVAX to prevent HIV-1 infection in Thailand

.

New Engl. J. Med.

,

361

,

2209

–

2220

.

Google Scholar

Crossref

WorldCat

Robins

,

J.

(

1986

)

A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect

.

Math. Modllng

,

7

,

1393

–

1512

.

Google Scholar

Crossref

WorldCat

Robins

,

J. M.

(

2000

) Marginal structural models versus structural nested models as tools for causal inference. In

Statistical Models in Epidemiology, the Environment, and Clinical Trials

(eds

M. E.

Halloran

and

D.

Berry

), pp.

95

–

133

.

New York

:

Springer

.

Robins

,

J. M.

,

Rotnitzky

,

A.

and

Zhao

,

L. P.

(

1994

)

Estimation of regression coefficients when some regressors are not always observed

.

J. Am. Statist. Ass.

,

89

,

846

–

866

.

Google Scholar

Crossref

WorldCat

Rotnitzky

,

A.

,

Robins

,

J. M.

and

Scharfstein

,

D. O.

(

1998

)

Semiparametric regression for repeated outcomes with nonignorable nonresponse

.

J. Am. Statist. Ass.

,

93

,

1321

–

1339

.

Google Scholar

Crossref

WorldCat

Rubin

,

D.

and

van der Laan

,

M. J.

(

2006

)

Extending marginal structural models through local, penalized, and additive learning. Working Paper 212.

Division of Biostatistics, University of California at Berkeley

,

Berkeley

.

Scharfstein

,

D. O.

,

Rotnitzky

,

A.

and

Robins

,

J. M.

(

1999

)

Adjusting for nonignorable drop-out using semiparametric nonresponse models

.

J. Am. Statist. Ass.

,

94

,

1096

–

1120

.

Google Scholar

Crossref

WorldCat

Sen

,

B.

,

Banerjee

,

M.

and

Woodroofe

,

M.

(

2010

)

Inconsistency of the bootstrap: the Grenander estimator

.

Ann. Statist.

,

38

,

1953

–

1977

.

Google Scholar

Crossref

WorldCat

Sheridan

,

P. A.

,

Paich

,

H. A.

,

Handy

,

J.

,

Karlsson

,

E. A.

,

Hudgens

,

M. G.

,

Sammon

,

A. B.

,

Holland

,

L. A.

,

Weir

,

S.

,

Noah

,

T. L.

and

Beck

,

M. A.

(

2012

)

Obesity is associated with impaired immune response to influenza vaccination in humans

.

Int. J. Obesty

,

36

, no. 8,

1072

–

1077

.

Google Scholar

Crossref

WorldCat

van der Vaart

,

A. W.

and

Wellner

,

J. A.

(

1996

)

Weak Convergence and Empirical Processes

.

New York

:

Springer

.

Westling

,

T.

and

Carone

,

M.

(

2019

)

A unified study of nonparametric inference for monotone functions

.

Ann. Statist.

to be published

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Westling

,

T.

,

van der Laan

,

M. J.

and

Carone

,

M.

(

2018

)

Correcting an estimator of a multivariate monotone function with isotonic regression

.

Preprint arXiv:1810.09022

.

University of Massachusetts Amherst

,

Amherst

.

Woodroofe

,

M.

and

Sun

,

J.

(

1993

)

A penalized maximum likelihood estimate of f(0+) when f is nonincreasing

.

Statist. Sin.

,

3

,

501

–

515

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Young

,

K. M.

,

Gray

,

C. M.

and

Bekker

,

L.-G.

(

2013

)

Is obesity a risk factor for vaccine non-responsiveness?

PLOS One

,

8

,

no. 12

,

article e82779

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Zhang

,

Z.

,

Zhou

,

J.

,

Cao

,

W.

and

Zhang

,

J.

(

2016

)

Causal inference with a quantitative exposure

.

Statist. Meth. Med. Res.

,

25

,

315

–

335

.

Google Scholar

Crossref

WorldCat

Zheng

,

W.

and

van der Laan

,

M. J.

(

2011

) Cross-validated targeted minimum loss based estimation. In

Targeted Learning: Causal Inference for Observational and Experimental Data

(eds

M. J.

van der Laan

and

S.

Rose

), ch. 27, pp.

459

–

473

.

New York

:

Springer

.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Download all slides

Month:	Total Views:
February 2023	18
March 2023	51
April 2023	24
May 2023	31
June 2023	41
July 2023	33
August 2023	75
September 2023	51
October 2023	52
November 2023	32
December 2023	54
January 2024	51
February 2024	69
March 2024	91
April 2024	88
May 2024	102
June 2024	70
July 2024	47
August 2024	32
September 2024	43
October 2024	84
November 2024	88
December 2024	45
January 2025	49
February 2025	40
March 2025	71
April 2025	50
May 2025	11

Article Contents

Causal Isotonic Regression

Summary

1 Introduction

1.1 Motivation and literature review

1.2 Parameter of interest and its causal interpretation

1.3 Contribution and organization of the paper

2 Proposed approach

2.1 Review of isotonic regression

2.2 Definition of proposed estimator

3 Theoretical properties

3.1 Invariance to strictly increasing exposure transformations

3.2 Consistency

3.3 Convergence in distribution

3.4 Grenander-type estimation without domain transformation

3.5 Discrete domains

3.6 Large sample results for causal effects

3.7 Use of cross-fitting to avoid empirical process conditions

4 Construction of confidence intervals

4.1 Wald-type confidence intervals

4.2 Scale estimation relying on consistent nuisance estimation

4.3 Doubly robust scale estimation

4.4 Confidence intervals via sample splitting

5 Numerical studies

6 Body mass index and T-cell response in human immunodeficiency virus vaccine studies

7 Concluding remarks

Supporting information

Acknowledgements

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Causal Isotonic Regression

Summary

1 Introduction

1.1 Motivation and literature review

1.2 Parameter of interest and its causal interpretation

1.3 Contribution and organization of the paper

2 Proposed approach

2.1 Review of isotonic regression

2.2 Definition of proposed estimator

3 Theoretical properties

3.1 Invariance to strictly increasing exposure transformations

3.2 Consistency

3.3 Convergence in distribution

3.4 Grenander-type estimation without domain transformation

3.5 Discrete domains

3.6 Large sample results for causal effects

3.7 Use of cross-fitting to avoid empirical process conditions

4 Construction of confidence intervals

4.1 Wald-type confidence intervals

4.2 Scale estimation relying on consistent nuisance estimation

4.3 Doubly robust scale estimation

4.4 Confidence intervals via sample splitting

5 Numerical studies

6 Body mass index and T-cell response in human immunodeficiency virus vaccine studies

7 Concluding remarks

Supporting information

Acknowledgements

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only