Abstract

Estimating dynamic treatment regimes (DTRs) from retrospective observational data is challenging as some degree of unmeasured confounding is often expected. In this work, we develop a framework of estimating properly defined ‘optimal’ DTRs with a time-varying instrumental variable (IV) when unmeasured covariates confound the treatment and outcome, rendering the potential outcome distributions only partially identified. We derive a novel Bellman equation under partial identification, use it to define a generic class of estimands (termed IV-optimal DTRs) and study the associated estimation problem. We then extend the IV-optimality framework to tackle the policy improvement problem, delivering IV-improved DTRs that are guaranteed to perform no worse and potentially better than a prespecified baseline DTR. Importantly, this IV-improvement framework opens up the possibility of strictly improving upon DTRs that are optimal under the no unmeasured confounding assumption (NUCA). We demonstrate via extensive simulations the superior performance of IV-optimal and IV-improved DTRs over the DTRs that are optimal only under the NUCA. In a real data example, we embed retrospective observational registry data into a natural, two-stage experiment with noncompliance using a differential-distance-based, time-varying IV and estimate useful IV-optimal DTRs that assign mothers to a high-level or low-level neonatal intensive care unit based on their prognostic variables.

1 Introduction

1.1 Dynamic treatment regime and instrumental variable

Estimating single-stage individualized treatment rules (ITRs) and the more general multiple-stage dynamic treatment regimes (DTRs) has attracted a lot of interest from diverse disciplines. Many estimation strategies have been proposed in the literature for data derived from randomized controlled trials or observational databases. Two main strands of work are Q-learning based on positing regression models for the outcome of interest at each stage based on study participants’ information and A-learning based on modelling the contrasts among treatments. Both methods can further be combined with classification tools to yield more robust and interpretable decision rules. Important examples of these methodologies include Murphy et al. (2001), Murphy (2003), Robins (2004), Chakraborty et al. (2010), Rubin and van der Laan (2012), Zhao et al. (2012), Zhao et al. (2015), Laber and Zhao (2015), Zhou et al. (2017), Tao et al. (2018), Zhang et al. (2018), Zhang and Zhang (2018), and Athey and Wager (2020), among many others. For a more comprehensive review of various methods, see, e.g., Moodie et al. (2007), Chakraborty and Murphy (2014), and Schulte et al. (2014).

Estimating optimal policies (ITRs or DTRs) can be challenging when data comes from observational databases (e.g., Medicare and Medicaid claims database, disease registry, etc) where some degree of unmeasured confounding is always expected. In these scenarios, an instrumental variable (IV) is a useful tool to infer the treatment effect (Angrist et al., 1996). An instrumental variable is valid if it is associated with the treatment (IV relevance), affects the outcome only through its association with the treatment (exclusion restriction) and is independent of the unobserved treatment-outcome confounding variables, possibly conditional on a rich set of observed covariates (IV unconfoundedness). We will rigorously define these IV identification assumptions in Section 2.

One subtlety in IV-based analyses is that even a valid IV cannot always identify the mean potential outcomes; rather, a valid IV along with appropriate, application-specific identification assumptions places certain restrictions on the potential outcome distributions, and this line of research is known as partial identification of probability distributions (Manski, 2003). This subtlety is inherited by the policy estimation problem using an IV. In particular, when the conditional average treatment effect (CATE) is not point identified from data, the optimal policy that maximizes the value function cannot be identified either, necessitating researchers to target alternative optimality criteria. While such criteria have been proposed in the single-stage setting from different perspectives (Cui & Tchetgen Tchetgen, 2020, 2021; Pu & Zhang, 2021), literature on the more general, multiple-stage setting is scarce.

Our first contribution in this article is to generalize the optimality criteria in the single-stage setting and develop a framework that is tailored to general sequential decision-making problems and incorporates rich information contained in a time-varying IV. This optimality criterion, termed IV-optimality, is based on a carefully weighted version of the partially identified Q-function and value function subject to the distributional constraints imposed by the IV. This criterion is distinct from the framework of Han (2019) that endows the collection of partially identified DTRs with a partial order and characterizes the set of maximal elements (see Zhang et al., 2020 for similar ideas). It is also distinct from Liao et al. (2021) that models the transition dynamics and assumes that the unmeasured treatment-outcome confounder is additive and not an effect modifier itself at any stage, or Shi et al. (2022) that utilizes Pearl’s (2009) ‘front-door criterion’ to aid point identification of optimal DTR in the presence of unmeasured treatment-outcome confounding. In the current work, we do not impose, a priori, assumptions that induce point identification. Importantly, our proposed set of optimality criteria collapse to that targeted by Liao et al. (2021) if the time-varying IV point identifies the treatment effect under an additional ‘additive unmeasured confounding’ assumption studied in Liao et al. (2021) and more general point-identification assumptions proposed and reviewed in Frangakis and Rubin (1999), Hernán and Robins (2006), Wang and Tchetgen Tchetgen (2018), and Swanson et al. (2018, Section 5). We take a hybrid approach of Q-learning (Schulte et al., 2014; Watkins & Dayan, 1992) and weighted classification (Zhang et al., 2012; Zhao et al., 2012) to target this optimality criterion and establish nonasymptotic rate of convergence of the proposed estimators.

The IV-optimality framework also motivates a conceptually simple yet informative variant framework that allows researchers to leverage a time-varying IV to improve upon a baseline DTR. The policy improvement problem was first considered in a series of papers by Kallus et al. (2019) and Kallus and Zhou (2020a, 2020b) under a ‘Rosenbaum-bounds-type’ sensitivity analysis model (Rosenbaum, 2002b). Despite its novelty and usefulness in a range of application scenarios, a sensitivity-analysis-based policy improvement framework does have a few limitations. In particular, each improved policy is indexed by a sensitivity parameter Γ that controls the degree of unmeasured confounding. Since the sensitivity parameter is not identified from the observed data, it is often unclear which improved policy best serves the purpose. Moreover, ‘Rosenbaum-bounds-type’ sensitivity analysis model does not fully take into account unmeasured confounding heterogeneity (Bonvini & Kennedy, 2021; Heng & Small, 2020). On the contrary, an IV-based policy improvement framework outputs data-driven partial identification intervals that are not indexed by sensitivity parameters. Additional information contained in the IV also opens up the possibility of strictly improving upon a policy that is optimal under the no unmeasured confounding assumption (NUCA) (Robins, 1992; Rosenbaum & Rubin, 1983), when the NUCA is in fact violated.

Our proposed method has two important applications. First, it is applicable when empirical researchers want to estimate useful DTRs from noisy observational data and have access to a time-varying IV at each stage, e.g., daily precipitation, differential distance, etc. Second, the framework described here is central to many sequential multiple assignment randomized trial (SMART) designs (Murphy, 2005) where interest lies in comparing adaptive interventions, and patients assigned to a particular treatment at each stage may not adhere to it. Noncompliance makes the treatment assigned at every stage a valid time-varying IV, and our proposed method provides a principled way to account for systematic noncompliance in SMART designs.

The rest of the article is organized as follows. We describe a real data application in Section 1.2. Section 2 reviews optimality criteria in the single-stage setting and provides a general IV-optimality framework that is amenable to being extended to the multiple-stage setting. Section 3 considers improving upon a baseline ITR with an IV. Building upon Sections 2 and 3, we describe an IV-optimality framework for policy estimation in the multiple-stage setting and extend this framework to tackle the policy improvement problem in Section 4. Section 5 studies the theoretical properties of the proposed methods. We report simulation results in Section 6 and revisit the application in Section 7. Section 8 concludes with a brief discussion. For brevity, all proofs are deferred to the Online Supplemental Materials S2 and S3.

1.2 Application: a natural, two-stage experiment derived from registry data

Lorch et al. (2012) conducted a retrospective cohort study to investigate the effect of delivery hospital on premature babies (gestational age between 23 and 37 weeks) and found a significant benefit to neonatal outcomes when premature babies were delivered at hospitals with high-level neonatal intensive care units (NICUs) compared to those without high-level NICUs. Lorch et al. (2012) used the differential travel time to the nearest high-level versus low-level NICU as an IV so that the outcome analysis is less confounded by mothers’ self-selection into hospitals with high-level NICUs. In their original study design, Lorch et al. (2012) further controlled for mothers’ neighbourhood characteristics, as well as their demographic information and variables related to delivery, in their matching-based study design so that the differential travel time was more likely to be a valid IV conditional on these variables. Put another way, the differential travel time created a natural experiment with noncompliance: Similar mothers who lived relatively close to a hospital with high-level NICUs were encouraged (but not forced) to deliver at a high-level NICU. More recently, Michael et al. (2020) considered mothers who delivered exactly two babies from 1996 to 2005 in Pennsylvania and investigated the cumulative effect of delivering at a high-level NICU on neonatal survival probability using the same differential travel time IV.

Currently, there is still limited capacity at high-level NICUs, so it is not practical to direct all mothers to these high-technology, high-volume hospitals. Understanding which mothers would most significantly benefit from delivering at a high-level NICU helps design optimal perinatal regionalization systems that designate hospitals by the scope of perinatal service provided and designate where infants are born or transferred according to the level of care they need at birth (Kroelinger et al., 2018; Lasswell et al., 2010). Indeed, previous studies seemed to suggest that although high-level NICUs significantly reduced deaths for babies of small gestational age, they made little difference for almost mature babies like 37 weeks (Yang et al., 2014).

Mothers who happened to relocate during two consecutive deliveries of babies constitute a natural two-stage, two-arm randomized controlled trial with noncompliance and present a unique opportunity to investigate an optimal DTR. See Figure 1 for the directed acyclic graph (DAG) illustrating this application. We will revisit the application after developing a methodology precisely suited for this purpose.

A DAG illustrating the NICU application. Observed baseline covariates like mothers’ demographics, neighbourhood characteristics, etc., are omitted for clearer presentation.
Figure 1.

A DAG illustrating the NICU application. Observed baseline covariates like mothers’ demographics, neighbourhood characteristics, etc., are omitted for clearer presentation.

2 Estimating individualized treatment rules with an instrumental variable

2.1 Optimal and NUCA-optimal ITRs

We first define the estimand of interest, an optimal ITR, under the potential outcome framework (Neyman, 1923; Rubin, 1974) and briefly review how to estimate the optimal ITR under the NUCA.

Consider a single-stage decision problem where one observes the covariates X=xXRd, takes a binary action A=a{±1} and receives the response Y(a)YR. Here, Y(a) denotes the potential outcome under action a. This decision-making process is formalized by an individualized treatment rule (or a single-stage policy), which is a map π from prognostic variables x to a treatment decision a=π(x). Denote by pa*(|x) the conditional law of Y(a) given the realization of X=x. The quality of π is quantified by its value:

(1)

where the outer expectation is taken with respect to the law of X and the inner expectation the potential outcome distribution pa*(|X) with action a set to π(X). Intuitively, Vπ measures the expected value of the response if the population were to follow π. Let V*=maxπΠVπ denote the maximal value of an ITR when restricted to a policy class Π. An ITR is said to be optimal with respect to Π if it achieves V*.

A well-known result of Zhang et al. (2012) (see also Zhao et al., 2012) asserts the duality between value maximization and risk minimization, in the sense that any optimal ITR that maximizes the value Vπ also minimizes the following risk (and vice versa):

(2)

where C*(x)=EYp+1*(|x)[Y]EYp1*(|x)[Y] is the CATE. It is not hard to show that the sign of the CATE, sgn(C*(x)), is the Bayes ITR (i.e., the optimal ITR when Π consists of all Boolean functions), and therefore the risk Rπ admits a natural interpretation as a weighted misclassification error.

Suppose we have a dataset consisting of independent and identically distributed (i.i.d.) samples from the law of the triplet (Xobs,Aobs,Yobs). In parallel to the potential outcome distribution pa*(|x), let paobs(|x) denote the conditional law of {Yobs|Aobs=a,Xobs=x}. One can then define counterparts of the value and the risk as in (1) and (2), but with pa*(|x) replaced by paobs(|x). We define

(3)

where Cobs(x)=EYp+1obs(|x)[Y]EYp1obs(|x)[Y]. The distribution paobs(|x) and hence Robs,π are always identified from the observed data, thus rendering it feasible to minimize Robs,π using the observed data. Under a version of the no unmeasured confounding assumption, the two distributions pa*(|X) and paobs(|X) agree, and hence a minimizer of Robs,π also minimizes Rπ and is an optimal ITR. To make the distinction clear, we refer to any minimizer of Robs,π as a NUCA-optimal ITR (i.e., only optimal in the conventional sense under the NUCA) and denote it as πNUCA.

2.2 Instrumental variables and IV-optimal ITRs

The NUCA is often a heroic assumption for observational data. In the classical causal inference literature, an instrumental variable is a widely used tool to infer the causal effect from noisy observational data (Angrist et al., 1996; Hernán & Robins, 2006; Imbens, 2004; Rosenbaum, 2002a). Let Zi{1,1} denote a binary IV of unit i, Ai(Zi) the potential treatment received under IV assignment Zi, and Yi(Ai(Zi),Zi) the potential outcome under IV assignment Zi and treatment Ai. This definition implicitly assumes the Stable Unit Treatment Value Assumption, which states that a unit’s potential treatment status depends on its own IV assignment, the potential outcome status depends on its own IV and treatment assignments, and there are not multiple versions of the same IV or treatment. Other core IV assumptions include IV relevance: P(A=1Z=1,X)>P(A=1Z=1,X), exclusion restriction: Yi(z,a)=Yi(a) for all z,a and i, and IV unconfoundedness (possibly conditional on X): Z⨿A(z),Y(z,a)X for z=±1 and a=±1. A variable Z is said to be a valid IV if it satisfies these core IV assumptions (Angrist et al., 1996; Baiocchi et al., 2014). There are many versions of core IV assumptions; in particular, the IV unconfoundedness assumption may be relaxed and replaced by (marginal, partial joint, or full joint) exchangeability assumptions (Swanson et al., 2018, Section 2). A valid IV satisfying core IV assumptions cannot point identify the mean conditional potential outcomes E[Y(±1)|X=x] or CATE C*(x) (Manski, 2003; Robins & Greenland, 1996; Swanson et al., 2018); hence, neither the value nor the risk of an ITR is point identified even with a valid IV.

Although a valid IV may not point identify the causal effects, it may still be useful because even under minimal identification assumptions, a valid IV can yield meaningful partial identification intervals or bounds for the CATE. That is, one can construct two functions L(x) and U(x), both of which are functionals of the observed data distribution (and thus estimable from the data), such that the CATE satisfies C*(x)[L(x),U(x)] almost surely. Important examples include the natural bounds (Manski, 1990; Robins, 1989), Balke-Pearl bounds (Balke & Pearl, 1997), and Manski-Pepper bounds (Manski & Pepper, 2000). More recently, Finkelstein and Shpitser (2020) and Duarte et al. (2021) further developed optimization-based algorithms that output sharp partial identification bounds for IV models with observed covariates.

Given the partial identification interval [L(x),U(x)], the risk of an ITR π defined in (2) is bounded between

and

(4)

Pu and Zhang (2021) argued that a sensible criterion is to minimize the expected worst-case risk R¯π, and the resulting minimizer is termed an IV-optimal ITR, as this ITR is ‘worst-case risk-optimal’ with respect to the partial identification interval induced by an IV and its associated identification assumptions; see also (Ben-Michael et al., 2021). Note that an IV-optimal ITR is not an optimal ITR without further assumptions.

Cui and Tchetgen Tchetgen (2021) proposed an alternative set of optimality criteria from the perspective of the partially identified value function. Instead of constructing two functions L(x),U(x) that bound the CATE, one may alternatively construct functions Q_(x,a), Q¯(x,a) that sandwich the conditional mean potential outcome E[Y(a)|X=x], and the value defined in (1) satisfies E[Q_(X,π(X))]VπE[Q¯(X,π(X))]. Cui and Tchetgen Tchetgen (2021) advocated maximizing some carefully chosen middle ground between the lower and upper bounds of the value:

(5)

where λ(x,a) is a prespecified function that captures a second-level individualism, i.e., how optimistic/pessimistic individuals with covariates X=x are.

Cui and Tchetgen Tchetgen (2021) pointed out that minimizing the maximum risk R¯π is not equivalent to maximizing the minimum value Vλ=0π, even when the bounds of the CATE are obtained via bounds on the Q-functions, i.e., L(x)=Q¯(x,+1)Q_(x,1) and U(x)=Q_(x,+1)Q¯(x,1). Rather, minimizing R¯π is equivalent to maximizing the midpoint of the minimum and maximum values, namely Vλ=1/2π.

2.3 A general IV-optimality framework

Conceptually, a valid IV and the associated identification assumptions impose distributional constraints on potential outcome distributions pa*(|x). Under assumptions in Cui and Tchetgen Tchetgen (2020) and Qiu et al. (2020), pa*(|x) can be expressed as functionals of the observed data distribution. Alternatively, if weaker assumptions are imposed and preclude point identification, the partial identification results assert that pa*(|x) is ‘weakly bounded’ in the sense that for a sufficiently regular function f, we can find two functions Q_(x,a;f) and Q¯(x,a;f) such that

(6)

If f is the identity function, then the above display is precisely the partial identification intervals of the mean conditional potential outcome E[Y(a)|X=x] in (5). In both cases, a valid IV with identification assumptions specifies a collection of distributions Px,a such that pa*(|x)Px,a. In the point-identification case, the set Px,a is a singleton consisting only of the ground truth potential outcome distribution pa*(|x); in the partial identification case, Px,a consists of all distributions that are weakly bounded in the sense of (6). In other words, Px,a is the collection of all possible potential outcome distributions pa*(|x) that are compatible with the putative IV and associated IV identification assumptions, and we refer to it as an IV-constrained set. We may further equip an IV-constrained set Px,a with a prior distribution Px,a. Here, Px,a is a probability distribution on Px,a, the latter of which is itself a collection of probability distributions. To have a fully rigorous treatment, we equip Px,a with a metric (e.g., Wasserstein metric) and work with the induced Borel sigma algebra (Parthasarathy, 2005).

Given the prior distributions {Px,a:xX,a=±1}, we can define the IV-constrained value of a policy π as

(7)

where we write pπ(x)=pπ(x)(|x) for notational simplicity. From now on, we will refer to the collection of criteria given by maximizing VPπ as IV-optimality. An ITR is said to be IV-optimal with respect to the prior distributions {Px,a:xX,a=±1} and a policy class Π if it maximizes VPπ among all πΠ. The flow chart in Figure 2 summarizes this conceptual framework.

A schematic plot of IV-optimality. Prior distributions {Px,a:x∈Rd,a=±1} imposed on the IV-constrained sets {Px,a:x∈Rd,a=±1} give rise to the IV-constrained value VPπ. An IV-optimal ITR maximizes VPπ over a policy class Π.
Figure 2.

A schematic plot of IV-optimality. Prior distributions {Px,a:xRd,a=±1} imposed on the IV-constrained sets {Px,a:xRd,a=±1} give rise to the IV-constrained value VPπ. An IV-optimal ITR maximizes VPπ over a policy class Π.

The formulation in (5) can be recovered by carefully choosing the prior distributions. In particular, let p_a(|x) and p¯a(|x) be distributions that witness the partial identification bounds Q_(x,a) and Q¯(x,a), respectively:

(8)

Then, criterion (5) is recovered by considering the following two-point priors:

where δp is a point mass at p. As discussed near the end of Section 2.2, setting λ(x,a) uniformly equal to 1/2 recovers the original IV-optimality criterion considered in Pu and Zhang (2021) that minimizes the worst-case risk (4). In fact, under certain regularity conditions, one can show that the reverse is also true: the formulation (7) for a specified collection prior distributions can also be recovered from (5) by a careful choice of λ. In view of such an equivalence, criterion (7) should not be regarded as a generalization of (5). Rather, it is a convenient tool amenable to being generalized to the multiple-stage setting. A proof of this equivalence statement will appear in Section 4.

3 Improving individualized treatment rules with an instrumental variable

Compared to estimating an optimal ITR, a less ambitious goal is to improve upon a baseline ITR πb, so that the improved ITR is no worse and potentially better than πb. In this section, we show how to achieve this goal with an IV in the single-stage setup.

The goal of ‘never being worse’ is reminiscent of the min-max risk criterion in (4), and it is natural to consider minimizing the maximum excess risk with respect to the baseline ITR πb, subject to the IV-informed partial identification constraints. Definition 1 summarizes this strategy.

Risk-based IV-improved ITR

 
Definition 1

Let πb denote a baseline ITR, Π a policy class, and [L(x),U(x)] an IV-informed partial identification interval of the CATE C*(x). A risk-based IV-improved ITR is any solution to the optimization problem below:

(9)

When Π consists of all Boolean functions, (9) admits the following explicit solution.

Formula for risk-based IV-improved ITR

 
Proposition 1

Define

(10)

Then, πR is a risk-based IV-improved ITR when Π consists of all Boolean functions.

The ITR in (10) admits an intuitive explanation: it takes action +1 when the partial identification interval is positive (i.e., 0<L(x)U(x)), takes action 1 when the interval is negative (i.e., L(x)U(x)<0), and otherwise follows the baseline ITR πb.

A second criterion closely related to that in Definition 1 maximizes the minimum excess value with respect to πb, subject to the distributional constraints imposed by the putative IV and its associated identification assumptions.

Value-based IV-improved ITR

 
Definition 2

Let πb denote a baseline ITR, Π a policy class, and {Px,a:xRd,a±1} a collection of IV-constrained sets. A value-based IV-improved ITR is any solution to the following optimization problem:

(11)

The above formulation is referred to as distributionally robust optimization in the optimization literature (Delage & Ye, 2010). When the IV-constrained sets are derived from partial identification intervals [Q_(x,a),Q¯(x,a)] and Π consists of all Boolean functions, the optimization problem (11) admits the explicit solution below, analogous to the one given in Proposition 1.

Formula for value-based IV-improved ITR

 
Proposition 2

Assume p_a(|x) and p¯a(|x), the two distributions defined in (8) that witness the partial identification bounds Q_(x,a), Q¯(x,a), are both inside Px,a. Define

(12)

where LV(x)=Q_(x,+1)Q¯(x,1) and UV(x)=Q¯(x,+1)Q_(x,1). Then, πV is a value-based IV-improved policy when Π consists of all Boolean functions.

Propositions 1 and 2 together reveal an interesting duality between worst-case excess risk minimization and worst-case excess value maximization. When the partial identification interval for the CATE is derived directly from the partial identification intervals for the mean conditional potential outcomes, so that LV=L and UV=U, then the two ITRs defined in (10) and (12) agree, and they simultaneously satisfy the two IV-improvement criteria presented in Definitions 1 and 2. Hence, we will not distinguish between two types of IV-improved ITRs henceforth.

Recall that a similar duality exists in the classical policy estimation problem under the NUCA, but disappears when extended to the partial identification settings. It is thus curious to see such a duality restored in the policy improvement problem.

4 Estimating dynamic treatment regimes with an instrumental variable

4.1 Optimal and SRA-optimal DTRs

We now consider a general K-stage decision-making problem (Murphy, 2003; Schulte et al., 2014). At the first stage k=1, we are given baseline covariates X1=x1X, make a decision a1=π1(x1), and observe a reward R1(a1)=r1. At stage k=2,,K, let ak1=(a1,,ak1) denote treatment decisions from stage 1 up to k1, and Rk1(ak1)=(R1(a1),R2(a2),,Rk1(ak1)) the rewards from stage 1 to k1. Meanwhile, denote by Xk(ak1) new covariate information (e.g., time-varying covariates, auxiliary outcomes, etc.) that would arise after stage k1 but before stage k had the participant followed the treatment decisions ak1, and Xk(ak1)=(X1,X2(a1),,Xk(ak1)) the entire covariate history up to stage k. Given Rr1=rk1,Xk=xk, a data-driven decision ak=πk(ak1,rk1,xk) is then made and we observe reward Rk(ak)=rk. Our goal is to estimate a DTR π={πk}k=1K, such that the expected value of cumulative rewards k=1Krk is maximized if the population were to follow π.

To simplify the notations, the historical information available for making the decision at stage k is denoted by H1=X1 for k=1 and Hk=(ak1,Rk1(ak1),Xk(ak1)) for any 2kK. The information Hk consists of Hk1 and Hk: Hk1 is the information available at the previous stage and Hk=(ak1,Rk1(ak1),Xk(ak1)) is the new information generated after decision ak1 is made. Let paK*(|hK) be the conditional law of R(aK) given a specific realization of the historical information HK=hk. We define the following action-value function (or Q-function) at stage K:

(13)

where we emphasize again that the expectation is taken over the potential outcome distribution paK*(|hK). For a policy π, its value function at stage K is then defined as

and depends on π only through πK.

Next, we define Q-functions and value functions at a generic stage k[K] recursively (denote [K]={1,,K}). In particular, for k=K1,,1, we let pak*(,|hk) denote the joint law of the reward Rk(ak) and new covariate information Xk+1(ak) observed immediately after decision ak has been made conditional on Hk=hk. The Q-function of π at stage k is then defined as:

(14)

where Hk+1=(hk,ak,Rk,Xk+1) is a function of Rk and Xk+1. The corresponding value function of π is taken to be Vkπ(hk)=Qkπ(hk,πk(hk)). Similar to (13), the expectation is taken over the potential outcome distribution pak*(,|hk). We interpret the Q-function Qkπ(hk,ak) as the cumulative rewards collected by executing ak at stage k and follow π from stage k+1 and onwards, and the value function Vkπ(hk) the cumulative rewards collected by executing π from stage k and onwards. Therefore, Qkπ depends on π only via π(k+1):K=(πk+1,,πK) and Vkπ depends on π only via πk:K=(πk,,πK). For notational simplicity, we interpret quantities whose subscripts do not make sense (e.g., a0,r0, and xK+1) as ‘null’ quantities and their occurrences in mathematical expressions will be disregarded.

With a slight abuse of notation, we let Π=Π1×ΠK be a policy class. A DTR is said to be optimal with respect to Π if it maximizes V1π(x1) for all fixed x1 (and hence E[V1π(X1)] for an arbitrary covariate distribution) over the policy class Π.

Assume for now that Π consists of all DTRs (i.e., each Πk consists of all Boolean functions). A celebrated result from the control theory states that the dynamic programming approach below, also known as backward induction, yields an optimal DTR π* (Murphy, 2003; Sutton & Barto, 2018):

(15)

In other words, π* satisfies Vkπ*(hk)=maxπVkπ(hk) for any k[K] and historical information hk, and is well defined as Qkπ* depends on π* only through π(k+1):K*.

The foregoing discussion is based on the potential outcome distributions. Suppose that we have collected i.i.d. data from the law of the random trajectory (Xkobs,Akobs,Rkobs)k=1K. Let {pakobs(,|hk)}k=1K1 be the conditional laws of the observed rewards and new covariate information identified from the observed data:

Here, Hkobs denotes the observed historical information up to stage k. Suppose that we obtain a DTR using the dynamic programming approach described in (15) with pakobs in place of pak*. Under a version of the sequential randomization assumptions (Robins, 1998) and with additional assumptions of consistency and positivity, we have pak*=pakobs, from which it follows that the DTR is optimal (Murphy, 2003; Schulte et al., 2014). We will refer to this policy as an SRA-optimal DTR and denote it as πSRA.

4.2 Identification with a time-varying instrumental variable

Suppose that, in addition to (Xkobs,Akobs,Rkobs)k=1K, we also observe a binary, time-varying instrumental variable {Zk}k=1K at each stage k and write Zk=(Z1,,Zk). The time-varying IV {Zk}k=1K is said to be valid if it satisfies the longitudinal generalizations of core IV assumptions including IV relevance: P(Ak=1Zk,Hkobs)>P(Ak=1Zk,Hkobs) for all k, exclusion restriction: Rk(ak,zk)=Rk(ak) for all k and ak, and IV unconfoundedness: ZkaK(zK),RK(aK,zK). The IV unconfoundedness assumption holds by design in sequential randomized trials and is more likely to hold conditional on historical information Hkobs for IV studies using observational data. The IV unconfoundedness assumption may also be replaced by weaker exchangeability assumptions discussed in Swanson et al. (2018, Section 2).

It is essential to have a time-varying IV (e.g., daily precipitation) rather than a time-independent IV (e.g., genetic variant like the sickle cell trait) in this setting. To illustrate this point, Figure 3 exhibits a causal direct acyclic graph (DAG) for K=2: Z1 and Z2 are time-varying IVs, A1 and A2 treatment received, R1 and R2 outcomes, and U1, U2, and U unmeasured confounders. We omit any observed covariates for clearer presentation. Figure 3 explicitly differentiates between two types of unmeasured confounding in the DAG: U1 and U2 represent unmeasured confounding specific to the first and second stages, and U represents unmeasured confounding shared between both stages. When studying the effect of A2 on R2, Z2 is a valid IV for A2; on the contrary, Z1 is not a valid IV for A2, because conditioning on A1 induces association between Z1 and U (represented by the dashed line in the DAG), thus violating the IV unconfoundedness assumption (Hernán et al., 2004), and not conditioning on A1 (i.e., marginalizing over A1) violates the exclusion restriction assumption; see, e.g., Verma and Pearl (2022).

Direct acyclic graph (DAG) when K=2 with time-varying IVs. U1 (and U2) represents stage k=1 (and stage k=2) unmeasured confounding, and U shared unmeasured confounding. Z1 is an IV for A1 and Z2 an IV for A2. Z1 is not a valid IV for A2 because conditioning on A1 induces association between Z1 and U, and not conditioning on A1 violates exclusion restriction.
Figure 3.

Direct acyclic graph (DAG) when K=2 with time-varying IVs. U1 (and U2) represents stage k=1 (and stage k=2) unmeasured confounding, and U shared unmeasured confounding. Z1 is an IV for A1 and Z2 an IV for A2. Z1 is not a valid IV for A2 because conditioning on A1 induces association between Z1 and U, and not conditioning on A1 violates exclusion restriction.

Point identification of Ak’s effect on Rk conditional on historical information Hkobs is possible under some version of homogeneity or no-interaction assumption. Michael et al. (2020, Section 3) describe sufficient conditions for point identification under a latent variable formulation of an IV model. Their conditions extend those in Wang and Tchetgen Tchetgen (2018) from static to longitudinal settings and generalize the ‘additive unmeasured confounder’ assumption studied in Liao et al. (2021). Michael et al.’s (2020) proposed point identification assumptions essentially entail further restriction that Zk cannot have an additive interaction with any component of unmeasured confounding in affecting the actual treatment received. For instance, in Figure 3, Z2 cannot interact with U2 or U and have an additive effect on A2 conditional on Hkobs, so that all systematic difference in a person’s compliance type at each stage k is completely accounted for by historical information up to stage k.

In the rest of article, we focus on developing a framework that accommodates both point and partial identification. In the context of estimating an ITR, Zhang and Pu (2020) and Pu and Zhang (2021) discussed some desirable features of a partial identification strategy in IV studies. Partial identification allows an IV-optimal ITR be well defined under minimal IV identification assumptions and amenable to different choices of an IV in empirical studies. Moreover, the quality of the estimated IV-optimal ITR adapts to the quality of the selected IV (Pu & Zhang, 2021, Remarks 5 and 6). These considerations remain relevant in estimating a DTR. In a multiple-stage, IV-assisted decision problem, point identification requires posting no-interaction-type assumptions at each stage k. It is likely that the no-interaction-type assumption is deemed reasonable towards later stages after having collected rich covariate history data, but not at initial stages, further necessitating a comprehensive framework covering both modes of identification.

4.3 IV-optimality for DTRs and dynamic programming under partial identification

Similar to the single-stage setting in Section 2.3, the IV at each stage, along with its associated identification assumptions, imposes distributional constraints on the potential outcome distributions pak*(,|hk). That is, we can specify, for each action ak and historical information hk at stage k, an IV-constrained set Phk,ak that contains the ground truth potential outcome distribution pak*(,|hk). We treat Phk,ak as a generic set of distributions compatible with the IV and identification assumptions for now; concrete examples will be given when we formally describe estimation procedures in Section 4.5.

Given the IV-constrained set Phk,ak, we impose a prior distribution Phk,ak on it. This notion generalizes the single-stage setting in Section 2.3. We use pak(,|hk)Phk,ak to denote sampling a distribution pak(,|hk)Phk,ak from Phk,ak. We will use the shorthand pak for pak(,|hk) where there is no ambiguity.

Under these notations, we introduce IV-constrained counterparts of the conventional Q- and value functions defined in (13) and (14).

IV-constrained Q- and value function

 
Definition 3

For each stage k[K], each action ak{±1}, and each configuration of the historical information Hk=hk, let Phk,ak be a prior distribution on the IV-constrained set Phk,ak. The IV-constrained Q-function and the corresponding value function of a DTR π with respect to the collection of prior distributions {Phk,ak} at stage K are

respectively. Recursively, at stage k=K1,,1, the IV-constrained Q-function and the corresponding value function are

respectively, where Hk+1=(hk,ak,Rk,Xk+1) is a function of Rk and Xk+1.

Similar to the conventional Q- and value functions, the IV-constrained Q-function QP,kπ depends on π only through π(k+1):K and the IV-constrained value function VP,kπ depends on π only through πk:K.

Definition 4 follows from Definition 3 and defines IV-optimality for DTRs.

IV-optimal DTR

 
Definition 4

A DTR π* is said to be IV-optimal with respect to the collection of prior distributions {Phk,ak} and a policy class Π if it satisfies

(16)

for every fixed h1=x1X.

Theorem 1 proves that an IV-optimal policy π* that maximizes the IV-constrained value function for every x1 exists and can be obtained via a modified dynamic programming algorithm when Π consists of all policies.

Dynamic programming for the IV-optimal DTR

 
Theorem 1

Let π* be recursively defined as follows:

Then, the DTR π* satisfies

(17)

for any stage k and any configuration of the historical information hk, where the maximization is taken over all DTRs.

Theorem 1 generalizes the classical dynamic programming algorithm under a partial identification scheme. The policy π* in Theorem 1 is always well-defined as QPkπ* depends on π* only through π(k+1):K*.

4.4 An alternative characterization of IV-optimal DTRs

Estimation of an IV-optimal DTR based on Theorem 1 is difficult, as the definitions of the IV-constrained Q- and value functions involve integration over the prior distributions {Phk,ak}, which may be difficult to specify. Recall that in the single-stage setting, there is an equivalence between the IV-constrained value (7) and the convex combination of the lower and upper bounds of value (5). The convex combination is easier to compute provided that the weights are specified. We now extend this equivalence relationship to the multiple-stage setting to get rid of the intractable integration over prior distributions and facilitate a practical estimation strategy.

We start by defining the worst-case and best-case IV-constrained Q- and value functions as well as their weighted versions as follows.

Worst-case, best-case, and weighted Q- and value functions

 
Definition 5

Let the prior distributions {Phk,ak} be specified and {λk(hk,ak)} be a collection of weighting functions taking values in [0,1]. The worst-case, best-case, and weighted Q-functions at stage K with respect to {Phk,ak} and {λk(hk,ak)} are defined as

(18)
(19)

The corresponding worst-case, best-case, and weighted value functions, denoted as V_Kπ(hK), V¯Kπ(hK), and Vλ,Kπ(hK), respectively, are obtained by setting ak=πK(hK) in the above displays. Recursively, at stage k=K1,,1, we define the worst-case, best-case, and weighted Q-functions as

(20)
(21)

respectively. Again, replacing ak in the above Q-functions with πk(hk) yields their corresponding value functions V_λ,kπ,V¯λ,kπ, and Vλ,kπ.

Definition 5 generalizes criterion (5). According to Definition 5, the worst-case and best-case Q-functions Q_λ,kπ and Q¯λ,kπ depend on the weighting functions only through λt(ht,at) for k+1tK, whereas the corresponding value functions depend on the weighting functions only through λt(ht,at) for ktK. For notational simplicity, we will add superscript π and subscript λ in Q-functions at stage K (e.g., we write Q_K=Qλ,Kπ), although they have no dependence on π or the weighting functions.

Proposition 3 establishes the equivalence between the weighted Q-functions (resp. value functions) and their IV-constrained counterparts in Definition 3.

Equivalence between weighted and IV-constrained Q- and value functions

 
Proposition 3

The following two statements hold:

  • Fix any collection of weighting functions {λk(hk,ak)}. Assume that in Definition 5, the infimums and supremums when defining weighted Q- and value functions are all attained. Then, there exists a collection of prior distributions {Phk,ak} such that

  • Reversely, if any collection of prior distributions {Phk,ak} is fixed, then there exists a collection of weighting functions {λk(hk,ak)} such that the above display holds true.

Proposition 3 effectively translates the problem of specifying a collection of prior distributions to specifying a collection of weighting functions, thus allowing one to bypass integration over prior distributions. Combine Proposition 3 and Theorem 1 and we have the following alternative characterization of the IV-optimal DTR.

Alternative characterization of the IV-optimal DTR

 
Corollary 1

Under the setting in Part 1 of Proposition 3, if we recursively define π* as

(22)

where the two quantities

depend on π* only through π(k+1):K* and are hence well-defined, then π* satisfies (17) for any stage k and any configuration of the historical information hk. In particular, π* is an IV-optimal DTR in the sense of Definition 4 when Π consists of all Boolean functions.

As an alternative to Theorem 1, Corollary 1 gives an analytic and constructive characterization of the IV-optimal DTR.

 
Example 1

Figure 4 illustrates the decision process when K=2. At the second stage, given h2+=(h1,a1=+1,R1(+1)=r1+,X2(+1)=x2+) and λ2(h2+,±1), an IV-optimal action is made based on comparing two weighted Q-functions Qλ,2(h2+,+1)=0.8 and Qλ,2(h2+,1)=0.5, and we have π2*(h2+)=+1. Analogously, we have π2*(h2)=1.

Given the knowledge of π2*, we can compute the weighted value function Vλ,2π*() of π* at the second stage. This allows us to construct a ‘pseudo-outcome’ at the first stage, denoted as PO1(r1,x2|h1,a1)=r1+V2π*(h1,a1,r1,x2). Importantly, PO1 depends on π2*, as it is the cumulative rewards if we observe h1 at the first stage, take an immediate action a1, and then act according to the IV-optimal decision π2* at the second stage.

The partial identification interval for the expected value of PO1(R1(a1),X2(a1)|h1,a1), where the expectation is taken over the potential outcome distribution of R1(a1) and X2(a1), is precisely the worst-case and best-case Q-functions Q_λ,1π*(h1,a1) and Q¯λ,1π*(h1,a1). We can then compute the weighted Q-function Qλ,1π*(h1,a1) given λ1(h1,a1), from which we can decide π1*. Reading the numbers off Figure 4, we conclude that π1*(h1)=1.

An illustrative example of an IV-optimal DTR when K=2. According to the values of the weighted Q-functions at k=2, we have π2*(h→2+)=+1 and π2*(h→2−)=−1. Given π2*, the weighted Q-functions at k=1 can be computed, and the IV-optimal action is π1*(h1)=−1.
Figure 4.

An illustrative example of an IV-optimal DTR when K=2. According to the values of the weighted Q-functions at k=2, we have π2*(h2+)=+1 and π2*(h2)=1. Given π2*, the weighted Q-functions at k=1 can be computed, and the IV-optimal action is π1*(h1)=1.

Weighting functions

 
Remark 1

The specification of weighting functions reveals one’s level of optimism. Fix all future weighting functions {λt:tk+1}. At stage k, if one adopts a worst-case perspective and would like to maximize the worst-case gain at this stage, then it suffices to compare the two worse-case Q-functions at stage k, namely Q_λ,kπ*(hk,+1) and Q_λ,kπ*(hk,1). This pessimistic action equals the sign of the difference of the two worst-case Q-functions and corresponds to taking λk(hk,±1)=1 (see Figure 5a). Alternatively, if one adopts a best-case perspective at stage k and would like to maximize the best-case gain at this stage, then one compares the two best-case Q-functions, namely Q¯λ,kπ*(hk,+1) and Q¯λ,kπ*(hk,1). This optimistic action equals the sign of the difference between the two best-case Q-functions and corresponds to taking λk(hk,±1)=0 (see Figure 5b). Recall that in the single-stage setting, the min-max risk criterion (4) corresponds to maximizing the weighted value (5) with weights set to be 1/2. This criterion can be seamlessly generalized to the current multiple-stage setting by setting λk(hk,±1)=1/2 (see Figure 5c). Other choices of weighting functions can also be made to incorporate domain knowledge and reflect individual-level preference, as suggested by Cui and Tchetgen Tchetgen (2021).

IV-optimal DTR at stage k with different choices of weighting functions. The weighted Q-functions are marked by red dots, and the corresponding IV-optimal action is coloured in red. (a–c) correspond to the worst-case, best-case, and the min-max perspectives, respectively. In each case, the IV-optimal action is sgn{Qλ→,kπ*(h→k,+1)−Qλ→,kπ*(h→k,−1)}. (a) Worst: λk(h→k,±1)=1. (b) Best: λk(h→k,±1)=0. (c) Min-max: λk(h→k,±1)=12.
Figure 5.

IV-optimal DTR at stage k with different choices of weighting functions. The weighted Q-functions are marked by red dots, and the corresponding IV-optimal action is coloured in red. (a–c) correspond to the worst-case, best-case, and the min-max perspectives, respectively. In each case, the IV-optimal action is sgn{Qλ,kπ*(hk,+1)Qλ,kπ*(hk,1)}. (a) Worst: λk(hk,±1)=1. (b) Best: λk(hk,±1)=0. (c) Min-max: λk(hk,±1)=12.

4.5 Estimating IV-optimal DTRs

According to Corollary 1, it suffices to estimate the weighted Q-functions {Qλ,kπ*(hk,±1)} and hence contrast functions {Cλ,kπ*(hk)} defined in (22) in order to estimate the IV-optimal DTR π*. We first consider a Q-learning approach to the estimation problem (Schulte et al., 2014; Watkins & Dayan, 1992). As a concrete example, we estimate the worst-case and best-case Q-functions using Manski and Pepper’s ‘minimal-assumptions’ bounds that depend only on core IV assumptions (Manski, 1997; Manski & Pepper, 2000). These core IV assumptions are automatically satisfied, for instance, in a SMART design with noncompliance. In the Online Supplementary Material S5, we review other useful partial identification bounds, including those depending on the monotone instrumental variable assumption, monotone treatment selection assumption (MTS), and monotone treatment response assumption (MTR).

At Stage K, define

(23)

Manski-Pepper bounds state that if RK[C_K,C¯K] almost surely, then its conditional mean potential outcome ERKpaK*(|hk)[RK] is lower bounded by

and upper bounded by

Hence, we set Q_K(hK,aK)=LBK, Q¯K(hK,aK)=UBK, and construct the stage-K weighted Q-function Qλ,K(hk,aK) provided values of λK(hk,aK). In practice, we may estimate all relevant quantities from the observed data by fitting parametric models (e.g., linear models) or flexible machine learning models (e.g., regression trees and random forests) and then invoke the plug-in principle.

Given an estimate Q^λ,K* of Qλ,K, we can then estimate the contrast function Cλ,K(hk) by C^λ,K*(hk)=Q^λ,K*(hk,+1)Q^λ,K*(hk,1), and the stage-K IV-optimal DTR πK*(hk) can be estimated by π^KQ(hk)=sgn{C^λ,K(hk)} according to Corollary 1. Moreover, the weighted value function of π* at stage K, Vλ,Kπ*(hk), can be estimated by V^λ,K*=Q^λ,K*(hk,π^KQ(hk)).

Now, assume for any stage tk+1, the stage-t weighted value function Vλ,tπ*(ht) has been estimated by V^λ,t*(ht), and that Rt[C_t,C¯t] almost surely. At stage k, define the pseudo-outcome POk(rk,xk+1|hk,ak)=rk+Vλ,k+1π*(hk,ak,rk,xk+1). By construction, we have POk(Rk,Xk+1|hk,ak)[tkC_t,tkC¯t] almost surely for (Rk,Xk+1)pak*(,|hk). Thus, we can apply Manski-Pepper bounds (or other partial identification bounds) one more time to bound the expected value of POk(Rk,Xk+1|hk,ak). Alternatively, two terms involved in the pseudo-outcome may be bounded separately using some stylized bounds based on firm domain knowledge; see Online Supplementary Material S5 for details. In this way, we obtain an estimate Q^λ,k* of the weighted Q-function of π* at stage k; see Online Supplemental Material S4 for detailed expressions. One nuance is that the pseudo-outcome POk(rk,xk+1|hk,ak) depends on the unknown quantity Vλ,k+1π*. At stage k, we have already obtained an estimate V^λ,k+1* of Vλ,k+1π*. Thus, the pseudo-outcome can be estimated by PO^k(rk,xk+1|hk,ak)=rk+V^λ,k+1*(hk,ak,rk,xk+1).

Finally, the contrast function Cλ,kπ*(hk) is estimated by C^λ,k*(hk)=Q^λ,k*(hk,+1)Q^λ,k*(hk,1), the stage-k IV-optimal DTR by π^kQ(hk)=sgn{C^λ,k(hk)}, and the stage-k weighted value function by V^λ,k*=Q^λ,k(hk,π^kQ(hk)). In this way, we recursively estimate all contrast functions and obtain an estimated IV-optimal DTR π^Q.

Algorithm 1:

Estimation of the IV-Optimal DTR

Input: Trajectories and instrument variables
{(xk,i,ak,i,rk,i,zk,i):k[K],i[n]}, weighting functions {λ(hk,ak)}, policy class Π, forms of partial identification intervals.
Output: Estimated IV-optimal DTR π^*.
# Step I: Q-learning
Obtain an estimate Q^λ,K* of Qλ,K using (hK,i,rK,i,zK,i)i=1n;
Estimate Cλ,K by C^λ,K*(hK)=Q^λ,K*(hK,+1)Q^λ,K*(hK,1);
Set π^KQ(hK)=sgn(C^λ,K*(hK)) and V^λ,k*(hK)=Q^λ,K*(hK,π^KQ(hK));
fork=K1,,1do
For each i[n], construct PO^k,i=rk,i+V^λ,k+1*(hk+1,i);
Obtain an estimate Q^λ,k* of Qλ,kπ* using using (hk,i,PO^k,i,zk,i)i=1n;
Estimate Cλ,kπ* by C^λ,k*(hk)=Q^λ,k*(hk,+1)Q^λ,K*(hk,1);
Set π^kQ(hk)=sgn(C^λ,k*(hk)) and V^λ,k*(hk)=Q^λ,k*(hk,π^kQ(hk))
end
# Step II: Weighted classification
for k=K,,1do
Solve the weighted classification problem in (24) to obtain π^k*;
end
returnπ^*
Input: Trajectories and instrument variables
{(xk,i,ak,i,rk,i,zk,i):k[K],i[n]}, weighting functions {λ(hk,ak)}, policy class Π, forms of partial identification intervals.
Output: Estimated IV-optimal DTR π^*.
# Step I: Q-learning
Obtain an estimate Q^λ,K* of Qλ,K using (hK,i,rK,i,zK,i)i=1n;
Estimate Cλ,K by C^λ,K*(hK)=Q^λ,K*(hK,+1)Q^λ,K*(hK,1);
Set π^KQ(hK)=sgn(C^λ,K*(hK)) and V^λ,k*(hK)=Q^λ,K*(hK,π^KQ(hK));
fork=K1,,1do
For each i[n], construct PO^k,i=rk,i+V^λ,k+1*(hk+1,i);
Obtain an estimate Q^λ,k* of Qλ,kπ* using using (hk,i,PO^k,i,zk,i)i=1n;
Estimate Cλ,kπ* by C^λ,k*(hk)=Q^λ,k*(hk,+1)Q^λ,K*(hk,1);
Set π^kQ(hk)=sgn(C^λ,k*(hk)) and V^λ,k*(hk)=Q^λ,k*(hk,π^kQ(hk))
end
# Step II: Weighted classification
for k=K,,1do
Solve the weighted classification problem in (24) to obtain π^k*;
end
returnπ^*
Algorithm 1:

Estimation of the IV-Optimal DTR

Input: Trajectories and instrument variables
{(xk,i,ak,i,rk,i,zk,i):k[K],i[n]}, weighting functions {λ(hk,ak)}, policy class Π, forms of partial identification intervals.
Output: Estimated IV-optimal DTR π^*.
# Step I: Q-learning
Obtain an estimate Q^λ,K* of Qλ,K using (hK,i,rK,i,zK,i)i=1n;
Estimate Cλ,K by C^λ,K*(hK)=Q^λ,K*(hK,+1)Q^λ,K*(hK,1);
Set π^KQ(hK)=sgn(C^λ,K*(hK)) and V^λ,k*(hK)=Q^λ,K*(hK,π^KQ(hK));
fork=K1,,1do
For each i[n], construct PO^k,i=rk,i+V^λ,k+1*(hk+1,i);
Obtain an estimate Q^λ,k* of Qλ,kπ* using using (hk,i,PO^k,i,zk,i)i=1n;
Estimate Cλ,kπ* by C^λ,k*(hk)=Q^λ,k*(hk,+1)Q^λ,K*(hk,1);
Set π^kQ(hk)=sgn(C^λ,k*(hk)) and V^λ,k*(hk)=Q^λ,k*(hk,π^kQ(hk))
end
# Step II: Weighted classification
for k=K,,1do
Solve the weighted classification problem in (24) to obtain π^k*;
end
returnπ^*
Input: Trajectories and instrument variables
{(xk,i,ak,i,rk,i,zk,i):k[K],i[n]}, weighting functions {λ(hk,ak)}, policy class Π, forms of partial identification intervals.
Output: Estimated IV-optimal DTR π^*.
# Step I: Q-learning
Obtain an estimate Q^λ,K* of Qλ,K using (hK,i,rK,i,zK,i)i=1n;
Estimate Cλ,K by C^λ,K*(hK)=Q^λ,K*(hK,+1)Q^λ,K*(hK,1);
Set π^KQ(hK)=sgn(C^λ,K*(hK)) and V^λ,k*(hK)=Q^λ,K*(hK,π^KQ(hK));
fork=K1,,1do
For each i[n], construct PO^k,i=rk,i+V^λ,k+1*(hk+1,i);
Obtain an estimate Q^λ,k* of Qλ,kπ* using using (hk,i,PO^k,i,zk,i)i=1n;
Estimate Cλ,kπ* by C^λ,k*(hk)=Q^λ,k*(hk,+1)Q^λ,K*(hk,1);
Set π^kQ(hk)=sgn(C^λ,k*(hk)) and V^λ,k*(hk)=Q^λ,k*(hk,π^kQ(hk))
end
# Step II: Weighted classification
for k=K,,1do
Solve the weighted classification problem in (24) to obtain π^k*;
end
returnπ^*

In many applications, it may be desirable to obtain parsimonious and interpretable DTRs inside some prespecified function class Π=Π1××ΠK. To achieve this, we may ‘project’ π^Q, the DTR obtained by Q-learning and is thus not necessarily inside Π, onto the function class Π. With some algebra, one readily checks that π* is a solution to the following weighted classification problem:

(24)

This representation illuminates a rich class of strategies to search for a DTR π within a desired, possibly parsimonious function class Π, via sequentially solving the following weighted classification problem:

where we recall that C^λ,k* is an estimate of C^λ,kπ* obtained via Q-learning, and {hk,i}i=1n is the historical information at stage k in our dataset. This strategy is similar to the proposal in Zhao et al. (2015). Algorithm 1 summarizes the procedure.

4.6 Improving dynamic treatment regimes with an instrumental variable

The IV-optimality framework developed in Sections 4.34.5 can be modified to tackle the policy improvement problem under the multiple-stage setup. Let πb be a baseline DTR to be improved. Some most important baseline DTRs include the standard-of-care DTR (i.e., πbk=1 for any k) and the SRA-optimal DTR defined in Section 4.1. The goal of policy improvement, as discussed in Section 3, is to obtain a DTR π such that π is no worse and potentially better than πb.

Similar to the materials in Sections 4.34.5, one can leverage additional information encoded in the collection of IV-constrained sets {Phk,ak}. Briefly speaking, instead of starting from the usual value function, one now starts from the ‘relative’ value function, which characterizes the excess value collected by a certain policy compared to the baseline policy πb. One can then define the notion of IV-improved DTRs using similar constructions as those appeared in Sections 4.34.5. We refer the readers to Online Supplemental Material S1 for a detailed exposition.

5 Theoretical properties

In this section, we prove nonasymptotic bounds on the deviance between the estimated IV-optimal DTR to its population counterpart. The theoretical guarantees for IV-improved DTR are deferred to Online Supplementary Material S1.2.

We need several standard assumptions, which we detail below. First of all, we need a proper control on the complexity of the policy class Π=Π1×ΠK. Note that any πkΠk is a Boolean function that sends a specific configuration of the historical information hk to a binary decision πk(hk). Let Hk be the collection of all possible hks. A canonical measure of complexity of Boolean functions is the Vapnik–Chervonenkis (VC) dimension (Vapnik & Chervonenkis, 1968). The VC dimension of Πk, denoted as vc(Πk), is the largest positive integer d such that there exists a set of d points {hk(1),,hk(d)}Hkshattered by Πk, in the sense that for any binary vector v{±1}d, there exists πk(v)Πk such that πk(v)(hk(j))=vj for j[d]. If no such d exists, then vc(Πk)=.

In practice, a DTR is most useful when it is parsimonious. For this reason, the class of linear decision rules and the class of decision trees with a fixed depth are popular in various application fields (Speth et al., 2020; Tao et al., 2018). It is well known that when the domain is a subset of Rd, the VC dimension of the class of linear decision rules is at most d+1 [see, e.g., Example 4.21 of Wainwright (2019)] and the VC dimension of the class of decision tree with L leaves is O(Llog(Ld)) (Leboeuf et al., 2020).

Note that a DTR π1:(k1), along with the distributions {pat(,|ht):at{±1},htHt}t=1k1, induces a probability distribution on Hk, the set of stage-k historical information. Let Hk denote the set of all such probability distributions when we vary π1:(k1)Π1:(k1) and pat(,|ht)Pht,at for t[k1]. An element in Hk will be denoted as qk, and the law of Hkobs will be denoted as qkobs. The next assumption concerns how much information on qkobs can be generalized to the information on a specific qkHk.

Bounded concentration coefficients

 
Assumption 1

Suppose there exist positive constants {ck}k=1K such that

Assumption 1 is often made in the reinforcement learning literature (see, e.g., Chen & Jiang, 2019; Munos, 2003; Szepesvári & Munos, 2005) and is closely related to the ‘overlap’ assumption in causal inference. A sufficient condition for the above assumption to hold is that the probability of seeing any historical information (according to the observed data distribution or any distribution in Hk) is strictly bounded away from 0 and 1.

Recall that our proposed algorithm for estimating an IV-optimal DTR is a two-step procedure, where in Step I, the contrast functions are estimated via Q-learning, and in Step II, a parsimonious DTR is obtained via weighted classification. As mentioned in Section 4.5, in the first step, there is a lot of flexibility in choosing the specific models and algorithms for estimating the contrast functions, and a fine-grained understanding of this step would require a case-by-case analysis. In order not to over-complicate the discussion, we assume the existence of a ‘contrast estimation oracle’.

Existence of the contrast estimation oracle

 
Assumption 2

Suppose in Q-learning, we use an algorithm that takes n i.i.d. samples and outputs {C^λ,k*}k=1K that satisfy

(25)

for any k[K],δ>0, where the probability is taken over the randomness in the n samples and the expectation is taken over the randomness in a fresh trajectory Hkobsqkobs.

Assumption 2 is usually a relatively mild assumption. The ground truth contrast functions Cλ,kπ* are superimpositions of several conditional expectations of the observed data distribution, and it is reasonable to assume that they can be estimated at a vanishing rate as the sample size tends to infinity. While this assumption simplifies the analysis by bypassing the case-by-case analyses of Q-learning, the analysis of the weighted classification remains nontrivial.

To proceed further, we adopt a form of sample splitting procedure called cross-fitting (Athey & Wager, 2020; Chernozhukov et al., 2018). Specifically, we split the n samples into m equally-sized batches: [n]=j[m]Bj, where m2 and m1 (i.e., of constant order). Each batch has size |Bj|=njn/mn. For each i[n], let ji[m] be the index of its batch, so that iBji. For the ith sample, we apply the contrast estimation oracle in Assumption 2 on the out-of-batch samples Bji=[n]Bji to obtain either an estimate C^λ,k*(hk;Bji) of Cλ,kπ*(hk). We then solve the following optimization problem:

(26)

We emphasize that cross-fitting is mostly for theoretical convenience. The performance of our algorithm with and without cross-fitting is similar.

In practice, given a limited computational budget, we may only solve the optimization problem (26) up to a certain precision. Our analysis will be conducted for the approximate minimizer π^k* that satisfies

(27)

respectively, where εkopt is the optimization error when solving (26).

To quantify the loss of information from restriction to the policy class Π, we define

Note that according to Corollaries 1, when Πk is the set of all Boolean functions, we have π~k*=πk*. Otherwise, the difference between π~k* and πk* measures the approximation error when we restrict ourselves to Πk instead of the set of all Boolean functions in the policy estimation (resp. policy improvement) problem.

We are now ready to present the main result of this section.

Performance of the estimated IV-optimal DTR

 
Theorem 2

Let Assumptions 1 and 2 hold. Fix any δ(0,1). Let the optimization error be Eopt=k=1Kckεkopt. Let the approximation error be

where πkπ(k+1):K denotes the DTR that acts according to πk at stage k and follows π from stage k+1 to K. Finally, define the generalization error

Then, there exists an absolute constant C>0 such that with probability 1δ, we have

(28)

In the above display, the expectations are taken over a fresh sample of the observed first stage covariates X1obs.

Theorem 2 shows that the weighted value function of π^* at the first stage converges to that of the IV-optimal DTR up to three sources of errors: the optimization error that stems from approximately solving the weighted classification problem, the approximation error that results from the restriction to a parsimonious policy class Π, and a vanishing generalization error. We defer the proof to Online Supplemental Material S3.

6 Simulation studies

6.1 Comparing IV and non-IV approaches

We considered a data-generating process with two time-independent covariates X1, X2Bernoulli(1/2). At stage one, there exist an unmeasured confounder U1Rademacher(1/2), an instrumental variable Z1Rademacher(1/2), a treatment assignment A1 that is Rademacher with probability expit{3+C(1+Z1)/2+ξ(1+U1)/2}, and a reward R1 that is Bernoulli with probability expit{0.2X1+0.05X2(1+A1)ξ(1+U1)/2}, where expit{x}=1/(1+ex). According to this data-generating process, treatment A1 has no causal effect on R1 for those with X2=0 but a modest effect for those with X2=1. The unmeasured confounder U1 represents some unmeasured disease severity, so that patients with U1=1 are more likely to receive treatment and have a smaller reward R1. This mimics the NICU application where sicker mothers are more likely to deliver in high-level NICUs and these sicker mothers and their newborns are at higher risk. Because of the unmeasured confounder U1, a naive non-IV analysis based on the observed data (X1,X2,A1,R1) underestimates the treatment effect of A1 on R1. At the second stage, there exist a second unmeasured confounder U2Rademacher(1/2), a second instrumental variable Z2Rademacher(1/2), a second treatment assignment A2 that is Rademacher with probability expit{3+C(1+Z2)/2+ξ(1+U2)/2}, and a second reward R2 that is Bernoulli with probability expit{0.2X1+0.1R1(1+A2)/2ξ(1+U2)/2}. Similar to U1, U2 represents unmeasured risk factors and a naive, non-IV analysis ignoring U2 underestimates the treatment effect of A2 on R2. Parameters C in the above data-generating process controls the strength of IV-treatment association and ξ controls the level of unmeasured confounding. Both parameters C and ξ will be varied later.

An ‘SRA-optimal’ DTR πSRA does not leverage the IV data and is based solely on the observed data Dobs={(X1,X2,A1,R1,A2,R2),i=1,,n} and falsely assumes the SRA. To estimate πSRA, we estimated the CATE using a robust stabilized augmented inverse probability weighting estimator (SAIPW) and then applied a weighted classification routine. This is known as C-learning in the literature (Zhang & Zhang, 2018), and it can also be viewed as a variant of the backward outcome weighted learning procedure (Zhao et al., 2015). We then estimated the IV-optimal DTR πIV,1/2 corresponding to λk(hk,±1)=0.5 for k=1,2. When constructing the IV-optimal DTRs, we used ‘minimal-assumptions’ bounds (Balke-Pearl bounds for a binary outcome and Manski-Pepper bounds otherwise). All classification problems involved in estimating πSRA and πIV,1/2 were implemented using a classification tree with a maximum depth of 2, which is meant to replicate the real data application where parsimonious rules are more useful and may deliver most insight (Laber & Zhao, 2015; Speth et al., 2020). Finally, we evaluated each estimated regime by generating 1,000,000 fresh samples of (X1,X2) and integrating out the unmeasured confounders U1 and U2.

Figure 6 summarizes the cumulative distribution functions of the value of estimated SRA-optimal policy πSRA and IV-optimal policy πIV,1/2 when C=4 and for assorted training sample sizes n and levels of confounding ξ. It is transparent that the IV-based approach has superior performance compared to the non-IV approach in all of these cases; moreover, the improvement becomes more conspicuous as the sample size increases. Table 1 further puts the percentage of improvement in the context of maximal possible improvement (comparing an optimal policy to a baseline policy that treats no one). For instance, when n=5,000 and ξ=0.5, the IV-based DTR πIV,1/2 attained a median of 77.7% of maximal possible improvement, while the non-IV approach only attained a median of 13.4% of maximal possible improvement. What explained the performance gap between IV and non-IV methods? The SAIPW estimates obtained from a naive non-IV approach tended to underestimate the treatment effect and indicate a spurious, negative treatment effect at either stage as a result of omitting the unmeasured confounders. Therefore, the estimated πSRA often recommended withholding the treatment even when the treatment was in fact beneficial. On the other hand, the IV-based interval estimates of the CATEs more faithfully tracked the genuine treatment effect. This difference in estimating the CATE then translated into the performance gap between IV and non-IV DTRs. In Online Supplementary Material S6, we further report simulation results corresponding to different levels of IV-treatment association as controlled by the parameter C. We found two consistent trends. First, πIV,1/2 achieved better performance when the IV-treatment association increases (that is, when C grows holding ξ fixed). Second, IV-optimal DTR πIV,1/2 consistently outperformed πSRA in these simulation settings though the degree of improvement depends on the specifics of the data-generating processes.

Cumulative distribution functions of the value of the estimated SRA-optimal policy πSRA and IV-optimal policy πIV,1/2 (min-max) for various n and ξ values and C=4.
Figure 6.

Cumulative distribution functions of the value of the estimated SRA-optimal policy πSRA and IV-optimal policy πIV,1/2 (min-max) for various n and ξ values and C=4.

Table 1.

Performance of IV and non-IV approaches for various training sample sizes n and levels of unmeasured confounding ξ when C=4

Non-IV method: πSRAIV method: πIV,1/2
% Improvement% Correct% Improvement% Correct
ξMedianIQRStage IStage IIMedianIQRStage IStage II
n = 1000
0.212.4[10.2, 14.9]0.00.951.0[35.9, 68.3]45.058.9
0.314.5[11.8, 17.2]0.11.458.9[45.9, 76.1]47.661.7
0.412.8[10.3, 15.3]0.11.858.3[46.4, 76.9]51.062.4
0.513.6[10.3, 16.5]0.23.161.5[50.4, 81.6]51.263.5
n = 2000
0.212.3[10.1, 14.6]0.00.054.2[45.5, 70.1]49.064.4
0.314.6[11.8, 17.2]0.00.162.3[51.2, 79.0]51.167.0
0.412.6[10.1, 15.0]0.00.159.8[49.9, 76.5]50.667.2
0.513.1[10.3, 16.1]0.00.164.6[53.5, 82.5]53.566.7
n = 5000
0.212.4[10.1, 14.6]0.00.065.7[50.1, 74.5]57.074.4
0.314.3[11.6, 17.1]0.00.074.0[55.2, 83.3]60.073.7
0.412.7[10.0, 15.5]0.00.073.4[55.8, 86.2]62.876.8
0.513.4[10.4, 16.2]0.00.077.7[58.1, 92.6]60.177.9
Non-IV method: πSRAIV method: πIV,1/2
% Improvement% Correct% Improvement% Correct
ξMedianIQRStage IStage IIMedianIQRStage IStage II
n = 1000
0.212.4[10.2, 14.9]0.00.951.0[35.9, 68.3]45.058.9
0.314.5[11.8, 17.2]0.11.458.9[45.9, 76.1]47.661.7
0.412.8[10.3, 15.3]0.11.858.3[46.4, 76.9]51.062.4
0.513.6[10.3, 16.5]0.23.161.5[50.4, 81.6]51.263.5
n = 2000
0.212.3[10.1, 14.6]0.00.054.2[45.5, 70.1]49.064.4
0.314.6[11.8, 17.2]0.00.162.3[51.2, 79.0]51.167.0
0.412.6[10.1, 15.0]0.00.159.8[49.9, 76.5]50.667.2
0.513.1[10.3, 16.1]0.00.164.6[53.5, 82.5]53.566.7
n = 5000
0.212.4[10.1, 14.6]0.00.065.7[50.1, 74.5]57.074.4
0.314.3[11.6, 17.1]0.00.074.0[55.2, 83.3]60.073.7
0.412.7[10.0, 15.5]0.00.073.4[55.8, 86.2]62.876.8
0.513.4[10.4, 16.2]0.00.077.7[58.1, 92.6]60.177.9

Note. The metric % Improvement is defined as the improvement in the value as a fraction of the maximal possible improvement (comparing an optimal policy to a baseline policy that treats no one). The metric % Correct at Stage II is defined as the percentage of subjects with R1=1 who are assigned A2=1 by the estimated DTR. The metric % Correct at Stage I is defined as the percentage of subjects with X2=1 who are assigned A1=1 by the estimated DTR.

Table 1.

Performance of IV and non-IV approaches for various training sample sizes n and levels of unmeasured confounding ξ when C=4

Non-IV method: πSRAIV method: πIV,1/2
% Improvement% Correct% Improvement% Correct
ξMedianIQRStage IStage IIMedianIQRStage IStage II
n = 1000
0.212.4[10.2, 14.9]0.00.951.0[35.9, 68.3]45.058.9
0.314.5[11.8, 17.2]0.11.458.9[45.9, 76.1]47.661.7
0.412.8[10.3, 15.3]0.11.858.3[46.4, 76.9]51.062.4
0.513.6[10.3, 16.5]0.23.161.5[50.4, 81.6]51.263.5
n = 2000
0.212.3[10.1, 14.6]0.00.054.2[45.5, 70.1]49.064.4
0.314.6[11.8, 17.2]0.00.162.3[51.2, 79.0]51.167.0
0.412.6[10.1, 15.0]0.00.159.8[49.9, 76.5]50.667.2
0.513.1[10.3, 16.1]0.00.164.6[53.5, 82.5]53.566.7
n = 5000
0.212.4[10.1, 14.6]0.00.065.7[50.1, 74.5]57.074.4
0.314.3[11.6, 17.1]0.00.074.0[55.2, 83.3]60.073.7
0.412.7[10.0, 15.5]0.00.073.4[55.8, 86.2]62.876.8
0.513.4[10.4, 16.2]0.00.077.7[58.1, 92.6]60.177.9
Non-IV method: πSRAIV method: πIV,1/2
% Improvement% Correct% Improvement% Correct
ξMedianIQRStage IStage IIMedianIQRStage IStage II
n = 1000
0.212.4[10.2, 14.9]0.00.951.0[35.9, 68.3]45.058.9
0.314.5[11.8, 17.2]0.11.458.9[45.9, 76.1]47.661.7
0.412.8[10.3, 15.3]0.11.858.3[46.4, 76.9]51.062.4
0.513.6[10.3, 16.5]0.23.161.5[50.4, 81.6]51.263.5
n = 2000
0.212.3[10.1, 14.6]0.00.054.2[45.5, 70.1]49.064.4
0.314.6[11.8, 17.2]0.00.162.3[51.2, 79.0]51.167.0
0.412.6[10.1, 15.0]0.00.159.8[49.9, 76.5]50.667.2
0.513.1[10.3, 16.1]0.00.164.6[53.5, 82.5]53.566.7
n = 5000
0.212.4[10.1, 14.6]0.00.065.7[50.1, 74.5]57.074.4
0.314.3[11.6, 17.1]0.00.074.0[55.2, 83.3]60.073.7
0.412.7[10.0, 15.5]0.00.073.4[55.8, 86.2]62.876.8
0.513.4[10.4, 16.2]0.00.077.7[58.1, 92.6]60.177.9

Note. The metric % Improvement is defined as the improvement in the value as a fraction of the maximal possible improvement (comparing an optimal policy to a baseline policy that treats no one). The metric % Correct at Stage II is defined as the percentage of subjects with R1=1 who are assigned A2=1 by the estimated DTR. The metric % Correct at Stage I is defined as the percentage of subjects with X2=1 who are assigned A1=1 by the estimated DTR.

6.2 Additional simulation results

In Online Supplementary Material S7, we verified that IV-improved DTRs had superior performance compared to their corresponding baseline DTRs, including an SRA-optimal DTR. In Online Supplementary Material S8.1, we discussed the primary difference between two IV-based approaches: a Wald-estimator-based, point-identification approach πIV,Wald and the more generic, partial identification approach. Without additional identification assumptions, a conditional Wald estimator identifies the conditional local average treatment effect, which is the average treatment effect among a latent subgroup known as compliers conditional on observed covariates. When there is treatment heterogeneity among different latent subgroups, e.g., when the treatment effect among compliers and that among the rest of the population are different, then a Wald-estimator-based approach may mistakenly assign study participants to treatment even when they are hurt by the treatment assignment on average conditional on covariate history. Simulation results that confirm this can be found in Online Supplementary Material S8.2 and S8.3.

7 Application

We considered a total of 183,487 mothers who delivered exactly two births during 1995 and 2009 in the Commonwealth of Pennsylvania and relocated at their second deliveries so that their ‘excess-travel-time’ IVs at two deliveries were different. We followed the study design in Baiocchi et al. (2010) and Lorch et al. (2012) and controlled for covariates that measured mothers’ neighbourhood circumstances including poverty rate, median income, etc., mothers’ demographic information including race (white or not), age, years of education, etc., and variables related to delivery including gestational age in weeks and length of prenatal care in months, and eight congenital diseases; see Online Supplemental Material S9 for details. The ‘excess-travel-time’ IVs in both stages were then dichotomized: 1 if above the median and 0 otherwise. Mothers’ treatment choice and their babies’ mortality status at the first delivery were included as covariates for studying the second delivery. We used the multiple imputation by chained equations method (Buuren & Groothuis-Oudshoorn, 2010) implemented in the R package MICE to impute missing covariate data and repeated the analysis on five imputed datasets.

We assume that high-level NICUs do no harm compared to low-level NICUs; therefore, all partial identification intervals in this application were estimated under the MTR assumption (Manski, 2003, Chp. 8). We considered estimating an IV-optimal DTR minimizing the maximum risk at each delivery (see Section 4.4) and explored the trade-off between minimizing the maximum risk and the cost/capacity constraint by adding a generic penalty to the value function. We performed weighted classification using a classification tree with maximum depth equal to 3 so that the resulting DTR is interpretable. Figure 7 plots three estimated DTRs corresponding to no penalty attached, a moderate penalty, and a large penalty attached to attending a high-level NICU. When there is no penalty attached, all mothers are assigned to a high-level NICU. As we increase the penalty, fewer mothers (albeit mothers who benefit most from attending a high-level NICU) are assigned to a high-level NICU. For instance, Figure 7b corresponds to sending 68% mothers to a high-level NICU at their first deliveries and 59.9% at their second deliveries. Mothers assigned to a high-level NICU according to this DTR either belong to racial and ethnic minority groups or are older and have premature gestational age. Similarly, Figure 7c plots a regime where less than 10% of mothers are assigned to a high-level NICU. Mothers who are assigned to a high-level NICU according to this DTR belong to racial and ethnicity minority groups and have premature births. Our analysis here seems to suggest that, in general, race and ethnicity, mother’s age, and gestational age are the most promising effect modifiers. Gestational age has long been hypothesized as an effect modifier; see Lorch et al. (2012), Yang et al. (2014), and Michael et al. (2020); more recently, Yannekis et al. (2020) found a differential effect between different races and ethnic groups. On the other hand, mother’s age as an effect modifier appears to be a new discovery that is worth looking into. Overall, our method both complemented previous published results and generated new insights.

Estimated IV-optimal DTRs using the NICU data. GA is gestational age. (a) No penalty for attending a high-level NICU. All mothers are assigned to high-level NICUs. (b) Moderate penalty for attending a high-level NICU. 68.3% and 59.9% of all mothers are assigned to high-level NICUs for their first and second deliveries, respectively. (c) Large penalty for attending a high-level NICU. 4.53% and 8.56% of all mothers are assigned to high-level NICUs for their first and second deliveries, respectively.
Figure 7.

Estimated IV-optimal DTRs using the NICU data. GA is gestational age. (a) No penalty for attending a high-level NICU. All mothers are assigned to high-level NICUs. (b) Moderate penalty for attending a high-level NICU. 68.3% and 59.9% of all mothers are assigned to high-level NICUs for their first and second deliveries, respectively. (c) Large penalty for attending a high-level NICU. 4.53% and 8.56% of all mothers are assigned to high-level NICUs for their first and second deliveries, respectively.

8 Discussion

We systematically studied the problem of estimating a DTR from retrospective observational data using a time-varying instrumental variable. We formulated the problem under a generic partial identification framework, derived a counterpart of the classical Q-learning and Bellman equation under partial identification and used it as the basis for generalizing a notion of IV-optimality to the DTRs. One important variant of the developed framework is a strategy to improve upon a baseline DTR. As demonstrated via extensive simulations, IV-improved DTRs indeed have favourable performance compared to the baseline DTRs, including baseline DTRs that are optimal under the NUCA.

With the increasing availability of administrative databases that keep track of clinical data, it is tempting to estimate some useful, real-world-evidence-based DTRs from such retrospective data. To make any causal/treatment effect statements from non-RCT data, an instrumental variable analysis is often better received by clinicians. Fortunately, many reasonably good instrumental variables are available, e.g., daily precipitation, geographic distances, service providers’ preference, etc. Many of these IVs are intrinsically time-varying and could be leveraged to estimate a DTR using the framework proposed in this article. In practice, to deliver the most useful policy intervention, it is important to take into account various practical constraints, e.g., those arising from limited facility capacity or increased cost. Our framework can be readily extended to incorporating various constraints.

To make the theoretical analyses versatile, we have assumed Assumptions 1 and 2 as well as the finiteness of VC dimension of policy classes in the current analysis. An alternative route that could potentially yield sharper bounds would be to assume margin-type conditions so that the Q-function has a separation between the optimal and nonoptimal treatments. See, e.g., Qian and Murphy (2011) for applying this strategy to the single-stage setting, and Luedtke and Van Der Laan (2016) and Shi et al. (2020) for extending this strategy to a multiple-stage, non-IV setting.

Acknowledgments

The authors would like to thank Dr. Zongming Ma, Dr. Dylan S. Small, three reviewers and editors for constructive feedback and comments. The authors would like to acknowledge Dr. Scott A. Lorch for the use of the data from the NICU study.

Data availability

The participants of this study did not give written consent for their data to be shared publicly, so due to the sensitive nature of the research supporting data is not available. The authors received no financial support for the research, authorship, and publication of this article.

Supplementary material

Supplementary material are available at Journal of the Royal Statistical Society: Series B online.

References

Angrist
J. D.
,
Imbens
G. W.
, &
Rubin
D. B.
(
1996
).
Identification of causal effects using instrumental variables
.
Journal of the American Statistical Association
,
91
(
434
),
444
455
. https://doi.org/10.1080/01621459.1996.10476902

Athey
,
S.
, &
Wager
,
S.
(
2020
).
Policy learning with observational data
.
Econometrica
,
89
(
1
),
133
161
.

Baiocchi
M.
,
Cheng
J.
, &
Small
D. S.
(
2014
).
Instrumental variable methods for causal inference
.
Statistics in Medicine
,
33
(
13
),
2297
2340
. https://doi.org/10.1002/sim.6128

Baiocchi
M.
,
Small
D. S.
,
Lorch
S.
, &
Rosenbaum
P. R.
(
2010
).
Building a stronger instrument in an observational study of perinatal care for premature infants
.
Journal of the American Statistical Association
,
105
(
492
),
1285
1296
. https://doi.org/10.1198/jasa.2010.ap09490

Balke
A.
, &
Pearl
J.
(
1997
).
Bounds on treatment effects from studies with imperfect compliance
.
Journal of the American Statistical Association
,
92
(
439
),
1171
1176
. https://doi.org/10.1080/01621459.1997.10474074

Ben-Michael
E.
,
Greiner
D. J.
,
Imai
K.
, &
Jiang
Z.
(
2021
).
‘Safe policy learning through extrapolation: Application to pre-trial risk assessment’, arXiv, arXiv:2109.11679, preprint: not peer reviewed
.

Bonvini
,
M.
, &
Kennedy
,
E. H.
(
2022). Sensitivity analysis via the proportion of unmeasured confounding
.
Journal of the American Statistical Association
,
117
(
539
),
1540
1550
. https://doi.org/10.1080/01621459.2020.1864382

Chakraborty
B.
,
Murphy
S.
, &
Strecher
V.
(
2010
).
Inference for non-regular parameters in optimal dynamic treatment regimes
.
Statistical Methods in Medical Research
,
19
(
3
),
317
343
. https://doi.org/10.1177/0962280209105013

Chakraborty
B.
, &
Murphy
S. A.
(
2014
).
Dynamic treatment regimes
.
Annual Review of Statistics and Its Application
,
1
(
1
),
447
464
. https://doi.org/10.1146/annurev-statistics-022513-115553

Chen
J.
, &
Jiang
N.
(
2019
).
Information-theoretic considerations in batch reinforcement learning. In International conference on machine learning (pp. 1042–1051). PMLR
.

Chernozhukov
V.
,
Chetverikov
D.
,
Demirer
M.
,
Duflo
E.
,
Hansen
C.
,
Newey
W.
, &
Robins
J.
(
2018
).
Double/debiased machine learning for treatment and structural parameters
.
The Econometrics Journal
,
21
(
1
),
C1
C68
. https://doi.org/10.1111/ectj.12097

Cui
,
Y.
, &
Tchetgen Tchetgen
,
E.
(
2020
).
A semiparametric instrumental variable approach to optimal treatment regimes under endogeneity
.
Journal of the American Statistical Association
,
116
(
533
),
162
173
. https://doi.org/10.1080/01621459.2020.1783272

Cui
Y.
, &
Tchetgen Tchetgen
E.
(
2021
).
Machine intelligence for individualized decision making under a counterfactual world: A rejoinder
.
Journal of the American Statistical Association
,
116
(
533
),
200
206
. https://doi.org/10.1080/01621459.2021.1872580

Delage
E.
, &
Ye
Y.
(
2010
).
Distributionally robust optimization under moment uncertainty with application to data-driven problems
.
Operations Research
,
58
(
3
),
595
612
. https://doi.org/10.1287/opre.1090.0741

Duarte
G.
,
Finkelstein
N.
,
Knox
D.
,
Mummolo
J.
, &
Shpitser
I.
(
2021
).
‘An automated approach to causal inference in discrete settings’, arXiv, arXiv:2109.13471, preprint: not peer reviewed
.

Finkelstein
N.
, &
Shpitser
I.
(
2020
).
Deriving bounds and inequality constraints using logical relations among counterfactuals. In Conference on uncertainty in artificial intelligence (pp. 1348–1357). PMLR
.

Frangakis
C. E.
, &
Rubin
D. B.
(
1999
).
Addressing complications of intention-to-treat analysis in the combined presence of all-or-none treatment-noncompliance and subsequent missing outcomes
.
Biometrika
,
86
(
2
),
365
379
. https://doi.org/10.1093/biomet/86.2.365

Han
S.
(
2019
).
‘Optimal dynamic treatment regimes and partial welfare ordering’, arXiv, arXiv:1912.10014, preprint: not peer reviewed
.

Heng
,
S.
, &
Small
,
D. S.
(
2021
).
Sharpening the Rosenbaum sensitivity bounds to address concerns about interactions between observed and unobserved covariates
.
Statistica Sinica
,
31
,
2331
2353
. https://doi.org/10.5705/ss.202020.0395

Hernán
,
M. A.
,
Hernández-Díaz
,
S.
, &
Robins
,
J. M.
(
2004
).
A structural approach to selection bias
.
Epidemiology
,
15
(
5
),
615
625
. https://doi.org/10.1097/01.ede.0000135174.63482.43

Hernán
,
M. A.
, &
Robins
,
J. M.
(
2006
).
Instruments for causal inference: An epidemiologist’s dream?
Epidemiology
,
17
(
4
),
360
372
. https://doi.org/10.1097/01.ede.0000222409.00878.37

Imbens
G. W.
(
2004
).
Nonparametric estimation of average treatment effects under exogeneity: A review
.
Review of Economics and Statistics
,
86
(
1
),
4
29
. https://doi.org/10.1162/003465304323023651

Kallus
,
N.
,
Mao
,
X.
, &
Zhou
,
A.
(
2019
).
Interval estimation of individual-level causal effects under unobserved confounding. In The 22nd international conference on artificial intelligence and statistics (pp. 2281–2290)
. PMLR.

Kallus
,
N.
, &
Zhou
,
A.
(
2020a
).
Confounding-robust policy evaluation in infinite-horizon reinforcement learning.
In Advances in neural information processing systems
(Vol. 33, pp. 22293–22304).

Kallus
,
N.
, &
Zhou
,
A.
(
2020b
).
Minimax-optimal policy learning under unobserved confounding
.
Management Science
,
67
(
5)
,
2870
2890
. https://doi.org/10.1287/mnsc.2020.3699

Kroelinger
C. D.
,
Okoroh
E. M.
,
Goodman
D. A.
,
Lasswell
S. M.
, &
Barfield
W. D.
(
2018
).
Comparison of state risk-appropriate neonatal care policies with the 2012 AAP policy statement
.
Journal of Perinatology
,
38
(
4
),
411
420
. https://doi.org/10.1038/s41372-017-0006-6

Laber
E. B.
, &
Zhao
Y. -Q.
(
2015
).
Tree-based methods for individualized treatment regimes
.
Biometrika
,
102
(
3
),
501
514
. https://doi.org/10.1093/biomet/asv028

Lasswell
S. M.
,
Barfield
W. D.
,
Rochat
R. W.
, &
Blackmon
L.
(
2010
).
Perinatal regionalization for very low-birth-weight and very preterm infants: A meta-analysis
.
JAMA
,
304
(
9
),
992
1000
. https://doi.org/10.1001/jama.2010.1226

Leboeuf
,
J.-S.
,
LeBlanc
,
F.
, &
Marchand
,
M.
(
2020
).
Decision trees as partitioning machines to characterize their generalization properties. In
Advances in neural information processing systems
(Vol. 33, pp. 18135–18145).

Liao
L.
,
Fu
Z.
,
Yang
Z.
,
Kolar
M.
, &
Wang
Z.
(
2021
).
‘Instrumental variable value iteration for causal offline reinforcement learning’, arXiv, arXiv:2102.09907, preprint: not peer reviewed
.

Lorch
S. A.
,
Baiocchi
M.
,
Ahlberg
C. E.
, &
Small
D. S.
(
2012
).
The differential impact of delivery hospital on the outcomes of premature infants
.
Pediatrics
,
130
(
2
),
270
278
. https://doi.org/10.1542/peds.2011-2820

Luedtke
A. R.
, &
Van Der Laan
M. J.
(
2016
).
Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy
.
Annals of Statistics
,
44
(
2
),
713
. https://doi.org/10.1214/15-AOS1384

Manski
,
C. F.
(
1990
).
Nonparametric bounds on treatment effects
.
The American Economic Review
,
80
,
319
323
. https://www.jstor.org/stable/2006592

Manski
C. F.
(
1997
).
Monotone treatment response
.
Econometrica: Journal of the Econometric Society
,
65
(
6
),
1311
1334
. https://doi.org/10.2307/2171738

Manski
C. F.
(
2003
).
Partial identification of probability distributions
.
Springer Science & Business Media
.

Manski
C. F.
, &
Pepper
J. V.
(
2000
).
Monotone instrumental variables: With an application to the returns to schooling
.
Econometrica
,
68
(
4
),
997
1010
. https://doi.org/10.1111/1468-0262.00144

Michael
,
H.
,
Cui
,
Y.
,
Lorch
,
S.
, &
Tchetgen
,
E. T.
(
202
3).
Instrumental variable estimation of marginal structural mean models for time-varying treatment
.
Journal of the American Statistical Association (just accepted)
. https://doi.org/10.1080/01621459.2023.2183131

Moodie
E. E.
,
Richardson
T. S.
, &
Stephens
D. A.
(
2007
).
Demystifying optimal dynamic treatment regimes
.
Biometrics
,
63
(
2
),
447
455
. https://doi.org/10.1111/j.1541-0420.2006.00686.x

Munos
,
R.
(
2003
) Error bounds for approximate policy iteration. In International conference on machine learning (Vol. 3, pp. 560–567)
.

Murphy
S. A.
(
2003
).
Optimal dynamic treatment regimes
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
65
(
2
),
331
355
. https://doi.org/10.1111/1467-9868.00389

Murphy
S. A.
(
2005
).
An experimental design for the development of adaptive treatment strategies
.
Statistics in Medicine
,
24
(
10
),
1455
1481
. https://doi.org/10.1002/sim.2022

Murphy
S. A.
,
van der Laan
M. J.
,
Robins
J. M.
, & Conduct Problems Prevention Research Group (
2001
).
Marginal mean models for dynamic regimes
.
Journal of the American Statistical Association
,
96
(
456
),
1410
1423
. https://doi.org/10.1198/016214501753382327

Neyman
J. S.
(
1923
).
On the application of probability theory to agricultural experiments. Essay on principles. Section 9 (Translated and edited by D.M. Dabrowska and T.P. Speed, Statistical Science (1990), 5, 465–480)
.
Annals of Agricultural Sciences
,
10
,
1
51
.

Parthasarathy
K. R.
(
2005
).
Probability measures on metric spaces
(Vol.
352
).
American Mathematical Society
.

Pearl
J.
(
2009
).
Causality
.
Cambridge University Press
.

Pu
H.
, &
Zhang
B.
(
2021
).
Estimating optimal treatment rules with an instrumental variable: A partial identification learning approach
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
83
(
2
),
318
345
. https://doi.org/10.1111/rssb.12413

Qian
M.
, &
Murphy
S. A.
(
2011
).
Performance guarantees for individualized treatment rules
.
The Annals of Statistics
,
39
(
2
),
1180
. https://doi.org/10.1214/10-AOS864

Qiu
H.
,
Carone
M.
,
Sadikova
E.
,
Petukhova
M.
,
Kessler
R. C.
, &
Luedtke
A.
(
2020
).
Optimal individualized decision rules using instrumental variable methods
.
Journal of the American Statistical Association
,
1
46
.

Robins
J. M.
(
1989
).
The analysis of randomized and non-randomized aids treatment trials using a new approach to causal inference in longitudinal studies
.
Health Service Research Methodology: A focus on AIDS
,
113
159
.

Robins
J. M.
(
1992
).
Estimation of the time-dependent accelerated failure time model in the presence of confounding factors
.
Biometrika
,
79
(
2
),
321
334
. https://doi.org/10.1093/biomet/79.2.321

Robins
J. M.
(
1998
).
Marginal structural models. In 1997 Proceedings of the section on Bayesian Statistical Science (pp. 1–10). American Statistical Association
.

Robins
J. M.
(
2004
).
Optimal structural nested models for optimal sequential decisions. In Proceedings of the second Seattle symposium in biostatistics (pp. 189–326). Springer
.

Robins
J. M.
, &
Greenland
S.
(
1996
).
Identification of causal effects using instrumental variables: Comment
.
Journal of the American Statistical Association
,
91
,
456
458
.

Rosenbaum
P. R.
(
2002a
).
Covariance adjustment in randomized experiments and observational studies
.
Statistical Science
,
17
(
3
),
286
327
. https://doi.org/10.1214/ss/1042727942

Rosenbaum
P. R.
(
2002b
).
Observational studies
.
Springer
.

Rosenbaum
P. R.
, &
Rubin
D. B.
(
1983
).
The central role of the propensity score in observational studies for causal effects
.
Biometrika
,
70
(
1
),
41
55
. https://doi.org/10.1093/biomet/70.1.41

Rubin
D. B.
(
1974
).
Estimating causal effects of treatments in randomized and nonrandomized studies
.
Journal of Educational Psychology
,
66
(
5
),
688
701
. https://doi.org/10.1037/h0037350

Rubin
D. B.
, &
van der Laan
M. J.
(
2012
).
Statistical issues and limitations in personalized medicine research with clinical trials
.
The International Journal of Biostatistics
,
8
(
1
),
18
. https://doi.org/10.1515/1557-4679.1423

Schulte
P. J.
,
Tsiatis
A. A.
,
Laber
E. B.
, &
Davidian
M.
(
2014
).
Q-and a-learning methods for estimating optimal dynamic treatment regimes
.
Statistical Science: A Review Journal of the Institute of Mathematical Statistics
,
29
(
4
),
640
. https://doi.org/10.1214/13-STS450

Shi
,
C.
,
Lu
,
W.
, &
Song
,
R.
(
2020
).
Breaking the curse of nonregularity with subagging: Inference of the mean outcome under optimal treatment regimes
.
Journal of Machine Learning Research
,
21
(
176
),
1
67
. http://jmlr.org/papers/v21/20-066.html

Shi
,
C.
,
Zhu
,
J.
,
Ye
,
S.
,
Luo
,
S.
,
Zhu
,
H.
, &
Song
,
R.
(
2022
).
Off-policy confidence interval estimation with confounded Markov decision process
.
Journal of the American Statistical Association (just accepted)
. https://doi.org/10.1080/01621459.2022.2110878

Speth
K. A.
,
Yoon
A. P.
,
Wang
L.
, &
Chung
K. C.
(
2020
).
Assessment of tree-based statistical learning to estimate optimal personalized treatment decision rules for traumatic finger amputations
.
JAMA Network Open
,
3
(
2
),
e1921626
e1921626
. https://doi.org/10.1001/jamanetworkopen.2019.21626

Sutton
R. S.
, &
Barto
A. G.
(
2018
).
Reinforcement learning: An introduction
.
MIT Press
.

Swanson
S. A.
,
Hernán
M. A.
,
Miller
M.
,
Robins
J. M.
, &
Richardson
T. S.
(
2018
).
Partial identification of the average treatment effect using instrumental variables: Review of methods for binary instruments, treatments, and outcomes
.
Journal of the American Statistical Association
,
113
(
522
),
933
947
. https://doi.org/10.1080/01621459.2018.1434530

Szepesvári
C.
, &
Munos
R.
(
2005
).
Finite time bounds for sampling based fitted value iteration. In Proceedings of the 22nd international conference on machine learning (pp. 880–887)
.

Tao
Y.
,
Wang
L.
, &
Almirall
D.
(
2018
).
Tree-based reinforcement learning for estimating optimal dynamic treatment regimes
.
The Annals of Applied Statistics
,
12
(
3
),
1914
. https://doi.org/10.1214/18-AOAS1137

Van Buuren
,
S.
, &
Groothuis-Oudshoorn
,
K.
(
2010
).
Mice: Multivariate imputation by chained equations in R
.
Journal of Statistical Software
,
45
(
3
),
1
67
. https://doi.org/10.18637/jss.v045.i03

Vapnik
V. N.
, &
Chervonenkis
A. Y.
(
1968
).
The uniform convergence of frequencies of the appearance of events to their probabilities. In Doklady Akademii Nauk (Vol. 181, pp. 781–783). Russian Academy of Sciences
.

Verma
,
T. S.
, &
Pearl
,
J.
(
2022
) Equivalence and synthesis of causal models. In Probabilistic and causal inference: The works of Judea Pearl.
Association for Computing Machinery
.

Wainwright
M. J.
(
2019
).
High-dimensional statistics: A non-asymptotic viewpoint
(Vol.
48
).
Cambridge University Press
.

Wang
L.
, &
Tchetgen Tchetgen
E.
(
2018
).
Bounded, efficient and multiply robust estimation of average treatment effects using instrumental variables
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
80
(
3
),
531
550
. https://doi.org/10.1111/rssb.12262

Watkins
,
C. J.
, &
Dayan
,
P.
(
1992
).
Q-learning
.
Machine Learning
,
8
,
279
292
. https://doi.org/10.1007/BF00992698

Yang
F.
,
Lorch
S. A.
, &
Small
D. S.
(
2014
).
Estimation of causal effects using instrumental variables with nonignorable missing covariates: Application to effect of type of delivery NICU on premature infants
.
Annals of Applied Statistics
,
8
(
1
),
48
73
. https://doi.org/10.1214/13-AOAS699

Yannekis
G.
,
Passarella
M.
, &
Lorch
S.
(
2020
).
Differential effects of delivery hospital on mortality and morbidity in minority premature and low birth weight neonates
.
Journal of Perinatology
,
40
(
3
),
404
411
. https://doi.org/10.1038/s41372-019-0423-9

Zhang
,
B.
, &
Pu
,
H.
(
2020
).
Discussion of Cui and Tchetgen Tchetgen (2020) and Qiu et al. (2020)
.
Journal of the American Statistical Association
,
116
(
533
),
196
199
. https://doi.org/10.1080/01621459.2020.1832500

Zhang
B.
,
Tsiatis
A. A.
,
Davidian
M.
,
Zhang
M.
, &
Laber
E.
(
2012
).
Estimating optimal treatment regimes from a classification perspective
.
Stat
,
1
(
1
),
103
114
. https://doi.org/10.1002/sta.411

Zhang
,
B.
,
Weiss
,
J.
,
Small
,
D. S.
, &
Zhao
,
Q.
(
2020
).
Selecting and ranking individualized treatment rules with unmeasured confounding
.
Journal of the American Statistical Association
,
116
(
533
),
295
308
. https://doi.org/10.1080/01621459.2020.1736083

Zhang
B.
, &
Zhang
M.
(
2018
).
C-learning: A new classification framework to estimate optimal dynamic treatment regimes
.
Biometrics
,
74
(
3
),
891
899
. https://doi.org/10.1111/biom.12836

Zhang
Y.
,
Laber
E. B.
,
Davidian
M.
, &
Tsiatis
A. A.
(
2018
).
Interpretable dynamic treatment regimes
.
Journal of the American Statistical Association
,
113
(
524
),
1541
1549
. https://doi.org/10.1080/01621459.2017.1345743

Zhao
Y.
,
Zeng
D.
,
Rush
A. J.
, &
Kosorok
M. R.
(
2012
).
Estimating individualized treatment rules using outcome weighted learning
.
Journal of the American Statistical Association
,
107
(
499
),
1106
1118
. https://doi.org/10.1080/01621459.2012.695674

Zhao
Y.-Q.
,
Zeng
D.
,
Laber
E. B.
, &
Kosorok
M. R.
(
2015
).
New statistical learning methods for estimating optimal dynamic treatment regimes
.
Journal of the American Statistical Association
,
110
(
510
),
583
598
. https://doi.org/10.1080/01621459.2014.937488

Zhou
X.
,
Mayer-Hamblett
N.
,
Khan
U.
, &
Kosorok
M. R.
(
2017
).
Residual weighted learning for estimating individualized treatment rules
.
Journal of the American Statistical Association
,
112
(
517
),
169
187
. https://doi.org/10.1080/01621459.2015.1093947

Author notes

Conflict of interest None declared.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/pages/standard-publication-reuse-rights)

Supplementary data