-
PDF
- Split View
-
Views
-
Cite
Cite
Melody Y Huang, Sensitivity analysis for the generalization of experimental results, Journal of the Royal Statistical Society Series A: Statistics in Society, Volume 187, Issue 4, October 2024, Pages 900–918, https://doi.org/10.1093/jrsssa/qnae012
- Share Icon Share
Abstract
Randomized controlled trials (RCT’s) allow researchers to estimate causal effects in an experimental sample with minimal identifying assumptions. However, to generalize or transport a causal effect from an RCT to a target population, researchers must adjust for a set of treatment effect moderators. In practice, it is impossible to know whether the set of moderators has been properly accounted for. I propose a two parameter sensitivity analysis for generalizing or transporting experimental results using weighted estimators. The contributions in the article are threefold. First, I show that the sensitivity parameters are scale-invariant and standardized, and introduce an estimation approach for researchers to account for both bias in their estimates from omitting a moderator, as well as potential changes to their inference. Second, I propose several tools researchers can use to perform sensitivity analysis: (1) numerical measures to summarize the uncertainty in an estimated effect to omitted moderators; (2) graphical summary tools to visualize the sensitivity in estimated effects; and (3) a formal benchmarking approach for researchers to estimate potential sensitivity parameter values using existing data. Finally, I demonstrate that the proposed framework can be easily extended to the class of doubly robust, augmented weighted estimators.
1 Introduction
Randomized controlled trials (RCT’s) provide researchers with a rich understanding of the treatment effect within an experimental sample. Because researchers have the ability to eliminate confounding by randomly assigning treatment in a controlled environment, experiments have a high degree of internal validity. However, problematically, a causal effect estimated from an RCT may not directly generalize or transport to populations of interest when the experimental sample is not representative of the larger population. One prominent source of bias arises from distributional differences in treatment effect moderators—i.e. covariates that drive propensity of selection into the experimental sample, as well as treatment effect heterogeneity—between the experimental sample and the population (e.g. Cole & Stuart, 2010; Imai et al., 2008; Olsen et al., 2013; see Egami & Hartman, 2022 for discussion on alternative sources of bias). To properly generalize or transport the results from an experiment into a target population, researchers must either re-weight the experimental sample to be representative of the target population, or successfully model the treatment effect heterogeneity (Kern et al., 2016; Stuart et al., 2011).
In practice, it is impossible to know whether the set of treatment effect moderators has been correctly identified. Researchers rely on the measured variables that are available in the sample and the population and often assume that the observed covariates sufficiently capture the confounding effect. However, when moderators are omitted from estimation, the resulting estimates will be biased. Recently, different sensitivity analyses have been proposed for assessing robustness to omitted moderators when generalizing (or transporting) causal effects; however, challenges remain. In particular, many of these existing approaches require researchers to justify sensitivity parameters that may be arbitrarily large or small (e.g. Dahabreh et al., 2022; Nguyen et al., 2017; Nie et al., 2021), and/or invoke parametric assumptions to model the estimated bias from moderators (e.g. Dahabreh et al., 2019; Duong et al., 2023).
In the following paper, we introduce a sensitivity analysis framework for unobserved moderators when using a weighted estimator for generalizing or transporting a causal effect. The proposed framework builds on the sensitivity analysis literature from observational studies (Cinelli & Hazlett, 2020; Hong et al., 2021; Shen et al., 2011), as well as existing sensitivity analysis approaches for generalizing or transporting an estimated treatment effect (Nguyen et al., 2017; Nie et al., 2021) and provides several key contributions.
First, we demonstrate that the bias of a weighted estimator may be decomposed into three bounded components. We show that two of the components are standardized representations of the omitted variable’s relationship with (1) the individual-level treatment effect and (2) the selection mechanism. These two components serve as the sensitivity parameters in the proposed framework. The last component is a scaling factor that measures how much inherent treatment effect heterogeneity and imbalance there is in the data and can be directly upper bounded using observed data. To help researchers account for estimation uncertainty, we introduce a bootstrapping-based approach for researchers to simultaneously consider not only potential bias that would occur from omitting a variable but also changes in statistical inference. Notably, the proposed sensitivity framework does not require any additional assumptions on the distributional or functional form of the observed data generating process (such as in Ding & VanderWeele, 2016 and Imbens, 2003), or additional parametric assumptions for modelling the bias (such as in Dahabreh et al., 2019 and Duong et al., 2023), thereby providing a large degree of flexibility for when the framework can be applied.
Second, we introduce several sensitivity tools to help researchers conduct their sensitivity analysis in transparent and interpretable ways. While sensitivity tools have become more prevalent in outcome-based sensitivity analyses (e.g. Cinelli & Hazlett, 2020; Zheng et al., 2021), such approaches do not currently exist for weighted estimators and are important in helping researchers reason about the potential bias from omitting a variable. The first approach is a graphical summary of sensitivity in the form of bias contour plots. The second is a numerical summary of sensitivity, and extends the robustness value from Cinelli and Hazlett (2020) for the weighted estimator setting. The robustness value summarizes the uncertainty in an estimate due to confounding from selection. Finally, we propose a formal benchmarking procedure that leverages observed covariates to posit plausible parameter values and allows researchers to incorporate their substantive knowledge for the relative strength of moderators. We provide extensions of the sensitivity analysis and sensitivity tools for the class of augmented weighted estimators.
The article is organized as follows. Section 2 introduces the notational framework, identifying assumptions, related literature, and the running example. Section 3 formalizes the proposed sensitivity analysis framework. In Section 4, we discuss three different tools that researchers can use to conduct the sensitivity analysis. Section 5 extends the framework for the class of augmented weighted estimators. Section 6 concludes. Proofs and extensions are provided in the Appendix of the online supplementary material.
2 Background
2.1 Notation and set-up
To begin, we define an infinite super-population from which the target population and the experimental sample are drawn. We define the target population as a sample of N units, drawn i.i.d. randomly from the infinite super-population. Following Buchanan et al. (2018), we define the experimental sample of n units as a potentially biased i.i.d. sample from the infinite super-population. Define as an indicator for whether the unit is in the experimental sample (i.e. when unit i is in the experiment, and otherwise), and let denote the set of indices for units included in the experimental sample.
Let be a binary treatment assignment variable, where for units assigned to treatment, and for control. We assume full compliance, such that treatment assigned implies treatment received, and following the potential outcomes framework, define to be the potential outcome when unit i receives treatment , where (Neyman, 1923; Rubin, 1974). Throughout the article, we make the standard assumptions of no interference and that treatments are identically administered across all units (i.e. SUTVA, as defined in Rubin, 1980). We assume a set of pre-treatment covariates exists across both the experimental sample and the target population. Finally, we define the individual-level treatment effect as the difference between the potential outcomes of unit i (i.e. ). Because we can never observe both potential outcomes of a specific unit, the individual-level treatment effect is unidentifiable (Holland, 1986).
To formalize, we assume both and are drawn i.i.d. from an infinite super-population. When the experimental sample is a biased sample from the super-population, the sampling distributions for the experimental sample and the target population will not be the same (i.e. ).
The sample average treatment effect (SATE) is defined as the average treatment effect across the experimental sample (i.e. ). Assuming random treatment assignment, a simple difference-in-means estimator can be used to estimate the SATE:
where represents the set of indices that correspond to units in the experimental sample (i.e. ). The population (or target) average treatment effect (PATE) is the causal quantity of interest, formally defined as:
where the expectation is taken over the realized, finite-sample target population. Researchers may instead, treat the estimand of interest as the average treatment effect across the infinite super-population, instead of the realized population. Furthermore, researchers may be interested in settings in which the target population and the experimental sample are drawn from different super-populations (i.e. in the transportability setting). The proposed sensitivity analysis can be extended easily in both of these settings. See Appendix A of the online supplementary material and Huang et al. (2023) for a more thorough discussion.
If the experimental sample is randomly drawn from the super-population, then is an unbiased estimator for the PATE. However, in most settings, the experimental sample is not representative of the target population, and experimental results cannot be directly extrapolated to the population (Cole & Stuart, 2010; Nguyen et al., 2017; Olsen et al., 2013). In these settings, an additional identifying assumption is necessary to recover the PATE from the experimental sample.
Conditional Ignorability of Sampling
Assumption 1 states that conditioned on the set , the distribution of the individual-level treatment effects in the sample will be equivalent to the distribution of individual-level treatment effects in the population (Kern et al., 2016). Following Egami and Hartman (2019), we refer to the set of covariates that allow the sampling mechanism to be conditionally independent from the individual-level treatment effect as the separating set. Conditional ignorability can be relaxed for a weaker assumption of mean exchangeability (i.e. ); importantly, mean exchangeability will still rely on researchers fully measuring the separating set (Dahabreh et al., 2019; Degtiar & Rose, 2023). Throughout the article, we focus on Assumption 1, as it is commonly invoked in the generalizabilty literature, but note that the same framework could be applied in settings when researchers are leveraging mean exchangeability. See Appendix A.2 of the online supplementary material for more discussion.
In addition to Assumption 1, we assume positivity (Rosenbaum & Rubin, 1983).
Positivity
Violations of the positivity assumption result in attempting to generalize beyond the support of the data (see Stuart et al., 2011 and Tipton, 2014 for more discussion).
The most common approach to estimating the PATE is through a weighted estimator, where the observations in the experimental sample are re-weighted to resemble that of the target population (Olsen et al., 2013; Stuart et al., 2011):
where the weights are defined as the sampling weights (i.e. ), and are the number of units in the treatment and control groups, respectively. Weights are often estimated using logistic regression (Buchanan et al., 2018; Cole & Stuart, 2010; Stuart et al., 2011). Recently, alternative weighting methods have been proposed, including more general balancing methods, such as entropy balancing, which adjust for distributional differences between the experimental sample and population observations without explicitly modelling the underlying probability function (Hainmueller, 2012; Josey et al., 2021; Lu et al., 2021; Särndal et al., 2003; see Ben-Michael et al. (2020) for more discussion). Importantly, the proposed sensitivity analysis can be applied for not only inverse-propensity score weights but also a large class of balancing weights.
In practice, researchers estimate the PATE under the assumption that they have correctly identified the full separating set (e.g. Egami & Hartman, 2022; Kern et al., 2016). When Assumption 1 holds, the weighted estimators will be consistent estimators for PATE. However, violations of this assumption can result in biased estimation. The goal of this article is to formalize a framework for assessing the sensitivity of PATE estimates to omitting a moderator from the separating set (i.e. a variable missing from the separating set necessary for Assumption 1 to hold). Throughout the article, we will focus our discussion on the setting in which researchers are using weighted estimators but provide extensions for augmented weighted estimators in Section 5.
2.2 Running example: Jobs Training Partnership Act
To enrich our discussion of the sensitivity analysis, we will use a set of experiments conducted on the Jobs Training Partnership Act (JTPA) as a running example throughout the article. The national JTPA study ran from 1987 to 1989 and assessed the effectiveness of the jobs training programs in helping individuals in the study find employment and increase their earnings. The original study was conducted across 16 different experimental sites. Individuals were first interviewed to determine whether or not they were eligible for JTPA services; those deemed eligible were assigned randomly to treatment and control using a 2:1 ratio. Individuals assigned to treatment were given access to JTPA services, while those assigned to control were told they were ineligible for the program. Following treatment assignment, a follow-up survey was conducted 18 months later, in which individuals were asked about their earnings (Bloom et al., 1993). We focus our analysis on the subset of adult women, the largest target group within the JTPA study.
We leverage the nature of the original multi-site experiment to perform an empirical validation exercise for the sensitivity analysis. More specifically, we pick one of the 16 experimental sites and generalize the estimated effect of JTPA access on earnings from this site to the remaining 15 sites. The target PATE is defined as the average treatment effect across the units in the other 15 experimental sites. This allows us to evaluate the actual error that is incurred from generalizing. To estimate the sample selection weights, we use entropy balancing across a set of pre-treatment covariates measured in the baseline survey (Hainmueller, 2012; Josey et al., 2021). Entropy balancing directly optimizes balance on moment conditions specified by the researcher (i.e. the average covariate value in the experimental sample, vs. the average covariate value in the target population) to estimate the weights, instead of first estimating the probabilities of selection into sample. The moment conditions typically include marginal moments of the distribution but can also include moment conditions defined by the joint distributions of covariates (e.g. Huang et al., 2022). We weight on previous earnings, age, hourly wage, years of education, whether or not the individual graduated high school (or has a GED), whether or not the individual is married, and indicators for whether the individual is black or Hispanic.
To illustrate the sensitivity analysis, we examine the site of Coosa Valley, Georgia, which consists of 788 individuals, 519 of whom were assigned to treatment, with the remainder in control. The target population (i.e. the other 15 experimental sites) consists of 5,314 individuals. To showcase the performance of the sensitivity analysis across alternative experimental sites, we also conduct the sensitivity analysis on the other 15 experimental sites from JTPA. The results are provided in Appendix D of the online supplementary material.
The within-site estimated impact of JTPA access on earnings in Coosa Valley, Georgia is $. After weighting, the estimated impact of JTPA access earnings is $. In the following sections, we will introduce a sensitivity framework that allow researchers to assess how robust the estimate is to unobserved confounders (Table 1).
Estimates of impact of JTPA access on earnings, generalizing the estimated effect from the site of Coosa Valley, Georgia to the other 15 experimental sites
. | Unweighted . | Weighted . |
---|---|---|
Impact of JTPA access on earnings * | 1.63 (0.95) | 2.81 (1.21) |
. | Unweighted . | Weighted . |
---|---|---|
Impact of JTPA access on earnings * | 1.63 (0.95) | 2.81 (1.21) |
*Estimates reported in thousands of USD.
Note. Standard errors are reported in the parentheses.
Estimates of impact of JTPA access on earnings, generalizing the estimated effect from the site of Coosa Valley, Georgia to the other 15 experimental sites
. | Unweighted . | Weighted . |
---|---|---|
Impact of JTPA access on earnings * | 1.63 (0.95) | 2.81 (1.21) |
. | Unweighted . | Weighted . |
---|---|---|
Impact of JTPA access on earnings * | 1.63 (0.95) | 2.81 (1.21) |
*Estimates reported in thousands of USD.
Note. Standard errors are reported in the parentheses.
3 Sensitivity analysis for weighted estimators
In the following section, we will introduce a sensitivity analysis for weighted estimators when omitting a confounder from the weight estimation.
3.1 Bias of a weighted estimator when omitting a moderator
We consider the sensitivity of a weighted estimator to a treatment effect moderator that has been omitted in the estimation of the weights. To begin, we formally define the minimum separating set as , where is observed, and is not. In other words, for the weighted estimator to be unbiased, both and would have to be included in the weights; however, we omit . We define the weights that include only as , and define the ideal weights as weights that include both and :
As defined in equation (5), both and represent the large-sample limits of inverse propensity score weights. As such, estimation error in the weights are not explicitly accounted for in this framework, though misspecification concerns can be addressed with the sensitivity analysis if researchers can write the error as an omitted variable problem. For example, if a linear probability model is used, can include non-linear functions of that matter for modelling selection. Furthermore, in defining the estimated and ideal weights as such, the proposed sensitivity framework will not account for settings in which researchers naively use uniform weights (i.e. for all units), or settings in which the ideal weights are uniform (i.e. for all units). Finally, we define as the linear error in the weights from omitting (i.e. ).
Because the proposed framework accounts for the error in the large-sample limits of the inverse propensity score weights, the sensitivity analysis may be applied not only in settings when researchers are estimating inverse propensity score weights but also for a wide class of balancing weights that asymptotically converge to the same large-sample limits. For example, the running example utilizes entropy balancing, which Zhao and Percival (2016) proved implicitly estimates propensity score weights using a modified loss function. We refer readers to Wang and Zubizarreta (2020), Soriano et al. (2021), and Ben-Michael et al. (2020) for more discussion on the connection between balancing weights and inverse-propensity score weighting, and Appendix A.3 of the online supplementary material for more discussion on applying the sensitivity framework for balancing weights.
Throughout, consistent with Shen et al. (2011) and Hong et al. (2021), we will refer to bias as the expectation of estimator minus the true value (i.e. true statistical bias). The bias of a weighted estimator from omitting a moderator is a function of and the degree to which this error term is related to treatment effect heterogeneity and the sampling process. The following theorem formalizes.
Bias of a Weighted Estimator from Omitting a Moderator
Theorem 3.1 identifies the three drivers of bias in a weighted estimator when a treatment effect moderator is omitted in the weight estimation: (1) the remaining imbalance in the omitted moderator (i.e. ), (2) the correlation between and the individual-level treatment effect (i.e. , and (3) a scaling factor, which is a function of the variance in the estimated weights and the amount of treatment effect heterogeneity (i.e. ). Theorem 3.1 provides a natural foundation for a sensitivity analysis. In particular, and will serve as our sensitivity parameters, while the scaling factor can be conservatively bounded using observed data. We show in Section 5 that a similar bias decomposition holds for augmented weighted estimators. Finally, we note that the derived bias expression will be the exact bias when researchers are using a Horvitz–Thompson style weighted estimator. In cases when researchers are using a stabilized weighted estimator, there will be finite-sample bias of order . However, the finite-sample bias will be dominated by the bias incurred from omitting a moderator from the weights (see Lunceford & Davidian, 2004; Miratrix et al., 2013; Rosenbaum, 2010 for more discussion).
For the setting in which , the bias decomposition in equation (6) will be undefined, and researchers will have to use an alternative decomposition, given in equation (7). However, we note that in order for the parameter to be equal to 1, researchers would have to include covariates that are exactly orthogonal to the moderator , and completely unrelated to the selection process. Thus, while it is mathematically possible to be in this setting, it is practically implausible. An alternative setting in which could equal 1 is if researchers posit naive, uniform weights. However, our definition for the estimated and ideal weights rules out this scenario.
3.2 Interpreting the parameters
In the following subsection, we discuss the interpretation of each of the sensitivity parameters. Instead of relying on unbounded sensitivity parameters (e.g. Hong et al., 2021; Shen et al., 2011), the proposed sensitivity analysis uses a correlation value and an measure to represent how related the moderator is to the individual-level treatment effect and the selection mechanism. Both of these parameters are standardized and scale-invariant; as such, they are not dependent on the distribution of the underlying outcomes. This is in contrast to alternative sensitivity approaches (e.g. Nie et al., 2021; Rosenbaum, 1987; Zhao et al., 2019), which rely on an unbounded sensitivity parameter that can be arbitrarily large or small that constrains a worst-case bound on the error from omitting a variable. Recent work by Huang and Pimentel (2022) and Jin et al. (2022) demonstrate that the magnitude of the underlying sensitivity parameter in these aforementioned frameworks are susceptible to the scale and distribution of both the outcomes, as well as the underlying omitted variable, which can make it difficult to reason about in practice. Thus, our proposed sensitivity framework can make it easier for researchers to reason about plausible sensitivity parameters, especially when paired with the sensitivity tools introduced in Section 4.
3.2.1 Variation in ideal weights explained by ()
The term is defined as the ratio of variances between the error term and the ideal weights. In the following lemma, we show that the variation in the true weights can be decomposed into two components: variation explained by the estimated weights, and the variation explained by the error term ; therefore, is bounded on the interval of 0 and 1. We can interpret as the proportion of variation in the true weights explained by the error term .
Variance Decomposition of wi*
The results of Lemma 3.1 follow from the fact that we may recover the estimated weights from projecting the ideal weights onto the space of the observed covariates . This is a general property of inverse propensity score weights.
Lemma 3.1 highlights that measures how much residual imbalance remains in the omitted confounder, after controlling for observed covariates (i.e. ). If the residual imbalance of the omitted moderator (i.e. imbalance in , conditional on ) is relatively small, then the estimated weights will be close to the true weights . As a result, will be close to 0. In contrast, if the residual imbalance of the omitted moderator is large, then we expect to be very different than , and will be large, approaching 1. Consider the running example. The original study cited the latent variable of motivation as a potential moderator (Bloom et al., 1993). While we cannot include motivation directly in the weights, we have included variables such as educational attainment and previous earnings, which are likely correlated to motivation. If, by controlling for variables such as educational attainment and previous earnings, we have accounted for much of the imbalance in motivation, then including motivation into the weight estimation should result in weights similar to the estimated weights , and will be relatively small (i.e. is close to zero).
3.2.2 Correlation between and ()
The correlation between and the individual-level treatment effect is a standardized measure for how much treatment effect heterogeneity explains. When is very high (i.e. ), then units with a large are overweighted ( corresponds to large ). Thus, in these settings, there will be positive bias. Conversely, if , the opposite would be true—we underweight units with a large individual-level treatment effect, which results in a negatively biased estimated PATE. If the correlation between the error term and the individual-level treatment effect were close to zero, then the imbalance in the omitted moderator is not related to treatment effect heterogeneity, and as such, omitting would not result in much bias.
While is inherently bounded on the interval , we can restrict the set of feasible correlation values to a tighter range.
Correlation Decomposition
Lemma 3.2 demonstrates that will be bounded between . If the estimated weights can explain most of the variation in treatment effect heterogeneity, the additional variation that can be explained by adding in the omitted moderator must be small.
The correlation between the estimated weights and will take on large values when (1) the covariates contained in explain much of the treatment effect heterogeneity and (2) the covariates that explain the treatment effect heterogeneity are imbalanced across the population and the experimental sample. To help provide intuition for this, consider the running example. If access to JTPA services was only effective for women who graduated high school, then if educational attainment were imbalanced across the experimental sample and the population, estimating weights on educational attainment would result in a large value. However, if educational attainment were not very imbalanced across the experimental sample and population, even though educational attainment explains much of the variation in the treatment effect heterogeneity, would be low. In such a scenario, the true value should also be small; however, this would not be reflected in the bound.
Remark on Estimating : In practice, it is not possible to directly calculate , since is unidentified. Researchers may conservatively estimate the correlation of and by using and , which is identified by randomization:
Because is a function of the variation in the individual-level treatment effect (i.e. ), if researchers use a more conservative estimate of , this will subsequently lead to a more conservative estimate on , and by extension, a more conservative estimate for the bounds on . See Section 3.2.3 for details on specifying .
3.2.3 Scaling factor ()
The last term in the bias decomposition is a scaling factor, made up of the product of the variance of the estimated weights and the variance in the individual-level treatment effect (i.e. ). This term is not related to the moderator, and is instead, intrinsic to the inherent data generating process. However, it does increase or decrease our exposure to bias from omitting a moderator.
We consider both terms in the scaling factor. The first term, , corresponds to how much inherent imbalance there is in the observed covariates between the experimental sample and the target population. As the variance of our estimated weights increases, this implies that the weights are accounting for larger distributional differences between the experimental sample and the target, and the potential for bias also increases.
The second term in the scaling factor is the magnitude of the treatment effect heterogeneity (). This is related to Meng (2018)’s ‘problem difficulty.’ More specifically, when there exists a large degree of treatment effect heterogeneity, the task of recovering the PATE becomes harder, and even small imbalances in the moderators can result in a large degree of bias. When there is less treatment effect heterogeneity, we have more leeway in mis-specifying the weights without incurring large amounts of bias. In the most extreme case of no treatment effect heterogeneity, we need not adjust for any moderators to have unbiased estimation.
Because treatment effect heterogeneity is inherent to the underlying data generating process, regardless of what variables are included in the weights, is fixed. We apply the results from Ding et al. (2019) to show that , while unidentifiable, can be bounded using Fréchet-Hoeffding bounds (Fréchet, 1951; Hoeffding, 1941) using the observed data, with opportunities for tighter bounds in cases when researchers are willing to invoke additional assumptions about the potential outcomes (see Appendix A.4 of the online supplementary material for more details).
In general, to estimate a conservative upper bound for the scaling factor, researchers can directly estimate and an upper bound for (denoted as ).
3.3 Accounting for changes in inference
In practice, researchers are concerned about not only the resulting bias from omitting a moderator, but also potential changes to their inference. To account for changes in inference, we leverage the results from Huang and Pimentel (2022) to estimate confidence intervals for a specified set using a percentile bootstrap. More formally, for any set of values, researchers can compute the associated confidence intervals of the adjusted point estimate. This approach allows researchers to simultaneously account for the bias that occurs from omitting a moderator, as well as the changes in uncertainty, without introducing additional sensitivity parameters. We provide details in Appendix A.5 of the online supplementary material. Researchers can compute the confidence intervals for the adjusted point estimates for a grid of values. Then, using the estimated confidence intervals, researchers can find the minimum bias that can occur before the intervals around the adjusted point estimate contain the null estimate, which would imply that omitting a moderator resulted in a change in the statistical significance of an estimated effect.
3.4 Summary of the sensitivity framework
To summarize the sensitivity analysis framework thus far, we have parameterized the bias of a weighted estimator when omitting a moderator in the estimation of the weights with the following components: (1) an measure that is bounded between 0 and 1 (i.e. ), (2) the correlation between the error term and the individual-level treatment effect (i.e. ), and (3) a scaling factor that can be upper bounded. We summarize this below.
Step 1. Estimate an upper bound for (i.e. ).
Step 2. Using , estimate as a bound for .
Step 3. Vary from to .
Step 4. Vary from the range of .
Step 5. Evaluate the bias (Theorem 3.1) and uncertainty (Table 4 in Appendix A.5 of the online supplementary material).
4 Tools for sensitivity analysis
In the following section, we provide different tools that researchers can use to help understand the degree of sensitivity associated with their estimated effects. We introduce two summary measures: (1) a graphical representation of sensitivity, in the form of bias contour plots and (2) a numerical measure, referred to as a robustness value, which summarizes how much confounding must be present for an omitted moderator to change the estimated effect. To assess the plausibility of parameter values, we introduce a formal benchmarking approach that allows researchers to use observed covariates to calibrate their understanding of potential sensitivity parameters. In Appendix A.6 of the online supplementary material, we provide an extreme scenario analysis that evaluates an upper bound for the bias that in the extreme case that is maximally correlated with the individual-level treatment effect.
4.1 Summary measures of sensitivity
4.1.1 Graphical summary: contour plots
A simple way to summarize and visualize the sensitivity of the point estimates is through the use of bias contour plots (see Figure 1). To generate the plots, the y-axis represents values that the correlation term can take on (i.e. the estimated range from Lemma 3.2), and the x-axis represents values of across the interval of .

Bias Contour Plot for Coosa Valley, Georgia. The inner region, with a solid boundary line, represents the region for which the estimated effect will be equal to zero, or become negative. The outer region, with a dashed boundary line, represents the part of the plot in which the estimated effect will no longer be statistically significant. To aid in our discussion, we use formal benchmarking (introduced in Section 4.2) to estimate the parameter values for an omitted moderator with similar confounding strength as an observed covariate.
Furthermore, we recommend researchers shade in the ‘killer confounder’ region. With some abuse of terminology, we use the term killer confounder instead of killer moderator to highlight that we are concerned about the confounding effects from the omitted moderator. The killer confounder region thus represents the set of all , where the bias from omitting a moderator is large enough to substantively alter the estimated effect. Throughout the article, we consider two different types of killer confounders: (1) an omitted moderator that is strong enough to result in a change in the directional sign of a treatment effect, or bring the treatment effect to zero; and (2) an omitted moderator that has sufficient confounding strength to alter the statistical significance of our estimated effect. If the killer confounder region is large, then there is greater sensitivity to potential violations to the conditional ignorability assumption. If the region is small, there is less sensitivity.
4.1.2 Numerical summary: robustness value
In practice, justifying whether the killer confounder region is large or small can be challenging. As such, we propose the robustness value as a standardized, numerical summary of how sensitive an estimated effect is to omitting a moderator that may change the substantive interpretation of an estimated treatment effect. This extends the robustness value proposed by Cinelli and Hazlett (2020) for weighted estimators.
The robustness value measures how strong a confounder must be in order for the bias to equal of the estimated effect:
Evaluating the robustness value at provides a measure for minimum confounding strength in order for the bias to equal the point estimate, which would result in the point estimate being equal to zero. is interpreted as the minimum amount of variation in treatment effect heterogeneity and the true sample selection weights , that the error term must explain (i.e. ) for the bias to be % that of the point estimate. Similarly, we may evaluate the robustness value associated with the minimum confounding strength of a confounder that results in the estimated effect changing its statistical significance. We denote this as . More details and derivations are provided in Appendix C of the online supplementary material.
A key property of the robustness value is that it exists on a scale from 0 to 1. When the robustness value is close to 1, then this implies that must explain close to 100% of the variation in both and for the estimated effect to be reduced to zero. In contrast, if the robustness value is close to zero, then if is able to explain a small amount of variation in both and , the error from omitting a moderator will be strong enough to reduce the estimated effect to zero. While the robustness value cannot rule out the possibility of a killer confounder, it can help researchers discuss the plausibility of such a variable. Like standard error, which summarizes our uncertainty due to sampling error, the robustness value serves as a summary measure of our uncertainty due to systematic bias.
Geometric connection to bias contour plots
The robustness value is connected to the boundary of the killer confounder region. For example, if researchers are considering just changes to their point estimates, the killer confounder region would be defined by the part of the plot in which the bias is large enough to reduce the estimate to zero. The point on the boundary for which is representative of the robustness value . The same interpretation applies if researchers define the killer confounder region with respect to the minimum bias associated with a change in the statistical significance of their estimated effect. The boundary of the killer confounder region represents the set of all potential parameter values associated with a killer confounder. As such, we recommend researchers report both the robustness value and the bias contour plots when performing sensitivity analysis.
4.1.3 Example: sensitivity summary measures in JTPA
We illustrate the proposed sensitivity summary measures in our running example. To conduct the sensitivity analysis, we use an estimated bound of 29.01 for . (Details on how was chosen is provided in Appendix D of the online supplementary material.) Table 2 provides the different sensitivity statistics:
. | Unweighted . | Weighted . | . | . |
---|---|---|---|---|
Impact of JTPA access on earnings * | 1.63 (0.95) | 2.81 (1.21) | 0.56 | 0.08 |
. | Unweighted . | Weighted . | . | . |
---|---|---|---|---|
Impact of JTPA access on earnings * | 1.63 (0.95) | 2.81 (1.21) | 0.56 | 0.08 |
Note. ; .
*Estimates reported in thousands of USD.
. | Unweighted . | Weighted . | . | . |
---|---|---|---|---|
Impact of JTPA access on earnings * | 1.63 (0.95) | 2.81 (1.21) | 0.56 | 0.08 |
. | Unweighted . | Weighted . | . | . |
---|---|---|---|---|
Impact of JTPA access on earnings * | 1.63 (0.95) | 2.81 (1.21) | 0.56 | 0.08 |
Note. ; .
*Estimates reported in thousands of USD.
We see that the estimated robustness value is , which implies that the error in the weights for omitting a confounder must explain 56% of the variation in the individual-level treatment effect, as well as 56% of the variation in the ideal weights in order for the treatment effect to be brought down to 0. Whether or not the robustness value is large or small depends on whether researchers believe that it is plausible for the error in omitting a confounder to explain 56% of the variation in both the ideal weights and the treatment effect heterogeneity. The estimated robustness value for a confounder that alters the statistical significance of an estimated effect is , which is much lower. As such, if the error from omitting a confounder explains 8% of the variation in the ideal weights and the treatment effect heterogeneity, then the estimated effect will no longer be statistically significant.
We also examine a bias contour plot, in which we shade in blue the part of the plot for which the bias is large enough to reduce the estimated impact of JTPA access on earnings to zero or negative (see Figure 1). The boundary of this region visualizes the full set of that corresponds to a confounder strong enough to reduce the estimated treatment effect to zero. For example, an omitted moderator that results in an error term that explains less than half the variation in the ideal weights (i.e. ) but explains a large amount of variation in the individual-level treatment effect (i.e. ) would reduce our estimate to zero. Similarly, a confounder that results in an error term that explains a large amount of the variation in the ideal weights (i.e. ), but a small portion of the variation in the individual-level treatment effect (i.e. ) would also reduce our estimate to zero.
We additionally shade in light grey the region of the plot in which the point estimate will still be positive, but the estimated effect will no longer be statistically significant. We see that this dominates a much larger part of the plot, as we are now incorporating uncertainty.
4.2 Formal benchmarking to infer reasonable parameters
A challenge in sensitivity analysis is positing reasonable values for the sensitivity parameters to take on. Furthermore, justifying whether the killer confounder region of a bias contour plot, or the robustness value, is large or small can be challenging in practice. In the following subsection, we introduce a formal benchmarking approach for researchers to use observed covariates to calibrate their understanding of plausible parameter values using relative strength.
To begin, let be an observed covariate (i.e. , and , ). Define as the error term that compares the weights estimated using all covariates with the weights estimated using all the covariates, except for :
where is the set of weights estimated using all the covariates , except for , and is the set of weights estimated using all available covariates . We define the amount of confounding strength an omitted moderator has by how much variation explains in the ideal weights and the individual-level treatment effect . Thus, to obtain formal benchmarks, we posit the amount of variation explained in and by , in comparison to . More formally, define:
where the numerators (i.e. and ) correspond to the sensitivity parameters introduced in Section 3. represents how much relative variation in the true sample selection weights the residual imbalance in (i.e. ) explains, relative to the residual imbalance in (i.e. ). If the residual imbalance in the omitted moderator is greater than the observed residual imbalance in the covariate , then we expect . represents how correlated the individual-level treatment effect and the error term are, relative to . and intuitively represent the relative confounding strength of an observed covariate. When , then we say that an omitted moderator has equivalent confounding strength to an observed covariate.
With a researcher-specified and , we obtain the formally benchmarked sensitivity parameters. Theorem 4.1 formalizes this.
Formal Benchmarking for Sensitivity Parameters
Theorem 4.1 allows researchers to estimate parameter values for an omitted moderator, with specified relative confounding strength to an observed covariate. There are several key points to highlight. First, the magnitude of the benchmarked parameter values is determined by the residual confounding from omitting a covariate that cannot already be explained by the other observed covariates. Thus, in interpreting the benchmarking results, we are considering the marginal contribution of an omitted moderator, after accounting for the observed covariates. Secondly, because both and are inherently bounded, and will also be bounded. As such, researchers can estimate the maximum confounding strength of an omitted moderator, relative to an observed covariate. Finally, Theorem 4.1 can be extended for a subset of covariates. This is helpful in cases when researchers are concerned about collinearity, or in settings when a subset of observed covariates (or interactions) is substantively known to be important. As such, researchers may estimate the effect of omitting a moderator with similar strength to the entire group of covariates.
While benchmarking can never be used to eliminate the existence of a killer confounder, benchmarking can help researchers reason about the plausibility that an omitted moderator, having already conditioned on the existing covariates, can reduce the estimated effect to zero. We elaborate on this point in the following subsection.
4.2.1 Using benchmarking to understand killer confounders
We will now detail how benchmarking can be employed to help researchers reason about the plausibility of a killer moderator. We propose two different approaches. The first approach compares the benchmarked bias with either the estimate (to assess sensitivity to a confounder reducing the estimated effect to zero), or the minimum bias estimated that can result in a statistically insignificant effect. The second approach is to compare the benchmarking results with the robustness value (either or ).
Minimum relative confounding strength Benchmarking the sensitivity parameters allows researchers to estimate the resulting bias from omitting a confounder with fixed relative confounding strength as a covariate. We propose a natural summary measure, referred to minimum relative confounding strength (MRCS) for how much relative confounding strength an omitted variable must have to result in a killer confounder. If researchers define a killer confounder as a confounder strong enough to reduce their estimated effect to zero, the MRCS can be simply solved by dividing the point estimate with the estimated bias when :
Similarly, if researchers are interested in killer confounders that would alter the statistical significance of their estimated effects, they can evaluate equation (11) using the estimated minimum bias threshold instead of the point estimate.
If the estimated MRCS is small (i.e. MRCS ), then this implies that an omitted moderator, with weak confounding strength, relative to the covariate , could lead to a killer confounder, if MRCS is large (i.e. MRCS ), then this indicates that an omitted moderator must be stronger than the observed covariate to result in a killer confounder. MRCS is an especially helpful measure when researchers have strong substantive priors for what may be important covariates.
Comparing benchmarking results with the robustness value
From benchmarking, researchers can estimate the necessary and values in order for (or ). We denote these values as . The interpretation of and is similar to that of the MRCS; however, researchers can now look at the drivers of bias with respect to the confounder’s relationship to the sample selection process and treatment effect heterogeneity separately.
4.2.2 Example: applying formal benchmarking in JTPA
To help assess plausible sensitivity parameters in the JTPA application, we perform formal benchmarking. Table 3 presents the results. For each of the covariates included in the weights, we estimate and , the MRCS, and . To account for estimation uncertainty in our benchmarking results, we perform benchmarking across repeated bootstrap iterations, and estimate the percentage of benchmarked results that result in enough bias to either (1) reduce the estimated effect to zero or (2) change the statistical significance of the estimated effect. We provide the results in Appendix D.2 of the online supplementary material.
. | . | . | . | Estimated effect = 0 . | Changes in Signif. . | ||||
---|---|---|---|---|---|---|---|---|---|
Variable . | . | . | . | MRCS . | . | . | MRCS . | . | . |
Prev. earnings | 0.01 | 0.41 | 0.12 | 23.4 | 72.9 | 1.8 | 2.4 | 10.7 | 0.7 |
Age | 0.00 | 0.04 | 0.00 | — | — | 18.2 | — | — | 7.0 |
Married | 0.05 | 0.19 | 0.14 | 20.7 | 12.3 | 4.0 | 2.1 | 1.8 | 1.5 |
Hourly wage | 0.03 | 0.24 | 0.14 | 20.6 | 20.2 | 3.1 | 2.1 | 3.0 | 1.2 |
Black | 0.17 | 0.11 | 0.16 | 17.7 | 3.4 | 7.1 | 1.8 | 0.5 | 2.7 |
Hispanic | 0.24 | 0.14 | 0.26 | 10.8 | 2.3 | 5.5 | 1.1 | 0.3 | 2.1 |
HS/GED | 0.07 | 0.04 | 0.04 | 76.1 | 8.1 | 18.5 | 7.7 | 1.2 | 7.1 |
Education | 0.07 | 0.10 | 0.09 | 30.0 | 7.5 | 7.6 | 3.0 | 1.1 | 2.9 |
. | . | . | . | Estimated effect = 0 . | Changes in Signif. . | ||||
---|---|---|---|---|---|---|---|---|---|
Variable . | . | . | . | MRCS . | . | . | MRCS . | . | . |
Prev. earnings | 0.01 | 0.41 | 0.12 | 23.4 | 72.9 | 1.8 | 2.4 | 10.7 | 0.7 |
Age | 0.00 | 0.04 | 0.00 | — | — | 18.2 | — | — | 7.0 |
Married | 0.05 | 0.19 | 0.14 | 20.7 | 12.3 | 4.0 | 2.1 | 1.8 | 1.5 |
Hourly wage | 0.03 | 0.24 | 0.14 | 20.6 | 20.2 | 3.1 | 2.1 | 3.0 | 1.2 |
Black | 0.17 | 0.11 | 0.16 | 17.7 | 3.4 | 7.1 | 1.8 | 0.5 | 2.7 |
Hispanic | 0.24 | 0.14 | 0.26 | 10.8 | 2.3 | 5.5 | 1.1 | 0.3 | 2.1 |
HS/GED | 0.07 | 0.04 | 0.04 | 76.1 | 8.1 | 18.5 | 7.7 | 1.2 | 7.1 |
Education | 0.07 | 0.10 | 0.09 | 30.0 | 7.5 | 7.6 | 3.0 | 1.1 | 2.9 |
Note. Point estimate (): ; ; ; .
The estimated bias is reported in thousands of USD.
. | . | . | . | Estimated effect = 0 . | Changes in Signif. . | ||||
---|---|---|---|---|---|---|---|---|---|
Variable . | . | . | . | MRCS . | . | . | MRCS . | . | . |
Prev. earnings | 0.01 | 0.41 | 0.12 | 23.4 | 72.9 | 1.8 | 2.4 | 10.7 | 0.7 |
Age | 0.00 | 0.04 | 0.00 | — | — | 18.2 | — | — | 7.0 |
Married | 0.05 | 0.19 | 0.14 | 20.7 | 12.3 | 4.0 | 2.1 | 1.8 | 1.5 |
Hourly wage | 0.03 | 0.24 | 0.14 | 20.6 | 20.2 | 3.1 | 2.1 | 3.0 | 1.2 |
Black | 0.17 | 0.11 | 0.16 | 17.7 | 3.4 | 7.1 | 1.8 | 0.5 | 2.7 |
Hispanic | 0.24 | 0.14 | 0.26 | 10.8 | 2.3 | 5.5 | 1.1 | 0.3 | 2.1 |
HS/GED | 0.07 | 0.04 | 0.04 | 76.1 | 8.1 | 18.5 | 7.7 | 1.2 | 7.1 |
Education | 0.07 | 0.10 | 0.09 | 30.0 | 7.5 | 7.6 | 3.0 | 1.1 | 2.9 |
. | . | . | . | Estimated effect = 0 . | Changes in Signif. . | ||||
---|---|---|---|---|---|---|---|---|---|
Variable . | . | . | . | MRCS . | . | . | MRCS . | . | . |
Prev. earnings | 0.01 | 0.41 | 0.12 | 23.4 | 72.9 | 1.8 | 2.4 | 10.7 | 0.7 |
Age | 0.00 | 0.04 | 0.00 | — | — | 18.2 | — | — | 7.0 |
Married | 0.05 | 0.19 | 0.14 | 20.7 | 12.3 | 4.0 | 2.1 | 1.8 | 1.5 |
Hourly wage | 0.03 | 0.24 | 0.14 | 20.6 | 20.2 | 3.1 | 2.1 | 3.0 | 1.2 |
Black | 0.17 | 0.11 | 0.16 | 17.7 | 3.4 | 7.1 | 1.8 | 0.5 | 2.7 |
Hispanic | 0.24 | 0.14 | 0.26 | 10.8 | 2.3 | 5.5 | 1.1 | 0.3 | 2.1 |
HS/GED | 0.07 | 0.04 | 0.04 | 76.1 | 8.1 | 18.5 | 7.7 | 1.2 | 7.1 |
Education | 0.07 | 0.10 | 0.09 | 30.0 | 7.5 | 7.6 | 3.0 | 1.1 | 2.9 |
Note. Point estimate (): ; ; ; .
The estimated bias is reported in thousands of USD.
From the benchmarking results, we see that omitting a confounder with equivalent confounding strength to the covariates previous earnings, whether or not the individual is Hispanic or Black, or hourly wage will result in the largest amount of bias. This is consistent with the substantive findings from the original study, which reported strong subgroup effects when looking at race and previous earnings (Bloom et al., 1993). However, the magnitude of the biases from omitting these covariates is relatively low, with most ranging from 0.12 to 0.26.
There are several takeaways to highlight from formal benchmarking. First, we see that considering the two dimensions associated with an omitted moderator matter—i.e. its relationship with the individual-level treatment effect, and its relationship with the selection mechanism. In particular, omitting a confounder like previous earnings results in a relatively ‘large’ correlation values of ; however, the benchmarked value associated with previous earnings is relatively low, at 0.01. As such, the overall bias from omitting a variable like previous earnings is relatively low, at . In contrast, omitting a covariate like whether or not an individual is black results in a relatively large benchmarked value of 0.17, but a smaller benchmarked correlation value of 0.11. As a result, the bias from omitting a variable like whether or not an individual is Black is also relatively low, at 0.16. By looking at both the and measures, we are able to obtain a more holistic view of the types of confounders that may lead to potential changes in our analysis.
Second, we see that there is a large degree of robustness to an omitted moderator being strong enough to reduce the estimated effect to zero. In particular, a confounder would have to be 10 to 20 times stronger than an observed covariate to reduce the estimated effect to zero. However, when accounting for uncertainty, we see that an omitted moderator 1.1 times as strong as whether or not an individual is Hispanic, or twice as strong as hourly wage would be strong enough to result in a statistically insignificant effect. As such, we conclude that while there is a large degree of robustness to a confounder reducing the point estimate to 0, there is some sensitivity to potential changes in the statistical significance of our estimated effect. Finally, we highlight that while the running example throughout this article focused on one experimental site in the JTPA study, we provide an illustration of the sensitivity analysis for all 16 experimental sites in Appendix D of the online supplementary material.
4.3 Practical considerations
It is important to emphasize the practical limits of the sensitivity tools. The proposed sensitivity framework provides researchers with different quantitative and graphical measures to assess the degree of robustness that is present in their point estimate. However, these tools cannot be used to rule out the existence of killer confounders. Akin to Cinelli and Hazlett (2020), we do not provide cutoff measures for measures such as the robustness value or the minimum relative confounding strength. The value of the sensitivity summary measures and benchmarking is maximized in settings when researchers have a strong substantive understanding of their study and have collected explanatory covariates. In the absence of either, being able to reason about the plausibility of killer confounders will be more difficult. As such, we caution researchers from using these tools without also considering substantive judgment. The sensitivity framework provides a strong foundation for researchers to discuss the plausibility of killer confounders but should not be used in lieu of substantive understanding of the underlying covariates and context.
5 Sensitivity analysis for augmented weighted estimators
In the following section, we extend the proposed sensitivity analysis for the class of augmented weighted, doubly robust estimators. Doubly robust estimators are a popular approach used to help improve the robustness of estimators to potential misspecfications (Bang & Robins, 2005; Dahabreh et al., 2019; Tan, 2007). There are many different doubly robust estimators (Kang & Schafer, 2007), but we will focus on the augmented weighted estimator:
Augmented Weighted Estimator
where represents the set of indices of the units in the target population, is the estimated individual-level treatment effect, and the weights are defined in the same manner as before. The augmented weighted estimator is semi-parametrically efficient and allows practitioners to model both the probability of sample selection and the treatment effect heterogeneity simultaneously (Robins et al., 1994). When one of these processes is specified correctly, then the estimator will be unbiased and asymptotically consistent.
In the following section, we introduce a sensitivity analysis for the augmented weighted estimator when omitting a confounder from the minimum separating set. We show that there are strong parallels between the sensitivity analysis for the augmented weighted estimator and the sensitivity analysis for the weighted estimator.
5.1 Bias formula
To begin, we show that the bias of an augmented weighted estimator when omitting a variable from the minimum separating set can be written as a function of three components.
Bias of Augmented Weighted Estimator
There are several key takeaways from Theorem 5.1. First, the double robustness of the augmented weighted estimator is apparent from Theorem 5.1 by noting that if there is no error in the estimated weights (i.e. ), or there is no error in estimating the treatment effect heterogeneity (i.e. is a consistent model for ), then will be made up of random noise, and the correlation between and will be zero (i.e. ). Second, Theorem 5.1 highlights that the bias of an augmented weighted estimator from omitting a confounder is very similar to the bias of a weighted estimator (i.e. equation (6)). The primary difference is that instead of the individual-level treatment effect , we are interested in , which is the residual component of that cannot be explained by .
Researchers can adapt Theorem 5.1 to the case where they are not re-weighting the data at hand and are focused solely on modelling the individual-level treatment effect (e.g. g-formula estimators). This provides a helpful theoretical connection between the proposed framework and existing sensitivity analyses for outcome modelling approaches. For example, if we assume that the individual-level treatment effect follows a linear model, then we recover the results from Nguyen et al. (2017) (see Appendix A.7 of the online supplementary material for more details). Similarly, Theorem 5.1 can be thought of as a parameter amplification for previously proposed frameworks that utilize a single sensitivity parameter (e.g. Dahabreh et al., 2022, which proposes a sensitivity parameter as a function of an exponential tilting model). The bias decomposition introduced in Theorem 5.1 is not unique, and alternative parameterizations could be used. However, the combination of simple sensitivity parameters and the proposed sensitivity tools provide an interpretable and transparent way to flexibly conducting sensitivity analysis, free from parametric assumptions.
5.2 Sensitivity analysis for augmented weighted estimators
In the previous subsection, we showed that the primary differentiation between the bias formula for the augmented weighted estimator and the weighted estimator is (i.e. the residuals in the treatment effect model). This results in two new components in the augmented weighted estimator setting: and . The third component in the bias decomposition is , which is identical in both the weighted and augmented weighted estimator setting. We show in Appendix A.7 of the online supplementary material that similar bounds to the ones derived in Section 3 apply to this setting. As such, after estimating an adequate upper bound for , researchers may vary both and across bounded ranges to assess the sensitivity of an augmented weighted estimator to omitted moderators. Similarly, the sensitivity tools in Section 4 can also be extended for the augmented weighted estimator case. Details are provided in Appendix A.7, and Appendix D.3 of the online supplementary material illustrates the sensitivity analysis using JTPA.
6 Conclusion
Generalizing or transporting causal effects from an experiment to a different, or larger, population requires researchers to correctly identify a separating set of pre-treatment covariates that allow the confounding effect of sample selection to be conditionally ignorable. When this separating set is not correctly identified, PATE estimation will be biased.
In this article, we formalize a sensitivity analysis framework for weighted estimators in the generalization or transportability setting, with extensions for augmented weighted estimators. We provide two key contributions in the article. First, we show that the bias from omitting a moderator can be decomposed into two sensitivity parameters and a scaling factor that can be bounded using the observed data. The two parameters are standardized measures of the omitted moderator’s relationship with (1) the selection process and (2) the treatment effect heterogeneity. Importantly, the framework allows researchers to simultaneously consider bias and changes in inference from omitting a variable. Second, we introduce a set of sensitivity tools that allow researchers to transparently summarize the uncertainty in their estimates from potential unobserved confounding. In particular, we propose both a visual summary tool in the form of bias contour plots and a numerical summary tool, in the form of a robustness value. Furthermore, we introduce benchmarking, which allows researchers to use observed covariates to reason about the plausibility of parameter values. The proposed framework and sensitivity tools collectively allow researchers to encode their substantive knowledge to quantitatively reason about sensitivity in their estimated effects.
There are several natural avenues of future research. First, the proposed sensitivity framework assumes that there are no overlap violations between the experimental sample and the target population. However, a practical concern that often arises is when there are certain subgroups in the target populations that are completely omitted from the experimental sample. As such, an important next step is to additionally consider potential violations to overlap when generalizing or transporting causal effects. Second, recent work in observational studies introduced the notion of design sensitivity to help researchers plan their analyses for robustness to omitted confounding (e.g. Howard & Pimentel, 2021; Huang et al., 2023; Rosenbaum, 2010). Thus, an interesting avenue of future work could introduce the notion of design sensitivity for external validity, which would allow researchers to examine how certain design decisions impact robustness to confounding when transporting or generalizing experiments to target populations of interest.
Acknowledgments
I would like to thank Erin Hartman, Chad Hazlett, Leonard Wainstein, Eli Ben-Michael, Dan Soriano, Avi Feller, Peng Ding, Sam Pimentel, and Carlos Cinelli for their helpful feedback. Furthermore, I would also like to thank the participants in the UCLA Causal Inference reading group and the Berkeley Causal Inference group.
Funding
Part of this work was conducted with support by the National Science Foundation Graduate Research Fellowship under Grant No. 2146752. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the National Science Foundation.
Data availability statement
The data underlying this article are publicly available through the W.E. Upjohn Institute for Employment Research. All of the code used to generate the analysis in the article can be found at the following Github repository github.com/melodyyhuang/senseweight.
Supplementary material
Supplementary material is available online at Journal of the Royal Statistical Society: Series A.
References
Author notes
Conflict of interest: None declared.