Abstract

We consider the problem of designing a prospective randomized trial in which the outcome data will be self-reported and will involve sensitive topics. Our interest is in how a researcher can adequately power her study when some respondents misreport the binary outcome of interest. To correct the power calculations, we first obtain expressions for the bias and variance induced by misreporting. We model the problem by assuming each individual in our study is a member of one “reporting class”: a true-reporter, false-reporter, never-reporter, or always-reporter. We show that the joint distribution of reporting classes and “response classes” (characterizing individuals’ response to the treatment) will exactly define the error terms for our causal estimate. We propose a novel procedure for determining adequate sample sizes under the worst-case power corresponding to a given level of misreporting. Our problem is motivated by prior experience implementing a randomized controlled trial of a sexual-violence prevention program among adolescent girls in Kenya.

In a wide swath of social science and public health research, outcome data is drawn from self-reported survey responses (13). Accurate responses are paramount to the integrity of such research, and, as a result, there is a vast literature discussing best practices for surveying sensitive topics, including specially designed survey instruments (4). Nonetheless, there are myriad examples of surveys whose conclusions were skewed by subpopulations of misreporters, especially when respondents were adolescents (1).

Statistical corrections have been developed for a variety of problems involving mismeasured data (5). Modern methods move beyond the classical assumption of zero correlation between errors and outcomes, seeking instead to adjust model-based procedures to mitigate bias and properly account for uncertainty (6). In causal inference, there has been substantial research interest in the issue of treatment compliance (7, 8), and the related issue of measurement error in treatment assignment variables (2, 9). Measurement error in outcome variables has been comparatively understudied.

The motivation for this work arises from prior experience designing and analyzing a randomized trial of a sexual-violence prevention program targeting adolescent girls in Kenya (10). In that study, girls reported key outcomes via surveys. Inaccurate reports represented a threat to causal inference.

Our goals in this paper are threefold. First, we define a mathematical framework for reasoning about the error induced in causal estimation due to measurement error in binary outcome variables. We show that the joint distribution of “reporting classes” (defining how individuals’ reported outcomes differ from their true outcomes) and “response classes” (defining how individuals respond to treatment) fully characterizes the bias and variance of the causal estimator. This yields insights into how to mitigate misreporting error in the design phase of the experiment. Second, we describe a method we have developed for determining adequate sample size to achieve a desired detection power, in the presence of misreporting. Third, we discuss a special case in which an analyst can account for misreporting directly by including a bias correction.

The remainder of this paper proceeds as follows. Related Work reviews prior literature on measurement error and response bias. Definitions and Examples introduces our definition of reporting classes and provides motivating examples. Theoretical Results details our main theoretical results, including a novel procedure for updating sample size calculations in the presence of misreporting. Methods in Context discusses how these results can be used by practitioners, and includes a simulation study as well as bias correction results. We then offer concluding remarks.

RELATED WORK

Our work builds on themes from the psychometrics literature, wherein researchers design survey measurements and analysis practices to more accurately measure sensitive data (11). Some survey respondents are hesitant to report sensitive data, or could be harmed if their individual responses were observed. Droitcour and Larson (4) address this problem with the three-card method, which protects individual privacy while allowing for unbiased population estimates. Cimpian and Timmer (1) address bias in estimates due not to privacy concerns but rather to mischievous responders. Their work addresses students who falsely self-report as lesbian, gay, bisexual, transgender, queer, or questioning (LGBTQ) and correspondingly skew estimates in studies that aim to understand and serve this community. Instead of changing survey estimates, the authors leverage statistical techniques, such as boosted regressions with outlier removal, to address bias in the responses.

Regarding measurement error in treatments, Flegel et al. (12) show how errors in exposure measurement can yield errors in classification, wherein misclassification probabilities for disease prediction differ by disease status. They focus on nondifferential measurement error, meaning errors do not differ by study arm (as opposed to differential errors, where measurement error can differ in the treatment and control groups). As a solution, Richardson et al. (13) suggest finding an unexposed reference group and estimating a disease score that quantifies unintended exposure effects. This can confer conditional independence between the potential outcomes under no exposure and the measured covariates. In another approach, Cole et al. (14) address measurement error in exposures for longitudinal observational studies, suggesting either multiple imputation or Bayesian methods.

Our work considers measurement error in outcomes. Tennekoon and Rosenman (15) provide a parametric model for misclassified binary outcomes that vary with covariates, showing how to find consistent estimates for model parameters and individual misclassification probabilities. VanderWeele and Li (16) propose a sensitivity analysis to determine how strong differential measurement error in either the treatment or the outcome would need to be to nullify the entire result. Their method compares the true effect of the exposure with the association between the exposure and the mismeasured outcome, on scales that depend on risk ratios of the true outcome probabilities conditioned on the exposure and the mismeasured probabilities, respectively.

Tools from causal inference have been useful in recent literature on differential measurement error. Imai and Yamamoto (17) derive bounds of the average treatment effect under differential measurement error in a binary treatment variable and introduce a sensitivity analysis to help the researcher understand the estimate’s robustness to such errors. Edwards et al. (18) also leverage insights from causal inference to discuss measurement error; our work is most similar in spirit to their research. The authors use potential outcomes to formalize a discussion of “missing data” due to measurement error in exposures. We build on this by introducing a framework for misreporting behavior, and deriving the statistical properties of treatment effect estimates under different configurations.

Our construction of reporting class is parallel to the compliance classes laid out in the work of Angrist et al. (19). The definition of response classes is drawn from Hernán and Robins (20).

DEFINITIONS AND EXAMPLES

In this section, we work through an illustrative example and introduce terminology.

Outcome definitions

We review several versions of the (binary) outcome that will be necessary for our analysis.

We operate in the potential outcomes framework (21). Each individual i in the trial has 2 associated outcomes: |${Y}_i(1)\in \left\{0,1\right\}$|⁠, the outcome if treated, and |${Y}_i(0)\in \left\{0,1\right\}$|⁠, the outcome if given the control. We stipulate that the stable unit treatment value assumption (SUTVA) holds, meaning the outcome for each individual is unaffected by the assignment of treatments to other individuals (22).

Next, we denote as |${Y}_i^{(t)}\in \left\{0,1\right\}$| the true, realized outcome experienced by individual i. If we define an indicator |${W}_i\in \left\{0,1\right\}$|⁠, such that |${W}_i=1$| for treated units and 0 otherwise, then the potential outcomes relate to the true realized outcome via the simple relationship

Last, we define |${Y}_i^{(r)}\in \left\{0,1\right\}$|⁠, as the outcome reported by individual i. In our setting, it is possible that |${Y}_i^{(r)}\ne {Y}_i^{(t)}$|⁠.

Reporting classes

Consider a randomized trial in which students are randomly assigned to receive either a violence prevention program (“intervention”) or an unrelated training (“control”). The intervention’s goal is to reduce students’ experience of violence, so the outcome is binary: “yes, I experienced violence in the prior 12 months” |$\left(=1\right)$| or “no, I did not experience violence in the prior 12 months” |$\left(=0\right)$|⁠.

Reporting classes are defined by the relationship between |${Y}_i^{(t)}$| and |${Y}_i^{(r)}$|⁠. The classes are summarized in Table 1. A true-reporter behaves as expected: If they experienced violence, they report having experienced violence, and if they did not experience violence, they report not having experienced violence. A false-reporter reports the opposite of their realized outcome. A never-reporter reports not having experienced violence regardless of their actual experience of violence. An always-reporter reports experiencing violence regardless of their actual experience.

Table 1

Reporting Behavior for Each of the 4 Reporting Classesa

Reporting Class|${{Y}}_{{i}}^{\left({r}\right)}$|When|${{Y}}_{{i}}^{\left({t}\right)}=\mathbf{0}$||${{Y}}_{{i}}^{\left({r}\right)}$|When|${{Y}}_{{i}}^{\left({t}\right)}=\mathbf{1}$|
True01
False10
Never00
Always11
Reporting Class|${{Y}}_{{i}}^{\left({r}\right)}$|When|${{Y}}_{{i}}^{\left({t}\right)}=\mathbf{0}$||${{Y}}_{{i}}^{\left({r}\right)}$|When|${{Y}}_{{i}}^{\left({t}\right)}=\mathbf{1}$|
True01
False10
Never00
Always11

a This table summarizes the reporting class categories on the basis of the combination of experience (yes or no) and reported outcome (yes or no).

Table 1

Reporting Behavior for Each of the 4 Reporting Classesa

Reporting Class|${{Y}}_{{i}}^{\left({r}\right)}$|When|${{Y}}_{{i}}^{\left({t}\right)}=\mathbf{0}$||${{Y}}_{{i}}^{\left({r}\right)}$|When|${{Y}}_{{i}}^{\left({t}\right)}=\mathbf{1}$|
True01
False10
Never00
Always11
Reporting Class|${{Y}}_{{i}}^{\left({r}\right)}$|When|${{Y}}_{{i}}^{\left({t}\right)}=\mathbf{0}$||${{Y}}_{{i}}^{\left({r}\right)}$|When|${{Y}}_{{i}}^{\left({t}\right)}=\mathbf{1}$|
True01
False10
Never00
Always11

a This table summarizes the reporting class categories on the basis of the combination of experience (yes or no) and reported outcome (yes or no).

We can hypothesize causes for misreporting behavior. We focus on never-reporters, who are more plausible as existing in the violence reduction setting. Suppose there are 3 causes of failing to report: 1) Students fear negative consequences if they report a violent event; 2) students feel shame for having been a victim of violence; or 3) students are confused about what constitutes violence.

Careful consideration of such hypotheses may help researchers to modify a prospective study to mitigate misreporting bias. For example: 1) To address concerns of negative consequences of the report, the researchers may use a technique like “randomized response,” which offers a higher degree of anonymity; 2) to reduce feelings of shame, researchers may word the question to avoid triggers of shame; and 3) to clarify what constitutes “experiencing violence,” researchers may use detailed descriptive scenarios, name specific physical and verbal acts, or use common slang terms more familiar to students. Further discussion can be found in Web Appendix 1 (available at https://doi.org/10.1093/aje/kwad027).

Response classes

We now turn to a second type of grouping, response classes. Unlike reporting classes, response classes are defined by the relationship between the potential outcomes |${Y}_i(1)$| and |${Y}_i(0)$|⁠. Definitions are provided in Table 2. For a useful prior discussion of this concept, see Hernán and Robins (20).

Note that our naming convention for response classes departs from the one used by Hernán and Robins. We seek to avoid valences for the outcome directions (e.g., instead of “Helped,” we use “Decrease”).

Interaction between class types

The joint distribution of response and reporting classes characterizes the bias and variance of the difference-in-means causal estimator.

To understand why, consider how response and reporting classes jointly vary in the violence prevention example. Among students in the Decrease response class, for whom |$\left({Y}_i(0)=1,{Y}_i(1)=0\right)$|⁠, there may be individuals who feel that they can modify their chances of experiencing violence and hence experience shame if they are unable to prevent violence. Contrast that group with the Predisposed group |$\left({Y}_i(0)={Y}_i(0)=1\right)$|⁠. These individuals may feel violence is inevitable and therefore feel more comfortable reporting their experiences truthfully. In this scenario, never-reporting would be more prevalent among those responsive to the treatment. This would bias estimates of the treatment effect toward 0, much more so than if never-reporting behavior were equally probable among the Decrease and Predisposed classes.

Table 2

Potential Outcomes for Each of the 4 Response Classesa

Response Class|${{Y}}_{{i}}\left(\mathbf{0}\right)$||${{Y}}_{{i}}\left(\mathbf{1}\right)$|
Decrease10
Increase01
Unsusceptible00
Predisposed11
Response Class|${{Y}}_{{i}}\left(\mathbf{0}\right)$||${{Y}}_{{i}}\left(\mathbf{1}\right)$|
Decrease10
Increase01
Unsusceptible00
Predisposed11

a The response classes are defined by the relationship between the potential outcomes |${Y}_i(1)$| and |${Y}_i(0)$|⁠. For a useful prior discussion of this concept, see Hernán and Robins (20).

Table 2

Potential Outcomes for Each of the 4 Response Classesa

Response Class|${{Y}}_{{i}}\left(\mathbf{0}\right)$||${{Y}}_{{i}}\left(\mathbf{1}\right)$|
Decrease10
Increase01
Unsusceptible00
Predisposed11
Response Class|${{Y}}_{{i}}\left(\mathbf{0}\right)$||${{Y}}_{{i}}\left(\mathbf{1}\right)$|
Decrease10
Increase01
Unsusceptible00
Predisposed11

a The response classes are defined by the relationship between the potential outcomes |${Y}_i(1)$| and |${Y}_i(0)$|⁠. For a useful prior discussion of this concept, see Hernán and Robins (20).

Some misreporting behavior can be mitigated in the design phase of the experiment. For example, it is particularly important how measurements are obtained. This includes aspects of the measurement tool, such as question wording and response type (e.g., multiple choice, open response), as well as the context in which it is asked (e.g., computer-based survey versus one-on-one interview). Methods geared toward establishing trust with respondents may engender more truthful responses.

However, once the measurement procedures are determined, we take the reporting class to be a fixed feature of the individual. Defining the reporting class in this way precludes any type of measurement error that arises differentially by treatment arm. For example, we do not allow for the possibility that the violence prevention program induces individuals to underreport experiences of violence more than they might have otherwise. Such “demand effects” are outside the scope of this work.

A remark about false-reporters

We have not offered an example of false-reporters. Like “defiers” in the instrumental variables literature, we suspect they are uncommon. The motivation to avoid telling the truth—in fact, being willing to report either outcome level, just not the true one—is a more complicated dynamic than the motivations discussed above. We are aware of anecdotal examples of quarrelsome study participants who might function as false-reporters, but we assume they are absent in the analyses to follow.

THEORETICAL RESULTS

In this section, we formalize the insights described previously, and introduce our main theoretical results.

Preliminaries

Model, assumptions, and notation.

We consider a superpopulation of |$i=1,\dots, {N}_{\mathrm{sp}}$| units, of whom |$2n<<{N}_{\mathrm{sp}}$| will be sampled for a completely randomized experiment. Once the experimental units are drawn, n of the units are selected via simple random sample to receive the treatment, while the remaining n units are assigned to control. Our target of estimation is the superpopulation treatment effect,
Denote by |${R}_i\in \left\{0,1\right\}$| the indicator of being sampled into the experiment and |${W}_i\in \left\{0,1\right\}$| the indicator of receiving the treatment. The 2 processes are supposed independent, such that |${R}_i\perp \perp {W}_j$| for all |$i,j$|⁠. We denote as |${\mathbb{E}}_R\left(\cdotp \!\right)$| expectation with respect to the |${R}_i$| and |${\mathbb{E}}_W\left(\cdotp\! \right)$| expectation with respect to the |${W}_i$|⁠. For any estimator |$\mathrm{\phi}$|⁠, define
We associate with each unit 2 additional fixed binary constants:

Per the discussion in the prior section, we make the following assumption.  

Assumption 1.

Every unit in the superpopulation is a never-reporter, an always-reporter, or a true-reporter. There are no false-reporters.

Under assumption 1, |${N}_i{A}_i=0$| for all units and |${N}_i={A}_i=0$| implies a unit is a true-reporter. We refer collectively to the group of never-reporters and always-reporters as “misreporters.”

Table 3

Population Proportions Across Response and Reporting Classesa

Response ClassY(0)Y(1)True-ReporterAlways-ReporterNever-ReporterFalse-Reporter
Decrease10|$\overline{TD}$||$\overline{AD}$||$\overline{ND}$|x|$\overline{D}$|
Increase01|$\overline{TI}$||$\overline{AI}$||$\overline{NI}$|x|$\overline{I}$|
Unsusceptible00|$\overline{TU}$||$\overline{AU}$||$\overline{NU}$|x|$\overline{U}$|
Predisposed11|$\overline{TP}$||$\overline{AP}$||$\overline{NP}$|x|$\overline{P}$|
|$\overline{T}$||$\overline{A}$||$\overline{N}$|
Response ClassY(0)Y(1)True-ReporterAlways-ReporterNever-ReporterFalse-Reporter
Decrease10|$\overline{TD}$||$\overline{AD}$||$\overline{ND}$|x|$\overline{D}$|
Increase01|$\overline{TI}$||$\overline{AI}$||$\overline{NI}$|x|$\overline{I}$|
Unsusceptible00|$\overline{TU}$||$\overline{AU}$||$\overline{NU}$|x|$\overline{U}$|
Predisposed11|$\overline{TP}$||$\overline{AP}$||$\overline{NP}$|x|$\overline{P}$|
|$\overline{T}$||$\overline{A}$||$\overline{N}$|

a This contingency table summarizes the structure of the superpopulation, defining the proportions of the population that jointly fall into each reporting and response class. The final row defines the marginal frequencies of each reporting class, while the final column defines the marginal frequencies of each response class. Boxes marked with an “x” are precluded by our assumptions.

Table 3

Population Proportions Across Response and Reporting Classesa

Response ClassY(0)Y(1)True-ReporterAlways-ReporterNever-ReporterFalse-Reporter
Decrease10|$\overline{TD}$||$\overline{AD}$||$\overline{ND}$|x|$\overline{D}$|
Increase01|$\overline{TI}$||$\overline{AI}$||$\overline{NI}$|x|$\overline{I}$|
Unsusceptible00|$\overline{TU}$||$\overline{AU}$||$\overline{NU}$|x|$\overline{U}$|
Predisposed11|$\overline{TP}$||$\overline{AP}$||$\overline{NP}$|x|$\overline{P}$|
|$\overline{T}$||$\overline{A}$||$\overline{N}$|
Response ClassY(0)Y(1)True-ReporterAlways-ReporterNever-ReporterFalse-Reporter
Decrease10|$\overline{TD}$||$\overline{AD}$||$\overline{ND}$|x|$\overline{D}$|
Increase01|$\overline{TI}$||$\overline{AI}$||$\overline{NI}$|x|$\overline{I}$|
Unsusceptible00|$\overline{TU}$||$\overline{AU}$||$\overline{NU}$|x|$\overline{U}$|
Predisposed11|$\overline{TP}$||$\overline{AP}$||$\overline{NP}$|x|$\overline{P}$|
|$\overline{T}$||$\overline{A}$||$\overline{N}$|

a This contingency table summarizes the structure of the superpopulation, defining the proportions of the population that jointly fall into each reporting and response class. The final row defines the marginal frequencies of each reporting class, while the final column defines the marginal frequencies of each response class. Boxes marked with an “x” are precluded by our assumptions.

We can express the reported outcome as

By inspection, the above definition yields the expected reporting behavior: Regardless of the value of |${Y}_i^{(t)}$|⁠, the reported outcome is 0 if |${N}_i=1$| and 1 if |${A}_i=1$|⁠. Only if |${A}_i={N}_i=0$| do we have |${Y}_i^{(r)}={Y}_i^{(t)}$|⁠. We also define |$\overline{N}$| and |$\overline{A}$| as the superpopulation averages of the indicators |${N}_i$| and |${A}_i$|⁠.

Response classes.

We define binary indicators |${D}_i,{I}_i,{U}_i, {P}_i\in \left\{0,1\right\}$| reflecting whether each individual i falls into the Decrease, Increase, Unsusceptible, or Predisposed response classes defined in the Definitions and Examples section. Every individual belongs to exactly one class, so

Using these definitions, we summarize our superpopulation via Table 3.

Boxes marked with an “x” are precluded by our assumptions. In the remaining boxes, the given symbol denotes the population frequency of units that fall into the given category. Marginal probabilities are given in the bottom row and rightmost column.

Eagle-eyed readers will note a point of ambiguity in these definitions. If individual i is in the Unsusceptible response class, then she will provide the exact same responses whether she is a true-reporter or a never-reporter: Namely, she will report |${Y}_i^{(r)}=0$| irrespective of treatment level. Conversely, a Predisposed individual will identically report |${Y}_i^{(r)}=1$| whether she is a true- or always-reporter, irrespective of treatment.

The choice of how to designate such individuals is essentially academic: Whether or not we classify them as true-reporters, we obtain the same values for the bias and variance of the causal estimator. However, the sensitivity model to be introduced later on will be simpler to conceptualize if we do not enforce |$\overline{NU}=\overline{AP}=0$|⁠. Hence, we will proceed with the population structure as described in Table 3, although we emphasize that readers can consider the classes |$\overline{TU}$| and |$\overline{NU}$| as comprising a single “type” of study participant, and the classes |$\overline{TP}$| and |$\overline{AP}$| as comprising another type.

Bias results

To update power calculations, we first must characterize the bias and variance resulting from misreporting. We consider the bias of the finite sample difference-in-means estimator,
 
Theorem 1.
Bias of difference-in-means estimator. Define |${\mathrm{\tau}}_{\mathrm{i}}={Y}_i(1)-{Y}_i(0)$| for |$i=1,2,\dots, {N}_{sp}$| such that
The bias of the difference-in-means estimator in estimating |$\tau$| is given by
where |$\mathrm{Cov}\left(A,\mathrm{\tau} \right)$| is the superpopulation covariance between |${A}_i$| and |${\mathrm{\tau}}_i$|⁠, and |$\mathrm{Cov}\left(N,\mathrm{\tau} \right)$| is defined analogously.

For a proof, see Web Appendix 2.

Power under independence

Theorem 1 shows there are several factors at play in determining the bias due to misreporting behavior. If, in the superpopulation, we have |${A}_i,{N}_i\perp \perp {I}_i,{D}_i$|⁠, then our estimate will be shrunk toward 0 by a multiplicative factor:
These results can be extended to a power analysis. We consider Neyman’s null hypothesis versus a one-directional alternative,

Under independence, the estimate is more and more attenuated as misreporter incidence grows. This yields a strict decline in power, as formalized in Theorem 2. 

Theorem 2.

Suppose |$\mathrm{\tau} <0$| and, in the superpopulation, |${A}_i,{N}_i\perp \perp {I}_i,{D}_i$|⁠. Then, the detection power is a strictly decreasing function of |$\overline{A}$| and |$\overline{N}$|⁠.

For a proof, see Web Appendix 3.

In the absence of the independence condition, detection powers can rise or fall with misreporter incidence. In the next section, we introduce the concept of worst-case power under a sensitivity model.

Sensitivity model

In many practical settings, it is unrealistic to expect exact independence between reporting and response classes (i.e., that the proportions of true-reporters, always-reporters, and never-reporters are identical across all response classes). However, deviations from independence may be relatively small in magnitude. To quantify these deviations, we introduce the bounding quantity |$\Gamma$|⁠.  

Definition 1.
Under the sensitivity model indexed by |$\Gamma \ge 1$|⁠, every subgroup proportion in Table 3 differs by a factor no greater than |$\Gamma$| from the proportion under row-column independence. For example,
with analogous bounds for the other nonzero entries in Table 3.

Under definition 1, a |$\Gamma =1$| bound enforces that the reporting and response classes be independent. A looser bound of |$\Gamma =2$| indicates that true-reporters, never-reporters, and always-reporters can be no more than twice but no less than half as frequent within any of the 4 response classes than they are on average across those response classes.

Suppose we fix values for both the reporting class frequencies (⁠|$\overline{T},\overline{A},\overline{N}$|⁠) and the response class frequencies |$\left(\overline{D},\overline{I},\overline{U},\overline{P}\right)$|⁠. Then, larger values of |$\Gamma$| allow for increasingly adversarial configurations of the joint distributions of the reporting classes and response classes, such that detection of a treatment effect can be made increasingly difficult.

Power and sample size calculations

Using our sensitivity model, we turn to our central task: computing adequate sample size for a prospective randomized controlled trial.

Suppose an analyst is designing an experiment for a treatment intended to have a preventive effect on an outcome of interest. The analyst has a desired type I error probability no greater than |$\mathrm{\alpha} \in \left(0,1\right)$| and can tolerate a type II error probability no greater than |$\mathrm{\beta} \in \left(0,1\right)$| (so her desired power is |$1-\mathrm{\beta}$|⁠).

To proceed, the analyst must provide prospective values of the reporting class frequencies (⁠|$\overline{T},\overline{A},\overline{N}$|⁠) and the response class frequencies |$\left(\overline{D},\overline{I},\overline{U},\overline{P}\right)$|⁠, as well as a value for |$\Gamma$|⁠. As in traditional power calculations, we suggest using empirical estimates from previous research to guide the selection of these parameters; further discussion can be found in the Validation Studies subsection. Define fixed vectors,
and optimization variables,
The expected mean outcomes among treated and control units are functions of these values:
Next, we express the standardized expectation of the difference-in-means estimator as
Finally, we can pose the sample size computation as an optimization problem. We want the minimum treatment and control sample size n such that we still achieve the desired power level under all allowable values of |$\mathrm{\delta}$|⁠. This is directly encoded in Optimization Problem 1:
(1)
where |$\Phi \left(\cdotp \right)$| is the CDF of a standard normal, and |${1}_c$| is the length-c vector containing all ones. Optimization Problem 1 is a quadratic fractional program and can be solved efficiently using Dinkelbach’s method (23, 24). For more details, see Web Appendix 4.

METHODS IN CONTEXT

Example: sexual-violence prevention study

We consider an example based on prior work evaluating a sexual-violence prevention program for adolescent girls in Kenya. From a pilot study (25), we estimated the annual baseline rate of sexual violence in this population to be approximately 7%, and we estimated that the intervention reduces the incidence of sexual violence by 50%. Our goal is to identify the required sample size for a larger, follow-up study.

A standard power calculation, ignoring misreporting, yields a minimum sample size of 998 units. We consider how the presence of misreporters might necessitate larger samples. We expect never-reporters to exist within the study population, because some girls may be disinclined to disclose sexual violence, due to feelings of shame or an inability to understand their experiences as violent (26, 27). Always-reporters and false-reporters are assumed absent.

In the ideal case, the pilot would provide guidance on the values of |$\overline{N}$| (the frequency of never-reporters) and |$\Gamma$| (the bound on deviations from independence). Here, we consider a grid of possible values for both parameters, where |$0\le \overline{N}\le 0.20$| and |$1\le \Gamma \le 2$|⁠. We must also posit values for the response class frequencies, |$\overline{I},\overline{D},\overline{P}$|⁠, and |$\overline{U}$|⁠. Under the assumption that |$\overline{I}=0$| (i.e., no one is harmed by the treatment), our choices for the other frequencies are fully defined by our assumed baseline rate and treatment effect: |$\overline{D}=0.035,\overline{P}=0.035$|⁠, and |$\overline{U}=0.930$|⁠.

To compute the required sample sizes, we solve Optimization Problem 1 repeatedly at each possible pair of values of |$\overline{N}$| and |$\Gamma$|⁠. Results are plotted in Figure 1. Recall that these are worst-case sample sizes, assuming the incidence of never-reporters is as adversarial as possible under the |$\Gamma$| constraint. Specifically, the algorithm allocates as much of the population as possible to the |$\overline{ND}$| subgroup subject to the constraints, yielding the smallest possible causal estimate and the largest required sample size.

Required sample size for the sexual violence prevention program study (in Kenya), at different values of never-reporter frequency. Note these are worst-case sample sizes, assuming the incidence of never-reporters is as adversarial as possible under the $\Gamma$ constraint.
Figure 1

Required sample size for the sexual violence prevention program study (in Kenya), at different values of never-reporter frequency. Note these are worst-case sample sizes, assuming the incidence of never-reporters is as adversarial as possible under the |$\Gamma$| constraint.

If |$\Gamma =1$| (i.e., response and reporting classes assumed independent), the worst-case sample size grows approximately linearly with never-reporter frequency, and the growth rate is slow. Even if 20% of study participants were never-reporters, the required sample size would grow to only 1,260 participants. However, as we increase |$\Gamma$|⁠, the required sample size grows much more quickly with |$\overline{N}$|⁠. At |$\Gamma =2$|⁠, the required sample size grows to over 2,200 participants when 20% of the population is composed of never-reporters.

Figure 1 demonstrates the risks of an underpowered study due to the presence of never-reporters—especially if never-reporters are overrepresented among those responsive to the treatment. These insights allow practitioners to quantify how much larger a study must be in order to detect an effect.

Further simulations can be found in Web Appendix 5. Corresponding to these simulations, figures analogous to Figure 1 can be found in Web Figure 1 and Web Figure 2. Additionally, details on the code to implement these simulations can be found in Web Appendix 6 (the “Code for Practitioners” repository).

Validation studies

We next consider how to estimate key parameters for use in Optimization Problem 1.

In the ideal case, we would conduct a smaller pilot study prior to the design of the main randomized controlled trial. In this study—which we refer to as a “pre-study” or a “validation study,” because its goals involve measurement comparison versus a gold standard—a sample of individuals would be recruited from the intended study population, with some individuals randomized to the treatment and others to the standard of care. We focus our attention exclusively on validation studies that are under the researchers’ control, such that treatment can indeed be randomized and reliably recorded.

At follow-up, participants would first be asked to provide their responses using the proposed measurement instrument (e.g., a survey). These results would then be validated via a gold-standard measurement, one that is essentially error-free but is usually costly or labor-intensive (such as an in-person interview). At the completion of this validation study, researchers would have access to the triplet |$\big({Y}_i^{(t)},{Y}_i^{(r)},{W}_i\big)$| for a sample of potential participants. We denote as |${m}_t={\sum}_i{W}_i$| the number of treated units in the validation study, and |${m}_c= \sum_i\left(1-{W}_i\right)$| the number of control units.

These data can be used to improve the quality of the subsequent trial in several ways. The pre-study affords the research team an opportunity to evaluate the trade-offs of using the gold-standard versus the proposed measure during the randomized trial.

Moreover, the pre-study can inform parameter choices when evaluating the necessary sample size for the experiment. If we impose assumption 1 (under which there are no false-reporters) then we can directly estimate a number of population quantities by relating them to empirical frequencies observed in the pre-study. To see how, we first define
as well as

We note that |${Y}_i^{(t)}(1)={Y}_i(1)$| and |${Y}_i^{(t)}(0)={Y}_i(0)$| (i.e., these are observations of the underlying potential outcomes) but we find this notation more intuitive.

Each unit in the pre-study sample will receive either the treated or control condition, so we observe either |$\big({Y}_i^{(t)}(1),{Y}_i^{(r)}(1)\big)$| or |$\big({Y}_i^{(t)}(0),{Y}_i^{(r)}(0)\big)$| for each unit. Hence, we can simply count the empirical frequencies of ones and zeros among true and reported outcomes to obtain unbiased estimates of key superpopulation frequencies. For example, if we obtain the proportion of treated units for whom |${Y}_i^{(t)}(1)={Y}_i^{(r)}(1)=1$| in the pre-study, then we observe
Similarly, if we obtain the proportion of control units for whom |${Y}_i^{(t)}(0)=1,{Y}_i^{(r)}(0)=0$|⁠, then

In Table 4 and Table 5, we summarize each of these relationships. Under assumption 1, every cell in these tables can be estimated unbiasedly by computing the empirical frequency of the row and column values observed in the pre-study.

Table 4

Joint Reporting and Response Classes for Treated Potential Outcomesa

|${{Y}}_{{i}}^{\left({r}\right)}\left(\mathbf{1}\right)$|
10
|${Y}_i^{(t)}(1)$|1|$\overline{TI}+\overline{TP}+\overline{AI}+\overline{AP}$||$\overline{NI}+\overline{NP}$|
0|$\overline{AD}+\overline{AU}$||$\overline{TD}+\overline{TU}+\overline{ND}+\overline{NU}$|
|${{Y}}_{{i}}^{\left({r}\right)}\left(\mathbf{1}\right)$|
10
|${Y}_i^{(t)}(1)$|1|$\overline{TI}+\overline{TP}+\overline{AI}+\overline{AP}$||$\overline{NI}+\overline{NP}$|
0|$\overline{AD}+\overline{AU}$||$\overline{TD}+\overline{TU}+\overline{ND}+\overline{NU}$|

a The cells in this table represent the empirical response frequencies among treated individuals, under both the proposed measurement instrument and a gold-standard measure. In each cell, we show the sum of superpopulation proportions that is equal in expectation to the empirical response proportion represented by the cell.

Table 4

Joint Reporting and Response Classes for Treated Potential Outcomesa

|${{Y}}_{{i}}^{\left({r}\right)}\left(\mathbf{1}\right)$|
10
|${Y}_i^{(t)}(1)$|1|$\overline{TI}+\overline{TP}+\overline{AI}+\overline{AP}$||$\overline{NI}+\overline{NP}$|
0|$\overline{AD}+\overline{AU}$||$\overline{TD}+\overline{TU}+\overline{ND}+\overline{NU}$|
|${{Y}}_{{i}}^{\left({r}\right)}\left(\mathbf{1}\right)$|
10
|${Y}_i^{(t)}(1)$|1|$\overline{TI}+\overline{TP}+\overline{AI}+\overline{AP}$||$\overline{NI}+\overline{NP}$|
0|$\overline{AD}+\overline{AU}$||$\overline{TD}+\overline{TU}+\overline{ND}+\overline{NU}$|

a The cells in this table represent the empirical response frequencies among treated individuals, under both the proposed measurement instrument and a gold-standard measure. In each cell, we show the sum of superpopulation proportions that is equal in expectation to the empirical response proportion represented by the cell.

Table 5

Joint Reporting and Response Classes for Untreated Potential Outcomesa

|${{Y}}_{{i}}^{\left({r}\right)}\left(\mathbf{0}\right)$|
10
|${Y}_i^{(t)}(0)$|1|$\overline{TD}+\overline{TP}+\overline{AD}+\overline{AP}$||$\overline{ND}+\overline{NP}$|
0|$\overline{AI}+\overline{AU}$||$\overline{TI}+\overline{TU}+\overline{NI}+\overline{NU}$|
|${{Y}}_{{i}}^{\left({r}\right)}\left(\mathbf{0}\right)$|
10
|${Y}_i^{(t)}(0)$|1|$\overline{TD}+\overline{TP}+\overline{AD}+\overline{AP}$||$\overline{ND}+\overline{NP}$|
0|$\overline{AI}+\overline{AU}$||$\overline{TI}+\overline{TU}+\overline{NI}+\overline{NU}$|

a The cells in this table represent the empirical response frequencies among control individuals, under both the proposed measurement instrument and a gold-standard measure. In each cell, we show the sum of superpopulation proportions that is equal in expectation to the empirical response proportion represented by the cell.

Table 5

Joint Reporting and Response Classes for Untreated Potential Outcomesa

|${{Y}}_{{i}}^{\left({r}\right)}\left(\mathbf{0}\right)$|
10
|${Y}_i^{(t)}(0)$|1|$\overline{TD}+\overline{TP}+\overline{AD}+\overline{AP}$||$\overline{ND}+\overline{NP}$|
0|$\overline{AI}+\overline{AU}$||$\overline{TI}+\overline{TU}+\overline{NI}+\overline{NU}$|
|${{Y}}_{{i}}^{\left({r}\right)}\left(\mathbf{0}\right)$|
10
|${Y}_i^{(t)}(0)$|1|$\overline{TD}+\overline{TP}+\overline{AD}+\overline{AP}$||$\overline{ND}+\overline{NP}$|
0|$\overline{AI}+\overline{AU}$||$\overline{TI}+\overline{TU}+\overline{NI}+\overline{NU}$|

a The cells in this table represent the empirical response frequencies among control individuals, under both the proposed measurement instrument and a gold-standard measure. In each cell, we show the sum of superpopulation proportions that is equal in expectation to the empirical response proportion represented by the cell.

Using the results summarized in Tables 4 and 5, we can estimate 8 different linear combinations of the population proportions described in Table 3. Seven of these linear combinations are independent, so we do not have sufficient information to obtain individual estimates of each of the 12 different population proportions in Table 3. However, we can arrive at such estimates by imposing additional constraints.

First, we can impose an assumption about the relative proportions of the Increase and Decrease response classes among each reporting class. Most typically, we would simply assume a unidirectional treatment effect, such that no individuals are assumed to fall in the Increase class or no individuals are assumed to fall in the Decrease class. This is common in practice, as we might expect that any treatment would help some individuals in the trial but not harm any other individuals in the trial. In other settings, it might be more realistic to instead assume a relative frequency of the Increase and Decrease classes within each reporting class—supposing, for example, that the proportions |$\overline{TI},$||$\overline{AI}$|⁠, and |$\overline{NI}$| are twice as large as |$\overline{TU},$||$\overline{AU}$|⁠, and |$\overline{NU}$|⁠, respectively. With any 3 linearly independent constraints of this type, we can estimate almost every nonzero proportion in Table 3.

The sole exception (per our discussion in the Theoretical Results section) is that we can identify only the sums |$\overline{TU}+\overline{NU}$| and |$\overline{TP}+\overline{AP}$|⁠, rather than the individual proportions themselves. We can use a simple heuristic to allocate these 2 sums to their constituent proportions. We advocate allocating them such that |$\overline{NU}=\overline{N}\cdotp \overline{U}$| and |$\overline{AP}=\overline{A}\cdotp \overline{P}$|⁠, with all of the remaining mass allocated to |$\overline{TU}$| and |$\overline{TP}$| respectively.

After these steps, researchers will have access to a full set of pilot estimates for entries in Table 3. Taking row and column sums of these pilot estimates will provide the marginal probabilities that define |$\mathrm{\pi} =\left({\mathrm{\pi}}_{\mathrm{rep}},{\mathrm{\pi}}_{\mathrm{res}}\right)$|⁠, an input to Optimization Problem 1. The other input, |$\Gamma$|⁠, can also be estimated by computing the maximal deviation from row-column independence among the pilot estimates. In practice, we suggest considering several different values of |$\Gamma$|⁠, informed also by subject matter knowledge, in order to see how the required sample size evolves with this parameter.

Special case: bias correction

Typically, validation studies will be useful for obtaining pilot estimates for the parameters in Optimization Problem 1. If the validation study is of sufficient size and quality, however, it could be used to directly estimate a bias correction for the causal estimates in our larger study. In particular, observe that the estimator
satisfies
as defined in Theorem 1.

In other words, with no further assumptions, we can estimate the bias in the usual difference-in-means estimator when using the proposed measurement instrument in an experiment. In typical cases, this quantity can be used as a “gut check” to ensure that the bias is not intolerably large for the main randomized trial.

However, with a sufficiently large and high-quality validation study, our estimator |$\hat{B}$| may also be estimated with low variance, such that we may consider |$\hat{B}$| to be a precise estimator of the true bias. In this case, we can use |$\hat{B}$| as a plug-in estimator and subtract this quantity from the causal estimate obtained from our larger randomized trial. This would yield a bias-corrected estimator, directly mitigating the error due to reporting bias.

CONCLUDING REMARKS

The formal treatment of reporting bias is a somewhat recent development in the causal inference literature, especially when considering measurement errors correlated with potential outcomes (2). This problem poses a threat to valid causal inference—especially when considering randomized trials in which the outcomes include sensitive topics. The problem can be partially mitigated through careful survey design and methods to ensure participant privacy. But there are many practical settings in which all participants cannot reasonably be expected to disclose data about an outcome of interest.

In this paper, we have considered the problem from the perspective of a researcher designing a prospective randomized trial. Our contributions are 3-fold. First, we defined a set of “reporting classes” to characterize individual reporting behavior. We demonstrate that the joint distribution of these reporting classes—along with “response classes” reflecting how individuals respond to treatment—determine exactly how much bias and variance will be induced in our causal estimate. Second, we have proposed a method for practitioners to adequately power their analyses in the presence of misreporting, assuming a worst-case deviation from independence of the reporting and response classes. Third, we have described a special case in which an analyst can directly correct for misreporting biases.

Future opportunities in this area are myriad. Recent papers have considered the case in which misreporting behavior is differential (16), meaning individuals may misreport only when receiving the treatment or only when receiving the control. Accounting for such behavior represents an important extension to this research.

ACKNOWLEDGMENTS

Author affiliations: Harvard Data Science Initiative, Harvard University, Cambridge, Massachusetts, United States (Evan T. R. Rosenman); LinkedIn Data Science and Applied Research, Sunnyvale, California, United States (Rina Friedberg); and Department of Epidemiology and Population Health, Stanford University School of Medicine, Stanford, California, United States (Mike Baiocchi).

We are grateful to Google and to the Marjorie Lozoff Fund, provided by the Michelle R. Clayman Institute for Gender Research, for financial support.

No original data was used in the content of this manuscript. Reproduction code for the central algorithm is included in the Web Material.

We thank Dr. Luke Miratrix, Dr. Sophie Litschwartz, Dr. Kristen Hunter, and Dr. Julia Simard for their useful comments and feedback.

Presented at the 2022 Joint Statistical Meetings, August 6–11, 2022, Washington, DC; and at the 2022 Society for Research on Educational Effective Conference, September 21–24, 2022, Arlington, Virginia.

A preprint of this article has been published online. Rosenman ETR, Friedberg R, Baiocchi M. Robust designs for prospective randomized trials surveying sensitive topics. arXiv. 2021. (https://doi.org/10.48550/arXiv.2108.0894). We caution that the terminology in the paper has changed significantly in the revision process.

This research was conducted while R.F. was a student at Stanford, and is not affiliated with LinkedIn.

Conflict of interest: none declared.

REFERENCES

1.

Cimpian
JR
,
Timmer
JD
.
Mischievous responders and sexual minority youth survey data: a brief history, recent methodological advances, and implications for research and practice
.
Arch Sex Behav.
2020
;
49
(
4
):
1097
1102
.

2.

Imai
K
,
Yamamoto
T
.
Causal inference with differential measurement error: nonparametric identification and sensitivity analysis
.
Am J Polit Sci.
2010
;
54
(
2
):
543
560
.

3.

Zaller
J
,
Feldman
S
.
A simple theory of the survey response: answering questions versus revealing preferences
.
Am J Polit Sci.
1992
;
36
(
3
):
579
616
.

4.

Droitcour
JA
,
Larson
EM
.
An innovative technique for asking sensitive questions: the three-card method
.
Bull Methodol Sociol.
2002
;
75
(
1
):
5
23
.

5.

Carroll
RJ
,
Ruppert
D
,
Stefanski
LA
, et al.
Measurement Error in Nonlinear Models
. 2nd ed.
Boca Raton, FL
:
Chapman and Hall/CRC
;
2006
.

6.

Hausman
J
.
Mismeasured variables in econometric analysis: problems from the right and problems from the left
.
J Econ Perspect
.
2001
;
15
(
4
):
57
67
.

7.

Frangakis
CE
,
Rubin
DB
.
Principal stratification in causal inference
.
Biometrics.
2002
;
58
(
1
):
21
29
.

8.

Balke
A
,
Pearl
J
.
Bounds on treatment effects from studies with imperfect compliance
.
J Am Stat Assoc.
1997
;
92
(
439
):
1171
1176
.

9.

Lewbel
A
.
Estimation of average treatment effects with misclassification
.
Econometrica.
2007
;
75
(
2
):
537
551
.

10.

Rosenman
ETR
,
Sarnquist
C
,
Friedberg
R
, et al.
Empirical insights for improving sexual assault prevention: evidence from baseline data for a cluster-randomized trial of impower and sources of strength
.
Violence Against Women.
2020
;
26
(
15–16
):
1855
1875
.

11.

Vitoratou
S
,
Pickles
A
.
A note on contemporary psychometrics
.
J Ment Health.
2017
;
26
(
6
):
486
488
.

12.

Flegel
KM
,
Keyl
PM
,
Nieto
JF
.
Differential misclassification arising from nondifferential errors in exposure measurement
.
Am J Epidemiol.
1991
;
134
(
10
):
1233
1246
.

13.

Richardson
DB
,
Keil
AP
,
Cole
SR
, et al.
Reducing bias due to exposure measurement error using disease risk scores
.
Am J Epidemiol.
2021
;
190
(
4
):
621
629
.

14.

Cole
SR
,
Jacobson
LP
,
Tien
PC
, et al.
Using marginal structural measurement-error models to estimate the long-term effect of antiretroviral therapy on incident aids or death
.
Am J Epidemiol.
2010
;
171
(
1
):
113
122
.

15.

Tennekoon
V
,
Rosenman
R
.
Systematically misclassified binary dependent variables
.
Commun Stat Theory Methods.
2016
;
45
(
9
):
2538
2555
.

16.

VanderWeele
TJ
,
Li
Y
.
Simple sensitivity analysis for differential measurement error
.
Am J Epidemiol.
2019
;
188
(
10
):
1823
1829
.

17.

Imai
K
,
Yamamoto
T
.
Causal inference with differential measurement error: nonparametric identification and sensitivity analysis
.
Am J Polit Sci.
2010
;
54
(
2
):
543
560
.

18.

Edwards
JK
,
Cole
SR
,
Westreich
D
.
All your data are always missing: incorporating bias due to measurement error into the potential outcomes framework
.
Int J Epidemiol.
2015
;
44
(
4
):
1452
1459
.

19.

Angrist
JD
,
Imbens
GW
,
Rubin
DB
.
Identification of causal effects using instrumental variables
.
J Am Stat Assoc.
1996
;
91
(
434
):
444
455
.

20.

Hernán
MA
,
Robins
JM
.
Causal Inference: What If?
Boca Raton, FL
:
Chapman & Hall/CRC
;
2010
.

21.

Rubin
DB
.
Estimating causal effects of treatments in randomized and nonrandomized studies
.
J Educ Psychol.
1974
;
66
(
5
):
688
701
.

22.

Imbens
GW
,
Rubin
DB
.
Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction
.
New York, NY
:
Cambridge University Press
;
2015
.

23.

Dinkelbach
W
.
On nonlinear fractional programming
.
Management Science.
1967
;
13
(
7
):
492
498
.

24.

Phillips
AT
. Quadratic fractional programming: Dinkelbach’s method. In:
Floudas
CA
,
Pardalos
PM
, eds.
Encyclopedia of Optimization
. 2nd ed.
Boston, MA
:
Springer US
;
2001
:
2107
2110
.

25.

Baiocchi
M
,
Friedberg
R
,
Rosenman
ETR
, et al.
Prevalence and risk factors for sexual assault among class 6 female students in unplanned settlements of Nairobi, Kenya: baseline analysis from the ImPower & Sources of Strength cluster randomized controlled trial
.
PLoS One.
2019
;
14
(
6
):e0213359.

26.

Cook
SL
,
Gidycz
CA
,
Koss
MP
, et al.
Emerging issues in the measurement of rape victimization
.
Violence Against Women.
2011
;
17
(
2
):
201
218
.

27.

Fisher
BS
,
Cullen
FT
.
Measuring the sexual victimization of women: evolution, current controversies, and future research
.
Criminal Justice.
2000
;
4
:
317
390
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data