Robust Designs for Prospective Randomized Trials Surveying Sensitive Topics

Reporting Behavior for Each of the 4 Reporting Classes^a

Reporting Class	\|${{Y}}_{{i}}^{\left({r}\right)}$\|When\|${{Y}}_{{i}}^{\left({t}\right)}=\mathbf{0}$\|	\|${{Y}}_{{i}}^{\left({r}\right)}$\|When\|${{Y}}_{{i}}^{\left({t}\right)}=\mathbf{1}$\|
True	0	1
False	1	0
Never	0	0
Always	1	1

Reporting Class	\|${{Y}}_{{i}}^{\left({r}\right)}$\|When\|${{Y}}_{{i}}^{\left({t}\right)}=\mathbf{0}$\|	\|${{Y}}_{{i}}^{\left({r}\right)}$\|When\|${{Y}}_{{i}}^{\left({t}\right)}=\mathbf{1}$\|
True	0	1
False	1	0
Never	0	0
Always	1	1

^a This table summarizes the reporting class categories on the basis of the combination of experience (yes or no) and reported outcome (yes or no).

Table 1

Reporting Behavior for Each of the 4 Reporting Classes^a

Reporting Class	\|${{Y}}_{{i}}^{\left({r}\right)}$\|When\|${{Y}}_{{i}}^{\left({t}\right)}=\mathbf{0}$\|	\|${{Y}}_{{i}}^{\left({r}\right)}$\|When\|${{Y}}_{{i}}^{\left({t}\right)}=\mathbf{1}$\|
True	0	1
False	1	0
Never	0	0
Always	1	1

Reporting Class	\|${{Y}}_{{i}}^{\left({r}\right)}$\|When\|${{Y}}_{{i}}^{\left({t}\right)}=\mathbf{0}$\|	\|${{Y}}_{{i}}^{\left({r}\right)}$\|When\|${{Y}}_{{i}}^{\left({t}\right)}=\mathbf{1}$\|
True	0	1
False	1	0
Never	0	0
Always	1	1

^a This table summarizes the reporting class categories on the basis of the combination of experience (yes or no) and reported outcome (yes or no).

We can hypothesize causes for misreporting behavior. We focus on never-reporters, who are more plausible as existing in the violence reduction setting. Suppose there are 3 causes of failing to report: 1) Students fear negative consequences if they report a violent event; 2) students feel shame for having been a victim of violence; or 3) students are confused about what constitutes violence.

Careful consideration of such hypotheses may help researchers to modify a prospective study to mitigate misreporting bias. For example: 1) To address concerns of negative consequences of the report, the researchers may use a technique like “randomized response,” which offers a higher degree of anonymity; 2) to reduce feelings of shame, researchers may word the question to avoid triggers of shame; and 3) to clarify what constitutes “experiencing violence,” researchers may use detailed descriptive scenarios, name specific physical and verbal acts, or use common slang terms more familiar to students. Further discussion can be found in Web Appendix 1 (available at https://doi.org/10.1093/aje/kwad027).

Response classes

We now turn to a second type of grouping, response classes. Unlike reporting classes, response classes are defined by the relationship between the potential outcomes |${Y}_i(1)$| and |${Y}_i(0)$|⁠. Definitions are provided in Table 2. For a useful prior discussion of this concept, see Hernán and Robins (20).

Note that our naming convention for response classes departs from the one used by Hernán and Robins. We seek to avoid valences for the outcome directions (e.g., instead of “Helped,” we use “Decrease”).

Interaction between class types

The joint distribution of response and reporting classes characterizes the bias and variance of the difference-in-means causal estimator.

To understand why, consider how response and reporting classes jointly vary in the violence prevention example. Among students in the Decrease response class, for whom |$\left({Y}_i(0)=1,{Y}_i(1)=0\right)$|⁠, there may be individuals who feel that they can modify their chances of experiencing violence and hence experience shame if they are unable to prevent violence. Contrast that group with the Predisposed group |$\left({Y}_i(0)={Y}_i(0)=1\right)$|⁠. These individuals may feel violence is inevitable and therefore feel more comfortable reporting their experiences truthfully. In this scenario, never-reporting would be more prevalent among those responsive to the treatment. This would bias estimates of the treatment effect toward 0, much more so than if never-reporting behavior were equally probable among the Decrease and Predisposed classes.

Table 2

Potential Outcomes for Each of the 4 Response Classes^a

Response Class	\|${{Y}}_{{i}}\left(\mathbf{0}\right)$\|	\|${{Y}}_{{i}}\left(\mathbf{1}\right)$\|
Decrease	1	0
Increase	0	1
Unsusceptible	0	0
Predisposed	1	1

^a The response classes are defined by the relationship between the potential outcomes |${Y}_i(1)$| and |${Y}_i(0)$|⁠. For a useful prior discussion of this concept, see Hernán and Robins (20).

Table 2

Potential Outcomes for Each of the 4 Response Classes^a

Response Class	\|${{Y}}_{{i}}\left(\mathbf{0}\right)$\|	\|${{Y}}_{{i}}\left(\mathbf{1}\right)$\|
Decrease	1	0
Increase	0	1
Unsusceptible	0	0
Predisposed	1	1

^a The response classes are defined by the relationship between the potential outcomes |${Y}_i(1)$| and |${Y}_i(0)$|⁠. For a useful prior discussion of this concept, see Hernán and Robins (20).

Some misreporting behavior can be mitigated in the design phase of the experiment. For example, it is particularly important how measurements are obtained. This includes aspects of the measurement tool, such as question wording and response type (e.g., multiple choice, open response), as well as the context in which it is asked (e.g., computer-based survey versus one-on-one interview). Methods geared toward establishing trust with respondents may engender more truthful responses.

However, once the measurement procedures are determined, we take the reporting class to be a fixed feature of the individual. Defining the reporting class in this way precludes any type of measurement error that arises differentially by treatment arm. For example, we do not allow for the possibility that the violence prevention program induces individuals to underreport experiences of violence more than they might have otherwise. Such “demand effects” are outside the scope of this work.

A remark about false-reporters

We have not offered an example of false-reporters. Like “defiers” in the instrumental variables literature, we suspect they are uncommon. The motivation to avoid telling the truth—in fact, being willing to report either outcome level, just not the true one—is a more complicated dynamic than the motivations discussed above. We are aware of anecdotal examples of quarrelsome study participants who might function as false-reporters, but we assume they are absent in the analyses to follow.

THEORETICAL RESULTS

In this section, we formalize the insights described previously, and introduce our main theoretical results.

Preliminaries

Model, assumptions, and notation.

We consider a superpopulation of |$i=1,\dots, {N}_{\mathrm{sp}}$| units, of whom |$2n<<{N}_{\mathrm{sp}}$| will be sampled for a completely randomized experiment. Once the experimental units are drawn, n of the units are selected via simple random sample to receive the treatment, while the remaining n units are assigned to control. Our target of estimation is the superpopulation treatment effect,

$$ \mathrm{\tau} =\frac{1}{N_{sp}}\sum \limits_{i=1}^{N_{\mathrm{sp}}}{Y}_i(1)-{Y}_i(0). $$

Denote by |${R}_i\in \left\{0,1\right\}$| the indicator of being sampled into the experiment and |${W}_i\in \left\{0,1\right\}$| the indicator of receiving the treatment. The 2 processes are supposed independent, such that |${R}_i\perp \perp {W}_j$| for all |$i,j$|⁠. We denote as |${\mathbb{E}}_R\left(\cdotp \!\right)$| expectation with respect to the |${R}_i$| and |${\mathbb{E}}_W\left(\cdotp\! \right)$| expectation with respect to the |${W}_i$|⁠. For any estimator |$\mathrm{\phi}$|⁠, define

$$ \mathbb{E}\left(\phi \right)={E}_R\left({E}_W\left(\phi |{\left\{{R}_i\right\}}_{i=1}^{N_{\mathrm{sp}}}\right)\right)\!. $$

We associate with each unit 2 additional fixed binary constants:

$$ \begin{align*} {N}_i&=\left\{\begin{array}{cc}1& \mathrm{if}\ i\ \mathrm{is}\ \mathrm{a}\ \mathrm{never}\mbox{-}\mathrm{reporter}\\ {}0& \mathrm{otherwise,}\end{array}\right.\\{A}_i&=\left\{\begin{array}{cc}1& \mathrm{if}\ i\ \mathrm{is}\ \mathrm{an}\ \mathrm{always}\mbox{-}\mathrm{reporter}\\ {}0& \mathrm{otherwise.}\end{array}\right. \end{align*}$$

Per the discussion in the prior section, we make the following assumption.

Assumption 1.

Every unit in the superpopulation is a never-reporter, an always-reporter, or a true-reporter. There are no false-reporters.

Under assumption 1, |${N}_i{A}_i=0$| for all units and |${N}_i={A}_i=0$| implies a unit is a true-reporter. We refer collectively to the group of never-reporters and always-reporters as “misreporters.”

Table 3

Population Proportions Across Response and Reporting Classes^a

Response Class	Y(0)	Y(1)	True-Reporter	Always-Reporter	Never-Reporter	False-Reporter
Decrease	1	0	\|$\overline{TD}$\|	\|$\overline{AD}$\|	\|$\overline{ND}$\|	x	\|$\overline{D}$\|
Increase	0	1	\|$\overline{TI}$\|	\|$\overline{AI}$\|	\|$\overline{NI}$\|	x	\|$\overline{I}$\|
Unsusceptible	0	0	\|$\overline{TU}$\|	\|$\overline{AU}$\|	\|$\overline{NU}$\|	x	\|$\overline{U}$\|
Predisposed	1	1	\|$\overline{TP}$\|	\|$\overline{AP}$\|	\|$\overline{NP}$\|	x	\|$\overline{P}$\|
			\|$\overline{T}$\|	\|$\overline{A}$\|	\|$\overline{N}$\|

Response Class	Y(0)	Y(1)	True-Reporter	Always-Reporter	Never-Reporter	False-Reporter
Decrease	1	0	\|$\overline{TD}$\|	\|$\overline{AD}$\|	\|$\overline{ND}$\|	x	\|$\overline{D}$\|
Increase	0	1	\|$\overline{TI}$\|	\|$\overline{AI}$\|	\|$\overline{NI}$\|	x	\|$\overline{I}$\|
Unsusceptible	0	0	\|$\overline{TU}$\|	\|$\overline{AU}$\|	\|$\overline{NU}$\|	x	\|$\overline{U}$\|
Predisposed	1	1	\|$\overline{TP}$\|	\|$\overline{AP}$\|	\|$\overline{NP}$\|	x	\|$\overline{P}$\|
			\|$\overline{T}$\|	\|$\overline{A}$\|	\|$\overline{N}$\|

^a This contingency table summarizes the structure of the superpopulation, defining the proportions of the population that jointly fall into each reporting and response class. The final row defines the marginal frequencies of each reporting class, while the final column defines the marginal frequencies of each response class. Boxes marked with an “x” are precluded by our assumptions.

Table 3

Open in new tab Download slide

Population Proportions Across Response and Reporting Classes^a

Response Class	Y(0)	Y(1)	True-Reporter	Always-Reporter	Never-Reporter	False-Reporter
Decrease	1	0	\|$\overline{TD}$\|	\|$\overline{AD}$\|	\|$\overline{ND}$\|	x	\|$\overline{D}$\|
Increase	0	1	\|$\overline{TI}$\|	\|$\overline{AI}$\|	\|$\overline{NI}$\|	x	\|$\overline{I}$\|
Unsusceptible	0	0	\|$\overline{TU}$\|	\|$\overline{AU}$\|	\|$\overline{NU}$\|	x	\|$\overline{U}$\|
Predisposed	1	1	\|$\overline{TP}$\|	\|$\overline{AP}$\|	\|$\overline{NP}$\|	x	\|$\overline{P}$\|
			\|$\overline{T}$\|	\|$\overline{A}$\|	\|$\overline{N}$\|

Response Class	Y(0)	Y(1)	True-Reporter	Always-Reporter	Never-Reporter	False-Reporter
Decrease	1	0	\|$\overline{TD}$\|	\|$\overline{AD}$\|	\|$\overline{ND}$\|	x	\|$\overline{D}$\|
Increase	0	1	\|$\overline{TI}$\|	\|$\overline{AI}$\|	\|$\overline{NI}$\|	x	\|$\overline{I}$\|
Unsusceptible	0	0	\|$\overline{TU}$\|	\|$\overline{AU}$\|	\|$\overline{NU}$\|	x	\|$\overline{U}$\|
Predisposed	1	1	\|$\overline{TP}$\|	\|$\overline{AP}$\|	\|$\overline{NP}$\|	x	\|$\overline{P}$\|
			\|$\overline{T}$\|	\|$\overline{A}$\|	\|$\overline{N}$\|

^a This contingency table summarizes the structure of the superpopulation, defining the proportions of the population that jointly fall into each reporting and response class. The final row defines the marginal frequencies of each reporting class, while the final column defines the marginal frequencies of each response class. Boxes marked with an “x” are precluded by our assumptions.

We can express the reported outcome as

$$ {Y}_i^{(r)}=\left(1-{N}_i\right)\left(1-{A}_i\right)\left({W}_i{Y}_i(1)+\left(1-{W}_i\right){Y}_i(0)\right)+{A}_i. $$

By inspection, the above definition yields the expected reporting behavior: Regardless of the value of |${Y}_i^{(t)}$|⁠, the reported outcome is 0 if |${N}_i=1$| and 1 if |${A}_i=1$|⁠. Only if |${A}_i={N}_i=0$| do we have |${Y}_i^{(r)}={Y}_i^{(t)}$|⁠. We also define |$\overline{N}$| and |$\overline{A}$| as the superpopulation averages of the indicators |${N}_i$| and |${A}_i$|⁠.

Response classes.

We define binary indicators |${D}_i,{I}_i,{U}_i, {P}_i\in \left\{0,1\right\}$| reflecting whether each individual i falls into the Decrease, Increase, Unsusceptible, or Predisposed response classes defined in the Definitions and Examples section. Every individual belongs to exactly one class, so

$$ {D}_i+{I}_i+{U}_i+{P}_i=1\ \mathrm{for}\ \mathrm{all}\ i. $$

Using these definitions, we summarize our superpopulation via Table 3.

Boxes marked with an “x” are precluded by our assumptions. In the remaining boxes, the given symbol denotes the population frequency of units that fall into the given category. Marginal probabilities are given in the bottom row and rightmost column.

Eagle-eyed readers will note a point of ambiguity in these definitions. If individual i is in the Unsusceptible response class, then she will provide the exact same responses whether she is a true-reporter or a never-reporter: Namely, she will report |${Y}_i^{(r)}=0$| irrespective of treatment level. Conversely, a Predisposed individual will identically report |${Y}_i^{(r)}=1$| whether she is a true- or always-reporter, irrespective of treatment.

The choice of how to designate such individuals is essentially academic: Whether or not we classify them as true-reporters, we obtain the same values for the bias and variance of the causal estimator. However, the sensitivity model to be introduced later on will be simpler to conceptualize if we do not enforce |$\overline{NU}=\overline{AP}=0$|⁠. Hence, we will proceed with the population structure as described in Table 3, although we emphasize that readers can consider the classes |$\overline{TU}$| and |$\overline{NU}$| as comprising a single “type” of study participant, and the classes |$\overline{TP}$| and |$\overline{AP}$| as comprising another type.

Bias results

To update power calculations, we first must characterize the bias and variance resulting from misreporting. We consider the bias of the finite sample difference-in-means estimator,

$$ \hat{\tau} =\frac{1}{n}\sum \limits_{i=1}^{N_{\mathrm{sp}}}{Y}_i{W}_i{R}_i-\frac{1}{n}\sum \limits_{i=1}^{N_{\mathrm{sp}}}{Y}_i\left(1-{W}_i\right){R}_i. $$

Theorem 1.

Bias of difference-in-means estimator. Define |${\mathrm{\tau}}_{\mathrm{i}}={Y}_i(1)-{Y}_i(0)$| for |$i=1,2,\dots, {N}_{sp}$| such that

$$ \mathrm{\tau} =\frac{1}{N_{sp}}\sum \limits_{i=1}^{N_{sp}}{\mathrm{\tau}}_i. $$

The bias of the difference-in-means estimator in estimating |$\tau$| is given by

$$ {\displaystyle \begin{array}{l}\mathrm{Bias}(\hat{\tau})=-(\overline{A}+\overline{N})\mathrm{\tau} -\mathrm{Cov}\left(A,\mathrm{\tau} \right)-\mathrm{Cov}\left(N,\mathrm{\tau} \right)\\ {}\kern3em =-\overline{NI}+\overline{ND}-\overline{AI}+\overline{AD}\end{array}} $$

where |$\mathrm{Cov}\left(A,\mathrm{\tau} \right)$| is the superpopulation covariance between |${A}_i$| and |${\mathrm{\tau}}_i$|⁠, and |$\mathrm{Cov}\left(N,\mathrm{\tau} \right)$| is defined analogously.

For a proof, see Web Appendix 2.

Power under independence

Theorem 1 shows there are several factors at play in determining the bias due to misreporting behavior. If, in the superpopulation, we have |${A}_i,{N}_i\perp \perp {I}_i,{D}_i$|⁠, then our estimate will be shrunk toward 0 by a multiplicative factor:

$$ \frac{\mathbb{E}(\hat{\tau})}{\mathrm{\tau}}=1-\overline{A}-\overline{N}\ \mathrm{under}\ \mathrm{independence}. $$

These results can be extended to a power analysis. We consider Neyman’s null hypothesis versus a one-directional alternative,

$$ {\displaystyle \begin{array}{l}{H}_0:\mathrm{\tau} =0\\ {}{H}_1:\mathrm{\tau} <0.\end{array}} $$

Under independence, the estimate is more and more attenuated as misreporter incidence grows. This yields a strict decline in power, as formalized in Theorem 2.

Theorem 2.

Suppose |$\mathrm{\tau} <0$| and, in the superpopulation, |${A}_i,{N}_i\perp \perp {I}_i,{D}_i$|⁠. Then, the detection power is a strictly decreasing function of |$\overline{A}$| and |$\overline{N}$|⁠.

For a proof, see Web Appendix 3.

In the absence of the independence condition, detection powers can rise or fall with misreporter incidence. In the next section, we introduce the concept of worst-case power under a sensitivity model.

Sensitivity model

In many practical settings, it is unrealistic to expect exact independence between reporting and response classes (i.e., that the proportions of true-reporters, always-reporters, and never-reporters are identical across all response classes). However, deviations from independence may be relatively small in magnitude. To quantify these deviations, we introduce the bounding quantity |$\Gamma$|⁠.

Definition 1.

Under the sensitivity model indexed by |$\Gamma \ge 1$|⁠, every subgroup proportion in Table 3 differs by a factor no greater than |$\Gamma$| from the proportion under row-column independence. For example,

$$ \frac{1}{\varGamma}\le \frac{\overline{TD}}{\overline{T}\cdotp \overline{D}}\le \Gamma $$

with analogous bounds for the other nonzero entries in Table 3.

Under definition 1, a |$\Gamma =1$| bound enforces that the reporting and response classes be independent. A looser bound of |$\Gamma =2$| indicates that true-reporters, never-reporters, and always-reporters can be no more than twice but no less than half as frequent within any of the 4 response classes than they are on average across those response classes.

Suppose we fix values for both the reporting class frequencies (⁠|$\overline{T},\overline{A},\overline{N}$|⁠) and the response class frequencies |$\left(\overline{D},\overline{I},\overline{U},\overline{P}\right)$|⁠. Then, larger values of |$\Gamma$| allow for increasingly adversarial configurations of the joint distributions of the reporting classes and response classes, such that detection of a treatment effect can be made increasingly difficult.

Power and sample size calculations

Using our sensitivity model, we turn to our central task: computing adequate sample size for a prospective randomized controlled trial.

Suppose an analyst is designing an experiment for a treatment intended to have a preventive effect on an outcome of interest. The analyst has a desired type I error probability no greater than |$\mathrm{\alpha} \in \left(0,1\right)$| and can tolerate a type II error probability no greater than |$\mathrm{\beta} \in \left(0,1\right)$| (so her desired power is |$1-\mathrm{\beta}$|⁠).

To proceed, the analyst must provide prospective values of the reporting class frequencies (⁠|$\overline{T},\overline{A},\overline{N}$|⁠) and the response class frequencies |$\left(\overline{D},\overline{I},\overline{U},\overline{P}\right)$|⁠, as well as a value for |$\Gamma$|⁠. As in traditional power calculations, we suggest using empirical estimates from previous research to guide the selection of these parameters; further discussion can be found in the Validation Studies subsection. Define fixed vectors,

$$ \begin{align*}{\mathrm{\pi}}_{\mathrm{rep}}=\left(\overline{T},\overline{A},\overline{N}\right),{\mathrm{\pi}}_{\mathrm{res}}=\left(\overline{D},\overline{I},\overline{U},\overline{P}\right),\mathrm{and}\ \mathrm{\pi} =\left({\mathrm{\pi}}_{\mathrm{rep}},{\pi}_{\mathrm{res}}\right),\end{align*}$$

and optimization variables,

$${\delta} =\left(\begin{array}{ccc}\overline{TD}& \overline{AD}& \overline{ND}\\ {}\overline{TI}& \overline{AI}& \overline{NI}\\ {}\overline{TU}& \overline{AU}& \overline{NU}\\ {}\overline{TP}& \overline{AP}& \overline{NP}\end{array} \right).$$

The expected mean outcomes among treated and control units are functions of these values:

$$ {\displaystyle \begin{array}{l}{\mathrm{\mu}}_1\left(\mathrm{\pi}, \mathrm{\delta} \right)=\overline{I}+\overline{P}-\overline{NI}-\overline{NP}+\overline{AD}+\overline{AU},\\ {}{\mathrm{\mu}}_0\left(\mathrm{\pi}, \mathrm{\delta} \right)=\overline{D}+\overline{P}-\overline{ND}-\overline{NP}+\overline{AI}+\overline{AU}.\end{array}} $$

Next, we express the standardized expectation of the difference-in-means estimator as

$$ \begin{align*}& t\left(\!\mathrm{\pi},\! \mathrm{\delta} \right)\! =n\times\! \frac{\mu_1\left(\mathrm{\pi}, \mathrm{\delta} \right)-{\mathrm{\mu}}_0\!\left(\mathrm{\pi}, \mathrm{\delta} \right)}{\mu\!_1\!\left(\!\mathrm{\pi},\! \mathrm{\delta} \right)\!\cdotp\! \left(1\!-\!{\mathrm{\mu}}\!_1\left(\mathrm{\pi},\! \mathrm{\delta} \right)\right)\!+\!{\mathrm{\mu}}_0\!\left(\mathrm{\pi},\! \mathrm{\delta} \right)\!\cdotp\! \left(1\!-\!{\mathrm{\mu}}_0\!\left(\mathrm{\pi},\! \mathrm{\delta} \right)\right)}. \end{align*}$$

Finally, we can pose the sample size computation as an optimization problem. We want the minimum treatment and control sample size n such that we still achieve the desired power level under all allowable values of |$\mathrm{\delta}$|⁠. This is directly encoded in Optimization Problem 1:

$$ \begin{equation} {\displaystyle \begin{array}{ll}\underset{n}{\mathrm{argmin}}\ \underset{\mathrm{\delta}}{\max }& t\left(\mathrm{\pi}, \mathrm{\delta}, n\right)\\ {}\mathrm{subject}\ \mathrm{to}& \Phi \left({\Phi}^{-1}\left(\mathrm{\alpha} \right)-t\left(\mathrm{\pi}, \mathrm{\delta}, n\right)\right)\ge 1-\mathrm{\beta}, \\[5pt] {}& \mathrm{\delta} {1}_3={\mathrm{\pi}}_{\mathrm{res}}^{\top },\\[7pt] {}& {\mathrm{\delta}}^{\top }{1}_4={\mathrm{\pi}}_{\mathrm{rep}}^{\top },\\[7pt] {}& \frac{1}{\Gamma}\le \mathrm{\delta} /\left({\mathrm{\pi}}_{\mathrm{res}}^{\top }{\mathrm{\pi}}_{\mathrm{rep}}\right)\le \Gamma, \end{array}} \end{equation}$$

(1)

where |$\Phi \left(\cdotp \right)$| is the CDF of a standard normal, and |${1}_c$| is the length-c vector containing all ones. Optimization Problem 1 is a quadratic fractional program and can be solved efficiently using Dinkelbach’s method (23, 24). For more details, see Web Appendix 4.

METHODS IN CONTEXT

Example: sexual-violence prevention study

We consider an example based on prior work evaluating a sexual-violence prevention program for adolescent girls in Kenya. From a pilot study (25), we estimated the annual baseline rate of sexual violence in this population to be approximately 7%, and we estimated that the intervention reduces the incidence of sexual violence by 50%. Our goal is to identify the required sample size for a larger, follow-up study.

A standard power calculation, ignoring misreporting, yields a minimum sample size of 998 units. We consider how the presence of misreporters might necessitate larger samples. We expect never-reporters to exist within the study population, because some girls may be disinclined to disclose sexual violence, due to feelings of shame or an inability to understand their experiences as violent (26, 27). Always-reporters and false-reporters are assumed absent.

In the ideal case, the pilot would provide guidance on the values of |$\overline{N}$| (the frequency of never-reporters) and |$\Gamma$| (the bound on deviations from independence). Here, we consider a grid of possible values for both parameters, where |$0\le \overline{N}\le 0.20$| and |$1\le \Gamma \le 2$|⁠. We must also posit values for the response class frequencies, |$\overline{I},\overline{D},\overline{P}$|⁠, and |$\overline{U}$|⁠. Under the assumption that |$\overline{I}=0$| (i.e., no one is harmed by the treatment), our choices for the other frequencies are fully defined by our assumed baseline rate and treatment effect: |$\overline{D}=0.035,\overline{P}=0.035$|⁠, and |$\overline{U}=0.930$|⁠.

To compute the required sample sizes, we solve Optimization Problem 1 repeatedly at each possible pair of values of |$\overline{N}$| and |$\Gamma$|⁠. Results are plotted in Figure 1. Recall that these are worst-case sample sizes, assuming the incidence of never-reporters is as adversarial as possible under the |$\Gamma$| constraint. Specifically, the algorithm allocates as much of the population as possible to the |$\overline{ND}$| subgroup subject to the constraints, yielding the smallest possible causal estimate and the largest required sample size.

$Required sample size for the sexual violence prevention program study (in Kenya), at different values of never-reporter frequency. Note these are worst-case sample sizes, assuming the incidence of never-reporters is as adversarial as possible under the $\Gamma$ constraint.$

Figure 1

Required sample size for the sexual violence prevention program study (in Kenya), at different values of never-reporter frequency. Note these are worst-case sample sizes, assuming the incidence of never-reporters is as adversarial as possible under the |$\Gamma$| constraint.

If |$\Gamma =1$| (i.e., response and reporting classes assumed independent), the worst-case sample size grows approximately linearly with never-reporter frequency, and the growth rate is slow. Even if 20% of study participants were never-reporters, the required sample size would grow to only 1,260 participants. However, as we increase |$\Gamma$|⁠, the required sample size grows much more quickly with |$\overline{N}$|⁠. At |$\Gamma =2$|⁠, the required sample size grows to over 2,200 participants when 20% of the population is composed of never-reporters.

Figure 1 demonstrates the risks of an underpowered study due to the presence of never-reporters—especially if never-reporters are overrepresented among those responsive to the treatment. These insights allow practitioners to quantify how much larger a study must be in order to detect an effect.

Further simulations can be found in Web Appendix 5. Corresponding to these simulations, figures analogous to Figure 1 can be found in Web Figure 1 and Web Figure 2. Additionally, details on the code to implement these simulations can be found in Web Appendix 6 (the “Code for Practitioners” repository).

Validation studies

We next consider how to estimate key parameters for use in Optimization Problem 1.

In the ideal case, we would conduct a smaller pilot study prior to the design of the main randomized controlled trial. In this study—which we refer to as a “pre-study” or a “validation study,” because its goals involve measurement comparison versus a gold standard—a sample of individuals would be recruited from the intended study population, with some individuals randomized to the treatment and others to the standard of care. We focus our attention exclusively on validation studies that are under the researchers’ control, such that treatment can indeed be randomized and reliably recorded.

At follow-up, participants would first be asked to provide their responses using the proposed measurement instrument (e.g., a survey). These results would then be validated via a gold-standard measurement, one that is essentially error-free but is usually costly or labor-intensive (such as an in-person interview). At the completion of this validation study, researchers would have access to the triplet |$\big({Y}_i^{(t)},{Y}_i^{(r)},{W}_i\big)$| for a sample of potential participants. We denote as |${m}_t={\sum}_i{W}_i$| the number of treated units in the validation study, and |${m}_c= \sum_i\left(1-{W}_i\right)$| the number of control units.

These data can be used to improve the quality of the subsequent trial in several ways. The pre-study affords the research team an opportunity to evaluate the trade-offs of using the gold-standard versus the proposed measure during the randomized trial.

Moreover, the pre-study can inform parameter choices when evaluating the necessary sample size for the experiment. If we impose assumption 1 (under which there are no false-reporters) then we can directly estimate a number of population quantities by relating them to empirical frequencies observed in the pre-study. To see how, we first define

$$ {Y}_i^{(t)}(1)={W}_i{Y}_i^{(t)}\ \mathrm{and}\ {Y}_i^{(t)}(0)=\left(1-{W}_i\right){Y}_i^{(t)}, $$

as well as

$$ {Y}_i^{(r)}(1)={W}_i{Y}_i^{(r)}\ \mathrm{and}\ {Y}_i^{(r)}(0)=\left(1-{W}_i\right){Y}_i^{(r)}. $$

We note that |${Y}_i^{(t)}(1)={Y}_i(1)$| and |${Y}_i^{(t)}(0)={Y}_i(0)$| (i.e., these are observations of the underlying potential outcomes) but we find this notation more intuitive.

Each unit in the pre-study sample will receive either the treated or control condition, so we observe either |$\big({Y}_i^{(t)}(1),{Y}_i^{(r)}(1)\big)$| or |$\big({Y}_i^{(t)}(0),{Y}_i^{(r)}(0)\big)$| for each unit. Hence, we can simply count the empirical frequencies of ones and zeros among true and reported outcomes to obtain unbiased estimates of key superpopulation frequencies. For example, if we obtain the proportion of treated units for whom |${Y}_i^{(t)}(1)={Y}_i^{(r)}(1)=1$| in the pre-study, then we observe

$$ \mathbb{E}\!\left(\!\frac{1}{m_t}\!\sum \limits_{i:{W}_i=1}I\!\left({Y}_i^{(t)}(1)\!=\!1,{Y}_i^{(r)}(1)\!=\!1\!\right)\!\!\right)\!=\overline{TI}+\overline{TP}+\overline{AI}+\overline{AP}. $$

Similarly, if we obtain the proportion of control units for whom |${Y}_i^{(t)}(0)=1,{Y}_i^{(r)}(0)=0$|⁠, then

$$ \mathbb{E}\left(\frac{1}{m_c}\sum \limits_{i:{W}_i=0}I\left({Y}_i^{(t)}(0)=1,{Y}_i^{(r)}(0)=0\right)\right)=\overline{ND}+\overline{NP}. $$

In Table 4 and Table 5, we summarize each of these relationships. Under assumption 1, every cell in these tables can be estimated unbiasedly by computing the empirical frequency of the row and column values observed in the pre-study.

Table 4

Joint Reporting and Response Classes for Treated Potential Outcomes^a

		\|${{Y}}_{{i}}^{\left({r}\right)}\left(\mathbf{1}\right)$\|
		1	0
\|${Y}_i^{(t)}(1)$\|	1	\|$\overline{TI}+\overline{TP}+\overline{AI}+\overline{AP}$\|	\|$\overline{NI}+\overline{NP}$\|
\|${Y}_i^{(t)}(1)$\|	0	\|$\overline{AD}+\overline{AU}$\|	\|$\overline{TD}+\overline{TU}+\overline{ND}+\overline{NU}$\|

		\|${{Y}}_{{i}}^{\left({r}\right)}\left(\mathbf{1}\right)$\|
		1	0
\|${Y}_i^{(t)}(1)$\|	1	\|$\overline{TI}+\overline{TP}+\overline{AI}+\overline{AP}$\|	\|$\overline{NI}+\overline{NP}$\|
\|${Y}_i^{(t)}(1)$\|	0	\|$\overline{AD}+\overline{AU}$\|	\|$\overline{TD}+\overline{TU}+\overline{ND}+\overline{NU}$\|

^a The cells in this table represent the empirical response frequencies among treated individuals, under both the proposed measurement instrument and a gold-standard measure. In each cell, we show the sum of superpopulation proportions that is equal in expectation to the empirical response proportion represented by the cell.

Table 4

Joint Reporting and Response Classes for Treated Potential Outcomes^a

		\|${{Y}}_{{i}}^{\left({r}\right)}\left(\mathbf{1}\right)$\|
		1	0
\|${Y}_i^{(t)}(1)$\|	1	\|$\overline{TI}+\overline{TP}+\overline{AI}+\overline{AP}$\|	\|$\overline{NI}+\overline{NP}$\|
\|${Y}_i^{(t)}(1)$\|	0	\|$\overline{AD}+\overline{AU}$\|	\|$\overline{TD}+\overline{TU}+\overline{ND}+\overline{NU}$\|

		\|${{Y}}_{{i}}^{\left({r}\right)}\left(\mathbf{1}\right)$\|
		1	0
\|${Y}_i^{(t)}(1)$\|	1	\|$\overline{TI}+\overline{TP}+\overline{AI}+\overline{AP}$\|	\|$\overline{NI}+\overline{NP}$\|
\|${Y}_i^{(t)}(1)$\|	0	\|$\overline{AD}+\overline{AU}$\|	\|$\overline{TD}+\overline{TU}+\overline{ND}+\overline{NU}$\|

^a The cells in this table represent the empirical response frequencies among treated individuals, under both the proposed measurement instrument and a gold-standard measure. In each cell, we show the sum of superpopulation proportions that is equal in expectation to the empirical response proportion represented by the cell.

Table 5

Joint Reporting and Response Classes for Untreated Potential Outcomes^a

		\|${{Y}}_{{i}}^{\left({r}\right)}\left(\mathbf{0}\right)$\|
		1	0
\|${Y}_i^{(t)}(0)$\|	1	\|$\overline{TD}+\overline{TP}+\overline{AD}+\overline{AP}$\|	\|$\overline{ND}+\overline{NP}$\|
\|${Y}_i^{(t)}(0)$\|	0	\|$\overline{AI}+\overline{AU}$\|	\|$\overline{TI}+\overline{TU}+\overline{NI}+\overline{NU}$\|

		\|${{Y}}_{{i}}^{\left({r}\right)}\left(\mathbf{0}\right)$\|
		1	0
\|${Y}_i^{(t)}(0)$\|	1	\|$\overline{TD}+\overline{TP}+\overline{AD}+\overline{AP}$\|	\|$\overline{ND}+\overline{NP}$\|
\|${Y}_i^{(t)}(0)$\|	0	\|$\overline{AI}+\overline{AU}$\|	\|$\overline{TI}+\overline{TU}+\overline{NI}+\overline{NU}$\|

^a The cells in this table represent the empirical response frequencies among control individuals, under both the proposed measurement instrument and a gold-standard measure. In each cell, we show the sum of superpopulation proportions that is equal in expectation to the empirical response proportion represented by the cell.

Table 5

Joint Reporting and Response Classes for Untreated Potential Outcomes^a

		\|${{Y}}_{{i}}^{\left({r}\right)}\left(\mathbf{0}\right)$\|
		1	0
\|${Y}_i^{(t)}(0)$\|	1	\|$\overline{TD}+\overline{TP}+\overline{AD}+\overline{AP}$\|	\|$\overline{ND}+\overline{NP}$\|
\|${Y}_i^{(t)}(0)$\|	0	\|$\overline{AI}+\overline{AU}$\|	\|$\overline{TI}+\overline{TU}+\overline{NI}+\overline{NU}$\|

		\|${{Y}}_{{i}}^{\left({r}\right)}\left(\mathbf{0}\right)$\|
		1	0
\|${Y}_i^{(t)}(0)$\|	1	\|$\overline{TD}+\overline{TP}+\overline{AD}+\overline{AP}$\|	\|$\overline{ND}+\overline{NP}$\|
\|${Y}_i^{(t)}(0)$\|	0	\|$\overline{AI}+\overline{AU}$\|	\|$\overline{TI}+\overline{TU}+\overline{NI}+\overline{NU}$\|

^a The cells in this table represent the empirical response frequencies among control individuals, under both the proposed measurement instrument and a gold-standard measure. In each cell, we show the sum of superpopulation proportions that is equal in expectation to the empirical response proportion represented by the cell.

Using the results summarized in Tables 4 and 5, we can estimate 8 different linear combinations of the population proportions described in Table 3. Seven of these linear combinations are independent, so we do not have sufficient information to obtain individual estimates of each of the 12 different population proportions in Table 3. However, we can arrive at such estimates by imposing additional constraints.

First, we can impose an assumption about the relative proportions of the Increase and Decrease response classes among each reporting class. Most typically, we would simply assume a unidirectional treatment effect, such that no individuals are assumed to fall in the Increase class or no individuals are assumed to fall in the Decrease class. This is common in practice, as we might expect that any treatment would help some individuals in the trial but not harm any other individuals in the trial. In other settings, it might be more realistic to instead assume a relative frequency of the Increase and Decrease classes within each reporting class—supposing, for example, that the proportions |$\overline{TI},$||$\overline{AI}$|⁠, and |$\overline{NI}$| are twice as large as |$\overline{TU},$||$\overline{AU}$|⁠, and |$\overline{NU}$|⁠, respectively. With any 3 linearly independent constraints of this type, we can estimate almost every nonzero proportion in Table 3.

The sole exception (per our discussion in the Theoretical Results section) is that we can identify only the sums |$\overline{TU}+\overline{NU}$| and |$\overline{TP}+\overline{AP}$|⁠, rather than the individual proportions themselves. We can use a simple heuristic to allocate these 2 sums to their constituent proportions. We advocate allocating them such that |$\overline{NU}=\overline{N}\cdotp \overline{U}$| and |$\overline{AP}=\overline{A}\cdotp \overline{P}$|⁠, with all of the remaining mass allocated to |$\overline{TU}$| and |$\overline{TP}$| respectively.

After these steps, researchers will have access to a full set of pilot estimates for entries in Table 3. Taking row and column sums of these pilot estimates will provide the marginal probabilities that define |$\mathrm{\pi} =\left({\mathrm{\pi}}_{\mathrm{rep}},{\mathrm{\pi}}_{\mathrm{res}}\right)$|⁠, an input to Optimization Problem 1. The other input, |$\Gamma$|⁠, can also be estimated by computing the maximal deviation from row-column independence among the pilot estimates. In practice, we suggest considering several different values of |$\Gamma$|⁠, informed also by subject matter knowledge, in order to see how the required sample size evolves with this parameter.

Special case: bias correction

Typically, validation studies will be useful for obtaining pilot estimates for the parameters in Optimization Problem 1. If the validation study is of sufficient size and quality, however, it could be used to directly estimate a bias correction for the causal estimates in our larger study. In particular, observe that the estimator

$\begin{align*}\hat{B}&=\frac{1}{m}\sum \limits_{i:{W}_i=0}I\left({Y}_i^{(t)}(0)=1,{Y}_i^{(r)}(0)=0\right)\\&\quad-\frac{1}{m}\sum \limits_{i:{W}_i=0}I\left({Y}_i^{(t)}(0)=0,{Y}_i^{(r)}(0)=1\right)\\ {}\kern0.75em &\quad+\frac{1}{m}\sum \limits_{i:{W}_i=1}I\left({Y}_i^{(t)}(1)=0,{Y}_i^{(r)}(1)=1\right)\\&\quad-\frac{1}{m}\sum \limits_{i:{W}_i=1}I\left({Y}_i^{(t)}(1)=1,{Y}_i^{(r)}(1)=0\right)\end{align*}$$

satisfies

$$ \mathbb{E}(\hat{B})=-\overline{NI}+\overline{ND}-\overline{AI}+\overline{AD}=\mathrm{Bias}(\hat{\mathrm{\tau}}), $$

as defined in Theorem 1.

In other words, with no further assumptions, we can estimate the bias in the usual difference-in-means estimator when using the proposed measurement instrument in an experiment. In typical cases, this quantity can be used as a “gut check” to ensure that the bias is not intolerably large for the main randomized trial.

However, with a sufficiently large and high-quality validation study, our estimator |$\hat{B}$| may also be estimated with low variance, such that we may consider |$\hat{B}$| to be a precise estimator of the true bias. In this case, we can use |$\hat{B}$| as a plug-in estimator and subtract this quantity from the causal estimate obtained from our larger randomized trial. This would yield a bias-corrected estimator, directly mitigating the error due to reporting bias.

CONCLUDING REMARKS

The formal treatment of reporting bias is a somewhat recent development in the causal inference literature, especially when considering measurement errors correlated with potential outcomes (2). This problem poses a threat to valid causal inference—especially when considering randomized trials in which the outcomes include sensitive topics. The problem can be partially mitigated through careful survey design and methods to ensure participant privacy. But there are many practical settings in which all participants cannot reasonably be expected to disclose data about an outcome of interest.

In this paper, we have considered the problem from the perspective of a researcher designing a prospective randomized trial. Our contributions are 3-fold. First, we defined a set of “reporting classes” to characterize individual reporting behavior. We demonstrate that the joint distribution of these reporting classes—along with “response classes” reflecting how individuals respond to treatment—determine exactly how much bias and variance will be induced in our causal estimate. Second, we have proposed a method for practitioners to adequately power their analyses in the presence of misreporting, assuming a worst-case deviation from independence of the reporting and response classes. Third, we have described a special case in which an analyst can directly correct for misreporting biases.

Future opportunities in this area are myriad. Recent papers have considered the case in which misreporting behavior is differential (16), meaning individuals may misreport only when receiving the treatment or only when receiving the control. Accounting for such behavior represents an important extension to this research.

ACKNOWLEDGMENTS

Author affiliations: Harvard Data Science Initiative, Harvard University, Cambridge, Massachusetts, United States (Evan T. R. Rosenman); LinkedIn Data Science and Applied Research, Sunnyvale, California, United States (Rina Friedberg); and Department of Epidemiology and Population Health, Stanford University School of Medicine, Stanford, California, United States (Mike Baiocchi).

We are grateful to Google and to the Marjorie Lozoff Fund, provided by the Michelle R. Clayman Institute for Gender Research, for financial support.

No original data was used in the content of this manuscript. Reproduction code for the central algorithm is included in the Web Material.

We thank Dr. Luke Miratrix, Dr. Sophie Litschwartz, Dr. Kristen Hunter, and Dr. Julia Simard for their useful comments and feedback.

Presented at the 2022 Joint Statistical Meetings, August 6–11, 2022, Washington, DC; and at the 2022 Society for Research on Educational Effective Conference, September 21–24, 2022, Arlington, Virginia.

A preprint of this article has been published online. Rosenman ETR, Friedberg R, Baiocchi M. Robust designs for prospective randomized trials surveying sensitive topics. arXiv. 2021. (https://doi.org/10.48550/arXiv.2108.0894). We caution that the terminology in the paper has changed significantly in the revision process.

This research was conducted while R.F. was a student at Stanford, and is not affiliated with LinkedIn.

Conflict of interest: none declared.

REFERENCES

1.

Cimpian

JR

,

Timmer

JD

.

Mischievous responders and sexual minority youth survey data: a brief history, recent methodological advances, and implications for research and practice

.

Arch Sex Behav.

2020

;

49

(

4

):

1097

–

1102

.

2.

Imai

K

,

Yamamoto

T

.

Causal inference with differential measurement error: nonparametric identification and sensitivity analysis

.

Am J Polit Sci.

2010

;

54

(

2

):

543

–

560

.

3.

Zaller

J

,

Feldman

S

.

A simple theory of the survey response: answering questions versus revealing preferences

.

Am J Polit Sci.

1992

;

36

(

3

):

579

–

616

.

4.

Droitcour

JA

,

Larson

EM

.

An innovative technique for asking sensitive questions: the three-card method

.

Bull Methodol Sociol.

2002

;

75

(

1

):

5

–

23

.

5.

Carroll

RJ

,

Ruppert

D

,

Stefanski

LA

, et al.

Measurement Error in Nonlinear Models

. 2nd ed.

Boca Raton, FL

:

Chapman and Hall/CRC

;

2006

.

6.

Hausman

J

.

Mismeasured variables in econometric analysis: problems from the right and problems from the left

.

J Econ Perspect

.

2001

;

15

(

4

):

57

–

67

.

7.

Frangakis

CE

,

Rubin

DB

.

Principal stratification in causal inference

.

Biometrics.

2002

;

58

(

1

):

21

–

29

.

8.

Balke

A

,

Pearl

J

.

Bounds on treatment effects from studies with imperfect compliance

.

J Am Stat Assoc.

1997

;

92

(

439

):

1171

–

1176

.

9.

Lewbel

A

.

Estimation of average treatment effects with misclassification

.

Econometrica.

2007

;

75

(

2

):

537

–

551

.

10.

Rosenman

ETR

,

Sarnquist

C

,

Friedberg

R

, et al.

Empirical insights for improving sexual assault prevention: evidence from baseline data for a cluster-randomized trial of impower and sources of strength

.

Violence Against Women.

2020

;

26

(

15–16

):

1855

–

1875

.

PubMed

11.

Vitoratou

S

,

Pickles

A

.

A note on contemporary psychometrics

.

J Ment Health.

2017

;

26

(

6

):

486

–

488

.

12.

Flegel

KM

,

Keyl

PM

,

Nieto

JF

.

Differential misclassification arising from nondifferential errors in exposure measurement

.

Am J Epidemiol.

1991

;

134

(

10

):

1233

–

1246

.

13.

Richardson

DB

,

Keil

AP

,

Cole

SR

, et al.

Reducing bias due to exposure measurement error using disease risk scores

.

Am J Epidemiol.

2021

;

190

(

4

):

621

–

629

.

14.

Cole

SR

,

Jacobson

LP

,

Tien

PC

, et al.

Using marginal structural measurement-error models to estimate the long-term effect of antiretroviral therapy on incident aids or death

.

Am J Epidemiol.

2010

;

171

(

1

):

113

–

122

.

15.

Tennekoon

V

,

Rosenman

R

.

Systematically misclassified binary dependent variables

.

Commun Stat Theory Methods.

2016

;

45

(

9

):

2538

–

2555

.

16.

VanderWeele

TJ

,

Li

Y

.

Simple sensitivity analysis for differential measurement error

.

Am J Epidemiol.

2019

;

188

(

10

):

1823

–

1829

.

17.

Imai

K

,

Yamamoto

T

.

Causal inference with differential measurement error: nonparametric identification and sensitivity analysis

.

Am J Polit Sci.

2010

;

54

(

2

):

543

–

560

.

18.

Edwards

JK

,

Cole

SR

,

Westreich

D

.

All your data are always missing: incorporating bias due to measurement error into the potential outcomes framework

.

Int J Epidemiol.

2015

;

44

(

4

):

1452

–

1459

.

19.

Angrist

JD

,

Imbens

GW

,

Rubin

DB

.

Identification of causal effects using instrumental variables

.

J Am Stat Assoc.

1996

;

91

(

434

):

444

–

455

.

20.

Hernán

MA

,

Robins

JM

.

Causal Inference: What If?

Boca Raton, FL

:

Chapman & Hall/CRC

;

2010

.

Google Preview

21.

Rubin

DB

.

Estimating causal effects of treatments in randomized and nonrandomized studies

.

J Educ Psychol.

1974

;

66

(

5

):

688

–

701

.

22.

Imbens

GW

,

Rubin

DB

.

Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction

.

New York, NY

:

Cambridge University Press

;

2015

.

23.

Dinkelbach

W

.

On nonlinear fractional programming

.

Management Science.

1967

;

13

(

7

):

492

–

498

.

24.

Phillips

AT

. Quadratic fractional programming: Dinkelbach’s method. In:

Floudas

CA

,

Pardalos

PM

, eds.

Encyclopedia of Optimization

. 2nd ed.

Boston, MA

:

Springer US

;

2001

:

2107

–

2110

.

25.

Baiocchi

M

,

Friedberg

R

,

Rosenman

ETR

, et al.

Prevalence and risk factors for sexual assault among class 6 female students in unplanned settlements of Nairobi, Kenya: baseline analysis from the ImPower & Sources of Strength cluster randomized controlled trial

.

PLoS One.

2019

;

14

(

6

):e0213359.

26.

Cook

SL

,

Gidycz

CA

,

Koss

MP

, et al.

Emerging issues in the measurement of rape victimization

.

Violence Against Women.

2011

;

17

(

2

):

201

–

218

.

27.

Fisher

BS

,

Cullen

FT

.

Measuring the sexual victimization of women: evolution, current controversies, and future research

.

Criminal Justice.

2000

;

4

:

317

–

390

.