-
PDF
- Split View
-
Views
-
Cite
Cite
Michael Rosenblum, Peter Miller, Benjamin Reist, Elizabeth A. Stuart, Michael Thieme, Thomas A. Louis, Adaptive Design in Surveys and Clinical Trials: Similarities, Differences and Opportunities for Cross-fertilization, Journal of the Royal Statistical Society Series A: Statistics in Society, Volume 182, Issue 3, June 2019, Pages 963–982, https://doi.org/10.1111/rssa.12438
- Share Icon Share
Summary
Adaptive designs involve preplanned rules for modifying an on-going study based on accruing data. We compare the goals and methods of adaptation for trials and surveys, identify similarities and differences, and make recommendations for what types of adaptive approaches from one domain have high potential to be useful in the other. For example, clinical trials could benefit from recently developed survey methods for monitoring which groups have low response rates and intervening to fix this. Clinical trials may also benefit from more formal identification of the target population, and from using paradata (contextual information collected before or during the collection of actual outcomes) to predict participant compliance and retention and then to intervene to improve these. Surveys could benefit from stopping rules based on information monitoring, applying techniques from sequential multiple-assignment randomized trial designs to improve response rates, prespecifying a formal adaptation protocol and including a data monitoring committee. We conclude with a discussion of the additional information, infrastructure and statistical analysis methods that are needed when conducting adaptive designs, as well as benefits and risks of adaptation.
1. Introduction
On their surface, clinical trials and sample surveys seem to have relatively little in common. Clinical trials aim to evaluate the effect of a treatment efficiently and ethically, e.g. a new drug or medical device. Sample surveys aim to collect information efficiently to estimate characteristics of a population accurately. However, as has been increasingly identified (Mercer et al., 2017) the two areas have much to learn from one another. We focus on the use of adaptive designs in trials and sample surveys, and how concepts and methods for adaptation in each field can cross-fertilize the other field. When used appropriately, adaptive designs have the potential to produce improved estimates and/or to lower cost.
By ‘adaptation’ we mean protocol-driven modification of an initial design on the basis of accruing information. A key aspect of adaptation is that it should be via a prespecified protocol, with the goal of learning from accruing information to improve the study's performance, while preserving the study's integrity and validity. In trials, modifications can be made, for example, to enrolment criteria, sample size, follow-up time, randomization probabilities and end points. In surveys, changes can be made, for example, to the number and timing of contacts, data collection mode (the web, mail, interview via telephone or face to face), the incentives provided to encourage participation and the decision to halt data collection.
We discuss the principal goals and methods of adaptive designs in surveys and clinical trials, and highlight key similarities and differences. We then present opportunities for technology transfer between the domains and for research that may benefit both. Although adaptation in surveys was inspired by experience in trials, the two fields have largely developed their adaptive designs independently, leading to some overlap, but also a lack of awareness of what advances in one field may contribute to the other. We aim to bridge this gap, to make readers aware of some of the connections between the fields, and to make researchers in each field aware of the key adaptive designs in the other. In so doing, researchers in each field can learn from the other, aiming towards more effective clinical trial and survey designs.
Given the translational aspects of this work and our desire to communicate across fields that may not be familiar with one another, Sections 2 and 3 provide an overview of adaptive designs in clinical trials and sample surveys respectively. Readers who are already familiar with the background information in Sections 2 and 3 can skip to Section 4, which presents examples of lessons and methods from the survey world that may be fruitful to apply in the clinical trials context, and vice versa. Examples of cross-fertilization from adaptive surveys to trials include systematically monitoring representativeness and improving it by targeted enrolment or follow-up, and collection and use of paradata. In the opposite direction, we consider how clinical trial stopping rules, sequential multiple-assignment randomized trial (SMART) designs and data monitoring committees could be usefully applied in the adaptive survey context. Section 5 describes the additional information, infrastructure and statistical analysis methods that are needed when conducting adaptive designs, as well as benefits and risks of adaptation.
2. Adaptive designs in clinical trials
2.1. Types of adaptation in clinical trials
A clinical trial involves recruiting participants from a target population (usually implicitly defined by eligibility criteria), randomly assigning them to experimental conditions (e.g. treatment and placebo), administering treatments and assessing outcomes. Our focus is on phase 2 and 3 randomized controlled trials, in which the target of estimation is the effect of treatment on efficacy and safety outcomes (Friedman et al., 2015). The main goal is to inform a decision on licensing, use or further study of a treatment. Though confirmatory clinical studies focus on one or two primary hypotheses, generally there are multiple secondary research questions including other end points and subgroups.
According to guidelines from the European Medicines Agency (EMA) (2007) and US Food and Drug Administration (FDA) (2010, 2016), adaptations in clinical trials should be protocol based, i.e. based on a preplanned rule. Different adaptation types are listed in Table 1 and discussed below.
In clinical trials |
(a) Early stopping for efficacy, futility or harm (i.e. group sequential designs) |
(b) Modifying enrolment criteria, dose, sample size, follow-up time, randomization probabilities or end points |
(c) Rerandomizing participants with poor outcomes to another treatment (SMART designs) |
In surveys |
(a) Changing modes of contact or data collection |
(b) Modifying timing or frequency of contact attempts |
(c) Changing incentives for responding |
(d) Deciding when to stop data collection |
In clinical trials |
(a) Early stopping for efficacy, futility or harm (i.e. group sequential designs) |
(b) Modifying enrolment criteria, dose, sample size, follow-up time, randomization probabilities or end points |
(c) Rerandomizing participants with poor outcomes to another treatment (SMART designs) |
In surveys |
(a) Changing modes of contact or data collection |
(b) Modifying timing or frequency of contact attempts |
(c) Changing incentives for responding |
(d) Deciding when to stop data collection |
In clinical trials |
(a) Early stopping for efficacy, futility or harm (i.e. group sequential designs) |
(b) Modifying enrolment criteria, dose, sample size, follow-up time, randomization probabilities or end points |
(c) Rerandomizing participants with poor outcomes to another treatment (SMART designs) |
In surveys |
(a) Changing modes of contact or data collection |
(b) Modifying timing or frequency of contact attempts |
(c) Changing incentives for responding |
(d) Deciding when to stop data collection |
In clinical trials |
(a) Early stopping for efficacy, futility or harm (i.e. group sequential designs) |
(b) Modifying enrolment criteria, dose, sample size, follow-up time, randomization probabilities or end points |
(c) Rerandomizing participants with poor outcomes to another treatment (SMART designs) |
In surveys |
(a) Changing modes of contact or data collection |
(b) Modifying timing or frequency of contact attempts |
(c) Changing incentives for responding |
(d) Deciding when to stop data collection |
The FDA draft guidance on adaptive designs for drugs and biologics (Food and Drug Administration, 2010) categorizes adaptations as ‘well understood’ and ‘less well understood’. The former are
‘well-established clinical study designs with planned modifications based on an interim study result analysis … that either need no statistical correction … or properly account for the analysis-related multiplicity of choices’.
The latter are
‘study designs with which there is relatively little regulatory experience and whose properties are not fully understood at this time’.
Examples of ‘well-understood’ adaptations include group sequential designs (i.e. designs with preplanned interim analyses where the trial can be stopped early for efficacy, futility or harm), designs that adapt the sample size on the basis of blinded analyses (described below) and designs that adapt the eligibility criteria by using only pretreatment data. The other clinical trial adaptations in Table 1 are ‘less well understood’.
The use of different adaptations in practice is summarized in Hatfield et al. (2016). They reported that the most common adaptation types in phase 2 trials (and combined phase 2–3 trials) are group sequential designs and dose selection designs. In phase 3, the most common adaptation is group sequential designs. Adaptive designs were most commonly used in oncology trials. Mistry et al. (2017), who reviewed the use of adaptive designs in oncology trials, found that group sequential designs were the most common type of adaptation. The review of adaptive designs by Bothwell et al. (2018) had similar results to those in Hatfield et al. (2016), except that biomarker adaptive designs (where design modifications are based on treatment effect estimates in groups defined by biomarkers such as baseline disease severity or a genetic marker) were added to the list of most common adaptations. These reviews are consistent with the earlier review by Morgan et al. (2014).
Lin et al. (2016) summarized the adaptive trial proposals that were received by the Center for Biologics Evaluation and Research at the FDA from 2008 to 2013. This centre regulates vaccines, blood products, allergenic products, cellular, tissue and gene therapies, and related medical and diagnostic devices. The most common adaptations were group sequential designs and sample size adaptation. The latter adaptation type was also found to be the most common in a survey (which excluded group sequential designs) of scientific advice letters from the EMA from January 2007 to May 2012 (Elsäßer et al., 2014).
For each type of adaptation in Table 1, there are multiple approaches for developing the adaptation rule and statistical analysis plan (which needs to account for the adaptation rule). For example, Berry et al. (2010) described Bayesian methods (which are often used in oncology and in medical device trials) and Wassmer and Brannath (2016) presented frequentist methods. The primary concerns of FDA and EMA regulators, as described in European Medicines Agency (2007) and Food and Drug Administration (2010, 2016), include that adaptive designs need to have correct type I error, lead to clearly interpretable trial results and minimize bias (in estimating treatment effects) caused by the adaptations.
2.2. Group sequential designs
As noted above, group sequential designs are one of the most commonly used and best understood types of adaptive clinical trial designs. We focus on them in this section because of their potential relevance for surveys, and then we briefly discuss other types of adaptations below.
Many phase 3 clinical trials are group sequential, i.e. they have (typically 3–5) preplanned interim analyses where a decision is made regarding stopping the trial early for efficacy, futility or harm (O’Brien and Fleming, 1979; Jennison and Turnbull, 1999). Specifically, the protocol can set rules for halting the study if the current set of observations clearly demonstrates a substantial treatment benefit (efficacy), if there is little chance that the trial will eventually show a statistically significant benefit of treatment (futility) or if an adverse event rate is unacceptable (harm). Group sequential designs are adaptive in that the decision to stop the trial is based on accrued data, typically through treatment effect estimates.
Some group sequential designs set the interim analysis timing on the basis of preplanned calendar times (e.g. analyses take place at 2, 3 and 4 years after study initiation), whereas others base this timing on the accrued data. In the latter case, interim analysis times are often based on blinded analyses of the data, i.e. analyses based on trial data that exclude all information about who is assigned to which study arm (Morgan et al., 2014). This is often used in trials with time-to-event outcomes (e.g. the time to infection with the human immunodeficiency virus or time to death), where analyses occur when the total number of events (pooling across study arms) exceeds preplanned thresholds. The analogue for continuous outcomes is to base the interim analysis timing on the variance of the treatment effect estimator (which decreases as more participants have their outcomes observed). This is called information monitoring (Scharfstein et al., 1997; Jennison and Turnbull, 1999), since the statistical information is defined as the reciprocal of the variance of the treatment effect estimator. For time-to-event outcomes, the statistical information is approximately proportional to the total number of events (pooling across study arms) under certain assumptions such as the number of events being relatively small compared with the number of participants in each arm (Jennison and Turnbull, 1999).
An advantage of using information monitoring to determine analysis timing is that it does not require statistical adjustments in testing and estimation of a key quantity of primary interest in confirmatory trials, i.e. the average treatment effect. That is, one can analyse the data as if the analysis timing had been set in advance, and the key statistical properties (such as type I error, power and estimator bias) are not impacted (asymptotically). We discuss possibilities for transferring information monitoring methods from the clinical trial context to the sample survey context in Section 4.3.
Monitoring statistical information can be used not only to determine when to stop a trial, but also to modify the length of follow-up for all trial participants. For example, if the overall event rate (pooling across arms) is projected to lead to insufficient power, then the follow-up time (as well as sample size) can be increased. This is a well-understood adaptation as long as the decision rule is based only on blinded data (Food and Drug Administration, 2010). The aforementioned FDA and EMA guidance documents do not discuss adapting the follow-up time or increasing follow-up intensity for a targeted subset of participants (e.g. those in an underrepresented group or a group with high dropout). We consider such adaptations in Section 4.1, motivated by analogous adaptations in survey sampling.
2.3. Modifying enrolment criteria, dose, sample size, follow-up time, randomization probabilities or end points
We next consider adaptations that are classified as less well understood in the aforementioned FDA draft guidance (Food and Drug Administration, 2010). These include modifying treatment dose, randomization probabilities, enrolment criteria and/or follow-up time (based on unblinded data), sample size (other than group sequential designs or based on blinded analyses) or end points. Lin et al. (2016), whose authors are all from the Center for Biologics Evaluation and Research at the FDA, clarified that less-well-understood methods can still be used in confirmatory trials for FDA approval; however, they generally require additional justification and discussion with the FDA before formal submission. It is beyond the scope of our paper to present examples of each type of less-well-understood adaptation; instead, we give an example of one such adaptation type that has shown promise in oncology trials.
Barker et al. (2009) and Rugo et al. (2011) described adaptive phase 2 trials for treating breast cancer and Kim et al. (2011) did the same for lung cancer. Each trial evaluates multiple treatments and adapts the randomization probabilities to different study arms. These adaptive trials aim to determine which patient subpopulations, which are defined by biomarkers measured before randomization, benefit from which treatments. The ultimate goal is to select promising treatment–population pairs for confirmatory assessment in future phase 3 trials. As described by Rugo et al. (2016),
‘Regimens move on from phase 2 if and when they have a high Bayesian predictive probability of success in a subsequent phase 3 neoadjuvant trial within the biomarker signature in which they performed well’.
2.4. Sequential multiple-assignment randomized trials
We next describe SMART designs, since in Section 4.4 we shall discuss potential applications to adaptive surveys. SMART designs involve individual level adaptation of the treatment based on the response to treatments previously given (Murphy et al., 2006, 2007). For participants who do not respond to the first treatment that they are randomized to, a different treatment is randomly assigned to them. This process may be repeated. According to Murphy et al. (2006),
‘the SMART design provides data that can be used both to assess the efficacy of each treatment within a sequence and to compare the effectiveness of strategies as a whole’.
This type of adaptation is fundamentally different from the other types that we discussed, in that changes are made to each participant's treatment regimen on the basis of accruing data for that participant. In contrast, the other adaptations are at the trial design level (rather than the individual level) and involve changes based on data from previous participants that impact later participants; for example, in dose finding trials study arms with ineffective doses may be dropped early, which means that later participants are only assigned to the remaining dose arms.
The ‘STAR*D’ trial is an example of a SMART design of multiple treatments for depression (Rush et al., 2004). It enrolled 4041 participants and evaluated not only first-line treatments but also second-line treatments for participants who do not respond to the first-line treatment.
3. Adaptive designs in sample surveys
Our focus is on surveys that identify prospective respondents through probability sampling from frames that are designed to represent the target population. The principal goal is to obtain accurate and precise estimates of target population characteristics. In household surveys, such characteristics include population frequencies or proportions, such as the number of people with age greater than 65 years, the average annual household income or the proportion of households living in rental housing units. In surveys of establishments (e.g. businesses, schools and farms) characteristics might include revenue, school enrolments and crop yields. Surveys also can be designed to estimate associations between different characteristics, such as the correlation between annual household income and housing characteristics.
We next define several terms that are used below. In surveys, ‘sample’ refers to the set of units (e.g. households, businesses or individuals) drawn from a sampling frame (a list of units, e.g. of physical addresses, telephone numbers or e-mail addresses) for interview attempts. A ‘case’ is another term for a sampled unit. Cases are contacted during the data collection period; as data collection progresses, the cases become classified as respondents or non-respondents, depending on the success or failure of contact and interviewing attempts. The response rate is the fraction of the sample who have responded.
Adaptive survey designs are a relatively recent innovation (Wagner, 2008). These designs depart from traditional survey data collection procedures, which employ a single protocol for all cases (e.g. households or establishments) for the duration of the data collection period, with the aim of achieving the highest response rate within the budget. Response rates for surveys have fallen in recent years and costs of data collection have risen correspondingly. At the same time, the goal of maximizing the survey response rate has been called into question. Groves and Peytcheva (2008) showed that the (overall) response rate may not be a good indicator of non-response bias in survey estimates. These factors have motivated experimentation with adaptive data collection designs that have the goal of interviewing representative or balanced samples (see Section 3.3), rather than trying simply to achieve the highest overall response rate.
3.1. Types of adaptation in surveys
Schouten et al. (2013, 2017), Tourangeau et al. (2017) and Chun et al. (2018) have discussed how surveys can be adapted as data accumulate with the aim of reducing non-response bias and containing costs. The following four types of survey adaptation are listed in the bottom half of Table 1: modifying mode of contact, timing or frequency of contact, incentives and when to stop data collection. Adaptation rules take as input the quality of the current sample, the projected cost of collection efforts and estimates of the likelihood of capturing remaining (non-responsive) cases. Surveys compute these quantities with the help of ‘auxiliary data’ (information about each sample unit that is known to the survey staff before initiating contact attempts) and ‘paradata’ (contextual information about each sample unit that is captured during attempts to recruit respondents). These data types are further described in Section 3.2
In a recent interview, former Census Director Robert Groves outlined the history of adaptive survey designs (Habermann et al., 2017). A key tenet of adaptive (sometimes referred to as ‘responsive’) designs is that different design features may be effective in obtaining responses from different members of the population. Examples of adaptive surveys that illustrate these features include Groves and Heeringa (2006), Peytchev et al. (2010) and Coffey et al. (2018). These studies involved changing elements of the data collection approach (subsampling of non-respondents, altering incentive amounts, interview mode changes and contact effort changes) in response to observed sample quality and the response propensity for cases that have not yet been interviewed. In establishment surveys, these and other sorts of adaptation have been explored (Thompson and Kaputa, 2017; McCarthy et al., 2017). Särndal and Lundquist (2014) suggested that adjusting for differential recruitment needs and response behaviours during or between rounds of data collection may be more effective than not making any adaptations and instead relying on post-survey statistical adjustment to account for non-response.
Similarly to clinical trials, sample surveys can determine adaptively when to stop data collection. Rao et al. (2008) and Wagner and Raghunathan (2010) proposed and evaluated stopping rules; these are discussed in Section 4.3.
3.2. Auxiliary data and paradata
Auxiliary data can be derived directly from a population register (which is a standard approach for household surveys in Sweden and the Netherlands) or from another survey of the same population (see the discussion of the National Survey of College Graduates (NSCG), section 3.3). Auxiliary data can also be derived from other sources (e.g. commercial data, voting records, medical records, business registers in establishment surveys and unstructured ‘big’ data) and appended to the sampling frame. The investment in constructing auxiliary information allows investigators to track during fieldwork the match between the sample and the subset who respond. This assessment is a key component of decisions to adapt data collection procedures, as we describe in Section 3.3.
Survey paradata are contextual information that is collected before or during the collection of actual outcomes. Taylor (2006) stated that
‘analysis of paradata can be used to explore case characteristics and allow for a better understanding of respondents and the interviewer-respondent dynamic’.
Paradata are by-products of the efforts that are made to contact respondents or to interview them. They include information that is recorded automatically when interviewers dial telephone numbers or attempt to conduct face-to-face interviews. A record of the number of telephone contact attempts by day and time and the disposition of calls (e.g. ‘ring no answer’, ‘answering machine’ and ‘respondent refusal’) may give insight into the likelihood of eventually achieving an interview. Similarly, the number of face-to-face contact attempts by day and time, observations of the household (e.g. is there a wheelchair ramp or children's toys in the yard?) and characteristics of any initial interaction with householders (refusal or appointment for another time) provide information for judging response propensity. Paradata also include information that is gathered during computer-based surveys (e.g. response time, key strokes, mouse hovering over different multiple choice answers before deciding on the final response or corrections to original responses) that can inform survey researchers about difficulties in the interview process and consequent data quality; see, for example, Olson and Parkhurst (2013) and Yan and Olson (2013). We discuss applications of paradata in adaptive surveys and how they can be transported to adaptive trials in Section 4.2.
3.3. Measuring and improving sample balance related to non-response: R-indicators and balance indicators
A type of adaptive design in surveys that has potential relevance for clinical trials involves measuring how similar the set of respondents is to the overall sample in terms of variables that are known before data collection begins, and then adapting data collection to try to fix discrepancies between these. The hope is that this will reduce non-response bias. We use the terms ‘balance’ and ‘representativeness’ interchangeably to denote the similarity between the distribution of auxiliary variables in the sample and in the subset of respondents. There are different ways to define and measure this similarity including balance indicators (Särndal, 2008, 2011; Särndal and Lundström, 2010; Lundquist and Särndal, 2013) and R-indicators (Schouten et al., 2009, 2011). As shown by Särndal and Lundquist (2014) and Schouten et al. (2016), if the auxiliary variables are highly correlated with outcomes of interest, increased sample balance has potential to reduce non-response bias in the survey estimates of these outcomes, especially if the missing data process is missingness at random (Rubin, 1976).
Balance indicators are functions of the distances between the means of auxiliary variables comparing the sample with the subset of respondents. The R-indicator is based on the variance of predicted response propensities across the sample by using auxiliary variables as predictors. Although R-indicators and balance indicators involve different measures of distance between the respondents and the full sample in terms of auxiliary variable distributions, there are connections between them. For instance, one of the balance indicators in Särndal (2011) is equivalent to the R-indicator if the response propensities that are used in the R-indicator are estimated by using a linear regression model instead of the standard logistic regression model. Some of the other balance indicators differ from R-indicators through using different transformations of the variance of response propensities to the [0, 1] scale or differ from R-indicators by a multiplicative factor that is equal to the response rate (Schouten and Cobben, 2012; Schouten et al., 2016).
Reist and Coffey (2014), Finamore et al. (2015) and Coffey et al. (2018) have used R-indicators in the NSCG with the goal of monitoring sample balance and then adapting data collection to improve it. Specifically, they monitored the following three types of R-indicator:
full sample, an overall measure of sample balance;
variable level unconditional partial, which identifies which variables contribute most to sample imbalance;
category level unconditional partial, the direction of misrepresentation (over or under) for each category of an auxiliary variable (Schouten et al., 2011).
Unconditional partial R-indicators were used in the NSCG because of their ease of interpretation compared with conditional versions of these. The following steps are used to identify cases that are candidates for adapting data collection.
Step 1: use the full sample R-indicator to evaluate whether the current set of respondents is sufficiently representative of the full sample.
Step 2: if not, then use the variable level R-indicator to identify the variable(s) that drive the imbalance, e.g. age.
Step 3: for the variable(s) that was identified in step 2, use the categorical level R-indicators to identify the categories that are the most overrepresented and underrepresented, e.g. individuals 18–24 years of age.
Step 4: the sample cases in categories that are identified in step 3 are candidates for adaptation, such as modified contact mode and frequency.
Coffey et al. (2018) have demonstrated that their interventions to reallocate effort selectively across non-respondents by using this approach can improve data quality measures, in this case R-indicators, without causing a lower overall response rate, even when some interventions reduced the effort that is applied to sample members. The interventions led to lower median costs compared with non-adaptive effort allocation.
In a different application, Luiten and Schouten (2013) used R-indicators in a study of consumer satisfaction in which fieldwork was adapted on the basis of the response propensities of cases. These studies involved changing elements of the data collection approach (incentive amounts, interview mode changes and contact effort) on the basis of the observed sample quality and the response propensity for non-respondents. In a simulation study, Särndal and Lundquist (2014) examined the effect of varying contact effort for non-respondents with different levels of response propensity during the course of data collection.
R-indicators and balance indicators have limitations. The utility of these indicators is directly related to how predictive of the outcomes of interest the covariates that are used to construct the indicator are. Additionally, unlike response rates, R-indicators can only be compared across surveys if a common set of covariates is used to construct the indicator. There is no set threshold for what is a ‘good’ R-indicator value; according to Beaumont et al. (2014),
R-indicators have also been criticized for being sensitive to sample size or sampling variance and response rates. Adjustments to the R-indicator have been proposed for small samples (Shlomo et al., 2012) and for low response rates, i.e. the coefficient of variation of the balancing propensities (Schouten et al., 2016). These balance indicators do not assess whether data are missing not at random (Nishimura et al., 2016).‘if no explanatory variable is included in the model, the R-indicator is equal to 1, which is the best value it can reach’.
If the model that is used to predict the propensities for the R-indicator is incorrectly specified, then the R-indicator will not perform optimally. For example, if the model is missing an interaction which is an important predictor of an outcome of interest, this may lead to interventions that exacerbate imbalance corresponding to this missing term. This is a common issue across balancing metrics and other monitoring metrics that have been proposed. For example, if a balance indicator does not include an important interaction as one of the quantities to balance on, then a similar problem could occur.
Alternative indicators for monitoring potential non-response bias have been proposed; see Wagner (2012) and Nishimura et al. (2016) for overviews of different methods. One is based on the fraction of missing information (FMI) (Rubin, 1987). Wagner (2010) proposed monitoring the FMI during data collection. Methods using pattern proxy mixture models have been proposed to evaluate during data collection how sensitive an imputation model is to violations of the missingness at random assumption (Andridge and Little, 2011).
Monitoring the FMI also has drawbacks in terms of its potential usefulness in guiding adaptation in surveys. The FMI is outcome variable specific, thus requiring a measure for every outcome of interest. Additionally, FMI monitoring assumes that a multiple-imputation or a likelihood-based approach to account for incomplete data is being used to adjust for non-response statistically. However, for many surveys weighting is the preferred approach for non-response adjustment. Unlike R-indicators, FMI monitoring does not provide information about which cases should be targeted to improve the measure.
Bayesian methods are being applied to adaptive survey designs, in particular through the Bayesian Adaptive Survey Design Network (Bayesian Adaptive Survey Design Network, 2018). Of particular relevance to the above discussion is the Network's work on specifying prior distributions for response propensity based on historical data (Schouten et al. (2017), chapter 8.4).
3.4. Similarities and differences between trial and survey adaptations
There is relatively little overlap between the adaptations for trials and surveys in Table 1. The only common adaptations involve modifying the follow-up time and deciding when to stop data collection altogether. Because of the many differences between adaptation types in the two domains and the shared goals of minimizing costs while making valid statistical inferences in the presence of non-response or missed visits, we think that there is ample opportunity for cross-fertilization as described next.
4. Opportunities for cross-fertilization
Adaptive methods from surveys may be useful in clinical trials, and vice versa. In this section we discuss opportunities with high potential for added value. A common theme is that adaptive methods that are developed for surveys may be applied to improve external validity (and sometimes also internal validity) in trials, and adaptive methods from trials may be used to improve internal validity and efficiency in surveys. Each section title below indicates the direction of technology transfer, e.g. transporting a survey method approach to clinical trials in Section 4.1.
4.1. Survey → clinical trial: systematically monitoring representativeness and improving it by targeted enrolment or follow-up
A strategy that is used in adaptive surveys (see Section 3.3), but to our knowledge not yet applied to clinical trials, is to monitor systematically how similar the set of respondents is to the overall sample on key baseline variables and then to adapt to improve the measure. This idea could be used to help to improve internal validity in trials, by selectively increasing follow-up efforts on participants who miss study visits to increase balance or representativeness in each study arm. Analogously, this approach could be applied to help to improve external validity in trials by monitoring how representative the enrolled participants are of the target population and selectively increasing efforts to enrol underrepresented groups.
For example, by monitoring R-indicators that measure differences between the set of currently enrolled participants and the target population, a researcher could be alerted early during enrolment that the trial is gathering individuals who have lower severity of disease than the target population; clinics serving higher disease severity patients could be added. An analogous approach could be used when collecting outcome data on trial participants, in which extra follow-up attempts could be targeted at participants with high severity of disease who dropped out of the trial; this is called double sampling (Cochran, 1963). In the above examples, it would be preferable to measure sample balance formally and to use a preplanned procedure to adapt to improve it.
Intensified follow-up efforts for participants who miss visits (or who have a high predicted probability of missing a future visit) in clinical trials could include, for example, more phone calls to schedule or confirm upcoming visits, arranging and paying for transportation to study visits, offering home visits instead of in-clinic visits or increasing monetary incentives for visit attendance. These are analogous to some of the survey adaptations in Table 1. Though some of the above intensified follow-up techniques are mentioned in the National Research Council report on the prevention and treatment of missing data in clinical trials (National Research Council, 2010), there is not yet a recommendation to use a formal metric (such as an R-indicator) for monitoring loss to follow-up and adapting follow-up efforts to improve that measure. Special care must be taken and institutional review board preapproval must be obtained for the above adaptations, especially those involving monetary incentives, because of ethical considerations in clinical trials (Friedman et al., 2015).
Adaptively modifying follow-up as described above may decrease the need for statistical adjustment after a trial and may mitigate the bias that is caused by informative dropout. Auxiliary data that are collected to help to assess and to predict dropout may also be useful in exploratory analyses that aim to extrapolate treatment effect estimates to populations that have different distributions of these auxiliary variables (Cole and Stuart, 2010; Keiding and Louis, 2016; Kern et al., 2016; Pearl and Bareinboim, 2014; Najafzadeh and Schneeweiss, 2017).
There are several potential advantages of R-indicators over the simpler approach of monitoring enrolment and follow-up within predefined strata of baseline variables. R-indicators account not only for different response rates but also for the relative sizes of different strata. This is important because in small strata a high or low enrolment or dropout rate may not have a large effect on the overall treatment effect estimate. Second, R-indicators provide an overall measure of the imbalance across important subgroups, which could be useful for detecting the need for major adaptations such as adding sites or for modifying study protocols to reduce dropout overall. Finally, variable level R-indicators enable us to identify which variables are contributing most to imbalance (Schouten et al., 2011). This can help in targeting cases for increased or decreased effort in enrolment or follow-up.
To monitor how well the enrolled participants represent the target population, the target population and the variables that are used for measuring representativeness need to be clearly defined. This is analogous to defining the sample frame and key auxiliary variables to be used in measuring balance in survey sampling. In trials, the eligibility criteria define the target population and key baseline variables can include demographic characteristics as well as disease-specific measures. The target population distribution of these variables could be estimated by using electronic health records or data from registries.
4.2. Survey → clinical trial: collection and use of paradata
Another potentially useful technology transfer from surveys to clinical trials could involve the clever use of paradata (defined in Section 3.2) to improve operations adaptively. We next present two examples of the use of paradata in adaptive surveys and then discuss how similar ideas could be applied in a trial context.
The NSCG collected paradata that were used in deciding how much effort to employ in contacting outstanding cases and what modes of contact (web, mail or telephone) to use. This information, coupled with data on the representativeness of the sample achieved, enabled researchers to focus contact efforts on underrepresented outstanding cases (Coffey et al., 2018).
As another example, the National Survey of Family Growth collected data, including paradata, that were used to guide fieldwork and to control costs that are related to contacting non-respondents (Rothwell and Ventura, 2009). These data included key subgroup response rates, daily costs and interviewer productivity rates. This information was used to adapt data collection, e.g. shifting interviewer resources to low response areas, reducing work for poorly performing interviewers and stopping work early to align with predefined survey cost or quality goals.
In the clinical trial context, similar sorts of paradata could inform efforts to improve participant recruitment, retention and compliance with protocols. The level of effort that is expended in encouraging participants to attend visits (e.g. the number of phone call attempts and outcomes) could inform estimates of the likelihood of recruitment or continued participation. This information could lead to intensified (and possibly more costly) recruitment approaches. Similarly, observations of clinic visits, e.g. how late or early participants arrive, the amount of time participants are made to wait before being seen, the number of questions that are answered and the time spent on each question in computer-assisted self-interviews (which are often used in trials to ask sensitive questions about participant behaviour) and clinician observations on participant satisfaction or frustration with the study experience could be used to predict participant retention and protocol compliance. Such predictions could help to identify whom to target with interventions that are aimed at encouraging participation, or help to select participants who are likely to have high compliance.
Paradata could also be included in statistical adjustments for non-compliance. If paradata or electronic health records are to be used in a trial, this must be included in participant consent forms and approved by the institutional review board with special attention to protecting privacy of participant information.
4.3. Clinical trial → survey: stopping rules
We now turn to some ideas from adaptive clinical trials that could be used in the survey context. The first idea is to consider potential applications of information monitoring (described in Section 2.2 for trials) to adaptive surveys. We consider a sequence of waves of data collection. After each wave, the estimator variance (i.e. the inverse of the information) for each key survey question is computed. When the maximum variance across key survey questions falls below a given threshold, the data collection stops. Monitoring information is similar to monitoring the effective sample size (Rao and Scott, 1992; Kish, 1995); the reason for the focus on effective sample size rather than actual sample size is that the former is directly related to the key statistical properties of estimator variance and confidence interval width.
Rao et al. (2008) and Wagner and Raghunathan (2010) developed stopping rules for adaptive surveys that differ from information monitoring. Their approaches are based on assessing changes (or predicting future changes) in key estimates across waves of data collection, where the estimators use multiple imputation to adjust for missing data. Wagner and Raghunathan (2010) augmented the approach of Rao et al. (2008) by employing frame data on all cases in their imputation models. If the next wave of data collection has a low predicted probability of changing the current estimates by at least a minimum threshold, then data collection is stopped.
Our proposed cross-fertilization idea is to compare information monitoring with the approaches of Rao et al. (2008) and Wagner and Raghunathan (2010) in simulation studies. The simulation scenarios could be the same as those in Rao et al. (2008) and Wagner and Raghunathan (2010). The performance criteria for such a comparison could include the following: the expected number of contact attempts, duration and cost of the survey; bias, variance and mean-squared error of estimators of key survey (outcome) variables; the confidence interval coverage probability. Each type of stopping rule depends on preplanned thresholds; by varying the thresholds, we can generate trade-off curves, e.g. cost versus mean-squared error, for each type of stopping rule. These curves can then be compared across different rule types. Since the main goal of an information-based stopping rule is to achieve a desired variance whereas the main goal of the rules of Rao et al. (2008) and Wagner and Raghunathan (2010) is to reduce non-response bias, we expect that these methods will have different strengths and weaknesses.
A challenge to implementing information monitoring in surveys, just as in clinical trials, is that variances are generally not known in advance and so must be estimated. This problem may be somewhat ameliorated in surveys that are similar to past surveys for which reliable variance estimates are available. Otherwise, a minimum sample size would generally be required before information monitoring for a given survey question could be started. If estimates for special subpopulations are of key importance, then this minimum sample size should be applied to each such subpopulation. Variability in the variance estimator should be taken into account, especially when it is suspected that distributions may have large skew or heavy tails. For a comparison of variance estimation methods in the context of survey sampling, see Wolter (2007), pages 354–366. It may be preferred to evaluate the variance at preset intervals rather than in realtime, for logistical reasons. Typically, confirmatory (phase 3) randomized trials involve five or fewer analysis times.
Adaptive survey stopping rules, such as those mentioned above, require prespecification of what constitutes a wave of data collection. For example, the first wave could involve mailing a survey to everyone in the sample and waiting a fixed period of time for responses; the subsequent waves could consist of new mailings (or other contact modes) to non-respondents in the previous waves. The set of non-respondents to target in each wave could be adaptively selected, e.g. using R-indicators as described in Section 3.3, to improve balance at the minimum cost.
4.4. Clinical trial → survey: sequential multiple-assignment randomized trials designs in surveys
Analogues of SMART designs could be useful in the survey context. In each wave, non-respondents would be randomized to different contact modes, intensities or incentives to respond. The goal is to learn which sequences are most effective in achieving sample balance, decreasing cost and decreasing the length of data collection. In determining the sample size for such surveys, we recommend considering the effective sample size since that is what determines the operating characteristics of the survey, similarly to the discussion in Section 4.3
We know of one example of SMART design methods that were applied to a sample survey. Dworak and Chang (2015) randomized non-respondents in the Health and Retirement Survey to receive different sequences of monetary incentives and persuasive messages aimed at increasing the response rate. They aimed to determine the best first- and second-line approaches. Though there was not a statistically significant effect on response rates comparing the persuasive message and monetary incentive (first-line treatments), the optimal approach in stage 2 was to intensify the treatment that was given in the first stage. They also found that interviewer behaviour moderates treatment impact.
We propose a possible extension of the SMART design of Dworak and Chang (2015). Their design focused on increasing the overall response rate. However, as discussed in Section 3.3, it may be more effective to increase sample representativeness (e.g. as measured by an R-indicator) than to increase the overall response rate, when the goal is reducing non-response bias. For this, data from the SMART design of Dworak and Chang (2015) could be reanalysed by using the statistical methods of Murphy (2003), Robins (2004) and van der Laan and Luedtke (2015) to estimate the optimal dynamic regime (sequence of treatments) within strata of auxiliary variables. Knowledge of stratum-specific optimal treatment rules could then be applied in similar surveys to target non-respondents who are most likely to increase representativeness at the lowest cost.
A limitation of the approach proposed is that the sampling frames would need to be similar for the optimal rule in one survey to be optimal in the other. Another limitation is that estimation of optimal regimes within strata may require the use of statistical models (depending on the number of strata and the number of cases), and model misspecification can lead to suboptimal treatment regimes.
4.5. Clinical trial → survey: formal adaptation protocol and a data monitoring committee
Our last proposal is to consider whether adaptive surveys may benefit from prospectively planned adaptation protocols and a data monitoring committee (DMC), which are common in adaptive trials. As stated in Food and Drug Administration (2006),
‘A clinical trial DMC is a group of individuals with pertinent expertise that reviews on a regular basis accumulating data from one or more ongoing clinical trials. The DMC advises the sponsor regarding the continuing safety of trial subjects and those yet to be recruited to the trial, as well as the continuing validity and scientific merit of the trial.’
Friedman et al. (2015) gave an overview of DMCs. Arms-length boards of this type are primarily responsible for patient safety where stopping or modification of trials becomes necessary for ethical or health-related reasons, but roles also include protocol review and advice on management issues. Membership of the committee usually includes clinicians, statisticians, field or laboratory staff (if relevant) and, sometimes, an ethicist. A walled-off statistical centre periodically prepares reports and presents them to the DMC, with a subsequent report given to the Principal Investigator and other study leaders with recommendations for continuing, modifying or stopping the study. There is on-going discussion on how best to structure DMCs for adaptive designs since they are responsible not only for patients’ safety but also for implementing the adaptations according to the study protocol (Chow et al., 2012; Bothwell et al., 2018).
Formal planning and monitoring committees may be useful when surveys use adaptive designs. The committee could monitor survey quality measures, implementation of adaptive decisions about timing, or frequency or mode of contact for non-respondents and respondent burden from multiple contacts. Although research on adaptive survey designs has explored elements that are needed for preplanned adaptation, formal planning should be enhanced. For example, sample representativeness monitoring methods, planned adaptations, observed effects of these interventions and survey stopping rules (see Section 4.2) should be detailed in a study protocol. Transparency in the data collection process, which is a key requisite for the credibility of survey findings, would be greatly enhanced by such documentation. The protocol, in turn, may benefit from oversight by a DMC.
A survey DMC could periodically weigh in on survey design issues, quality measures, adaptive treatment decisions and questions of respondent burden. This arrangement would remove from the survey director the responsibility of handling data collection decisions and ethical issues autonomously. To be effective and credible, a DMC needs to be an arms-length committee, comprised of experts who can bring a disinterested perspective to bear on judgements about the survey conduct. To be practical, a DMC needs to work in a timely fashion so as not to delay data collection. A DMC comprised of members of a survey organization in which an adaptive survey is being planned, but who have no stake in the adaptive survey themselves, may satisfy these criteria. The potential benefits of a DMC for adaptive surveys would need to be weighed against the additional resources that it would require.
5. Key issues in using adaptive designs
In both clinical trials and survey domains there are a variety of opinions on the goals, desirability and practicality of using adaptive designs. See, for example, Särndal and Lundquist (2014) and Beaumont et al. (2014) regarding surveys and Cook and Demets (2010), Morgan et al. (2014) and Hey and Kimmelman (2015) for clinical trials. In both domains, implementing an adaptive design requires additional information and infrastructure relative to a fixed design. We highlight key requirements and special concerns arising in adaptive designs.
5.1. Necessary data and other information
In both trials and surveys, valid adaptation is only possible when timely and accurate information is available. Database, computing and communications infrastructures must be in place, and clinical or field staff must be ready to adapt. Communication systems must be up to the task. For clinical trials, end point measurement must be sufficiently rapid so as to take advantage of accruing information before enrolment is complete. Data cleaning and verification must be done quickly and accurately to avoid making incorrect decisions.
As noted above, adaptive surveys are greatly facilitated by auxiliary data linked to the sampling frame. The connection may be achieved directly, as when information from administrative records or a business register is linked to individual households or firms, or it may be achieved statistically, as when ‘big’ data—e.g. traffic sensor information—are linked to responding units. We do not know of examples of the latter form of linkage for adaptive surveys, but much attention is now directed in the research community at the methods and technology for making such ties. Adaptive survey design also benefits from paradata collected to record the outcomes of interview contact attempts and interviewer observations on characteristics of households to be interviewed. Bates et al. (2008) illustrated use of a ‘contact history instrument’ to predict the likelihood of obtaining interviews at contacted cases in the National Health Interview Survey. Kirgis and Lepkowski (2013) described how interviewer observations of households and data collected in contacts with potential respondents can be employed in directing subsequent field efforts.
5.2. Implementation infrastructure
Rapid advances in technology over the last decade in realtime communications, electronic data collection, remote monitoring and analysis, and use of paradata have mitigated logistical challenges in implementing adaptations. The legacy data collection systems that were in place in survey organizations were not designed with adaptive interventions in mind and therefore required cumbersome and complicated solutions for adaptive implementations. Often these overly complex legacy systems lacked sufficient standardization and integration: a phenomenon which is also closely associated with unintended and undetected survey error (see Thalji et al. (2013)). The new requirements of adaptive design cast a revealing light on these weaknesses. A growing number of national statistical organizations, academic statistical institutes and private statistical firms are launching major information system initiatives that are designed to capitalize on the growth of data and processing power with adaptive design as a central goal (see Thalji et al. (2013) and related papers in the same issue).
There are a substantial number of software packages for adaptive designs in clinical trials; see, for example, Wassmer and Brannath (2016), pages 277–280, and MD Anderson Cancer Center (2018). Phase III clinical trials are likely to present the greatest opportunity for cross-pollination, because they deal with a relatively large group of patients, making them more comparable in number of participants with statistical surveys.
Finally, review and comparison of standards in both domains may be instructive. In the survey domain, there are data and architecture standards such as the ‘Common statistical production architecture’, the ‘Generic statistical business process model’, the ‘Generic statistical information model’ (United Nations Economic Commission for Europe, 2013, 2017), the ‘Statistical data and metadata exchange’ (Nelson et al., 2012) and the ‘Data documentation initiative’ (Thomas et al., 2014). These are relatively new and are experiencing limited but increasing uptake in the statistical community (Thieme and Mathur, 2014). They provided a common conceptual framework within which survey organizations can build shared systems to handle not only the varying adaptive requirements for data collection, but also to satisfy the data discovery and data sharing requirements between and among statistical organizations.
In the clinical trial domain, examples of standards include the Clinical Data Interchange Standards Consortium (2017) and Health Level 7 (2017a), as well as the reference information model (Health Level 7, 2017b). Another, the Biomedical Research Integrated Domain Group (2017), combined common elements of various standards and enables tracking and reporting of study conduct, protocol representation and adverse events. Looking across these conceptual frameworks designed to enable common use of data and to encourage communication and collaboration between previously separate fields (such as surveys and clinical trials) has the potential to facilitate technology transfer.
5.3. Issues in adaptation
Although adaptive methods in surveys and clinical studies can be effective, care is needed in developing and implementing them because they have a strong influence on the eventual data set. Adaptation based on what ends up not being the primary end point in a clinical trial can degrade information for the eventual primary end point and, by inappropriate adjustments in allocation percentages, may compromise ethics. Similarly, adaptation based on inaccurate or incomplete auxiliary data and paradata can lead to errors in survey monitoring and ineffective interventions. The overarching issue is that although researchers can disagree on how to analyse a data set, adaptive methods generate the data set and there is no turning back. Therefore, it is important to ensure that methods are robust to model misspecification and other violations of assumptions.
We recommend conducting simulation studies to assess the effect of measurement error and model misspecification before conducting an adaptive design or survey; this is also recommended in Food and Drug Administration (2010, 2016). In the survey domain, it is important to do more research examining trade-offs between different types of error (e.g. non-response and measurement) and between, for example, mean-squared error and cost (see Section 4.3 for a proposal to evaluate such trade-offs between stopping rules).
5.4. Population inference challenges
Valid analyses of data sets that are generated by adaptive methods require more care and sophistication than those generated from a fixed plan approach. And, many analysis issues are common to surveys and clinical studies. These include attention to the adaptation plan and robustness to modelling assumptions. In general, analysis must take adaptation, which is an aspect of the sampling plan, into account. For example, in surveys, an emphasis on gathering ‘influential cases’—those which are important to sample balance—changes response propensities and must be taken into account. These changes in response propensities are incorporated in the R-indicator after each new wave of data collection by refitting the propensity model. Adaptations also need to be accounted for in the non-response adjustments that are made after data collection has concluded. One approach to accomplish this is to include all variables that are used by the R-indicator in a calibration estimator, such as a generalized regression estimator. Another approach is to use indicators for each intervention (e.g. a change to contact mode) as variables in a propensity model to be used in a propensity-based non-response adjustment.
Similarly, focusing on cases with documented high response propensities changes the sampling plan. In surveys, the mode of data collection (e.g. web based, hard copy, telephone or in person) can generate ‘mode effects.’ An adaptive design can employ several modes in different sequence, so the effect of a single mode can depend on the full mode sequence.
Acknowledgements
T. A. Louis was supported by an interagency personnel agreement with Johns Hopkins University while serving as Associate Director for Research and Methodology at the US Census Bureau, M. Rosenblum by the Patient-Centered Outcomes Research Institute (grant ME-1306-03198) and the US Food and Drug Administration (grant HHSF223201400113C) and E. A. Stuart by the National Science Foundation (grant DRL-1335843) and the Institute of Education Sciences (grant R305D150001). This research was supported in part by the Intramural Research Program of the US Department of Agriculture National Agricultural Statistics Service. This work is solely the responsibility of the authors and does not represent the views of these agencies; the findings and conclusions in this publication have not been formally disseminated by any of the above agencies and should not be construed to represent any agency determination or policy.