-
PDF
- Split View
-
Views
-
Cite
Cite
Carina Cornesse, Annelies G Blom, David Dutwin, Jon A Krosnick, Edith D De Leeuw, Stéphane Legleye, Josh Pasek, Darren Pennay, Benjamin Phillips, Joseph W Sakshaug, Bella Struminskaya, Alexander Wenz, A Review of Conceptual Approaches and Empirical Evidence on Probability and Nonprobability Sample Survey Research, Journal of Survey Statistics and Methodology, Volume 8, Issue 1, February 2020, Pages 4–36, https://doi.org/10.1093/jssam/smz041
- Share Icon Share
Abstract
There is an ongoing debate in the survey research literature about whether and when probability and nonprobability sample surveys produce accurate estimates of a larger population. Statistical theory provides a justification for confidence in probability sampling as a function of the survey design, whereas inferences based on nonprobability sampling are entirely dependent on models for validity. This article reviews the current debate about probability and nonprobability sample surveys. We describe the conditions under which nonprobability sample surveys may provide accurate results in theory and discuss empirical evidence on which types of samples produce the highest accuracy in practice. From these theoretical and empirical considerations, we derive best-practice recommendations and outline paths for future research.
1. INTRODUCTION
In recent years, several cases of mispredicted election outcomes have made the news across the world. Prominent examples include the 2015 Israeli parliamentary election where the majority of polls predicted that Benjamin Netanyahu would lose his presidency,1 the 2016 Brexit referendum where the majority of polls predicted that Britain would vote to remain in the European Union,2 and the 2016 US presidential election where the majority of polls predicted that Hillary Clinton would defeat Donald Trump.3 When investigating potential reasons for these and other polling failures, researchers have pointed toward the fact that election polls based on probability samples usually reached more accurate predictions than election polls based on nonprobability samples (Sohlberg, Gilljam, and Martinsson 2017; Sturgis, Kuha, Baker, Callegaro, and Fisher 2018).
The finding that election polls based on probability samples usually reach better predictions than election polls based on nonprobability samples is not new. Already in the 1920s and 1930s, the scientific community debated which sampling design was better: probability sampling, as initially introduced by Arthur L. Bowley in 1906, or nonprobability sampling, as initially introduced by Anders N. Kiaer in 1895 (Lessler and Kalsbeek 1992; Bethlehem 2009). After several dramatic cases of mispredicted election outcomes in the United States (Literary Digest in 1936, see Crossley 1937; Gallup, Crossley, and Roper in the polling debacle of 1948, see Converse 1987), nonprobability sampling was identified as a principal cause of prediction inaccuracy and was replaced by probability sampling in most high-quality social research.
With the rise of the internet in the late 20th century, however, nonprobability sampling rose to popularity again as a fast and cheap method for recruiting online panels (Göritz, Reinhold, and Batinic 2000). However, nonprobability online panels face a number of challenges, such as noncoverage of people without internet access and selection bias due to the reliance on convenience samples of volunteers who might participate in multiple online panels (Bethlehem 2017). Despite these challenges, a vast amount of opinion polls today are conducted using nonprobability online panels (Callegaro, Villar, Yeager, and Krosnick 2014a; Callegaro, Baker, Bethlehem, Göritz, and Krosnick 2014b). In addition, nonpanel-based recruitment of online respondents, for example through river sampling (American Association for Public Opinion Research 2013) has been on the rise. As a result, the majority of survey data collected online around the world today rely on nonprobability samples (Callegaro et al. 2014a, 2014b).
Regardless of whether they are designed to predict election outcomes or to measure public opinion and regardless of whether they are conducted online or offline, when they are used for research and polling purposes, probability and nonprobability sample surveys often share a common goal: to efficiently estimate the characteristics of a large population based on measurements of a small subset of the population. Therefore, both probability and nonprobability sample surveys require that (i) the sampled units are exchangeable with nonsampled units that share the same measured characteristics, (ii) no parts of the population are systematically excluded entirely from the sample, and (iii) the composition of the sampled units with respect to observed characteristics either matches or can be adjusted to match the composition of the larger population (Mercer, Kreuter, Keeter, and Stuart 2017).
Despite their shared objective of providing accurate insights into a population of interest, probability and nonprobability sample surveys differ in a critical aspect. The key difference between probability and nonprobability sample surveys lies in the type and strength of the justification for why each approach should achieve accuracy. In the case of probability sample surveys, the justification is probability sampling theory, which is based on a set of established mathematical principles (Fisher 1925; Neyman 1934; Kish 1965). This sound theoretical basis makes it possible to compute the accuracy of estimates (e.g., in the form of confidence intervals or margins of error) and gives a universal validity to the estimation method. Furthermore, because providers of probability sample surveys routinely describe the details of the data-generating process, researchers are able to make adjustments that account for potential coverage, sampling, and nonresponse biases (e.g., Harter, Battaglia, Buskirk, Dillman, and English 2016; Blumberg and Luke 2017).
For nonprobability sample surveys, the justification for expecting accurate measurements rests on untested modeling assumptions that are based on a researcher’s beliefs about the characteristics that make the sample different from the rest of the population and how those characteristics relate to the research topic. These assumptions can take different forms, such as quasi-randomization or superpopulation modeling (Deville 1991; Elliott and Valliant 2017). However, there is no general statistical theory of nonprobability sampling that justifies when and why accurate inferences can be expected: the validity is topic and survey dependent. Furthermore, online nonprobability sample providers often consider their data collection procedures to be proprietary, thus making it difficult or impossible to know what factors to include in any model aimed at correcting for selection bias in key estimates (Mercer et al. 2017).
This article is intended to move the debate about probability and nonprobability sample surveys forward: We first describe the assumptions that must be made in order to expect nonprobability samples to yield accurate results (section 2). We then summarize the empirical evidence on the accuracy of probability and nonprobability sample surveys to date (section 3). Finally, we conclude our review with practical recommendations and paths for future research (section 4).
2. CONCEPTUALIZATION OF NONPROBABILITY SAMPLING APPROACHES
The Total Survey Error (TSE; Groves and Lyberg 2010) framework that forms the bedrock of quality assessments for probability samples does not cleanly translate to the world of nonprobability sample data collection. Nonprobability samples do not involve a series of well-controlled and well-understood departures from a perfect sampling frame. Instead, most such samples rely on a collection of convenience samples that are aggregated and/or adjusted, with the goal of reducing the final difference between sample and population. Nonprobability samples cannot be evaluated by quantifying and summing the errors that occur at each stage of the sampling process. Instead, nonprobability samples can only be evaluated by assessing how closely the final modeled sample compares to the population in terms of various characteristics.
Although little has been proposed by way of formal statistical theory justifying the use of nonprobability samples for population inferences, the methods adopted for engaging with these samples suggest that a few combinations of assumptions could justify such an approach. In general, justifications can stem from four basic types of claims: (i) that any sample examining a particular question will yield the same inferences, (ii) that the specific design of the sample, as related to the questions at hand, will produce conclusions that mirror the population of interest, (iii) that a series of analytical steps will account for any differences between the sample and the population, and (iv) that the particular combination of sample and/or analytic approaches will produce accurate population estimates. Hence, the suggestion that any particular method, when used for a specific research question, is appropriate depends on underlying claims about the question of interest, the sample, and any adjustment procedures used.
2.1 Design Ignorability Due to the Question of Interest
In some cases, the method of sampling may be unrelated to the phenomena of interest. For instance, researchers may be trying to understand some process that occurs for all individuals, as in most physiological and some psychological studies. Under these circumstances, it may be reasonable to presume that any given group of individuals would behave like any other group of individuals unless there are known confounding factors. Researchers may claim that nonprobability sampling is irrelevant when answering particular questions for a few different reasons. They may believe that the process they are investigating is universal and, thus, that all people would behave similarly. On a more limited scope, they might contend that the particular phenomenon they are studying is appropriately distributed in any broad population sample and that the specific composition of that sample is unlikely to influence their conclusions.
The suggestion that some particular inference is unrelated to the sample being drawn could derive either from theoretical expectations of orthogonality or prior empirical evidence of orthogonality. Of particular note, there are some theoretical and empirical reasons to believe that certain classes of inference may be more or less susceptible to sample imbalance. Some scholars have argued that trends over time in attitudes and behaviors should be less strongly dependent on sample composition than estimates of the distributions of those attitudes and behaviors (Page and Shapiro 1992). Similarly, a few empirical studies have found that relations between variables were more similar across probability and nonprobability samples than other types of estimates (Berrens, Bohara, Jenkins-Smith, Silva, and Weimer 2003; Pasek 2016). The claim that some kinds of inferences may be made equivalently well regardless of sampling strategy may sometimes be correct. The challenge, however, is determining when this might be the case.
2.2 Fit for Purpose Designs
Researchers do not need to establish that the question they are studying is impervious to sampling strategies to make the case that a nonprobability sample is appropriate for their inferences. Instead, they can assert that the design employed mitigates whatever biases might have emerged in the sampling process. The classic example of this type of argument stems from the use of quota samples. Quota samples are typically designed to ensure that the set of respondents matches the population on certain key demographic parameters. The idea underlying this approach is that the demographic parameters that form the basis for the quotas capture the sources of bias for researchers’ inferences. To the extent that this is true, inferences made from quota samples will be accurate because all potential confounds are neutralized by the design. That is, the remaining error induced from the sampling process would be orthogonal to the questions of interest.
Notably, demographic quotas are not the only way to select respondents such that they reflect the population across key confounds. Scholars have proposed techniques ranging from matching individuals in nonprobability samples with individuals from probability samples as a means to recruit respondents to surveys (Rivers 2007) to blending together samples drawn from sources that have known opposing biases (Comer 2019). For any of these processes, if a researcher can be confident that the sample selection strategy eliminates all potential confounds for their particular question of interest, then the use of that sampling strategy is not only justifiable but will yield accurate inferences.
The challenge with these sorts of approaches is that the accuracy of critical assumptions can only really be established empirically. It is also unclear what to make of evidence that a particular conclusion is robust to a particular sampling decision. It may be the case that the nature of the question and/or type of inference renders that conclusion accurate for any similar question on a similarly derived sample or it might be that the particularities of a single analysis happen to have yielded identical conclusions by mere chance. There is no obvious way to establish which of these is the case, though claims of empirical robustness are strengthened by a clear theoretical rationale.
2.3 Global Adjustment Approaches
In the next two sections, we describe modeling approaches that have been used to improve the accuracy of nonprobability sample data. Researchers have long known that even probability samples are sometimes inaccurate either by chance or due to variations in the likelihood that certain subgroups of the population will participate in a survey. For this reason, many statistical adjustment procedures typically used to adjust for systematic biases in probability samples have been adopted to adjust for selection biases in nonprobability samples. These approaches can be divided into two types: global adjustments and outcome-specific adjustments. Global adjustments refer to approaches that use a model to create a single adjustment that can be applied in any subsequent analysis, regardless of the outcome of interest. Outcome-specific adjustments tailor the adjustment model to a specific outcome of interest.
Regarding global adjustments, one commonly used approach is calibration weighting (Deville and Särndal 1992; Roshwalb, Lewis, and Petrin 2016; Santoso, Stein, and Stevenson 2016). Calibration weighting involves weighting the respondent pool such that the weighted sample totals of a certain characteristic correspond to known population totals of that same characteristic. The known population totals might come from census data, official statistics, or other data sources assumed to be of high quality. The procedure produces a global adjustment weight that can be applied to the analysis of any outcome variable. Using such weights amounts to making the assumption that once the known sources of deviation are accounted for in the adjustment procedure, the remaining errors will be unrelated to the likelihood that a particular unit in the population participated in the survey. This strategy presumes that the sampled units within specified population subgroups will be roughly equivalent to the nonsampled units within those subgroups with respect to any inferences that will be made with the data.
Although calibration weighting only requires access to population-level benchmark data, alternative global adjustment procedures make use of unit-level reference data to improve the accuracy of nonprobability sample estimates. One approach, known as sample matching, attempts to compose a balanced nonprobability sample by selecting units from a very large frame, such as a list of registered members of an opt-in panel, based on an array of auxiliary characteristics (often demographic) that closely match to the characteristics of units from a reference probability sample (Rivers 2007; Vavreck and Rivers 2008; Bethlehem 2016). The matching procedure, which may be performed before any units are invited to the nonprobability survey, relies on a distance metric (e.g., Euclidean distance) to identify the closest match between pairs of units based on the set of common auxiliary characteristics.
Another approach that uses unit-level reference data is propensity score weighting. This approach is performed after the survey data have been collected from units in a nonprobability sample. The basic procedure is to vertically concatenate the nonprobability sample survey data with a reference dataset, typically a large probability sample survey. Then a model (e.g., logit or probit) is fitted using variables measured in both datasets to predict the probability that a particular unit belongs to the nonprobability sample (Rosenbaum and Rubin 1983, 1984; Lee 2006; Dever, Rafferty, and Valliant 2008; Valliant and Dever 2011). A weight is then constructed based on the inverse of this estimated inclusion probability and used in any subsequent analysis of the nonprobability survey data. Like calibration weighting, propensity score weighting only works if two conditions are satisfied: (i) the weighting variables and the propensity of response in the sample are correlated; and (ii) the weighting variables are correlated with the outcome variables of interest. A related approach is to use the concatenated dataset to fit a prediction model using variables measured in both datasets, which is then used to impute the values of variables for units in the nonprobability sample that were only observed for units in the reference probability sample (Raghunathan 2015).
All of the previously described adjustment approaches do not require the use of the reference data beyond the weighting, matching, or imputation steps and are discarded during the analysis of the nonprobability survey data. Alternative approaches combine both data sources and analyze them jointly. One such approach is pseudo design-based estimation (Elliott 2009; Elliott and Valliant 2017), where pseudo-inclusion probabilities are estimated for the nonprobability sample units based on a set of variables common to both nonprobability and probability samples. Different techniques may be used to estimate the inclusion probabilities. For example, one could concatenate the nonprobability and probability datasets and predict the probability of participating in the nonprobability survey, similar to the aforementioned propensity score weighting procedure. Alternatively, one could employ sample matching with both surveys and donate a probability sample unit’s inclusion probability to the closest recipient match in the nonprobability sample. Once the pseudo-inclusion probabilities have been assigned to all nonprobability sample units, then these units can be treated as if they were selected using the same underlying sampling mechanism as the probability sample units. The datasets may then be combined and analyzed jointly by using the actual and pseudo weights. For variance estimation, Elliott and Valliant (2017) recommend the use of design-based resampling approaches, such as the bootstrap or jackknife to account for variability in both the pseudo weights and the target quantity of interest. For nonprobability samples that have an underlying cluster structure (e.g., different types of persons recruited from different web sites), cluster resampling approaches should be used.
Another approach to combining and analyzing probability and nonprobability samples jointly is blended calibration (DiSogra, Cobb, Chan, and Dennis 2011; Fahimi, Barlas, Thomas, and Buttermore 2015). Blended calibration is a form of calibration weighting that combines a weighted probability sample with an unweighted nonprobability sample and calibrates the combined sample to benchmark values measured on units from the weighted probability sample survey. The combined sample is then analyzed using the calibrated weights.
In summary, each of the previously described approaches produces a single global adjustment that can be applied to any analysis regardless of the outcome variable of interest. These methods entail the assumption that the selection mechanism of the nonprobability sample is ignorable conditional on the variables used in the adjustment method. For example, selection bias is assumed to be negligible or nonexistent within subgroups used in the calibration and propensity scoring procedures. It is further assumed that the adjustment variables in the reference data source (census or probability sample survey) are measured without error and are highly correlated with the target analysis variables and correlated with the probability to participate in the nonprobability sample survey. A very large and diverse reference data source with an extensive set of common variables is therefore needed to maximize the validity of these strong assumptions. One should always keep in mind that global adjustment procedures may improve the situation for some estimates but not others, and there is no guarantee that biases in the nonprobability data will be removed completely. In practice, one never knows with certainty that ignorability assumptions hold, although in some cases it may be possible to place bounds on the potential magnitude of bias through sensitivity analysis (Little, West, Boonstra, and Hu 2019).
2.4 Outcome-Specific Adjustment Approaches
Turning now to outcome-specific adjustment approaches, these approaches utilize adjustment models that are tailored to a specific outcome variable. That is, they adjust for the selection mechanism into a nonprobability sample with respect to a given outcome variable, Y, which is of interest to the researcher. Such approaches attempt to control for variables that govern the selection process and are correlated with the target outcome variable. One example of such a framework is the notion that probability and nonprobability samples both constitute draws from a hypothetical infinite “superpopulation” (Deville 1991). The goal of the analyst is then to model the data-generating process of the underlying superpopulation by accounting for all relevant variables in the analysis model. In practice, this means that a researcher may fit a prediction model for some analysis variable Y based on the sample at hand, which is then used to predict the Y’s for the nonsampled units. The sampled and nonsampled units are then combined to estimate the quantity of interest (e.g., mean, total, regression coefficient) for Y in the total population (Elliott and Valliant 2017). The key assumptions of this approach are that the analysis variable, Y, is explained through a common model for the sampled and nonsampled units and that all parameters governing the superpopulation model are controlled for in the analysis model. The approach also requires the availability of auxiliary data on the population to make predictions for the nonsampled units. Variance estimation for the predictions can be implemented using a variety of frequentist methods, including jackknife and bootstrap replication estimators as described in Valliant, Dorfman, and Royall (2000).
Another model-based approach, which can be applied in a superpopulation framework, is model-assisted calibration. This approach involves constructing calibrated weights by using a model to predict the values for a given analysis variable (Wu and Sitter 2001). The calibrated weights are generated based on constraints placed on the population size and population total of the predicted values. Various model selection approaches (e.g., LASSO) have been proposed for parsimonious modeling of the analysis variable (Chen, Valliant, and Elliott 2018). A key assumption of the method is that the model is correctly specified and capable of making reliable predictions across different samples of the population. It is also assumed that all relevant superpopulation parameters are included in the model if the method is implemented in a superpopulation framework.
Multilevel regression and poststratification is another approach used to estimate a specific outcome of interest from a nonprobability sample (Wang, Rothschild, Goel, and Gelman 2015; Downes, Gurrin, English, Pirkis, Currier 2018). The basic idea is to fit a multilevel regression model predicting an outcome given a set of covariates. The use of a multilevel model makes it possible to incorporate a large number of covariates or high order interactions into the prediction model (Ghitza and Gelman 2013). The model is then used to estimate the mean value for a large number of poststratification cells defined by the cross-classification of all variables used in the regression model. It is necessary that the relative size of each cell in the population is known or can be reliably estimated from external data sources such as a census or population registry. Population quantities are estimated by aggregating these predicted cell means with each cell weighted proportionally to its share of the population. The multilevel regression model allows for cell-level estimates to be generated even when few units exist within the sample cells. The method can also be implemented in a Bayesian hierarchical modeling framework (Park, Gelman, and Bafumi 2004). Like all of the previously mentioned model-based approaches, the model is assumed to control for all variables that affect the probability of inclusion in the nonprobability sample. The method also requires good model fit, and large cell sizes are preferable to generate robust cell-level estimates. For additional hierarchical and Bayesian modeling approaches that have been proposed to estimate outcomes from nonprobability samples, we refer to Ganesh, Pineau, Chakraborty, and Dennis (2017), Pfeffermann (2017), and Pfeffermann, Eltinge, and Brown (2015).
Collectively, these approaches to using nonprobability sample data to reach population-level conclusions depend on combinations of assumptions about ignorable errors and the availability of information about the sources of nonrandomness in respondent selection that can be used to adjust for any errors that are not ignorable. Although these dependencies are not fundamentally different from the assumptions underlying the use of probability samples, they are more difficult to rely on as we know little about the factors that lead individuals to become members of nonprobability samples.
3. THE ACCURACY OF PROBABILITY AND NONPROBABILITY SAMPLE SURVEYS
Several studies have empirically assessed the accuracy of probability and nonprobability sample surveys by comparing survey outcomes to external population benchmarks (see table 1). The vast majority of these studies concluded that probability sample surveys have a significantly higher accuracy than nonprobability sample surveys. Only a few studies have found that probability sample surveys do not generally have a significantly higher accuracy than nonprobability sample surveys.
Studies on the Accuracy of Probability (PS) and Nonprobability (NPS) Sample Surveys
Study . | Country . | Benchmark . | PS modes studied . | PS more accurate than NPS?a . | Did (re)weighting sufficiently reduce NPS bias?b . |
---|---|---|---|---|---|
Blom et al. (2018) | Germany | Census data | F2F, web | Yes (univariate) | No (raking) |
Election data | |||||
MacInnis et al. (2018) | USA | High-quality PS | Phone, web | Yes (univariate) | No (poststratification) |
Dassonneville et al. (2018) | Belgium | Census data | F2F | Yes (univariate) | No (unspecified approach) – univariate |
Election outcome | No (multivariate) | N/A – multivariate | |||
Legleye et al. (2018) | France | Census data | Phone | Yes (univariate) | N/A |
Pennay et al. (2018) | Australia | High-quality PS | Phone | Yes (univariate) | No (poststratification) |
Sturgis et al. (2018) | UK | Election outcome | F2F | Yes (univariate) | No (raking, propensity weighting, matching) |
Dutwin and Buskirk (2017) | USA | High-quality PS | F2F, phone | Yes (univariate) | No (propensity weighting, matching, raking) |
Sohlberg et al. (2017) | Sweden | Election outcome | Phone | Yes (univariate) | N/A |
Brüggen et al. (2016) | Netherlands | Population register | F2F, web | Yes (univariate) | No (Generalized Regression Estimation) – univariate, bivariate |
Yes (bivariate) | |||||
Kennedy et al. (2016) | USA | High-quality PS | Web, mail | No (univariate) | N/A – univariate |
No (multivariate) | N/A – multivariate | ||||
Pasek (2016) | USA | High-quality PS | Phone | Yes (univariate) | No (raking, propensity weighting) – univariate, bivariate, longitudinal |
No (bivariate) | |||||
Yes (longitudinal) | |||||
Gittelman et al. (2015) | USA | High-quality PS | Phone | No (univariate) | No (poststratification) |
Ansolabehere and Schaffner (2014) | USA | High-quality PS | Phone, mail | No (univariate) | N/A |
Election outcome | No (multivariate) | ||||
Erens et al. (2014) | UK | High-quality PS | CASI | Yes (univariate) | N/A |
Steinmetz et al. (2014) | Netherlands | High-quality PS | Web | No (univariate) | Yes (propensity weighting) – univariate, bivariate |
No (bivariate) | |||||
Ansolabehere and Rivers (2013) | USA | High-quality PS | F2F | No (univariate) | N/A |
Election outcome | |||||
Szolnoki and Hoffmann (2013) | Germany | High-quality PS | F2F, phone | Yes (univariate) | N/A |
Chan and Ambrose (2011) | Canada | Unspecified | Web | No (univariate) | N/A |
Scherpenzeel and Bethlehem (2011) | Netherlands | Election outcome | F2F, web | Yes (univariate) | N/A |
Unspecified | |||||
Yeager et al. (2011) | USA | High-quality PS | Phone, web | Yes (univariate) | No (poststratification) |
Chang and Krosnick (2009) | USA | High-quality PS | Phone, web | Yes (univariate) | No (raking) |
Walker, Pettit, and Rubinson (2009)c | USA | Unspecified | Phone, mail | Yes (univariate) | N/A |
Loosveldt and Sonck (2008) | Belgium | Census data | F2F | No (univariate) | No (poststratification, propensity weighting) |
Malhotra and Krosnick (2007) | USA | High-quality PS | F2F | Yes (univariate) | No (unspecified) |
Berrens et al. (2003) | USA | High-quality PS | Web, phone | No (univariate) | Yes (raking, propensity weighting) – univariate, multivariate |
No (multivariate) |
Study . | Country . | Benchmark . | PS modes studied . | PS more accurate than NPS?a . | Did (re)weighting sufficiently reduce NPS bias?b . |
---|---|---|---|---|---|
Blom et al. (2018) | Germany | Census data | F2F, web | Yes (univariate) | No (raking) |
Election data | |||||
MacInnis et al. (2018) | USA | High-quality PS | Phone, web | Yes (univariate) | No (poststratification) |
Dassonneville et al. (2018) | Belgium | Census data | F2F | Yes (univariate) | No (unspecified approach) – univariate |
Election outcome | No (multivariate) | N/A – multivariate | |||
Legleye et al. (2018) | France | Census data | Phone | Yes (univariate) | N/A |
Pennay et al. (2018) | Australia | High-quality PS | Phone | Yes (univariate) | No (poststratification) |
Sturgis et al. (2018) | UK | Election outcome | F2F | Yes (univariate) | No (raking, propensity weighting, matching) |
Dutwin and Buskirk (2017) | USA | High-quality PS | F2F, phone | Yes (univariate) | No (propensity weighting, matching, raking) |
Sohlberg et al. (2017) | Sweden | Election outcome | Phone | Yes (univariate) | N/A |
Brüggen et al. (2016) | Netherlands | Population register | F2F, web | Yes (univariate) | No (Generalized Regression Estimation) – univariate, bivariate |
Yes (bivariate) | |||||
Kennedy et al. (2016) | USA | High-quality PS | Web, mail | No (univariate) | N/A – univariate |
No (multivariate) | N/A – multivariate | ||||
Pasek (2016) | USA | High-quality PS | Phone | Yes (univariate) | No (raking, propensity weighting) – univariate, bivariate, longitudinal |
No (bivariate) | |||||
Yes (longitudinal) | |||||
Gittelman et al. (2015) | USA | High-quality PS | Phone | No (univariate) | No (poststratification) |
Ansolabehere and Schaffner (2014) | USA | High-quality PS | Phone, mail | No (univariate) | N/A |
Election outcome | No (multivariate) | ||||
Erens et al. (2014) | UK | High-quality PS | CASI | Yes (univariate) | N/A |
Steinmetz et al. (2014) | Netherlands | High-quality PS | Web | No (univariate) | Yes (propensity weighting) – univariate, bivariate |
No (bivariate) | |||||
Ansolabehere and Rivers (2013) | USA | High-quality PS | F2F | No (univariate) | N/A |
Election outcome | |||||
Szolnoki and Hoffmann (2013) | Germany | High-quality PS | F2F, phone | Yes (univariate) | N/A |
Chan and Ambrose (2011) | Canada | Unspecified | Web | No (univariate) | N/A |
Scherpenzeel and Bethlehem (2011) | Netherlands | Election outcome | F2F, web | Yes (univariate) | N/A |
Unspecified | |||||
Yeager et al. (2011) | USA | High-quality PS | Phone, web | Yes (univariate) | No (poststratification) |
Chang and Krosnick (2009) | USA | High-quality PS | Phone, web | Yes (univariate) | No (raking) |
Walker, Pettit, and Rubinson (2009)c | USA | Unspecified | Phone, mail | Yes (univariate) | N/A |
Loosveldt and Sonck (2008) | Belgium | Census data | F2F | No (univariate) | No (poststratification, propensity weighting) |
Malhotra and Krosnick (2007) | USA | High-quality PS | F2F | Yes (univariate) | No (unspecified) |
Berrens et al. (2003) | USA | High-quality PS | Web, phone | No (univariate) | Yes (raking, propensity weighting) – univariate, multivariate |
No (multivariate) |
The study results reported here are based on initial comparisons of accuracy as reported by the authors. Some studies use raw (unweighted) data in their initial comparisons, while others use data that are already weighted using either weights that the authors calculated themselves or the weights that a survey vendor delivered with the data.
We report whether (re)weighting has sufficiently reduced bias based on the authors’ own judgments as reported in their conclusions. This column includes studies that compare raw (unweighted) data in their initial comparison of accuracy with the data after a weighting procedure was performed and studies that used weighted data in their initial comparison of accuracy and reweighted the data in subsequent comparisons. Studies that only reported raw data or only weighted data, without using any reweighting approaches, are labelled “not applicable” (N/A) in this column.
As reported by Callegaro et al. (2014a, 2014b).
Studies on the Accuracy of Probability (PS) and Nonprobability (NPS) Sample Surveys
Study . | Country . | Benchmark . | PS modes studied . | PS more accurate than NPS?a . | Did (re)weighting sufficiently reduce NPS bias?b . |
---|---|---|---|---|---|
Blom et al. (2018) | Germany | Census data | F2F, web | Yes (univariate) | No (raking) |
Election data | |||||
MacInnis et al. (2018) | USA | High-quality PS | Phone, web | Yes (univariate) | No (poststratification) |
Dassonneville et al. (2018) | Belgium | Census data | F2F | Yes (univariate) | No (unspecified approach) – univariate |
Election outcome | No (multivariate) | N/A – multivariate | |||
Legleye et al. (2018) | France | Census data | Phone | Yes (univariate) | N/A |
Pennay et al. (2018) | Australia | High-quality PS | Phone | Yes (univariate) | No (poststratification) |
Sturgis et al. (2018) | UK | Election outcome | F2F | Yes (univariate) | No (raking, propensity weighting, matching) |
Dutwin and Buskirk (2017) | USA | High-quality PS | F2F, phone | Yes (univariate) | No (propensity weighting, matching, raking) |
Sohlberg et al. (2017) | Sweden | Election outcome | Phone | Yes (univariate) | N/A |
Brüggen et al. (2016) | Netherlands | Population register | F2F, web | Yes (univariate) | No (Generalized Regression Estimation) – univariate, bivariate |
Yes (bivariate) | |||||
Kennedy et al. (2016) | USA | High-quality PS | Web, mail | No (univariate) | N/A – univariate |
No (multivariate) | N/A – multivariate | ||||
Pasek (2016) | USA | High-quality PS | Phone | Yes (univariate) | No (raking, propensity weighting) – univariate, bivariate, longitudinal |
No (bivariate) | |||||
Yes (longitudinal) | |||||
Gittelman et al. (2015) | USA | High-quality PS | Phone | No (univariate) | No (poststratification) |
Ansolabehere and Schaffner (2014) | USA | High-quality PS | Phone, mail | No (univariate) | N/A |
Election outcome | No (multivariate) | ||||
Erens et al. (2014) | UK | High-quality PS | CASI | Yes (univariate) | N/A |
Steinmetz et al. (2014) | Netherlands | High-quality PS | Web | No (univariate) | Yes (propensity weighting) – univariate, bivariate |
No (bivariate) | |||||
Ansolabehere and Rivers (2013) | USA | High-quality PS | F2F | No (univariate) | N/A |
Election outcome | |||||
Szolnoki and Hoffmann (2013) | Germany | High-quality PS | F2F, phone | Yes (univariate) | N/A |
Chan and Ambrose (2011) | Canada | Unspecified | Web | No (univariate) | N/A |
Scherpenzeel and Bethlehem (2011) | Netherlands | Election outcome | F2F, web | Yes (univariate) | N/A |
Unspecified | |||||
Yeager et al. (2011) | USA | High-quality PS | Phone, web | Yes (univariate) | No (poststratification) |
Chang and Krosnick (2009) | USA | High-quality PS | Phone, web | Yes (univariate) | No (raking) |
Walker, Pettit, and Rubinson (2009)c | USA | Unspecified | Phone, mail | Yes (univariate) | N/A |
Loosveldt and Sonck (2008) | Belgium | Census data | F2F | No (univariate) | No (poststratification, propensity weighting) |
Malhotra and Krosnick (2007) | USA | High-quality PS | F2F | Yes (univariate) | No (unspecified) |
Berrens et al. (2003) | USA | High-quality PS | Web, phone | No (univariate) | Yes (raking, propensity weighting) – univariate, multivariate |
No (multivariate) |
Study . | Country . | Benchmark . | PS modes studied . | PS more accurate than NPS?a . | Did (re)weighting sufficiently reduce NPS bias?b . |
---|---|---|---|---|---|
Blom et al. (2018) | Germany | Census data | F2F, web | Yes (univariate) | No (raking) |
Election data | |||||
MacInnis et al. (2018) | USA | High-quality PS | Phone, web | Yes (univariate) | No (poststratification) |
Dassonneville et al. (2018) | Belgium | Census data | F2F | Yes (univariate) | No (unspecified approach) – univariate |
Election outcome | No (multivariate) | N/A – multivariate | |||
Legleye et al. (2018) | France | Census data | Phone | Yes (univariate) | N/A |
Pennay et al. (2018) | Australia | High-quality PS | Phone | Yes (univariate) | No (poststratification) |
Sturgis et al. (2018) | UK | Election outcome | F2F | Yes (univariate) | No (raking, propensity weighting, matching) |
Dutwin and Buskirk (2017) | USA | High-quality PS | F2F, phone | Yes (univariate) | No (propensity weighting, matching, raking) |
Sohlberg et al. (2017) | Sweden | Election outcome | Phone | Yes (univariate) | N/A |
Brüggen et al. (2016) | Netherlands | Population register | F2F, web | Yes (univariate) | No (Generalized Regression Estimation) – univariate, bivariate |
Yes (bivariate) | |||||
Kennedy et al. (2016) | USA | High-quality PS | Web, mail | No (univariate) | N/A – univariate |
No (multivariate) | N/A – multivariate | ||||
Pasek (2016) | USA | High-quality PS | Phone | Yes (univariate) | No (raking, propensity weighting) – univariate, bivariate, longitudinal |
No (bivariate) | |||||
Yes (longitudinal) | |||||
Gittelman et al. (2015) | USA | High-quality PS | Phone | No (univariate) | No (poststratification) |
Ansolabehere and Schaffner (2014) | USA | High-quality PS | Phone, mail | No (univariate) | N/A |
Election outcome | No (multivariate) | ||||
Erens et al. (2014) | UK | High-quality PS | CASI | Yes (univariate) | N/A |
Steinmetz et al. (2014) | Netherlands | High-quality PS | Web | No (univariate) | Yes (propensity weighting) – univariate, bivariate |
No (bivariate) | |||||
Ansolabehere and Rivers (2013) | USA | High-quality PS | F2F | No (univariate) | N/A |
Election outcome | |||||
Szolnoki and Hoffmann (2013) | Germany | High-quality PS | F2F, phone | Yes (univariate) | N/A |
Chan and Ambrose (2011) | Canada | Unspecified | Web | No (univariate) | N/A |
Scherpenzeel and Bethlehem (2011) | Netherlands | Election outcome | F2F, web | Yes (univariate) | N/A |
Unspecified | |||||
Yeager et al. (2011) | USA | High-quality PS | Phone, web | Yes (univariate) | No (poststratification) |
Chang and Krosnick (2009) | USA | High-quality PS | Phone, web | Yes (univariate) | No (raking) |
Walker, Pettit, and Rubinson (2009)c | USA | Unspecified | Phone, mail | Yes (univariate) | N/A |
Loosveldt and Sonck (2008) | Belgium | Census data | F2F | No (univariate) | No (poststratification, propensity weighting) |
Malhotra and Krosnick (2007) | USA | High-quality PS | F2F | Yes (univariate) | No (unspecified) |
Berrens et al. (2003) | USA | High-quality PS | Web, phone | No (univariate) | Yes (raking, propensity weighting) – univariate, multivariate |
No (multivariate) |
The study results reported here are based on initial comparisons of accuracy as reported by the authors. Some studies use raw (unweighted) data in their initial comparisons, while others use data that are already weighted using either weights that the authors calculated themselves or the weights that a survey vendor delivered with the data.
We report whether (re)weighting has sufficiently reduced bias based on the authors’ own judgments as reported in their conclusions. This column includes studies that compare raw (unweighted) data in their initial comparison of accuracy with the data after a weighting procedure was performed and studies that used weighted data in their initial comparison of accuracy and reweighted the data in subsequent comparisons. Studies that only reported raw data or only weighted data, without using any reweighting approaches, are labelled “not applicable” (N/A) in this column.
As reported by Callegaro et al. (2014a, 2014b).
Table 1 provides a list of the studies that are included in our overview. A key inclusion requirement is that the studies contain comparisons of probability and nonprobability sample surveys with external population benchmarks. We do not include any studies published in languages other than English or that contain sample accuracy assessments of probability and nonprobability sample surveys only as a minor byproduct or that compare probability sample surveys with nonsurvey convenience samples, such as Amazon MTurk (e.g., Coppock and McClellan, 2019).
3.1 Initial Accuracy Comparisons for Probability and Nonprobability Sample Surveys
As table 1 shows, a number of studies have demonstrated that probability sample surveys have a higher accuracy than nonprobability sample surveys. The higher accuracy of probability sample surveys has been demonstrated across various topics, such as voting behavior (Malhotra and Krosnick 2007; Chang and Krosnick 2009; Sturgis et al. 2018), health behavior (Yeager, Krosnick, Chang, Javitz, and Levendusky 2011), consumption behavior (Szolnoki and Hoffmann 2013), sexual behavior and attitudes (Erens, Burkill, Couper, Conrad, and Clifton 2014; Legleye, Charrance, Razafindratsima, Bajos, and Bohet 2018), and socio-demographics (Malhotra and Krosnick 2007; Chang and Krosnick 2009; Yeager et al. 2011; Szolnoki and Hoffmann 2013; Erens et al. 2014; Dutwin and Buskirk 2017; MacInnis, Krosnick, Ho, and Cho 2018). In addition, the higher accuracy of probability sample surveys has been found across a number of countries, such as Australia (Pennay, Neiger, Lavrakas, and Borg 2018), France (Legleye et al. 2018), Germany (Szolnoki and Hoffmann 2013; Blom, Ackermann-Piek, Helmschrott, Cornesse, Bruch, and Sakshaug 2018), the Netherlands (Scherpenzeel and Bethlehem 2011; Brüggen, van den Brakel, and Krosnick 2016), Sweden (Sohlberg et al. 2017), the United Kingdom (Sturgis et al. 2018), and the United States (Malhotra and Krosnick 2007; Chang and Krosnick 2009; Yeager et al. 2011; Dutwin and Buskirk 2017; MacInnis et al. 2018). Furthermore, the higher accuracy of probability sample surveys has been shown over time, with the first study demonstrating higher accuracy of probability sample surveys in 2007 (Malhotra and Krosnick, 2007) to the most recent ones in 2018 (Blom et al. 2018; Legleye et al. 2018; MacInnis et al. 2018; Sturgis et al. 2018). All of these studies from different times and countries that focused on different topics reached the conclusion that probability sample surveys led to more accurate estimates than nonprobability samples.
A prominent study from this line of research is Yeager et al. (2011). In a series of analyses of surveys conducted between 2004 and 2008 in the United States, the authors found that probability sample surveys were consistently more accurate than nonprobability sample surveys across many benchmark variables (primary demographics such as age, gender, and education; secondary demographics such as marital status and homeownership; and nondemographics such as health ratings and possession of a driver’s license), even after poststratification weighting.
Another influential study that is based on a particularly rich database was conducted with four probability sample face-to-face surveys, one probability sample online survey, and eighteen nonprobability sample online surveys (the so-called NOPVO [National Dutch Online Panel Comparison Study] project) in 2006 to 2008 in the Netherlands (Brüggen et al. 2016). In line with the findings from Yeager et al. (2011), the authors found that the probability face-to-face and internet surveys were consistently more accurate than the nonprobability internet surveys across a variety of sociodemographic variables and variables on health and life satisfaction.
A recent example that focuses on the current debate about why polls mispredict election outcomes was published by Sturgis et al (2018). Investigating why British polls mispredicted the outcome of the 2015 UK general election, Sturgis et al. (2018) examined data from twelve British pre-election polls and assessed a number of potential reasons for the polling debacle, such as whether voters changed their mind at the last minute (i.e., “late swing”), whether the mode in which the surveys were conducted (telephone or online) played a role, or whether polls failed to correct for the common problem of overestimating voter turnout (i.e., proper turnout weighting). The authors found that sample inaccuracy due to nonprobability sample selection likely had the biggest impact on the mispredictions.
Although most studies have demonstrated that accuracy is higher in probability sample surveys than in nonprobability sample surveys, two studies have yielded mixed findings depending on the type of estimate examined (Pasek 2016; Dassonneville, Blais, Hooghe, and Deschouwer 2018). Both studies show that accuracy is higher in probability sample surveys for univariate estimates. Pasek (2016) also reports higher accuracy of probability sample surveys for longitudinal analyses. However, these studies found no difference in accuracy regarding bivariate (Pasek 2016) and multivariate (Dassonneville et al. 2018) estimates.
Several studies have found no consistent superiority in accuracy of probability or nonprobability sample surveys over one another. These studies generally yielded mixed findings: a probability sample survey was found to be more accurate than some but not all nonprobability sample surveys examined (Kennedy, Mercer, Keeter, Hatley, McGeeney, and Gimenez 2016); or probability sample surveys were shown to be more accurate than nonprobability sample surveys on some variables while nonprobability sample surveys were more accurate than probability sample surveys on other variables (Loosveldt and Sonck 2008; Chan and Ambrose 2011; Steinmetz, Bianchi, Tijdens, and Biffignandi 2014). In some of the studies, the authors speculated that it might be survey mode rather than the sampling design that led to comparable accuracy (Berrens et al. 2003; Ansolabehere and Schaffner 2014; Gittelman, Thomas, Lavrakas, and Lange 2015).
In general, it should be noted that sample accuracy assessments face some challenges. One common challenge of such studies is to disentangle mode effects (i.e., measurement bias) from sampling effects (i.e., selection bias). This challenge occurs because probability sample surveys are usually conducted offline (e.g., via face-to-face or telephone interviews), whereas nonprobability sample surveys are usually conducted online via nonprobability online panels. However, several studies show that it is possible to disentangle the mode effect from the sampling effect by comparing offline probability sample surveys with online probability sample surveys (mode effect) and comparing online probability sample surveys to online nonprobability sample surveys (sampling effect). The majority of these studies conclude that both offline and online probability sample surveys are more accurate than nonprobability online sample surveys (Chang and Krosnick 2009; Scherpenzeel and Bethlehem 2011; Yeager et al. 2011; Brüggen et al. 2016; Dutwin and Buskirk 2017; Blom et al. 2018; MacInnis et al. 2018).
A related challenge that sample accuracy assessments of probability and nonprobability sample surveys face is the question of how to measure accuracy in a way that accounts for both sampling variability and systematic bias. This challenge occurs because often there is only one probability sample survey and one nonprobability sample survey available for sample accuracy assessment. However, several large-scale studies that compare a larger number of probability sample surveys with a larger number of nonprobability sample surveys have found that probability sample surveys are consistently more accurate than nonprobability sample surveys (Yeager et al. 2011; Brüggen et al. 2016; Blom et al. 2018; MacInnis et al. 2018; Sturgis et al. 2018). This suggests that although surveys vary on a number of design factors other than their sampling design (e.g., incentive schemes, contact frequency) and might sometimes be more or less accurate by chance, samples are generally more likely to have higher accuracy if they are based on probability sampling procedures rather than nonprobability sampling procedures.
Another common challenge that sample accuracy assessments face is the availability of appropriate gold standard benchmarks. In the current literature, the most commonly used benchmarks are large-scale, high-quality probability sample surveys. A typical example of such a benchmark is the US American Community Survey (Yeager et al. 2011 and MacInnis et al. 2018). Other benchmarks used in the literature are population census data (Legleye et al. 2018), election outcomes (Sohlberg et al. 2017), and population register data (Brüggen et al. 2016).
All of these benchmarks have advantages and disadvantages. A key advantage of using a large-scale, high-quality probability sample survey is that the set of variables available for comparisons usually includes not only sociodemographic variables but also substantive variables on attitudes and behavior. A disadvantage is that large-scale, high-quality probability sample surveys are surveys themselves and might therefore contain typical survey errors, such as coverage, sampling, and nonresponse errors (Groves and Lyberg 2010).
Census data and population register data have the advantage of not suffering from survey errors. However, such data are often not available for the current year and might therefore be outdated at the time of the study. Population register data might also be outdated, for example, if immigration, emigration, births, and deaths are not captured in a timely manner. In addition, census data and population register data are typically limited to a small set of sociodemographic characteristics. With regard to election outcomes, an advantage is that they are key variables of substantive interest to many social scientists. However, if survey data fail to accurately predict election outcomes, there are many potential explanations for this besides the sampling approach; see Sturgis et al. (2018) for a list of reasonable explanations tested after the British polling disaster of 2015.
3.2 Weighting Approaches to Reduce Bias in Nonprobability Sample Surveys
Many studies examining the accuracy of probability and nonprobability sample surveys have attempted to eliminate biases in nonprobability sample surveys. The majority of these studies found that weighting did not sufficiently reduce bias in nonprobability sample surveys (see table 1). Generally speaking, probability sample surveys were found to be more accurate than nonprobability sample surveys even after (re)weighting.
Although the majority of studies found that weighting did not reduce the bias in nonprobability sample surveys sufficiently (table 1), some studies showed that weighting did reduce the bias somewhat. However, whether researchers considered the bias to be sufficiently reduced by weighting varied from study to study. For example, Berrens et al. (2003) considered the bias in a nonprobability sample survey sufficiently reduced even though an estimate of mean household income deviated from the benchmark by between 4.8 percentage points (after propensity weighting) and 11.9 percentage points (after raking). A study concluding that weighting approaches sufficiently reduced bias in nonprobability sample surveys also reported that weighting increased the variance of the estimates significantly (Steinmetz et al. 2014).
Most of the studies listed in table 1 focus on the success of weighting procedures to reduce bias in nonprobability sample surveys. Only a few studies listed in table 1 have also assessed the success of weighting procedures in reducing bias in probability sample surveys. For example, MacInnis et al. (2018) reported that weighting reliably eliminated the small biases present in unweighted probability sample survey data. This is in line with research by Gittelman et al. (2015), who showed that poststratification weighting successfully reduced biases in a probability sample survey but was less successful across a number of nonprobability survey samples and even increased bias in one instance.
The studies listed in table 1 used one or more common weighting procedures, such as raking, poststratification, and propensity weighting, to improve the accuracy of nonprobability sample survey measurements. Several other studies in the literature have investigated the effectiveness of various weighting procedures in reducing bias in nonprobability sample surveys, without examining the accuracy of probability sample surveys. Table 2 provides an overview of these studies. A key inclusion requirement is that studies assess whether weighting nonprobability sample surveys reduced bias as compared with unweighted estimates and if so, to what extent. We again exclude all studies published in languages other than English and studies that contain assessments of weighting procedures for nonprobability sample surveys only as a minor byproduct or that examine nonsurvey data.
Studies Exclusively Investigating Weighting Procedures to Reduce Bias in Nonprobability Sample Surveys
Study . | Benchmark . | Does weighting sufficiently reduce bias in NPS?a . |
---|---|---|
Mercer et al. (2018) | High-quality PS | No (raking, propensity weighting, matching) |
Smyk, Tyrowicz, and Van der Velde (2018) | High-quality PS | No (propensity weighting) |
Gelman et al. (2017) | High-quality PS | No (raking) |
Election outcome | Yes (multilevel regression and poststratification) | |
Goel et al. (2015) | High-quality PS | No (raking)Yes (model-based poststratification) |
Wang et al. (2015) | Election outcome | Yes (multilevel regression and poststratification) |
Lee and Valliant (2009) | High-quality PS | Yes (propensity weighting, calibration) |
Schonlau, van Soest, Kapteyn, and Couper (2009) | High-quality PS | No (propensity weighting, matching) |
Schonlau, van Soest, and Kapteyn (2007) | PS | No (propensity weighting) |
Lee 2006 | High-quality PS | No (propensity weighting) |
Duffy, Smith, Terhanian, and Bremer (2005) | High-quality PS | No (raking, propensity weighting) |
Schonlau, Zapert, Simon, Sanstad, and Marcus (2004) | PS | No (poststratification, propensity weighting) |
Taylor (2000) | High-quality PS + Election outcome | Yes (raking, propensity weighting) |
Study . | Benchmark . | Does weighting sufficiently reduce bias in NPS?a . |
---|---|---|
Mercer et al. (2018) | High-quality PS | No (raking, propensity weighting, matching) |
Smyk, Tyrowicz, and Van der Velde (2018) | High-quality PS | No (propensity weighting) |
Gelman et al. (2017) | High-quality PS | No (raking) |
Election outcome | Yes (multilevel regression and poststratification) | |
Goel et al. (2015) | High-quality PS | No (raking)Yes (model-based poststratification) |
Wang et al. (2015) | Election outcome | Yes (multilevel regression and poststratification) |
Lee and Valliant (2009) | High-quality PS | Yes (propensity weighting, calibration) |
Schonlau, van Soest, Kapteyn, and Couper (2009) | High-quality PS | No (propensity weighting, matching) |
Schonlau, van Soest, and Kapteyn (2007) | PS | No (propensity weighting) |
Lee 2006 | High-quality PS | No (propensity weighting) |
Duffy, Smith, Terhanian, and Bremer (2005) | High-quality PS | No (raking, propensity weighting) |
Schonlau, Zapert, Simon, Sanstad, and Marcus (2004) | PS | No (poststratification, propensity weighting) |
Taylor (2000) | High-quality PS + Election outcome | Yes (raking, propensity weighting) |
We report whether weighting sufficiently reduced bias based on the authors’ own judgments as reported in their conclusions.
Studies Exclusively Investigating Weighting Procedures to Reduce Bias in Nonprobability Sample Surveys
Study . | Benchmark . | Does weighting sufficiently reduce bias in NPS?a . |
---|---|---|
Mercer et al. (2018) | High-quality PS | No (raking, propensity weighting, matching) |
Smyk, Tyrowicz, and Van der Velde (2018) | High-quality PS | No (propensity weighting) |
Gelman et al. (2017) | High-quality PS | No (raking) |
Election outcome | Yes (multilevel regression and poststratification) | |
Goel et al. (2015) | High-quality PS | No (raking)Yes (model-based poststratification) |
Wang et al. (2015) | Election outcome | Yes (multilevel regression and poststratification) |
Lee and Valliant (2009) | High-quality PS | Yes (propensity weighting, calibration) |
Schonlau, van Soest, Kapteyn, and Couper (2009) | High-quality PS | No (propensity weighting, matching) |
Schonlau, van Soest, and Kapteyn (2007) | PS | No (propensity weighting) |
Lee 2006 | High-quality PS | No (propensity weighting) |
Duffy, Smith, Terhanian, and Bremer (2005) | High-quality PS | No (raking, propensity weighting) |
Schonlau, Zapert, Simon, Sanstad, and Marcus (2004) | PS | No (poststratification, propensity weighting) |
Taylor (2000) | High-quality PS + Election outcome | Yes (raking, propensity weighting) |
Study . | Benchmark . | Does weighting sufficiently reduce bias in NPS?a . |
---|---|---|
Mercer et al. (2018) | High-quality PS | No (raking, propensity weighting, matching) |
Smyk, Tyrowicz, and Van der Velde (2018) | High-quality PS | No (propensity weighting) |
Gelman et al. (2017) | High-quality PS | No (raking) |
Election outcome | Yes (multilevel regression and poststratification) | |
Goel et al. (2015) | High-quality PS | No (raking)Yes (model-based poststratification) |
Wang et al. (2015) | Election outcome | Yes (multilevel regression and poststratification) |
Lee and Valliant (2009) | High-quality PS | Yes (propensity weighting, calibration) |
Schonlau, van Soest, Kapteyn, and Couper (2009) | High-quality PS | No (propensity weighting, matching) |
Schonlau, van Soest, and Kapteyn (2007) | PS | No (propensity weighting) |
Lee 2006 | High-quality PS | No (propensity weighting) |
Duffy, Smith, Terhanian, and Bremer (2005) | High-quality PS | No (raking, propensity weighting) |
Schonlau, Zapert, Simon, Sanstad, and Marcus (2004) | PS | No (poststratification, propensity weighting) |
Taylor (2000) | High-quality PS + Election outcome | Yes (raking, propensity weighting) |
We report whether weighting sufficiently reduced bias based on the authors’ own judgments as reported in their conclusions.
In general, the majority of the studies that investigated the effectiveness of various weighting procedures in reducing bias in nonprobability sample surveys (table 2) reached the same conclusion as the studies that assessed weighting approaches in both probability and nonprobability sample surveys (table 1): weighting does not sufficiently reduce bias in nonprobability sample surveys. Only a few studies found that weighting could sufficiently reduce bias in nonprobability sample surveys.
As table 2 shows, two of the studies that documented a sufficient reduction in the bias of nonprobability sample surveys applied multilevel regression and poststratification weights (Wang et al. 2015; Gelman, Goel, Rothschild, and Wang 2017), one study applied model-based poststratification (Goel, Obeng, and Rothschild 2015), and one study applied propensity weighting and calibration (Lee and Valliant 2009). Two of these studies used weighting to adjust nonprobability sample survey data to accurately predict election outcomes after the actual election outcomes were already known, which reduces confidence in the conclusions (Wang et al. 2015; Gelman et al. 2017).
In sum, the majority of the research on weighting and accuracy finds that the inaccuracy of nonprobability samples cannot be reliably solved by weighting procedures. Some authors conducting such studies also offer explanations as to why their attempts to achieve accurate estimates from weighted nonprobability samples were not successful. Mercer, Lau, and Kennedy (2018), for instance, show that complex weighting procedures outperform basic weighting procedures. Furthermore, the authors show that to get accurate estimates from nonprobability sample surveys by weighting, the availability of variables that predict the outcome of interest are more important than which statistical method is used.
4. CLOSING REMARKS
In this article, we have reviewed conceptual approaches and empirical evidence on probability and nonprobability sample surveys. Probability sampling theory is well established and based on sound mathematical principles, whereas nonprobability sampling is not. Although there are potential justifications for drawing inferences from nonprobability samples, the rationale for many studies remains unarticulated, and the inferences from nonprobability sample surveys generally require stronger modeling assumptions than are necessary for probability samples. The basic problem with these modeling assumptions remains that they cannot be tested. We have therefore proposed a conceptual framework for nonprobability sample surveys to explicate these modeling assumptions, including practical suggestions about when it might be justified to make such assumptions (section 2).
In addition, we have summarized the empirical evidence on the accuracy of probability and nonprobability sample surveys (section 3). Our literature overview shows that, even in the age of declining response rates, accuracy in probability sample surveys is generally higher than in nonprobability sample surveys. There is no empirical support to the claim that switching to nonprobability sample surveys is advisable because the steadily declining response rates across the globe compromise probability sample survey data quality. Based on the accumulated empirical evidence, our key recommendation is to continue to rely on probability sample surveys.
In the case that only nonprobability sample survey data are available, we recommend carefully choosing among the various modeling approaches based on their underlying assumptions. In order for researchers to be able to justify the modeling approaches used, we recommend obtaining as much information as possible about the data-generating process (see the transparency recommendations in the appendix).
Apart from these key recommendations, this report also shows that there are gaps in the existing literature. To be able to evaluate if and when nonprobability sample surveys can be used as an alternative to probability sample surveys, we need more insights into the success of nonprobability sample surveys in producing accurate estimates in bivariate and multivariate analyses, longitudinal analyses, and experimental research settings. In addition, we need more research into variance estimation and advanced weighting techniques, in particular with regard to collecting and utilizing weighting variables that are correlated with key survey variables and the data-generating process.
Finally, this report shows that there is great variability across nonprobability sample surveys. Therefore, we would like to end this report with a general call for more transparency in the survey business. For users, researchers and clients, it can be difficult to decide which vendors to trust with data collection. As long as many vendors are unwilling to disclose necessary information about the data collection and processing, researchers will remain unable to make informed decisions about vendors and will lack the information necessary to understand the limitations of their collected data. The availability of research reports that outline the methodology used for data collection and manipulation is therefore of utmost importance (Bethlehem 2017, Chapter 11).
As clients, we can reward vendors who are willing to provide more methodological information to us. As a practical matter, this requires being explicit about our needs prior to contracting and in reaching out to a broad set of nonprobability sample providers. Another form of action we can take as clients is to ensure that when contracting vendors who belong to an organization with relevant standards, such as ESOMAR, or that have ISO 20252 or 26362 certification, these vendors disclose information as required by the respective code or certification (see the appendix and Bethlehem [2017, Chapter 12] for more information on relevant transparency guidelines and standards).
Acknowledgement
The authors would like to thank the Collaborative Research Center (SFB) 884 “Political Economy of Reforms” (projects A8 and Z1), funded by the German Research Foundation (DFG), for organizing the Workshop on Probability-based and Nonprobability Survey Research in July 2018 at the University of Mannheim. In addition, the authors would like to thank Melvin John, Sabrina Seidl, and Nourhan Elsayed for their assistance with manuscript preparation. The authors are grateful to Andrew Mercer for valuable input and feedback on earlier versions of this manuscript. The order of authors is alphabetical after the lead author.
Footnotes
https://edition.cnn.com/2015/03/18/middleeast/israel-election-polls/index.html, accessed on November 30, 2019.
https://www.theguardian.com/politics/2016/jun/24/how-eu-referendum-pollsters-wrong-opinion-predict-close, accessed on September 30, 2019.
https://www.forbes.com/sites/startswithabang/2016/11/09/the-science-of-error-how-polling-botched-the-2016-election/#babf86437959, accessed on September 30, 2019.
Appendix: Transparency Guidelines
Various codes of ethics and guidelines address the disclosure of methodological information for online panels (e.g., Arbeitskreis Deutscher Markt- und Sozialforschungsinstitute eV 2001; Interactive Marketing Research Organization [IMRO] 2015; International Organization for Standardization [ISO] 2009, 2012; ESOMAR 2012, 2014, 2015a, 2015b; American Association for Public Opinion Research 2015). The lack of information available from some online panel vendors can unfortunately make it impossible for researchers to comply with their own codes or certifications (e.g., ISO 2012, §4.5.1.4; American Association for Public Opinion Research 2015, §III.A.5–8, 12, 14). The unwillingness of some vendors to disclose necessary information is unfortunate for all concerned. Consumers of research are denied information needed to form an opinion as to the quality of the research. Researchers are unable to make informed decisions about vendors and lack information needed to understand the limitations of their data. Further, vendors themselves are unable to benefit from the methodological advances that would follow from greater availability of information on panel operations.
What can be done? Given the abundance of guidelines—several of which are particularly helpful (IMRO 2015; ESOMAR 2015a)—there is little need for another extensive set of recommendations. Therefore, we summarize common reporting recommendations in table A1 and, for the most part, refer the reader to existing guidelines addressing these points (particularly helpful sources are italicized). At times, we make additional recommendations that may go beyond reporting requirements; these contain the words, “We recommend.”
In table A2, we also list relevant case-level data that researchers may wish to ensure that the vendor will be able to provide. Such data will likely be most useful to the researcher where they request all data from all cases invited to participate in the survey, not only those that completed the survey or were not removed for data quality reasons.
Construct . | Description . | References . |
---|---|---|
Universe | What universe does the panel represent? How well does the panel cover that universe? | American Association for Public Opinion Research (2015, §III.A.3); ESOMAR (2014, §5.2); ISO (2009, §4.4.1, §4.7); ISO (2012, §4.5.1.4, §7.2.e, §7.2.f.2) |
Sampling frame | What method or methods are used to recruit panel members? | American Association for Public Opinion Research (2015, §III.A.5-8); ESOMAR (2012, §2, §5); ESOMAR (2014, §5.2); ESOMAR (2015a, §3.7); ESOMAR (2015b, §5.2, §6, §6.1); IMRO (2015, §4); ISO (2012, §4.5.1.4, §7.2.f) |
We recommend that the percentage of the sample recruited from different sources be included (e.g., online advertisement, email list, aggregator website). | ||
Panel size | How large is the panel? What is the percentage of active members? How is “active” defined? | IMRO (2015, §14, §17); ISO (2009, §4.4.2) |
Replenishment and retirement | How frequently are new panel members added? | IMRO (2015, §18, §19); ISO (2009, §4.5.1) |
Are panel members retired from the panel by the vendor? If so, what criteria are used to determine retirement? | ||
What is the panel’s annual “churn rate” (the percentage of all panelists who are voluntarily or involuntarily retired each year)? | ||
Aggregator sites | Does the panel draw on aggregator sites, which are sites that allow respondents to select from multiple surveys for which they may qualify (see IMRO 2015) | IMRO (2015, §6) |
Sample from other panels | Was sample from other providers used in this study? | ESOMAR (2012, §4) |
Blending | If there is more than one sample source, how are these blended together? | ESOMAR (2012, §3); ESOMAR (2015b, §6.1) |
Sample drawn from frame | How was the sample selected from the panel? Any quotas or other specifications used in selection of the sample should be specified. How many units were drawn? | ESOMAR (2012, §7); ESOMAR (2015a, §3.7); ESOMAR (2015b, §6.2); IMRO (2015, §23); ISO (2009, 4.6.1); ISO (2012, 4.5.1.4) |
Routers | Was a router used? If so, how did the vendor determine which surveys are considered for a participant? What is the priority basis for allocating participants to surveys? How is potential bias from the use of a router addressed? | American Association for Public Opinion Research (2015, §III.A.14); ESOMAR (2012, §8-§11); ESOMAR (2015a, §3.7); ESOMAR (2015b, §6.2) |
Identity validation | How does the vendor confirm participants’ identities? How does the vendor detect fraudulent participants? | American Association for Public Opinion Research (2015, §III.A.17); ESOMAR (2012, §22); ESOMAR (2015a, §3.8); ESOMAR (2015b, §6.1); IMRO (2015, §24); ISO (2009, §4.3.4.2, §4.3.4.3) |
Within-panel deduplication | What procedures does the panel employ to ensure that no research participant can complete the survey more than once? | American Association for Public Opinion Research (2015, §III.A.17); ESOMAR (2015a, §3.2, §3.8); IMRO (2015, §25) |
Cross-panel deduplication | If multiple sample sources are used, how does the panel ensure that no research participant can complete the survey more than once? | American Association for Public Opinion Research (2015, §III.A.17); ESOMAR (2012, §3); ESOMAR (2015a, §3.8); IMRO (2015, §29) |
Category exclusion | How are panel members who have recently completed a survey on the same topic treated? If they are excluded, for how long or using what rules? | IMRO (2015, §15); ISO (2009, §4.6.2) |
Participation limits | How often can a panel member be contacted to take part in a survey in a specified period? | ESOMAR (2012, §19, §20); ESOMAR (2015a, §3.8); IMRO (2015, §16); ISO (2009, §4.5.1, §4.6.2) |
Other data quality checks | What checks are made for satisficing and fraud? | American Association for Public Opinion Research (2015, §III.A.17); ESOMAR (2012, §18); ESOMAR (2015a, §3.3, §3.8); ESOMAR (2015b, §6, §6.1, §6.4); IMRO (2015, §30); ISO (2009, §4.6.6, §4.7); ISO (2012, §7.2.k) |
These might include checks for speeding and patterned responses, trap questions, counts of missing or nonsubstantive responses (e.g., don’t know), device fingerprinting or logic checks. | ||
Panel profile | How often are profiles updated? | IMRO (2015, §21) |
Fieldwork dates | When was the survey fielded? If there were multiple sample sources, did the field period vary between sources? | American Association for Public Opinion Research (2015, §III.A.4); ESOMAR (2014, §5.2); ESOMAR (2015a, §3.8); ESOMAR (2015b, §6.3); ISO (2009, §4.7); ISO (2012, §4.8.3, §7.2.g) |
Invitations and reminders | What text is used for invitations and any reminders? | American Association for Public Opinion Research (2015, §III.A.16); ESOMAR (2012, §13); ISO (2009, §4.7) |
We recommend disclosing the maximum number of reminders sent. | ||
Incentives | What incentives, if any, are offered to panel members? | American Association for Public Opinion Research (2015, §III.A.16); ESOMAR (2012, §14); ESOMAR (2015a, §3.7); ESOMAR (2015b, §6.1-2); IMRO (2015, §20); ISO (2009, §4.5.2); ISO (2012, §7.2.i) |
Questionnaire | In the event that the researcher did not create the questionnaire, the questionnaire should be available, including any screening questions. | American Association for Public Opinion Research (2015, §III.A.2, §III.A.14-15); ESOMAR (2014, §5.2); ESOMAR (2015a, §3.8); ESOMAR (2015b, §5.2, §6.3); ISO (2009, §4.7); ISO (2012, §7.2.l) |
As web surveys are an intensely visual medium (e.g., Couper, Tourangeau, and Kenyon 2004 and Couper, Tourangeau, Conrad, and Crawford 2004) we recommend that screenshots be provided as part of reporting. | ||
Final dispositions and outcome rates | How many panel members were invited to complete the survey? What is the break-off rate? How are these calculated? | American Association for Public Opinion Research (2015, §III.A.18); ESOMAR (2015a, §3.7-8); ESOMAR (2015b, §6.1); IMRO (2015, §26); ISO (2009, §4.7); ISO (2012, 7.2.f.4) |
We recommend that the definitions of outcome rates developed by Callegaro and DiSogra (2008) and DiSogra and Callegaro (2016) be used; these are largely incorporated in American Association for Public Opinion Research (2015). | ||
Handling of mobile devices / smartphones | We recommend documenting whether the questionnaire layout is adapted/optimized for smartphone completion and, if so, how it is adapted or optimized. | Not referenced in any guideline |
There is an extensive literature on methods for formatting web surveys on mobile devices and implications for data quality (e.g., Peytchev and Hill 2010; Chrzan and Saunders 2012; de Bruijne and Wijnant 2013a, 2013b; Link, Murphy, Schober, Buskirk, Hunter Childs 2014; Mavletova and Couper 2014, 2016; Arn, Klug, and Kolodziejski 2015; Struminskaya, Weyandt, and Bosnjak 2015; Lugtig and Toepoel 2016; Revilla, Toninelli, and Ochoa, 2016; Revilla, Toninelli, and Ochoa 2016; Couper, Antoun, and Mavletova 2017; Peterson, Griffin, LaFrance, and Li 2017; Antoun, Katz, Argueta, and Wang 2018). Although not incorporated into other guidelines, it is important that clients are sufficiently informed as handling mobile devices may have consequences for break-off and measurement |
Construct . | Description . | References . |
---|---|---|
Universe | What universe does the panel represent? How well does the panel cover that universe? | American Association for Public Opinion Research (2015, §III.A.3); ESOMAR (2014, §5.2); ISO (2009, §4.4.1, §4.7); ISO (2012, §4.5.1.4, §7.2.e, §7.2.f.2) |
Sampling frame | What method or methods are used to recruit panel members? | American Association for Public Opinion Research (2015, §III.A.5-8); ESOMAR (2012, §2, §5); ESOMAR (2014, §5.2); ESOMAR (2015a, §3.7); ESOMAR (2015b, §5.2, §6, §6.1); IMRO (2015, §4); ISO (2012, §4.5.1.4, §7.2.f) |
We recommend that the percentage of the sample recruited from different sources be included (e.g., online advertisement, email list, aggregator website). | ||
Panel size | How large is the panel? What is the percentage of active members? How is “active” defined? | IMRO (2015, §14, §17); ISO (2009, §4.4.2) |
Replenishment and retirement | How frequently are new panel members added? | IMRO (2015, §18, §19); ISO (2009, §4.5.1) |
Are panel members retired from the panel by the vendor? If so, what criteria are used to determine retirement? | ||
What is the panel’s annual “churn rate” (the percentage of all panelists who are voluntarily or involuntarily retired each year)? | ||
Aggregator sites | Does the panel draw on aggregator sites, which are sites that allow respondents to select from multiple surveys for which they may qualify (see IMRO 2015) | IMRO (2015, §6) |
Sample from other panels | Was sample from other providers used in this study? | ESOMAR (2012, §4) |
Blending | If there is more than one sample source, how are these blended together? | ESOMAR (2012, §3); ESOMAR (2015b, §6.1) |
Sample drawn from frame | How was the sample selected from the panel? Any quotas or other specifications used in selection of the sample should be specified. How many units were drawn? | ESOMAR (2012, §7); ESOMAR (2015a, §3.7); ESOMAR (2015b, §6.2); IMRO (2015, §23); ISO (2009, 4.6.1); ISO (2012, 4.5.1.4) |
Routers | Was a router used? If so, how did the vendor determine which surveys are considered for a participant? What is the priority basis for allocating participants to surveys? How is potential bias from the use of a router addressed? | American Association for Public Opinion Research (2015, §III.A.14); ESOMAR (2012, §8-§11); ESOMAR (2015a, §3.7); ESOMAR (2015b, §6.2) |
Identity validation | How does the vendor confirm participants’ identities? How does the vendor detect fraudulent participants? | American Association for Public Opinion Research (2015, §III.A.17); ESOMAR (2012, §22); ESOMAR (2015a, §3.8); ESOMAR (2015b, §6.1); IMRO (2015, §24); ISO (2009, §4.3.4.2, §4.3.4.3) |
Within-panel deduplication | What procedures does the panel employ to ensure that no research participant can complete the survey more than once? | American Association for Public Opinion Research (2015, §III.A.17); ESOMAR (2015a, §3.2, §3.8); IMRO (2015, §25) |
Cross-panel deduplication | If multiple sample sources are used, how does the panel ensure that no research participant can complete the survey more than once? | American Association for Public Opinion Research (2015, §III.A.17); ESOMAR (2012, §3); ESOMAR (2015a, §3.8); IMRO (2015, §29) |
Category exclusion | How are panel members who have recently completed a survey on the same topic treated? If they are excluded, for how long or using what rules? | IMRO (2015, §15); ISO (2009, §4.6.2) |
Participation limits | How often can a panel member be contacted to take part in a survey in a specified period? | ESOMAR (2012, §19, §20); ESOMAR (2015a, §3.8); IMRO (2015, §16); ISO (2009, §4.5.1, §4.6.2) |
Other data quality checks | What checks are made for satisficing and fraud? | American Association for Public Opinion Research (2015, §III.A.17); ESOMAR (2012, §18); ESOMAR (2015a, §3.3, §3.8); ESOMAR (2015b, §6, §6.1, §6.4); IMRO (2015, §30); ISO (2009, §4.6.6, §4.7); ISO (2012, §7.2.k) |
These might include checks for speeding and patterned responses, trap questions, counts of missing or nonsubstantive responses (e.g., don’t know), device fingerprinting or logic checks. | ||
Panel profile | How often are profiles updated? | IMRO (2015, §21) |
Fieldwork dates | When was the survey fielded? If there were multiple sample sources, did the field period vary between sources? | American Association for Public Opinion Research (2015, §III.A.4); ESOMAR (2014, §5.2); ESOMAR (2015a, §3.8); ESOMAR (2015b, §6.3); ISO (2009, §4.7); ISO (2012, §4.8.3, §7.2.g) |
Invitations and reminders | What text is used for invitations and any reminders? | American Association for Public Opinion Research (2015, §III.A.16); ESOMAR (2012, §13); ISO (2009, §4.7) |
We recommend disclosing the maximum number of reminders sent. | ||
Incentives | What incentives, if any, are offered to panel members? | American Association for Public Opinion Research (2015, §III.A.16); ESOMAR (2012, §14); ESOMAR (2015a, §3.7); ESOMAR (2015b, §6.1-2); IMRO (2015, §20); ISO (2009, §4.5.2); ISO (2012, §7.2.i) |
Questionnaire | In the event that the researcher did not create the questionnaire, the questionnaire should be available, including any screening questions. | American Association for Public Opinion Research (2015, §III.A.2, §III.A.14-15); ESOMAR (2014, §5.2); ESOMAR (2015a, §3.8); ESOMAR (2015b, §5.2, §6.3); ISO (2009, §4.7); ISO (2012, §7.2.l) |
As web surveys are an intensely visual medium (e.g., Couper, Tourangeau, and Kenyon 2004 and Couper, Tourangeau, Conrad, and Crawford 2004) we recommend that screenshots be provided as part of reporting. | ||
Final dispositions and outcome rates | How many panel members were invited to complete the survey? What is the break-off rate? How are these calculated? | American Association for Public Opinion Research (2015, §III.A.18); ESOMAR (2015a, §3.7-8); ESOMAR (2015b, §6.1); IMRO (2015, §26); ISO (2009, §4.7); ISO (2012, 7.2.f.4) |
We recommend that the definitions of outcome rates developed by Callegaro and DiSogra (2008) and DiSogra and Callegaro (2016) be used; these are largely incorporated in American Association for Public Opinion Research (2015). | ||
Handling of mobile devices / smartphones | We recommend documenting whether the questionnaire layout is adapted/optimized for smartphone completion and, if so, how it is adapted or optimized. | Not referenced in any guideline |
There is an extensive literature on methods for formatting web surveys on mobile devices and implications for data quality (e.g., Peytchev and Hill 2010; Chrzan and Saunders 2012; de Bruijne and Wijnant 2013a, 2013b; Link, Murphy, Schober, Buskirk, Hunter Childs 2014; Mavletova and Couper 2014, 2016; Arn, Klug, and Kolodziejski 2015; Struminskaya, Weyandt, and Bosnjak 2015; Lugtig and Toepoel 2016; Revilla, Toninelli, and Ochoa, 2016; Revilla, Toninelli, and Ochoa 2016; Couper, Antoun, and Mavletova 2017; Peterson, Griffin, LaFrance, and Li 2017; Antoun, Katz, Argueta, and Wang 2018). Although not incorporated into other guidelines, it is important that clients are sufficiently informed as handling mobile devices may have consequences for break-off and measurement |
Construct . | Description . | References . |
---|---|---|
Universe | What universe does the panel represent? How well does the panel cover that universe? | American Association for Public Opinion Research (2015, §III.A.3); ESOMAR (2014, §5.2); ISO (2009, §4.4.1, §4.7); ISO (2012, §4.5.1.4, §7.2.e, §7.2.f.2) |
Sampling frame | What method or methods are used to recruit panel members? | American Association for Public Opinion Research (2015, §III.A.5-8); ESOMAR (2012, §2, §5); ESOMAR (2014, §5.2); ESOMAR (2015a, §3.7); ESOMAR (2015b, §5.2, §6, §6.1); IMRO (2015, §4); ISO (2012, §4.5.1.4, §7.2.f) |
We recommend that the percentage of the sample recruited from different sources be included (e.g., online advertisement, email list, aggregator website). | ||
Panel size | How large is the panel? What is the percentage of active members? How is “active” defined? | IMRO (2015, §14, §17); ISO (2009, §4.4.2) |
Replenishment and retirement | How frequently are new panel members added? | IMRO (2015, §18, §19); ISO (2009, §4.5.1) |
Are panel members retired from the panel by the vendor? If so, what criteria are used to determine retirement? | ||
What is the panel’s annual “churn rate” (the percentage of all panelists who are voluntarily or involuntarily retired each year)? | ||
Aggregator sites | Does the panel draw on aggregator sites, which are sites that allow respondents to select from multiple surveys for which they may qualify (see IMRO 2015) | IMRO (2015, §6) |
Sample from other panels | Was sample from other providers used in this study? | ESOMAR (2012, §4) |
Blending | If there is more than one sample source, how are these blended together? | ESOMAR (2012, §3); ESOMAR (2015b, §6.1) |
Sample drawn from frame | How was the sample selected from the panel? Any quotas or other specifications used in selection of the sample should be specified. How many units were drawn? | ESOMAR (2012, §7); ESOMAR (2015a, §3.7); ESOMAR (2015b, §6.2); IMRO (2015, §23); ISO (2009, 4.6.1); ISO (2012, 4.5.1.4) |
Routers | Was a router used? If so, how did the vendor determine which surveys are considered for a participant? What is the priority basis for allocating participants to surveys? How is potential bias from the use of a router addressed? | American Association for Public Opinion Research (2015, §III.A.14); ESOMAR (2012, §8-§11); ESOMAR (2015a, §3.7); ESOMAR (2015b, §6.2) |
Identity validation | How does the vendor confirm participants’ identities? How does the vendor detect fraudulent participants? | American Association for Public Opinion Research (2015, §III.A.17); ESOMAR (2012, §22); ESOMAR (2015a, §3.8); ESOMAR (2015b, §6.1); IMRO (2015, §24); ISO (2009, §4.3.4.2, §4.3.4.3) |
Within-panel deduplication | What procedures does the panel employ to ensure that no research participant can complete the survey more than once? | American Association for Public Opinion Research (2015, §III.A.17); ESOMAR (2015a, §3.2, §3.8); IMRO (2015, §25) |
Cross-panel deduplication | If multiple sample sources are used, how does the panel ensure that no research participant can complete the survey more than once? | American Association for Public Opinion Research (2015, §III.A.17); ESOMAR (2012, §3); ESOMAR (2015a, §3.8); IMRO (2015, §29) |
Category exclusion | How are panel members who have recently completed a survey on the same topic treated? If they are excluded, for how long or using what rules? | IMRO (2015, §15); ISO (2009, §4.6.2) |
Participation limits | How often can a panel member be contacted to take part in a survey in a specified period? | ESOMAR (2012, §19, §20); ESOMAR (2015a, §3.8); IMRO (2015, §16); ISO (2009, §4.5.1, §4.6.2) |
Other data quality checks | What checks are made for satisficing and fraud? | American Association for Public Opinion Research (2015, §III.A.17); ESOMAR (2012, §18); ESOMAR (2015a, §3.3, §3.8); ESOMAR (2015b, §6, §6.1, §6.4); IMRO (2015, §30); ISO (2009, §4.6.6, §4.7); ISO (2012, §7.2.k) |
These might include checks for speeding and patterned responses, trap questions, counts of missing or nonsubstantive responses (e.g., don’t know), device fingerprinting or logic checks. | ||
Panel profile | How often are profiles updated? | IMRO (2015, §21) |
Fieldwork dates | When was the survey fielded? If there were multiple sample sources, did the field period vary between sources? | American Association for Public Opinion Research (2015, §III.A.4); ESOMAR (2014, §5.2); ESOMAR (2015a, §3.8); ESOMAR (2015b, §6.3); ISO (2009, §4.7); ISO (2012, §4.8.3, §7.2.g) |
Invitations and reminders | What text is used for invitations and any reminders? | American Association for Public Opinion Research (2015, §III.A.16); ESOMAR (2012, §13); ISO (2009, §4.7) |
We recommend disclosing the maximum number of reminders sent. | ||
Incentives | What incentives, if any, are offered to panel members? | American Association for Public Opinion Research (2015, §III.A.16); ESOMAR (2012, §14); ESOMAR (2015a, §3.7); ESOMAR (2015b, §6.1-2); IMRO (2015, §20); ISO (2009, §4.5.2); ISO (2012, §7.2.i) |
Questionnaire | In the event that the researcher did not create the questionnaire, the questionnaire should be available, including any screening questions. | American Association for Public Opinion Research (2015, §III.A.2, §III.A.14-15); ESOMAR (2014, §5.2); ESOMAR (2015a, §3.8); ESOMAR (2015b, §5.2, §6.3); ISO (2009, §4.7); ISO (2012, §7.2.l) |
As web surveys are an intensely visual medium (e.g., Couper, Tourangeau, and Kenyon 2004 and Couper, Tourangeau, Conrad, and Crawford 2004) we recommend that screenshots be provided as part of reporting. | ||
Final dispositions and outcome rates | How many panel members were invited to complete the survey? What is the break-off rate? How are these calculated? | American Association for Public Opinion Research (2015, §III.A.18); ESOMAR (2015a, §3.7-8); ESOMAR (2015b, §6.1); IMRO (2015, §26); ISO (2009, §4.7); ISO (2012, 7.2.f.4) |
We recommend that the definitions of outcome rates developed by Callegaro and DiSogra (2008) and DiSogra and Callegaro (2016) be used; these are largely incorporated in American Association for Public Opinion Research (2015). | ||
Handling of mobile devices / smartphones | We recommend documenting whether the questionnaire layout is adapted/optimized for smartphone completion and, if so, how it is adapted or optimized. | Not referenced in any guideline |
There is an extensive literature on methods for formatting web surveys on mobile devices and implications for data quality (e.g., Peytchev and Hill 2010; Chrzan and Saunders 2012; de Bruijne and Wijnant 2013a, 2013b; Link, Murphy, Schober, Buskirk, Hunter Childs 2014; Mavletova and Couper 2014, 2016; Arn, Klug, and Kolodziejski 2015; Struminskaya, Weyandt, and Bosnjak 2015; Lugtig and Toepoel 2016; Revilla, Toninelli, and Ochoa, 2016; Revilla, Toninelli, and Ochoa 2016; Couper, Antoun, and Mavletova 2017; Peterson, Griffin, LaFrance, and Li 2017; Antoun, Katz, Argueta, and Wang 2018). Although not incorporated into other guidelines, it is important that clients are sufficiently informed as handling mobile devices may have consequences for break-off and measurement |
Construct . | Description . | References . |
---|---|---|
Universe | What universe does the panel represent? How well does the panel cover that universe? | American Association for Public Opinion Research (2015, §III.A.3); ESOMAR (2014, §5.2); ISO (2009, §4.4.1, §4.7); ISO (2012, §4.5.1.4, §7.2.e, §7.2.f.2) |
Sampling frame | What method or methods are used to recruit panel members? | American Association for Public Opinion Research (2015, §III.A.5-8); ESOMAR (2012, §2, §5); ESOMAR (2014, §5.2); ESOMAR (2015a, §3.7); ESOMAR (2015b, §5.2, §6, §6.1); IMRO (2015, §4); ISO (2012, §4.5.1.4, §7.2.f) |
We recommend that the percentage of the sample recruited from different sources be included (e.g., online advertisement, email list, aggregator website). | ||
Panel size | How large is the panel? What is the percentage of active members? How is “active” defined? | IMRO (2015, §14, §17); ISO (2009, §4.4.2) |
Replenishment and retirement | How frequently are new panel members added? | IMRO (2015, §18, §19); ISO (2009, §4.5.1) |
Are panel members retired from the panel by the vendor? If so, what criteria are used to determine retirement? | ||
What is the panel’s annual “churn rate” (the percentage of all panelists who are voluntarily or involuntarily retired each year)? | ||
Aggregator sites | Does the panel draw on aggregator sites, which are sites that allow respondents to select from multiple surveys for which they may qualify (see IMRO 2015) | IMRO (2015, §6) |
Sample from other panels | Was sample from other providers used in this study? | ESOMAR (2012, §4) |
Blending | If there is more than one sample source, how are these blended together? | ESOMAR (2012, §3); ESOMAR (2015b, §6.1) |
Sample drawn from frame | How was the sample selected from the panel? Any quotas or other specifications used in selection of the sample should be specified. How many units were drawn? | ESOMAR (2012, §7); ESOMAR (2015a, §3.7); ESOMAR (2015b, §6.2); IMRO (2015, §23); ISO (2009, 4.6.1); ISO (2012, 4.5.1.4) |
Routers | Was a router used? If so, how did the vendor determine which surveys are considered for a participant? What is the priority basis for allocating participants to surveys? How is potential bias from the use of a router addressed? | American Association for Public Opinion Research (2015, §III.A.14); ESOMAR (2012, §8-§11); ESOMAR (2015a, §3.7); ESOMAR (2015b, §6.2) |
Identity validation | How does the vendor confirm participants’ identities? How does the vendor detect fraudulent participants? | American Association for Public Opinion Research (2015, §III.A.17); ESOMAR (2012, §22); ESOMAR (2015a, §3.8); ESOMAR (2015b, §6.1); IMRO (2015, §24); ISO (2009, §4.3.4.2, §4.3.4.3) |
Within-panel deduplication | What procedures does the panel employ to ensure that no research participant can complete the survey more than once? | American Association for Public Opinion Research (2015, §III.A.17); ESOMAR (2015a, §3.2, §3.8); IMRO (2015, §25) |
Cross-panel deduplication | If multiple sample sources are used, how does the panel ensure that no research participant can complete the survey more than once? | American Association for Public Opinion Research (2015, §III.A.17); ESOMAR (2012, §3); ESOMAR (2015a, §3.8); IMRO (2015, §29) |
Category exclusion | How are panel members who have recently completed a survey on the same topic treated? If they are excluded, for how long or using what rules? | IMRO (2015, §15); ISO (2009, §4.6.2) |
Participation limits | How often can a panel member be contacted to take part in a survey in a specified period? | ESOMAR (2012, §19, §20); ESOMAR (2015a, §3.8); IMRO (2015, §16); ISO (2009, §4.5.1, §4.6.2) |
Other data quality checks | What checks are made for satisficing and fraud? | American Association for Public Opinion Research (2015, §III.A.17); ESOMAR (2012, §18); ESOMAR (2015a, §3.3, §3.8); ESOMAR (2015b, §6, §6.1, §6.4); IMRO (2015, §30); ISO (2009, §4.6.6, §4.7); ISO (2012, §7.2.k) |
These might include checks for speeding and patterned responses, trap questions, counts of missing or nonsubstantive responses (e.g., don’t know), device fingerprinting or logic checks. | ||
Panel profile | How often are profiles updated? | IMRO (2015, §21) |
Fieldwork dates | When was the survey fielded? If there were multiple sample sources, did the field period vary between sources? | American Association for Public Opinion Research (2015, §III.A.4); ESOMAR (2014, §5.2); ESOMAR (2015a, §3.8); ESOMAR (2015b, §6.3); ISO (2009, §4.7); ISO (2012, §4.8.3, §7.2.g) |
Invitations and reminders | What text is used for invitations and any reminders? | American Association for Public Opinion Research (2015, §III.A.16); ESOMAR (2012, §13); ISO (2009, §4.7) |
We recommend disclosing the maximum number of reminders sent. | ||
Incentives | What incentives, if any, are offered to panel members? | American Association for Public Opinion Research (2015, §III.A.16); ESOMAR (2012, §14); ESOMAR (2015a, §3.7); ESOMAR (2015b, §6.1-2); IMRO (2015, §20); ISO (2009, §4.5.2); ISO (2012, §7.2.i) |
Questionnaire | In the event that the researcher did not create the questionnaire, the questionnaire should be available, including any screening questions. | American Association for Public Opinion Research (2015, §III.A.2, §III.A.14-15); ESOMAR (2014, §5.2); ESOMAR (2015a, §3.8); ESOMAR (2015b, §5.2, §6.3); ISO (2009, §4.7); ISO (2012, §7.2.l) |
As web surveys are an intensely visual medium (e.g., Couper, Tourangeau, and Kenyon 2004 and Couper, Tourangeau, Conrad, and Crawford 2004) we recommend that screenshots be provided as part of reporting. | ||
Final dispositions and outcome rates | How many panel members were invited to complete the survey? What is the break-off rate? How are these calculated? | American Association for Public Opinion Research (2015, §III.A.18); ESOMAR (2015a, §3.7-8); ESOMAR (2015b, §6.1); IMRO (2015, §26); ISO (2009, §4.7); ISO (2012, 7.2.f.4) |
We recommend that the definitions of outcome rates developed by Callegaro and DiSogra (2008) and DiSogra and Callegaro (2016) be used; these are largely incorporated in American Association for Public Opinion Research (2015). | ||
Handling of mobile devices / smartphones | We recommend documenting whether the questionnaire layout is adapted/optimized for smartphone completion and, if so, how it is adapted or optimized. | Not referenced in any guideline |
There is an extensive literature on methods for formatting web surveys on mobile devices and implications for data quality (e.g., Peytchev and Hill 2010; Chrzan and Saunders 2012; de Bruijne and Wijnant 2013a, 2013b; Link, Murphy, Schober, Buskirk, Hunter Childs 2014; Mavletova and Couper 2014, 2016; Arn, Klug, and Kolodziejski 2015; Struminskaya, Weyandt, and Bosnjak 2015; Lugtig and Toepoel 2016; Revilla, Toninelli, and Ochoa, 2016; Revilla, Toninelli, and Ochoa 2016; Couper, Antoun, and Mavletova 2017; Peterson, Griffin, LaFrance, and Li 2017; Antoun, Katz, Argueta, and Wang 2018). Although not incorporated into other guidelines, it is important that clients are sufficiently informed as handling mobile devices may have consequences for break-off and measurement |
Construct . | Notes . |
---|---|
Sample source | Sample source (e.g., list, web advertisement) |
Within-survey paradata | Date-time survey started and completed |
Duration between survey started and completed | |
Frequency of invitations to surveys: each week, etc. | |
Absence of invitations to concurrent surveys during a certain time (e.g., 2 consecutive weeks after the first invitation to participate in the survey) | |
Device information provided for each session for surveys completed in multiple sessions: | |
User agent string or equivalent information (e.g., operating system and version number, browser and version number) | |
Is JavaScript enabled? | |
Is AJAX enabled? | |
Screen resolution | |
Cross-survey paradata | Date joined panel (if relevant) |
Number of surveys the panel member has been invited to complete | |
Number of surveys the panel member completed | |
Individual-level completion rate (number of surveys completed / number of surveys invited) | |
Profile data | Profile information relevant to any quotas or other selection mechanisms |
Date panel profile data last updated | |
Panelists recruitment | Recruitment origin of the selected panelists (respondents and nonrespondents): list, telephone, web, ad, etc. |
Membership to other panels (if possible) | |
History of contacts | Number of reminders (if any) |
Regular or extra incentive | |
Invited to participate in other surveys during the research or not | |
Quality data | If all sample cases (not only completes) are requested: |
The status of any data quality checks (e.g., duplicate responses, speeding, straight-lining, trap questions) | |
For each case removed from the data, the reasons why the case was removed | |
We recommend that all sample cases be requested from the panel vendor |
Construct . | Notes . |
---|---|
Sample source | Sample source (e.g., list, web advertisement) |
Within-survey paradata | Date-time survey started and completed |
Duration between survey started and completed | |
Frequency of invitations to surveys: each week, etc. | |
Absence of invitations to concurrent surveys during a certain time (e.g., 2 consecutive weeks after the first invitation to participate in the survey) | |
Device information provided for each session for surveys completed in multiple sessions: | |
User agent string or equivalent information (e.g., operating system and version number, browser and version number) | |
Is JavaScript enabled? | |
Is AJAX enabled? | |
Screen resolution | |
Cross-survey paradata | Date joined panel (if relevant) |
Number of surveys the panel member has been invited to complete | |
Number of surveys the panel member completed | |
Individual-level completion rate (number of surveys completed / number of surveys invited) | |
Profile data | Profile information relevant to any quotas or other selection mechanisms |
Date panel profile data last updated | |
Panelists recruitment | Recruitment origin of the selected panelists (respondents and nonrespondents): list, telephone, web, ad, etc. |
Membership to other panels (if possible) | |
History of contacts | Number of reminders (if any) |
Regular or extra incentive | |
Invited to participate in other surveys during the research or not | |
Quality data | If all sample cases (not only completes) are requested: |
The status of any data quality checks (e.g., duplicate responses, speeding, straight-lining, trap questions) | |
For each case removed from the data, the reasons why the case was removed | |
We recommend that all sample cases be requested from the panel vendor |
Construct . | Notes . |
---|---|
Sample source | Sample source (e.g., list, web advertisement) |
Within-survey paradata | Date-time survey started and completed |
Duration between survey started and completed | |
Frequency of invitations to surveys: each week, etc. | |
Absence of invitations to concurrent surveys during a certain time (e.g., 2 consecutive weeks after the first invitation to participate in the survey) | |
Device information provided for each session for surveys completed in multiple sessions: | |
User agent string or equivalent information (e.g., operating system and version number, browser and version number) | |
Is JavaScript enabled? | |
Is AJAX enabled? | |
Screen resolution | |
Cross-survey paradata | Date joined panel (if relevant) |
Number of surveys the panel member has been invited to complete | |
Number of surveys the panel member completed | |
Individual-level completion rate (number of surveys completed / number of surveys invited) | |
Profile data | Profile information relevant to any quotas or other selection mechanisms |
Date panel profile data last updated | |
Panelists recruitment | Recruitment origin of the selected panelists (respondents and nonrespondents): list, telephone, web, ad, etc. |
Membership to other panels (if possible) | |
History of contacts | Number of reminders (if any) |
Regular or extra incentive | |
Invited to participate in other surveys during the research or not | |
Quality data | If all sample cases (not only completes) are requested: |
The status of any data quality checks (e.g., duplicate responses, speeding, straight-lining, trap questions) | |
For each case removed from the data, the reasons why the case was removed | |
We recommend that all sample cases be requested from the panel vendor |
Construct . | Notes . |
---|---|
Sample source | Sample source (e.g., list, web advertisement) |
Within-survey paradata | Date-time survey started and completed |
Duration between survey started and completed | |
Frequency of invitations to surveys: each week, etc. | |
Absence of invitations to concurrent surveys during a certain time (e.g., 2 consecutive weeks after the first invitation to participate in the survey) | |
Device information provided for each session for surveys completed in multiple sessions: | |
User agent string or equivalent information (e.g., operating system and version number, browser and version number) | |
Is JavaScript enabled? | |
Is AJAX enabled? | |
Screen resolution | |
Cross-survey paradata | Date joined panel (if relevant) |
Number of surveys the panel member has been invited to complete | |
Number of surveys the panel member completed | |
Individual-level completion rate (number of surveys completed / number of surveys invited) | |
Profile data | Profile information relevant to any quotas or other selection mechanisms |
Date panel profile data last updated | |
Panelists recruitment | Recruitment origin of the selected panelists (respondents and nonrespondents): list, telephone, web, ad, etc. |
Membership to other panels (if possible) | |
History of contacts | Number of reminders (if any) |
Regular or extra incentive | |
Invited to participate in other surveys during the research or not | |
Quality data | If all sample cases (not only completes) are requested: |
The status of any data quality checks (e.g., duplicate responses, speeding, straight-lining, trap questions) | |
For each case removed from the data, the reasons why the case was removed | |
We recommend that all sample cases be requested from the panel vendor |
References
American Association for Public Opinion Research (
American Association for Public Opinion Research (
Arbeitskreis Deutscher Markt- und Sozialforschungsinstitute e.V. (
Callegaro, M., and C. DiSogra (2008), “Computing Response Metrics for Online Panels,” Public opinion quarterly, 72, 1008–1032.
ESOMAR (
ESOMAR (
ESOMAR (
ESOMAR (
Interactive Marketing Research Organization (IMRO) (
International Organization for Standardization (
International Organization for Standardization (
Klein, R. A., M. Vianello, F. Hasselman, B. G. Adams, R. B. Adams Jr, S. Alper, M. Aveyard, et al. (2018). Many Labs 2: Investigating variation in replicability across samples and settings. Advances in Methods and Practices in Psychological Science, 1, 443–490.
Revilla, M., C. Ochoa, and D. Toninelli (