-
PDF
- Split View
-
Views
-
Cite
Cite
Don Jang, Ana Lucía Córdova Cazar, Big Data Meets Survey Science, Journal of the Royal Statistical Society Series A: Statistics in Society, Volume 185, Issue Supplement_2, December 2022, Pages S167–S169, https://doi.org/10.1111/rssa.12967
- Share Icon Share
Surveys have long been the primary source of data collection about people’s attitudes, beliefs and opinions. They are useful for measuring specific information about individuals, as well as understanding public opinions and creating accurate and precise official statistics. However, survey data collection has been changing disruptively with the advancement of technology, the availability of new and different data sources, and the growing demand for timely dissemination of reliable information. This special issue was planned as a part of the BigSurv20 conference (https://www.bigsurv20.org) and is dedicated to solutions to these challenges, through innovative methodological developments and applications. We were encouraged by the many submissions that covered various topics involving disparate data. We are very pleased to have selected 12 papers and to have them published in this special issue.
Data sources considered in this issue’s manuscripts show a great deal of diversity, spanning from survey and Census data to data from administrative sources, web scraping, seismometers, satellite imagery, photographs, digital-trace data, big open-source databases, smartphones, Twitter and Facebook advertising. Readers will see different purposes that these data may be used for, different estimation methods that enable the use of rich information in disparate data sources, and methodological solutions for data integration and estimation, such as machine learning and national language processing. The use of a new data source comes with challenges: assessing total data error, acquiring consent from survey participants for linkage to other data, or tracking their web-browsing behaviours. In what follows, we briefly describe the key contributions of each paper.
Rich information in administrative data has been sought as an effort to address potential nonresponse bias stemming from high nonresponse rates. In this issue, Küfner et al. compare response propensity models based on several machine learning algorithms to utilise many variables available in administrative data for establishment surveys for non-response bias adjustments. Their empirical results add evidence that the expanded use of administrative data could reduce non-response bias if variables therein are correlated with outcome and response variables, indicating the importance of data in context. The utility of administrative data can go beyond reducing nonresponse bias. For example, Shook-Sa et al. used state administrative data to estimate a rare population quantity, the number of persons with HIV in North Carolina jails, by linking them to the roster data of incarcerated persons collected via web scraping. This paper shows promising use of web-scraped data linked with good-quality administrative data by using outcome regression models and weighting calibration methods. Such a value-added application of an incomplete but crucial data is also presented in Koebe et al.’s article. They brought satellite imagery as an additional source to derive annual small-area updates of multi-dimensional poverty indicators for an at-risk population, filling the gaps of information needed between decennial censuses.
A couple of papers examine the use of open-source data for the prediction of economic productivity or interest rates, ideally on a near-real-time basis. Pezzoli and Tosetti look for a way to utilise seismic data to forecast regional industrial production, which shows potential for monitoring the economy at a high granular level in a timely manner. Such prediction based on open-source data collected over time is also researched by Consoli et al., who employed economic news within a neural network framework to forecast the Italian 10-year interest rate spread. They used a big, open-source database known as the Global Database of Events, Language and Tone to extract topical and emotional news content linked to bond markets dynamics and then deployed this information within a probabilistic forecasting framework with autoregressive recurrent networks.
With technology, artefacts of our increasingly digital lives have offered additional, broader information about our behaviours (e.g., personal interests captured through internet browsing) in the form of ‘big data’. Surveys and such digital-trace data have great potential to complement one another and to allow scientists to better understand people and the world in which we live, for instance by combining the low cost per data point of big data (offsetting the rising costs of survey-based data collection) with the ability to collect very specific information that addresses research questions by using survey data. However, the representativeness and quality of big data can be questionable, posing challenges for the use of such data in social science contexts. Bach et al. discuss the promises and challenges of augmenting survey data with digital traces. They demonstrate how to obtain measurements of news media consumption from survey respondents’ web-browsing data using BERT, a natural language-processing algorithm that estimates contextual word embeddings from text data. Ilic et al. conducted an experimental survey asking survey respondents of the Dutch LISS panel about their dwelling conditions. Depending on the condition, respondents were asked to either take several photos of their house or answer a set of survey questions about the same topics. This paper documents the feasibility of collecting pictures instead of answers in a web survey and studies its consequences for components of total survey error.
Social scientists increasingly use social media and digital-trace data as a supplement to or replacement for survey data. The utility of such data depends on the quality of the data and research objectives. However, because these data are not usually gathered for research purposes, little is known about their quality or fitness for this purpose. Grow et al. have looked at the utility of Facebook’s advertising platform by comparing self-reported and Facebook-classified demographic information (sex, age and region of residence) in a large-scale, cross-national online survey. Bosch and Revilla propose a total error framework for digital-trace or ‘metered’ data (TEM), mirroring the total survey error framework. The authors present the TEM framework by describing the different data generation and analysis process for metered data as opposed to survey data. The paper discusses how this framework can help improve the quality of both stand-alone metered data research projects, as well as foster understanding of how and when survey and metered data can be combined.
While bringing digital-trace data such as passively collected smartphone data to augment survey responses can provide rich information on individual and social behaviours, it will also encounter hurdles in acquiring data-sharing consent from respondents. Agreeing to this novel form of data collection requires multiple consent steps, and little is known about the effects of non-participation. Silber et al. shed light on key factors that influence the data-sharing behaviour of survey respondents for different types of digital-trace data: Facebook, Twitter, Spotify and health-tracking apps. They showed that data-sharing rates vary based on four factors of data sharing: the method, respondent characteristics, sample composition and incentives. This paper provides some practical recommendations that can be considered in research designs that link surveys and digital-trace data. Keusch et al. attempted to understand the variation of consent rates for mobile data collection across demographic groups and investigated the impact of low consent rates on coverage errors in the German Panel Study Labour Market and Social Security. Mneimneh’s work adds to this line of research that is looking at the variation of willingness regarding data sharing. In particular, the author investigates various factors that could affect respondents’ decisions to link their survey data with their public Twitter data. Privacy concerns, the respondents’ social media engagement and consent request placement were all found to be related to the consent to link. These findings have important implications for designing future studies aimed at linking social media data with survey data.
In summary, this special issue tackles various research questions using disparate data and new methods. We hope that it will stimulate survey researchers, data scientists and social science researchers to continue to meet the challenges of the demand for rapid-cycle but high-quality information by developing new methods and innovative solutions.