ODISSEI Conference for Social Science in the Netherlands 2022

12 September 2022

The ODISSEI Conference for Social Science in the Netherlands seeks to bring together a community of computational social scientists to discuss data, methods, infrastructure, ethics and theoretical work related to digital and computational approaches in social science research. ODISSEI, the research infrastructure for social science in the Netherlands, connects researchers with the necessary data, expertise and resources to conduct ground-breaking research and embrace the computational turn in social enquiry.

Conference registration: Registration for this conference has closed.
Conference date: 3 November 2022
Location: Media Plaza (Jaarbeurs), Utrecht, the Netherlands
Contactcommunications@odissei-data.nl
Please note: registration is free, but as there is a limit to the location’s capacity, please let us know as soon as possible if you have to cancel your registration.

Livestream

You can find a livestream of the plenary room of the programme (‘Progress’) here.

Programme

Find the abstracts and full overview in the programme below.
A pdf with the overview of the programme can be found here.

Floor plan

You can download the floor plan of the location here.

With coffee and tea.

Professor Frauke Kreuter is co-director of the Social Data Science Center and faculty member in the Joint Program in Survey Methodology (JPSM) at the University of Maryland, USA; Professor of Statistics and Data Science at the Ludwig-Maximilians-University of Munich, Germany and head of the statistical methods group at the Institute for Employment Research (IAB) in Nuremberg, Germany. She is co-editor of Big Data and Social Science: Data Science Methods and Tools for Research and Practice (CRC Press, Second Edition 2021).

Room: Progress

Opportunities and challenges involved in combining (big and small) data sources
Frauke Kreuter, University of Maryland, University of Munich

New (often big) data sources offer enormous potential for exploring and solving complex societal challenges. Many agencies hope to use administrative data to optimize bureaucratic processes and reduce errors in human decision-making. Others hope to use digital data traces, derived from smartphones or IoT devices, to learn more about human behavior and interactions without increasing response buden. Unfortunately, the data generating processes are embedded in a social and economic context, which is often ignored when data are collected, shared or used downstream. There is among others a growing concern about the lack of fairness, equity and diversity. This talk outlines recent developments in the use of different data products in economic and social research. Frauke Kreuter explains the shortcomings in their application and how science can come to grips with issues of data quality, ethics and privacy without compromising the ability to reproduce and reuse the data. The talk also outlines the essential conditions for a successful and fair use of AI.

Professor Frauke Kreuter holds the LMU chair of Statistics and Data Science in the Social Sciences and Humanities and is co-director of the Data Science Centers at the University of Maryland and the University of Munich.

Willingness and nonparticipation biases in data donation
Bella Struminskaya, Utrecht University
Recent technological advances and technologies’ integration into people’s lives result in the continuous collection of data by organisations. The current European legislature (GDPR) allows individuals to request information about themselves from the gathering organisations and share it with researchers. Such rich data provides ample opportunities to study human behavior. For example, donation of geolocation history allows to study human mobility at unprecedently granular levels; donation social media data allows insights into individuals’ social networks; donation of fitness-tracking data allows insights into physical activity. The donated data is less susceptible to social desirability and recall biases than self-report and when combined with in-the-moment questi

Predicting Attrition of LISS Panel Members using Machine Learning and Survey Responsiveness Data
Isabel van den Heuvel, Eindhoven University of Technology (TU/e); Zhuozhao Zhan, TU/e; Seyit Höcük, Centerdata; Edwin van den Heuvel, TU/e; Joris Mulder, Centerdata
Background: Panel members that have been recruited for the LISS panel may stop responding to survey requests. This phenomenon is known as panel attrition. The attrition of panel members could lead to an imbalance in subgroup characteristics, making the panel non-representative and population estimates potentially biased. When the attrition moments of panel members can be predicted accurately, they can be approached and motivated to stay active in the panel. Previous studies have demonstrated that attrition is associated with various factors, but the prediction of attrition using survey responsiveness variables (i.e., paradata) has not been thoroughly investigated.
Methods: Attrition is being predicted for the LISS panel members who were active in the period from 2007 to 2019 using both socio-demographic variables and paradata. Statistical analysis was conducted with Cox proportional hazard model (screening variables associated with attrition), Random Forest (static prediction model using all variables), and the landmarking approach (dynamic prediction using survey responsiveness patterns).
Results: The percentage of attrition over the full period was determined at 68.5% [67.8; 69.2]. Many well-known socio-demographic variables were associated with attrition (e.g., sex, age, household size, income, occupation). The random forest data analysis demonstrated good performance (AUC: 83.9%; C-index 79.1%) and showed that the paradata was more important in predicting attrition than the socio-demographic variables. Prediction performance reduced substantially without paradata (AUC: 61.2%; C-index: 57.1%). Using only the survey responsiveness patterns of six-month windows, landmarking had a good prediction of attrition (AUC: 76.0%).
Conclusions: Our analysis shows that the use of paradata, and in particular the survey responsiveness patterns of panel members, in combination with machine learning techniques could predict the attrition of panel members accurately. Landmarking can be further optimized for the LISS panel to help retain panel members.

Data science challenges in smart surveys. Case studies on consumption, physical activity and travel
Barry Schouten, Statistics Netherlands (CBS)
Smart surveys add features of smart devices to surveys such as in-device processing, use of mobile device sensors, linkage to external sensor systems and data donation. They do so with the aim to ease respondent burden, to improve data quality and/or to enrich survey data. Smart features are especially promising for survey topics that are cognitively demanding, require detailed knowledge and recall, or for which questions provide weak proxies to the concepts of interest.
While smart surveys may be promising from a survey error perspective, their design and implementation pose new challenges. Respondents need to be engaged and need to trust statistical institutes in carefully handling data. Respondents need to understand and be able to perform the survey tasks. They also need to provide context to data being measured. The survey back-office IT and logistics become more demanding. And last, but not least, there is a strong reliance on advanced data extraction and machine learning methods to transform new forms of data to official statistics. The latter imply trade-offs in active and online learning and the role of respondents.
In the paper, we explain how computational science methods come in and what choices need to be made at the hand of three case studies. In household budget surveys text extraction and machine learning are used to classify scanned and/or digital receipts, and possibly donated bank transactions data. In physical activity survey trackers with motion sensors are used to predict type and intensity of activity. In travel surveys location data, possibly enriched with open points-of-interest data, are employed to derive travelled distances. Here, stop-detection, travel mode prediction and even prediction of travel purpose may come in. In all cases, data show deficiencies that automated procedures cannot overcome completely and respondents will need to assist.

Effects of Survey Design on Response rate in Crime Surveys
Jonas Klingwort, Statistics Netherlands (CBS)
The number of population surveys conducted is enormous and increasing, but response rates are declining across all data collection modes. With increasing nonresponse, the risk of a nonresponse bias increases resulting in biased survey estimates. The solution to avoid the missing data problem is not to have any. A well-designed survey and professional administration are required to approach this solution.
This work aims to quantify the effects of survey design features on the response rate in surveys. This is demonstrated using German crime surveys. Individual and independent studies dominate criminological survey research in Germany. This circumstance allows systematically studying the effects of different survey design features and their impact on the response rate in crime surveys.
A systematic literature review of German crime surveys between 2000-2022 was conducted, and 138 surveys were identified. Of those, 80 qualified to be eligible for analysis. Furthermore, the survey design features study year (2000-2022), target population (general and non-general), coverage area (local, national, regional), and data collection mode (CATI, CAWI, F2F, PAPI, Other) were collected. A meta-regression model was fitted to quantify the influence of the design features on the response rate.
Preliminary results show significant regression coefficients for most of the included design features, which indicate a linear relationship between predictor and response rate.
Such a model can be used for decision-making when (re-) designing a population survey and which design features to adjust if a high response rate is desired.

Willingness and nonparticipation biases in data donation
Bella Struminskaya, Utrecht University
Recent technological advances and technologies’ integration into people’s lives result in the continuous collection of data by organisations. The current European legislature (GDPR) allows individuals to request information about themselves from the gathering organisations and share it with researchers. Such rich data provides ample opportunities to study human behavior. For example, donation of geolocation history allows to study human mobility at unprecedently granular levels; donation social media data allows insights into individuals’ social networks; donation of fitness-tracking data allows insights into physical activity. The donated data is less susceptible to social desirability and recall biases than self-report and when combined with in-the-moment questi

Predicting Attrition of LISS Panel Members using Machine Learning and Survey Responsiveness Data
Isabel van den Heuvel, Eindhoven University of Technology (TU/e); Zhuozhao Zhan, TU/e; Seyit Höcük, Centerdata; Edwin van den Heuvel, TU/e; Joris Mulder, Centerdata
Background: Panel members that have been recruited for the LISS panel may stop responding to survey requests. This phenomenon is known as panel attrition. The attrition of panel members could lead to an imbalance in subgroup characteristics, making the panel non-representative and population estimates potentially biased. When the attrition moments of panel members can be predicted accurately, they can be approached and motivated to stay active in the panel. Previous studies have demonstrated that attrition is associated with various factors, but the prediction of attrition using survey responsiveness variables (i.e., paradata) has not been thoroughly investigated.
Methods: Attrition is being predicted for the LISS panel members who were active in the period from 2007 to 2019 using both socio-demographic variables and paradata. Statistical analysis was conducted with Cox proportional hazard model (screening variables associated with attrition), Random Forest (static prediction model using all variables), and the landmarking approach (dynamic prediction using survey responsiveness patterns).
Results: The percentage of attrition over the full period was determined at 68.5% [67.8; 69.2]. Many well-known socio-demographic variables were associated with attrition (e.g., sex, age, household size, income, occupation). The random forest data analysis demonstrated good performance (AUC: 83.9%; C-index 79.1%) and showed that the paradata was more important in predicting attrition than the socio-demographic variables. Prediction performance reduced substantially without paradata (AUC: 61.2%; C-index: 57.1%). Using only the survey responsiveness patterns of six-month windows, landmarking had a good prediction of attrition (AUC: 76.0%).
Conclusions: Our analysis shows that the use of paradata, and in particular the survey responsiveness patterns of panel members, in combination with machine learning techniques could predict the attrition of panel members accurately. Landmarking can be further optimized for the LISS panel to help retain panel members.

Data science challenges in smart surveys. Case studies on consumption, physical activity and travel
Barry Schouten, Statistics Netherlands (CBS)
Smart surveys add features of smart devices to surveys such as in-device processing, use of mobile device sensors, linkage to external sensor systems and data donation. They do so with the aim to ease respondent burden, to improve data quality and/or to enrich survey data. Smart features are especially promising for survey topics that are cognitively demanding, require detailed knowledge and recall, or for which questions provide weak proxies to the concepts of interest.
While smart surveys may be promising from a survey error perspective, their design and implementation pose new challenges. Respondents need to be engaged and need to trust statistical institutes in carefully handling data. Respondents need to understand and be able to perform the survey tasks. They also need to provide context to data being measured. The survey back-office IT and logistics become more demanding. And last, but not least, there is a strong reliance on advanced data extraction and machine learning methods to transform new forms of data to official statistics. The latter imply trade-offs in active and online learning and the role of respondents.
In the paper, we explain how computational science methods come in and what choices need to be made at the hand of three case studies. In household budget surveys text extraction and machine learning are used to classify scanned and/or digital receipts, and possibly donated bank transactions data. In physical activity survey trackers with motion sensors are used to predict type and intensity of activity. In travel surveys location data, possibly enriched with open points-of-interest data, are employed to derive travelled distances. Here, stop-detection, travel mode prediction and even prediction of travel purpose may come in. In all cases, data show deficiencies that automated procedures cannot overcome completely and respondents will need to assist.

Effects of Survey Design on Response rate in Crime Surveys
Jonas Klingwort, Statistics Netherlands (CBS)
The number of population surveys conducted is enormous and increasing, but response rates are declining across all data collection modes. With increasing nonresponse, the risk of a nonresponse bias increases resulting in biased survey estimates. The solution to avoid the missing data problem is not to have any. A well-designed survey and professional administration are required to approach this solution.
This work aims to quantify the effects of survey design features on the response rate in surveys. This is demonstrated using German crime surveys. Individual and independent studies dominate criminological survey research in Germany. This circumstance allows systematically studying the effects of different survey design features and their impact on the response rate in crime surveys.
A systematic literature review of German crime surveys between 2000-2022 was conducted, and 138 surveys were identified. Of those, 80 qualified to be eligible for analysis. Furthermore, the survey design features study year (2000-2022), target population (general and non-general), coverage area (local, national, regional), and data collection mode (CATI, CAWI, F2F, PAPI, Other) were collected. A meta-regression model was fitted to quantify the influence of the design features on the response rate.
Preliminary results show significant regression coefficients for most of the included design features, which indicate a linear relationship between predictor and response rate.
Such a model can be used for decision-making when (re-) designing a population survey and which design features to adjust if a high response rate is desired.

Willingness and nonparticipation biases in data donation
Bella Struminskaya, Utrecht University
Recent technological advances and technologies’ integration into people’s lives result in the continuous collection of data by organisations. The current European legislature (GDPR) allows individuals to request information about themselves from the gathering organisations and share it with researchers. Such rich data provides ample opportunities to study human behavior. For example, donation of geolocation history allows to study human mobility at unprecedently granular levels; donation social media data allows insights into individuals’ social networks; donation of fitness-tracking data allows insights into physical activity. The donated data is less susceptible to social desirability and recall biases than self-report and when combined with in-the-moment questi

  • Chair: Boukje Cuelenaere, Centerdata
  • Predicting Attrition of LISS Panel Members using Machine Learning and Survey Responsiveness Data
    Isabel van den Heuvel, Eindhoven University of Technology (TU/e); Zhuozhao Zhan, TU/e; Seyit Höcük, Centerdata; Edwin van den Heuvel, TU/e; Joris Mulder, Centerdata
  • Data science challenges in smart surveys. Case studies on consumption, physical activity and travel
    Barry Schouten, Statistics Netherlands (CBS)
  • Effects of Survey Design on Response rate in Crime Surveys
    Jonas Klingwort, Statistics Netherlands (CBS)
  • Willingness and nonparticipation biases in data donation
    Bella Struminskaya, Utrecht University

Predicting Attrition of LISS Panel Members using Machine Learning and Survey Responsiveness Data
Isabel van den Heuvel, Eindhoven University of Technology (TU/e); Zhuozhao Zhan, TU/e; Seyit Höcük, Centerdata; Edwin van den Heuvel, TU/e; Joris Mulder, Centerdata
Background: Panel members that have been recruited for the LISS panel may stop responding to survey requests. This phenomenon is known as panel attrition. The attrition of panel members could lead to an imbalance in subgroup characteristics, making the panel non-representative and population estimates potentially biased. When the attrition moments of panel members can be predicted accurately, they can be approached and motivated to stay active in the panel. Previous studies have demonstrated that attrition is associated with various factors, but the prediction of attrition using survey responsiveness variables (i.e., paradata) has not been thoroughly investigated.
Methods: Attrition is being predicted for the LISS panel members who were active in the period from 2007 to 2019 using both socio-demographic variables and paradata. Statistical analysis was conducted with Cox proportional hazard model (screening variables associated with attrition), Random Forest (static prediction model using all variables), and the landmarking approach (dynamic prediction using survey responsiveness patterns).
Results: The percentage of attrition over the full period was determined at 68.5% [67.8; 69.2]. Many well-known socio-demographic variables were associated with attrition (e.g., sex, age, household size, income, occupation). The random forest data analysis demonstrated good performance (AUC: 83.9%; C-index 79.1%) and showed that the paradata was more important in predicting attrition than the socio-demographic variables. Prediction performance reduced substantially without paradata (AUC: 61.2%; C-index: 57.1%). Using only the survey responsiveness patterns of six-month windows, landmarking had a good prediction of attrition (AUC: 76.0%).
Conclusions: Our analysis shows that the use of paradata, and in particular the survey responsiveness patterns of panel members, in combination with machine learning techniques could predict the attrition of panel members accurately. Landmarking can be further optimized for the LISS panel to help retain panel members.

Data science challenges in smart surveys. Case studies on consumption, physical activity and travel
Barry Schouten, Statistics Netherlands (CBS)
Smart surveys add features of smart devices to surveys such as in-device processing, use of mobile device sensors, linkage to external sensor systems and data donation. They do so with the aim to ease respondent burden, to improve data quality and/or to enrich survey data. Smart features are especially promising for survey topics that are cognitively demanding, require detailed knowledge and recall, or for which questions provide weak proxies to the concepts of interest.
While smart surveys may be promising from a survey error perspective, their design and implementation pose new challenges. Respondents need to be engaged and need to trust statistical institutes in carefully handling data. Respondents need to understand and be able to perform the survey tasks. They also need to provide context to data being measured. The survey back-office IT and logistics become more demanding. And last, but not least, there is a strong reliance on advanced data extraction and machine learning methods to transform new forms of data to official statistics. The latter imply trade-offs in active and online learning and the role of respondents.
In the paper, we explain how computational science methods come in and what choices need to be made at the hand of three case studies. In household budget surveys text extraction and machine learning are used to classify scanned and/or digital receipts, and possibly donated bank transactions data. In physical activity survey trackers with motion sensors are used to predict type and intensity of activity. In travel surveys location data, possibly enriched with open points-of-interest data, are employed to derive travelled distances. Here, stop-detection, travel mode prediction and even prediction of travel purpose may come in. In all cases, data show deficiencies that automated procedures cannot overcome completely and respondents will need to assist.

Effects of Survey Design on Response rate in Crime Surveys
Jonas Klingwort, Statistics Netherlands (CBS)
The number of population surveys conducted is enormous and increasing, but response rates are declining across all data collection modes. With increasing nonresponse, the risk of a nonresponse bias increases resulting in biased survey estimates. The solution to avoid the missing data problem is not to have any. A well-designed survey and professional administration are required to approach this solution.
This work aims to quantify the effects of survey design features on the response rate in surveys. This is demonstrated using German crime surveys. Individual and independent studies dominate criminological survey research in Germany. This circumstance allows systematically studying the effects of different survey design features and their impact on the response rate in crime surveys.
A systematic literature review of German crime surveys between 2000-2022 was conducted, and 138 surveys were identified. Of those, 80 qualified to be eligible for analysis. Furthermore, the survey design features study year (2000-2022), target population (general and non-general), coverage area (local, national, regional), and data collection mode (CATI, CAWI, F2F, PAPI, Other) were collected. A meta-regression model was fitted to quantify the influence of the design features on the response rate.
Preliminary results show significant regression coefficients for most of the included design features, which indicate a linear relationship between predictor and response rate.
Such a model can be used for decision-making when (re-) designing a population survey and which design features to adjust if a high response rate is desired.

Willingness and nonparticipation biases in data donation
Bella Struminskaya, Utrecht University
Recent technological advances and technologies’ integration into people’s lives result in the continuous collection of data by organisations. The current European legislature (GDPR) allows individuals to request information about themselves from the gathering organisations and share it with researchers. Such rich data provides ample opportunities to study human behavior. For example, donation of geolocation history allows to study human mobility at unprecedently granular levels; donation social media data allows insights into individuals’ social networks; donation of fitness-tracking data allows insights into physical activity. The donated data is less susceptible to social desirability and recall biases than self-report and when combined with in-the-moment questi

  • Chair: Boukje Cuelenaere, Centerdata
  • Predicting Attrition of LISS Panel Members using Machine Learning and Survey Responsiveness Data
    Isabel van den Heuvel, Eindhoven University of Technology (TU/e); Zhuozhao Zhan, TU/e; Seyit Höcük, Centerdata; Edwin van den Heuvel, TU/e; Joris Mulder, Centerdata
  • Data science challenges in smart surveys. Case studies on consumption, physical activity and travel
    Barry Schouten, Statistics Netherlands (CBS)
  • Effects of Survey Design on Response rate in Crime Surveys
    Jonas Klingwort, Statistics Netherlands (CBS)
  • Willingness and nonparticipation biases in data donation
    Bella Struminskaya, Utrecht University

Predicting Attrition of LISS Panel Members using Machine Learning and Survey Responsiveness Data
Isabel van den Heuvel, Eindhoven University of Technology (TU/e); Zhuozhao Zhan, TU/e; Seyit Höcük, Centerdata; Edwin van den Heuvel, TU/e; Joris Mulder, Centerdata
Background: Panel members that have been recruited for the LISS panel may stop responding to survey requests. This phenomenon is known as panel attrition. The attrition of panel members could lead to an imbalance in subgroup characteristics, making the panel non-representative and population estimates potentially biased. When the attrition moments of panel members can be predicted accurately, they can be approached and motivated to stay active in the panel. Previous studies have demonstrated that attrition is associated with various factors, but the prediction of attrition using survey responsiveness variables (i.e., paradata) has not been thoroughly investigated.
Methods: Attrition is being predicted for the LISS panel members who were active in the period from 2007 to 2019 using both socio-demographic variables and paradata. Statistical analysis was conducted with Cox proportional hazard model (screening variables associated with attrition), Random Forest (static prediction model using all variables), and the landmarking approach (dynamic prediction using survey responsiveness patterns).
Results: The percentage of attrition over the full period was determined at 68.5% [67.8; 69.2]. Many well-known socio-demographic variables were associated with attrition (e.g., sex, age, household size, income, occupation). The random forest data analysis demonstrated good performance (AUC: 83.9%; C-index 79.1%) and showed that the paradata was more important in predicting attrition than the socio-demographic variables. Prediction performance reduced substantially without paradata (AUC: 61.2%; C-index: 57.1%). Using only the survey responsiveness patterns of six-month windows, landmarking had a good prediction of attrition (AUC: 76.0%).
Conclusions: Our analysis shows that the use of paradata, and in particular the survey responsiveness patterns of panel members, in combination with machine learning techniques could predict the attrition of panel members accurately. Landmarking can be further optimized for the LISS panel to help retain panel members.

Data science challenges in smart surveys. Case studies on consumption, physical activity and travel
Barry Schouten, Statistics Netherlands (CBS)
Smart surveys add features of smart devices to surveys such as in-device processing, use of mobile device sensors, linkage to external sensor systems and data donation. They do so with the aim to ease respondent burden, to improve data quality and/or to enrich survey data. Smart features are especially promising for survey topics that are cognitively demanding, require detailed knowledge and recall, or for which questions provide weak proxies to the concepts of interest.
While smart surveys may be promising from a survey error perspective, their design and implementation pose new challenges. Respondents need to be engaged and need to trust statistical institutes in carefully handling data. Respondents need to understand and be able to perform the survey tasks. They also need to provide context to data being measured. The survey back-office IT and logistics become more demanding. And last, but not least, there is a strong reliance on advanced data extraction and machine learning methods to transform new forms of data to official statistics. The latter imply trade-offs in active and online learning and the role of respondents.
In the paper, we explain how computational science methods come in and what choices need to be made at the hand of three case studies. In household budget surveys text extraction and machine learning are used to classify scanned and/or digital receipts, and possibly donated bank transactions data. In physical activity survey trackers with motion sensors are used to predict type and intensity of activity. In travel surveys location data, possibly enriched with open points-of-interest data, are employed to derive travelled distances. Here, stop-detection, travel mode prediction and even prediction of travel purpose may come in. In all cases, data show deficiencies that automated procedures cannot overcome completely and respondents will need to assist.

Effects of Survey Design on Response rate in Crime Surveys
Jonas Klingwort, Statistics Netherlands (CBS)
The number of population surveys conducted is enormous and increasing, but response rates are declining across all data collection modes. With increasing nonresponse, the risk of a nonresponse bias increases resulting in biased survey estimates. The solution to avoid the missing data problem is not to have any. A well-designed survey and professional administration are required to approach this solution.
This work aims to quantify the effects of survey design features on the response rate in surveys. This is demonstrated using German crime surveys. Individual and independent studies dominate criminological survey research in Germany. This circumstance allows systematically studying the effects of different survey design features and their impact on the response rate in crime surveys.
A systematic literature review of German crime surveys between 2000-2022 was conducted, and 138 surveys were identified. Of those, 80 qualified to be eligible for analysis. Furthermore, the survey design features study year (2000-2022), target population (general and non-general), coverage area (local, national, regional), and data collection mode (CATI, CAWI, F2F, PAPI, Other) were collected. A meta-regression model was fitted to quantify the influence of the design features on the response rate.
Preliminary results show significant regression coefficients for most of the included design features, which indicate a linear relationship between predictor and response rate.
Such a model can be used for decision-making when (re-) designing a population survey and which design features to adjust if a high response rate is desired.

Willingness and nonparticipation biases in data donation
Bella Struminskaya, Utrecht University
Recent technological advances and technologies’ integration into people’s lives result in the continuous collection of data by organisations. The current European legislature (GDPR) allows individuals to request information about themselves from the gathering organisations and share it with researchers. Such rich data provides ample opportunities to study human behavior. For example, donation of geolocation history allows to study human mobility at unprecedently granular levels; donation social media data allows insights into individuals’ social networks; donation of fitness-tracking data allows insights into physical activity. The donated data is less susceptible to social desirability and recall biases than self-report and when combined with in-the-moment questi

  • Chair: Chang Sun, Maastricht University
  • Guess what I am doing: Identifying Physical Activities from Accelerometer data through Machine Learning and Deep Learning
    Joris Mulder, Centerdata – Tilburg University
  • Who suffers most from problematic debts? Evidence from health insurance defaults
    Mark Kattenberg, Anne-Fleur Roos, Jurre Thiel, Centraal Planbureau
  • Harnessing heterogeneity in behavioural research using computational social science
    Giuseppe Alessandro Veltri, University of Trento
  • Latent class analysis with distal outcomes: Two modified three-step methods using propensity scores
    Tra Le, Felix Clouth, Jeroen Vermunt, Tilburg University

Guess what I am doing: Identifying Physical Activities from Accelerometer data through Machine Learning and Deep Learning
Joris Mulder, Centerdata – Tilburg University
Nowadays, accelerometers or actigraphs are used in many research projects, providing highly detailed, objectively measured sensory data of physical activity. Where self-reported data might miss everyday life activities (e.g. walking to the shop, climbing stairs) accelerometer data provides a more complete picture of physical activity. The primary objective of this research is identifying specific activity patterns from the accelerometer data using machine learning and deep learning techniques. The secondary objective is improving the accuracy of identifying the specific activity patterns by validating activities through time-use data and survey data.
Activity data was collected through a large-scale accelerometer study in the probability-based LISS panel, consisting of approximately 7500 panel members from 5000 households. 1200 respondents participated in the study and wore an accelerometer for 8 days, measuring physical activity 24/7. A diverse group of 20 people labeled specific activity patterns in a controlled setting by wearing the device and performing the activities. The labeled data were used to train supervised machine-learning models. A deep learning model was trained to enhance the detection of the activities. Moreover, 450 respondents from the accelerometer study also participated in a time-use study in the LISS panel. Respondents reported their daily activities on a smartphone, using a time-use app. The reported time-use activities were used to validate the detected activities by the deep learning model.
We show that machine learning and deep learning models can be successfully used to identify specific types of activity from an accelerometer signal and can be validated by time-use data. Patterns of specific activities (i.e. sleeping, sitting, walking, cycling, jogging, driving) were successfully identified. The deep learning model increased the predictive power to better distinguish between specific activities. The time-use data proved to be useful to further validate certain hard to identify activities (i.e. cycling).

Who suffers most from problematic debts? Evidence from health insurance defaults
Mark Kattenberg, Anne-Fleur Roos, Jurre Thiel, Centraal Planbureau
Qualitative evidence suggests that financial troubles cause or deepen mental health problems, but quantitative evidence is scarce. The available evidence suggests that problematic debt increase mental health (Roos et al., 2021), but we do not know whether people are equally affected by problematic debts or whether some suffer more than others. Following Roos et al., (2021) we use nationwide individual-level panel data from the Netherlands for the years 2011-2015 to study the relationship between problematic debts and mental health and use a difference-in-differences approach with individual fixed effects for identification. To detect heterogeneity in the effect of problematic debts on mental health, we modifying the causal forest algorithm developed by Athey et al., (2019) to incorporate time and individual fixed effects. Our results help policymakers to identify which people mentally suffer most from problematic debts, which is important information when designing preventive policies aimed at reducing health care or policies that should prevent that debts become problematic.

Harnessing heterogeneity in behavioural research using computational social science
Giuseppe Alessandro Veltri, University of Trento
Digital online platforms have extended experiments to large national and international samples, thus increasing the potential heterogeneity present in responses to the examined treatments. Therefore, identifying and studying such heterogeneity is crucial in online behavioural experiments. New analytical techniques have emerged in computational social science to achieve this goal. We will illustrate an example from a study conducted in the context of the COVID-19 pandemic, which applies model-based recursive partitioning to data from an online experiment aimed at increasing vaccine willingness in eight European countries. Another valuable information generated by this approach is identifying particular segments of the sample under investigation that might merit further investigation. Identifying ‘local’ models of the population is not just a matter of chance. When applied to independent variables involving socioeconomic and behavioural measures, this possibility/technique allows us to detect/determine subgroups characterised by a particular socioeconomic or cognitive pattern shared by that group. Such a group could very well be transversal to traditional sociodemographic categories.

Latent class analysis with distal outcomes: Two modified three-step methods using propensity scores
Tra Le, Felix Clouth, Jeroen Vermunt, Tilburg University
Bias-adjusted three-step latent class (LC) analysis has become a popular technique to
estimate the relationship between class membership and a distal outcome. The classes are latent, so class memberships are not random assignments. Thus, confounding needs to be accounted for to draw causality. Modern causal inference techniques using propensity scores have become increasingly popular. We propose two novel strategies that make use of propensity scores to estimate the causal effect of LC membership on an outcome variable. They aim to tackle the limitations faced by existing methods.
Both strategies modify the bias-adjusted three-step approach by using propensity scores in the last step to control for confounding. The first strategy includes the IPW (inverse propensity weighting) as fixed weights (called IPW strategy) whereas the second strategy includes the propensity scores as control variables (called propensity scores as covariates strategy). To avoid misspecifying the relationship between the propensity scores and the outcome variable, we used a flexible regression model with quadratic terms and interactions. In both strategies, classification errors are accounted for using the BCH method.
A simulation study was used to compare their performances with three existing approaches. We varied the sample size, effect size, confounding strength, and class separation for binary and continuous outcome variables, with 500 replications per condition. The results showed that our IPW strategy, albeit the most logical one, had a non-convergence issue (due to extreme weights for the binary outcome variable) and the lowest efficiency. The propensity scores as covariates strategy had the best performance: it estimated the causal effect with the lowest bias and is relatively efficient. We use data from Wave 14 of the LISS panel to demonstrate our methods. Specifically, using the module Family and Household, we investigate how different types of parent-child relationships affect perceived relationship quality, controlling for some sociodemographic variables.

  • Chair: Laura Boeschoten, Utrecht University
  • Mobile-tracked Visits to Millions of Places Reveal Socioeconomic Inequality of Daily Consumption in the United States
    Yuanmo He, Milena Tsvetkova, the London School of Economics and Political Science
  • Realtime User Ratings as a Strategy for Combatting Misinformation: An Experimental Study
    Jonas Stein, University of Groningen
  • Online housing search and residential mobility
    Joep Steegmans, Leiden University

Mobile-tracked Visits to Millions of Places Reveal Socioeconomic Inequality of Daily Consumption in the United States
Yuanmo He, Milena Tsvetkova, the London School of Economics and Political Science
An important aspect of socioeconomic inequality is the difference in daily consumption practices by socioeconomic status (SES). This difference is not only a manifestation of inequality but also a trigger for further inequality in other life outcomes. For example, constrained by availability and price, people in low SES tend to go to low-price supermarkets and consume unhealthy food and beverages, which could contribute to later health problems. Differential daily consumption patterns result from both economic constraints and social processes. Sociologists Veblen and Bourdieu suggest that people use different consumption behaviour to distinguish their SES, and people in similar SES tend to have similar consumption preferences. Empirical evidence also shows that lifestyle choices could become correlated with demographic characteristics due to homophily and social influence. Therefore, we hypothesize that SES is associated with different consumption preferences for consumer brands, but they do not necessarily correspond to economic constraints driven by the product prices.
To test the hypotheses, we combine data from SafeGraph, Yelp, and US Census. Linking SafeGraph and Census data, we can obtain the distribution of brand visitors’ income from the median income of their home census block groups. We can also use Yelp’s dollar sign as an indicator of the brands’ price levels. Comparing the brands’ price levels with the income distribution of visitors, we can identify outliers that indicate unexpected lifestyle correlations. Based on existing literature, we expect to identify lifestyle groups that exhibit patterns of conspicuous consumption (low-SES people visiting high-SES brands), inconspicuous consumption (high-SES people visiting low- or middle-SES brands), and omnivorousness (high-SES people tend to have more diverse consumption practices than low-SES people). The study adds valuable descriptive detail to our understanding of the socioeconomic inequality in daily consumption and provides behavioral evidence for the arbitrary correlation of consumer and cultural preferences.

Realtime User Ratings as a Strategy for Combatting Misinformation: An Experimental Study
Jonas Stein, University of Groningen
Fact-checking takes time. As a consequence, verdicts are usually reached after a message has already gone viral and late-stage interventions can have only limited effect. An emergent approach (e.g. Twitter’s Birdwatch) is to harness the wisdom of the crowd by enabling recipients of an online message on a social media platform to attach veracity assessments to it, with the intention to allow poor initial crowd reception to temper belief in and further spread of misinformation. We study this approach by letting 4,000 subjects in 80 experimental bipartisan communities sequentially rate the veracity of informational messages. We find that in well-mixed communities, the public display of earlier veracity ratings indeed enhances the correct classification of true and false messages by subsequent users. But when false information is sequentially rated in strongly segregated communities, crowd intelligence backfires. This happens because early raters’ ideological bias, which is aligned with a message, influences later raters’ assessments away from the truth. These results identify an important problem for community misinformation detection systems and suggest that platforms must somehow compensate the deleterious effect of echo chambers in their design.

Online housing search and residential mobility
Joep Steegmans, Leiden University
In recent years, the internet has come to play an important role in housing search. This has led to novel user-generated data that can be used to investigate housing search behaviour. This project uses data from the largest digital housing platform in the Netherlands: Funda. The user-generated data include registered mouse clicks, webpages being opened, etc. The novel data provide detailed information on housing search that until recently remained unobserved.
The study analyses municipal flows of mouse clicks made by individual housing platform users. In order to study buyer search 10 terabytes of data are processed and analysed – thereby creating important computational challenges. More particularly, the study investigates the relationship between online search and real behaviour in the housing market by empirically testing whether online search data can be used to predict real residential mobility flows between municipalities. The first hypothesis is that virtual search flows between municipalities precede real residential moves. The second hypothesis is that the effect increases with the seriousness of the online platform users.
The research project provides important new insights into buyer search dynamics and decision making in the housing market. The study contributes to a better understanding of the role of housing platforms in housing search. The study’s findings are valuable with respect to policy design and spatial planning. Apart from that, the project stimulates the use of novel data sources for both academics and policy makers.

  • Chair: Boukje Cuelenaere, Centerdata
  • Predicting Attrition of LISS Panel Members using Machine Learning and Survey Responsiveness Data
    Isabel van den Heuvel, Eindhoven University of Technology (TU/e); Zhuozhao Zhan, TU/e; Seyit Höcük, Centerdata; Edwin van den Heuvel, TU/e; Joris Mulder, Centerdata
  • Data science challenges in smart surveys. Case studies on consumption, physical activity and travel
    Barry Schouten, Statistics Netherlands (CBS)
  • Effects of Survey Design on Response rate in Crime Surveys
    Jonas Klingwort, Statistics Netherlands (CBS)
  • Willingness and nonparticipation biases in data donation
    Bella Struminskaya, Utrecht University

Predicting Attrition of LISS Panel Members using Machine Learning and Survey Responsiveness Data
Isabel van den Heuvel, Eindhoven University of Technology (TU/e); Zhuozhao Zhan, TU/e; Seyit Höcük, Centerdata; Edwin van den Heuvel, TU/e; Joris Mulder, Centerdata
Background: Panel members that have been recruited for the LISS panel may stop responding to survey requests. This phenomenon is known as panel attrition. The attrition of panel members could lead to an imbalance in subgroup characteristics, making the panel non-representative and population estimates potentially biased. When the attrition moments of panel members can be predicted accurately, they can be approached and motivated to stay active in the panel. Previous studies have demonstrated that attrition is associated with various factors, but the prediction of attrition using survey responsiveness variables (i.e., paradata) has not been thoroughly investigated.
Methods: Attrition is being predicted for the LISS panel members who were active in the period from 2007 to 2019 using both socio-demographic variables and paradata. Statistical analysis was conducted with Cox proportional hazard model (screening variables associated with attrition), Random Forest (static prediction model using all variables), and the landmarking approach (dynamic prediction using survey responsiveness patterns).
Results: The percentage of attrition over the full period was determined at 68.5% [67.8; 69.2]. Many well-known socio-demographic variables were associated with attrition (e.g., sex, age, household size, income, occupation). The random forest data analysis demonstrated good performance (AUC: 83.9%; C-index 79.1%) and showed that the paradata was more important in predicting attrition than the socio-demographic variables. Prediction performance reduced substantially without paradata (AUC: 61.2%; C-index: 57.1%). Using only the survey responsiveness patterns of six-month windows, landmarking had a good prediction of attrition (AUC: 76.0%).
Conclusions: Our analysis shows that the use of paradata, and in particular the survey responsiveness patterns of panel members, in combination with machine learning techniques could predict the attrition of panel members accurately. Landmarking can be further optimized for the LISS panel to help retain panel members.

Data science challenges in smart surveys. Case studies on consumption, physical activity and travel
Barry Schouten, Statistics Netherlands (CBS)
Smart surveys add features of smart devices to surveys such as in-device processing, use of mobile device sensors, linkage to external sensor systems and data donation. They do so with the aim to ease respondent burden, to improve data quality and/or to enrich survey data. Smart features are especially promising for survey topics that are cognitively demanding, require detailed knowledge and recall, or for which questions provide weak proxies to the concepts of interest.
While smart surveys may be promising from a survey error perspective, their design and implementation pose new challenges. Respondents need to be engaged and need to trust statistical institutes in carefully handling data. Respondents need to understand and be able to perform the survey tasks. They also need to provide context to data being measured. The survey back-office IT and logistics become more demanding. And last, but not least, there is a strong reliance on advanced data extraction and machine learning methods to transform new forms of data to official statistics. The latter imply trade-offs in active and online learning and the role of respondents.
In the paper, we explain how computational science methods come in and what choices need to be made at the hand of three case studies. In household budget surveys text extraction and machine learning are used to classify scanned and/or digital receipts, and possibly donated bank transactions data. In physical activity survey trackers with motion sensors are used to predict type and intensity of activity. In travel surveys location data, possibly enriched with open points-of-interest data, are employed to derive travelled distances. Here, stop-detection, travel mode prediction and even prediction of travel purpose may come in. In all cases, data show deficiencies that automated procedures cannot overcome completely and respondents will need to assist.

Effects of Survey Design on Response rate in Crime Surveys
Jonas Klingwort, Statistics Netherlands (CBS)
The number of population surveys conducted is enormous and increasing, but response rates are declining across all data collection modes. With increasing nonresponse, the risk of a nonresponse bias increases resulting in biased survey estimates. The solution to avoid the missing data problem is not to have any. A well-designed survey and professional administration are required to approach this solution.
This work aims to quantify the effects of survey design features on the response rate in surveys. This is demonstrated using German crime surveys. Individual and independent studies dominate criminological survey research in Germany. This circumstance allows systematically studying the effects of different survey design features and their impact on the response rate in crime surveys.
A systematic literature review of German crime surveys between 2000-2022 was conducted, and 138 surveys were identified. Of those, 80 qualified to be eligible for analysis. Furthermore, the survey design features study year (2000-2022), target population (general and non-general), coverage area (local, national, regional), and data collection mode (CATI, CAWI, F2F, PAPI, Other) were collected. A meta-regression model was fitted to quantify the influence of the design features on the response rate.
Preliminary results show significant regression coefficients for most of the included design features, which indicate a linear relationship between predictor and response rate.
Such a model can be used for decision-making when (re-) designing a population survey and which design features to adjust if a high response rate is desired.

Willingness and nonparticipation biases in data donation
Bella Struminskaya, Utrecht University
Recent technological advances and technologies’ integration into people’s lives result in the continuous collection of data by organisations. The current European legislature (GDPR) allows individuals to request information about themselves from the gathering organisations and share it with researchers. Such rich data provides ample opportunities to study human behavior. For example, donation of geolocation history allows to study human mobility at unprecedently granular levels; donation social media data allows insights into individuals’ social networks; donation of fitness-tracking data allows insights into physical activity. The donated data is less susceptible to social desirability and recall biases than self-report and when combined with in-the-moment questi

  • Chair: Chang Sun, Maastricht University
  • Guess what I am doing: Identifying Physical Activities from Accelerometer data through Machine Learning and Deep Learning
    Joris Mulder, Centerdata – Tilburg University
  • Who suffers most from problematic debts? Evidence from health insurance defaults
    Mark Kattenberg, Anne-Fleur Roos, Jurre Thiel, Centraal Planbureau
  • Harnessing heterogeneity in behavioural research using computational social science
    Giuseppe Alessandro Veltri, University of Trento
  • Latent class analysis with distal outcomes: Two modified three-step methods using propensity scores
    Tra Le, Felix Clouth, Jeroen Vermunt, Tilburg University

Guess what I am doing: Identifying Physical Activities from Accelerometer data through Machine Learning and Deep Learning
Joris Mulder, Centerdata – Tilburg University
Nowadays, accelerometers or actigraphs are used in many research projects, providing highly detailed, objectively measured sensory data of physical activity. Where self-reported data might miss everyday life activities (e.g. walking to the shop, climbing stairs) accelerometer data provides a more complete picture of physical activity. The primary objective of this research is identifying specific activity patterns from the accelerometer data using machine learning and deep learning techniques. The secondary objective is improving the accuracy of identifying the specific activity patterns by validating activities through time-use data and survey data.
Activity data was collected through a large-scale accelerometer study in the probability-based LISS panel, consisting of approximately 7500 panel members from 5000 households. 1200 respondents participated in the study and wore an accelerometer for 8 days, measuring physical activity 24/7. A diverse group of 20 people labeled specific activity patterns in a controlled setting by wearing the device and performing the activities. The labeled data were used to train supervised machine-learning models. A deep learning model was trained to enhance the detection of the activities. Moreover, 450 respondents from the accelerometer study also participated in a time-use study in the LISS panel. Respondents reported their daily activities on a smartphone, using a time-use app. The reported time-use activities were used to validate the detected activities by the deep learning model.
We show that machine learning and deep learning models can be successfully used to identify specific types of activity from an accelerometer signal and can be validated by time-use data. Patterns of specific activities (i.e. sleeping, sitting, walking, cycling, jogging, driving) were successfully identified. The deep learning model increased the predictive power to better distinguish between specific activities. The time-use data proved to be useful to further validate certain hard to identify activities (i.e. cycling).

Who suffers most from problematic debts? Evidence from health insurance defaults
Mark Kattenberg, Anne-Fleur Roos, Jurre Thiel, Centraal Planbureau
Qualitative evidence suggests that financial troubles cause or deepen mental health problems, but quantitative evidence is scarce. The available evidence suggests that problematic debt increase mental health (Roos et al., 2021), but we do not know whether people are equally affected by problematic debts or whether some suffer more than others. Following Roos et al., (2021) we use nationwide individual-level panel data from the Netherlands for the years 2011-2015 to study the relationship between problematic debts and mental health and use a difference-in-differences approach with individual fixed effects for identification. To detect heterogeneity in the effect of problematic debts on mental health, we modifying the causal forest algorithm developed by Athey et al., (2019) to incorporate time and individual fixed effects. Our results help policymakers to identify which people mentally suffer most from problematic debts, which is important information when designing preventive policies aimed at reducing health care or policies that should prevent that debts become problematic.

Harnessing heterogeneity in behavioural research using computational social science
Giuseppe Alessandro Veltri, University of Trento
Digital online platforms have extended experiments to large national and international samples, thus increasing the potential heterogeneity present in responses to the examined treatments. Therefore, identifying and studying such heterogeneity is crucial in online behavioural experiments. New analytical techniques have emerged in computational social science to achieve this goal. We will illustrate an example from a study conducted in the context of the COVID-19 pandemic, which applies model-based recursive partitioning to data from an online experiment aimed at increasing vaccine willingness in eight European countries. Another valuable information generated by this approach is identifying particular segments of the sample under investigation that might merit further investigation. Identifying ‘local’ models of the population is not just a matter of chance. When applied to independent variables involving socioeconomic and behavioural measures, this possibility/technique allows us to detect/determine subgroups characterised by a particular socioeconomic or cognitive pattern shared by that group. Such a group could very well be transversal to traditional sociodemographic categories.

Latent class analysis with distal outcomes: Two modified three-step methods using propensity scores
Tra Le, Felix Clouth, Jeroen Vermunt, Tilburg University
Bias-adjusted three-step latent class (LC) analysis has become a popular technique to
estimate the relationship between class membership and a distal outcome. The classes are latent, so class memberships are not random assignments. Thus, confounding needs to be accounted for to draw causality. Modern causal inference techniques using propensity scores have become increasingly popular. We propose two novel strategies that make use of propensity scores to estimate the causal effect of LC membership on an outcome variable. They aim to tackle the limitations faced by existing methods.
Both strategies modify the bias-adjusted three-step approach by using propensity scores in the last step to control for confounding. The first strategy includes the IPW (inverse propensity weighting) as fixed weights (called IPW strategy) whereas the second strategy includes the propensity scores as control variables (called propensity scores as covariates strategy). To avoid misspecifying the relationship between the propensity scores and the outcome variable, we used a flexible regression model with quadratic terms and interactions. In both strategies, classification errors are accounted for using the BCH method.
A simulation study was used to compare their performances with three existing approaches. We varied the sample size, effect size, confounding strength, and class separation for binary and continuous outcome variables, with 500 replications per condition. The results showed that our IPW strategy, albeit the most logical one, had a non-convergence issue (due to extreme weights for the binary outcome variable) and the lowest efficiency. The propensity scores as covariates strategy had the best performance: it estimated the causal effect with the lowest bias and is relatively efficient. We use data from Wave 14 of the LISS panel to demonstrate our methods. Specifically, using the module Family and Household, we investigate how different types of parent-child relationships affect perceived relationship quality, controlling for some sociodemographic variables.

  • Chair: Laura Boeschoten, Utrecht University
  • Mobile-tracked Visits to Millions of Places Reveal Socioeconomic Inequality of Daily Consumption in the United States
    Yuanmo He, Milena Tsvetkova, the London School of Economics and Political Science
  • Realtime User Ratings as a Strategy for Combatting Misinformation: An Experimental Study
    Jonas Stein, University of Groningen
  • Online housing search and residential mobility
    Joep Steegmans, Leiden University

Mobile-tracked Visits to Millions of Places Reveal Socioeconomic Inequality of Daily Consumption in the United States
Yuanmo He, Milena Tsvetkova, the London School of Economics and Political Science
An important aspect of socioeconomic inequality is the difference in daily consumption practices by socioeconomic status (SES). This difference is not only a manifestation of inequality but also a trigger for further inequality in other life outcomes. For example, constrained by availability and price, people in low SES tend to go to low-price supermarkets and consume unhealthy food and beverages, which could contribute to later health problems. Differential daily consumption patterns result from both economic constraints and social processes. Sociologists Veblen and Bourdieu suggest that people use different consumption behaviour to distinguish their SES, and people in similar SES tend to have similar consumption preferences. Empirical evidence also shows that lifestyle choices could become correlated with demographic characteristics due to homophily and social influence. Therefore, we hypothesize that SES is associated with different consumption preferences for consumer brands, but they do not necessarily correspond to economic constraints driven by the product prices.
To test the hypotheses, we combine data from SafeGraph, Yelp, and US Census. Linking SafeGraph and Census data, we can obtain the distribution of brand visitors’ income from the median income of their home census block groups. We can also use Yelp’s dollar sign as an indicator of the brands’ price levels. Comparing the brands’ price levels with the income distribution of visitors, we can identify outliers that indicate unexpected lifestyle correlations. Based on existing literature, we expect to identify lifestyle groups that exhibit patterns of conspicuous consumption (low-SES people visiting high-SES brands), inconspicuous consumption (high-SES people visiting low- or middle-SES brands), and omnivorousness (high-SES people tend to have more diverse consumption practices than low-SES people). The study adds valuable descriptive detail to our understanding of the socioeconomic inequality in daily consumption and provides behavioral evidence for the arbitrary correlation of consumer and cultural preferences.

Realtime User Ratings as a Strategy for Combatting Misinformation: An Experimental Study
Jonas Stein, University of Groningen
Fact-checking takes time. As a consequence, verdicts are usually reached after a message has already gone viral and late-stage interventions can have only limited effect. An emergent approach (e.g. Twitter’s Birdwatch) is to harness the wisdom of the crowd by enabling recipients of an online message on a social media platform to attach veracity assessments to it, with the intention to allow poor initial crowd reception to temper belief in and further spread of misinformation. We study this approach by letting 4,000 subjects in 80 experimental bipartisan communities sequentially rate the veracity of informational messages. We find that in well-mixed communities, the public display of earlier veracity ratings indeed enhances the correct classification of true and false messages by subsequent users. But when false information is sequentially rated in strongly segregated communities, crowd intelligence backfires. This happens because early raters’ ideological bias, which is aligned with a message, influences later raters’ assessments away from the truth. These results identify an important problem for community misinformation detection systems and suggest that platforms must somehow compensate the deleterious effect of echo chambers in their design.

Online housing search and residential mobility
Joep Steegmans, Leiden University
In recent years, the internet has come to play an important role in housing search. This has led to novel user-generated data that can be used to investigate housing search behaviour. This project uses data from the largest digital housing platform in the Netherlands: Funda. The user-generated data include registered mouse clicks, webpages being opened, etc. The novel data provide detailed information on housing search that until recently remained unobserved.
The study analyses municipal flows of mouse clicks made by individual housing platform users. In order to study buyer search 10 terabytes of data are processed and analysed – thereby creating important computational challenges. More particularly, the study investigates the relationship between online search and real behaviour in the housing market by empirically testing whether online search data can be used to predict real residential mobility flows between municipalities. The first hypothesis is that virtual search flows between municipalities precede real residential moves. The second hypothesis is that the effect increases with the seriousness of the online platform users.
The research project provides important new insights into buyer search dynamics and decision making in the housing market. The study contributes to a better understanding of the role of housing platforms in housing search. The study’s findings are valuable with respect to policy design and spatial planning. Apart from that, the project stimulates the use of novel data sources for both academics and policy makers.

  • Chair: Boukje Cuelenaere, Centerdata
  • Predicting Attrition of LISS Panel Members using Machine Learning and Survey Responsiveness Data
    Isabel van den Heuvel, Eindhoven University of Technology (TU/e); Zhuozhao Zhan, TU/e; Seyit Höcük, Centerdata; Edwin van den Heuvel, TU/e; Joris Mulder, Centerdata
  • Data science challenges in smart surveys. Case studies on consumption, physical activity and travel
    Barry Schouten, Statistics Netherlands (CBS)
  • Effects of Survey Design on Response rate in Crime Surveys
    Jonas Klingwort, Statistics Netherlands (CBS)
  • Willingness and nonparticipation biases in data donation
    Bella Struminskaya, Utrecht University

Predicting Attrition of LISS Panel Members using Machine Learning and Survey Responsiveness Data
Isabel van den Heuvel, Eindhoven University of Technology (TU/e); Zhuozhao Zhan, TU/e; Seyit Höcük, Centerdata; Edwin van den Heuvel, TU/e; Joris Mulder, Centerdata
Background: Panel members that have been recruited for the LISS panel may stop responding to survey requests. This phenomenon is known as panel attrition. The attrition of panel members could lead to an imbalance in subgroup characteristics, making the panel non-representative and population estimates potentially biased. When the attrition moments of panel members can be predicted accurately, they can be approached and motivated to stay active in the panel. Previous studies have demonstrated that attrition is associated with various factors, but the prediction of attrition using survey responsiveness variables (i.e., paradata) has not been thoroughly investigated.
Methods: Attrition is being predicted for the LISS panel members who were active in the period from 2007 to 2019 using both socio-demographic variables and paradata. Statistical analysis was conducted with Cox proportional hazard model (screening variables associated with attrition), Random Forest (static prediction model using all variables), and the landmarking approach (dynamic prediction using survey responsiveness patterns).
Results: The percentage of attrition over the full period was determined at 68.5% [67.8; 69.2]. Many well-known socio-demographic variables were associated with attrition (e.g., sex, age, household size, income, occupation). The random forest data analysis demonstrated good performance (AUC: 83.9%; C-index 79.1%) and showed that the paradata was more important in predicting attrition than the socio-demographic variables. Prediction performance reduced substantially without paradata (AUC: 61.2%; C-index: 57.1%). Using only the survey responsiveness patterns of six-month windows, landmarking had a good prediction of attrition (AUC: 76.0%).
Conclusions: Our analysis shows that the use of paradata, and in particular the survey responsiveness patterns of panel members, in combination with machine learning techniques could predict the attrition of panel members accurately. Landmarking can be further optimized for the LISS panel to help retain panel members.

Data science challenges in smart surveys. Case studies on consumption, physical activity and travel
Barry Schouten, Statistics Netherlands (CBS)
Smart surveys add features of smart devices to surveys such as in-device processing, use of mobile device sensors, linkage to external sensor systems and data donation. They do so with the aim to ease respondent burden, to improve data quality and/or to enrich survey data. Smart features are especially promising for survey topics that are cognitively demanding, require detailed knowledge and recall, or for which questions provide weak proxies to the concepts of interest.
While smart surveys may be promising from a survey error perspective, their design and implementation pose new challenges. Respondents need to be engaged and need to trust statistical institutes in carefully handling data. Respondents need to understand and be able to perform the survey tasks. They also need to provide context to data being measured. The survey back-office IT and logistics become more demanding. And last, but not least, there is a strong reliance on advanced data extraction and machine learning methods to transform new forms of data to official statistics. The latter imply trade-offs in active and online learning and the role of respondents.
In the paper, we explain how computational science methods come in and what choices need to be made at the hand of three case studies. In household budget surveys text extraction and machine learning are used to classify scanned and/or digital receipts, and possibly donated bank transactions data. In physical activity survey trackers with motion sensors are used to predict type and intensity of activity. In travel surveys location data, possibly enriched with open points-of-interest data, are employed to derive travelled distances. Here, stop-detection, travel mode prediction and even prediction of travel purpose may come in. In all cases, data show deficiencies that automated procedures cannot overcome completely and respondents will need to assist.

Effects of Survey Design on Response rate in Crime Surveys
Jonas Klingwort, Statistics Netherlands (CBS)
The number of population surveys conducted is enormous and increasing, but response rates are declining across all data collection modes. With increasing nonresponse, the risk of a nonresponse bias increases resulting in biased survey estimates. The solution to avoid the missing data problem is not to have any. A well-designed survey and professional administration are required to approach this solution.
This work aims to quantify the effects of survey design features on the response rate in surveys. This is demonstrated using German crime surveys. Individual and independent studies dominate criminological survey research in Germany. This circumstance allows systematically studying the effects of different survey design features and their impact on the response rate in crime surveys.
A systematic literature review of German crime surveys between 2000-2022 was conducted, and 138 surveys were identified. Of those, 80 qualified to be eligible for analysis. Furthermore, the survey design features study year (2000-2022), target population (general and non-general), coverage area (local, national, regional), and data collection mode (CATI, CAWI, F2F, PAPI, Other) were collected. A meta-regression model was fitted to quantify the influence of the design features on the response rate.
Preliminary results show significant regression coefficients for most of the included design features, which indicate a linear relationship between predictor and response rate.
Such a model can be used for decision-making when (re-) designing a population survey and which design features to adjust if a high response rate is desired.

Willingness and nonparticipation biases in data donation
Bella Struminskaya, Utrecht University
Recent technological advances and technologies’ integration into people’s lives result in the continuous collection of data by organisations. The current European legislature (GDPR) allows individuals to request information about themselves from the gathering organisations and share it with researchers. Such rich data provides ample opportunities to study human behavior. For example, donation of geolocation history allows to study human mobility at unprecedently granular levels; donation social media data allows insights into individuals’ social networks; donation of fitness-tracking data allows insights into physical activity. The donated data is less susceptible to social desirability and recall biases than self-report and when combined with in-the-moment questi