ODISSEI Conference for Social Science in the Netherlands 2022

This conference seeks to bring together a community of computational social scientists to discuss data, methods, infrastructure, ethics and theoretical work related to digital and computational approaches in social science research. ODISSEI, the research infrastructure for social science in the Netherlands, connects researchers with the necessary data, expertise and resources to conduct ground-breaking research and embrace the computational turn in social enquiry.

Conference registration: Registration for the conference has closed.
Conference date: 3 November 2022
Location: Media Plaza (Jaarbeurs), Utrecht, the Netherlands
Contactcommunications@odissei-data.nl
Please note: registration is free, but as there is a limit to the location’s capacity, please let us know as soon as possible if you have to cancel your registration.

Livestream

You can find a livestream of the plenary room of the programme (‘Progress’) here.

Programme

Find the abstracts and full overview in the programme below.
A pdf with an overview of the programme can be found here.

Floor plan

You can download the floor plan of the location here.

With coffee and tea.

Professor Frauke Kreuter is co-director of the Social Data Science Center and faculty member in the Joint Program in Survey Methodology (JPSM) at the University of Maryland, USA; Professor of Statistics and Data Science at the Ludwig-Maximilians-University of Munich, Germany and head of the statistical methods group at the Institute for Employment Research (IAB) in Nuremberg, Germany. She is co-editor of Big Data and Social Science: Data Science Methods and Tools for Research and Practice (CRC Press, Second Edition 2021).

Chair: Tom Emery (ODISSEI)

Room: Progress

Abstract

Opportunities and challenges involved in combining (big and small) data sources
Frauke Kreuter, University of Maryland, University of Munich

New (often big) data sources offer enormous potential for exploring and solving complex societal challenges. Many agencies hope to use administrative data to optimize bureaucratic processes and reduce errors in human decision-making. Others hope to use digital data traces, derived from smartphones or IoT devices, to learn more about human behavior and interactions without increasing response buden. Unfortunately, the data generating processes are embedded in a social and economic context, which is often ignored when data are collected, shared or used downstream. There is among others a growing concern about the lack of fairness, equity and diversity. This talk outlines recent developments in the use of different data products in economic and social research. Frauke Kreuter explains the shortcomings in their application and how science can come to grips with issues of data quality, ethics and privacy without compromising the ability to reproduce and reuse the data. The talk also outlines the essential conditions for a successful and fair use of AI.

Professor Frauke Kreuter holds the LMU chair of Statistics and Data Science in the Social Sciences and Humanities and is co-director of the Data Science Centers at the University of Maryland and the University of Munich.

10.30-12.00 – Parallel Session 1

  • Chair: Chang Sun, Maastricht University
  • Guess what I am doing: Identifying Physical Activities from Accelerometer data through Machine Learning and Deep Learning
    Joris Mulder, Centerdata – Tilburg University
  • Who suffers most from problematic debts? Evidence from health insurance defaults
    Mark Kattenberg, Anne-Fleur Roos, Jurre Thiel, Centraal Planbureau
  • Harnessing heterogeneity in behavioural research using computational social science
    Giuseppe Alessandro Veltri, University of Trento
  • Latent class analysis with distal outcomes: Two modified three-step methods using propensity scores
    Tra Le, Felix Clouth, Jeroen Vermunt, Tilburg University

Guess what I am doing: Identifying Physical Activities from Accelerometer data through Machine Learning and Deep Learning
Joris Mulder, Centerdata – Tilburg University
Nowadays, accelerometers or actigraphs are used in many research projects, providing highly detailed, objectively measured sensory data of physical activity. Where self-reported data might miss everyday life activities (e.g. walking to the shop, climbing stairs) accelerometer data provides a more complete picture of physical activity. The primary objective of this research is identifying specific activity patterns from the accelerometer data using machine learning and deep learning techniques. The secondary objective is improving the accuracy of identifying the specific activity patterns by validating activities through time-use data and survey data.
Activity data was collected through a large-scale accelerometer study in the probability-based LISS panel, consisting of approximately 7500 panel members from 5000 households. 1200 respondents participated in the study and wore an accelerometer for 8 days, measuring physical activity 24/7. A diverse group of 20 people labeled specific activity patterns in a controlled setting by wearing the device and performing the activities. The labeled data were used to train supervised machine-learning models. A deep learning model was trained to enhance the detection of the activities. Moreover, 450 respondents from the accelerometer study also participated in a time-use study in the LISS panel. Respondents reported their daily activities on a smartphone, using a time-use app. The reported time-use activities were used to validate the detected activities by the deep learning model.
We show that machine learning and deep learning models can be successfully used to identify specific types of activity from an accelerometer signal and can be validated by time-use data. Patterns of specific activities (i.e. sleeping, sitting, walking, cycling, jogging, driving) were successfully identified. The deep learning model increased the predictive power to better distinguish between specific activities. The time-use data proved to be useful to further validate certain hard to identify activities (i.e. cycling).

Who suffers most from problematic debts? Evidence from health insurance defaults
Mark Kattenberg, Anne-Fleur Roos, Jurre Thiel, Centraal Planbureau
Qualitative evidence suggests that financial troubles cause or deepen mental health problems, but quantitative evidence is scarce. The available evidence suggests that problematic debt increase mental health (Roos et al., 2021), but we do not know whether people are equally affected by problematic debts or whether some suffer more than others. Following Roos et al., (2021) we use nationwide individual-level panel data from the Netherlands for the years 2011-2015 to study the relationship between problematic debts and mental health and use a difference-in-differences approach with individual fixed effects for identification. To detect heterogeneity in the effect of problematic debts on mental health, we modifying the causal forest algorithm developed by Athey et al., (2019) to incorporate time and individual fixed effects. Our results help policymakers to identify which people mentally suffer most from problematic debts, which is important information when designing preventive policies aimed at reducing health care or policies that should prevent that debts become problematic.

Harnessing heterogeneity in behavioural research using computational social science
Giuseppe Alessandro Veltri, University of Trento
Digital online platforms have extended experiments to large national and international samples, thus increasing the potential heterogeneity present in responses to the examined treatments. Therefore, identifying and studying such heterogeneity is crucial in online behavioural experiments. New analytical techniques have emerged in computational social science to achieve this goal. We will illustrate an example from a study conducted in the context of the COVID-19 pandemic, which applies model-based recursive partitioning to data from an online experiment aimed at increasing vaccine willingness in eight European countries. Another valuable information generated by this approach is identifying particular segments of the sample under investigation that might merit further investigation. Identifying ‘local’ models of the population is not just a matter of chance. When applied to independent variables involving socioeconomic and behavioural measures, this possibility/technique allows us to detect/determine subgroups characterised by a particular socioeconomic or cognitive pattern shared by that group. Such a group could very well be transversal to traditional sociodemographic categories.

Latent class analysis with distal outcomes: Two modified three-step methods using propensity scores
Tra Le, Felix Clouth, Jeroen Vermunt, Tilburg University
Bias-adjusted three-step latent class (LC) analysis has become a popular technique to
estimate the relationship between class membership and a distal outcome. The classes are latent, so class memberships are not random assignments. Thus, confounding needs to be accounted for to draw causality. Modern causal inference techniques using propensity scores have become increasingly popular. We propose two novel strategies that make use of propensity scores to estimate the causal effect of LC membership on an outcome variable. They aim to tackle the limitations faced by existing methods.
Both strategies modify the bias-adjusted three-step approach by using propensity scores in the last step to control for confounding. The first strategy includes the IPW (inverse propensity weighting) as fixed weights (called IPW strategy) whereas the second strategy includes the propensity scores as control variables (called propensity scores as covariates strategy). To avoid misspecifying the relationship between the propensity scores and the outcome variable, we used a flexible regression model with quadratic terms and interactions. In both strategies, classification errors are accounted for using the BCH method.
A simulation study was used to compare their performances with three existing approaches. We varied the sample size, effect size, confounding strength, and class separation for binary and continuous outcome variables, with 500 replications per condition. The results showed that our IPW strategy, albeit the most logical one, had a non-convergence issue (due to extreme weights for the binary outcome variable) and the lowest efficiency. The propensity scores as covariates strategy had the best performance: it estimated the causal effect with the lowest bias and is relatively efficient. We use data from Wave 14 of the LISS panel to demonstrate our methods. Specifically, using the module Family and Household, we investigate how different types of parent-child relationships affect perceived relationship quality, controlling for some sociodemographic variables.

  • Chair: Laura Boeschoten, Utrecht University
  • Mobile-tracked Visits to Millions of Places Reveal Socioeconomic Inequality of Daily Consumption in the United States
    Yuanmo He, Milena Tsvetkova, the London School of Economics and Political Science
  • Realtime User Ratings as a Strategy for Combatting Misinformation: An Experimental Study
    Jonas Stein, University of Groningen
  • Online housing search and residential mobility
    Joep Steegmans, Leiden University

Mobile-tracked Visits to Millions of Places Reveal Socioeconomic Inequality of Daily Consumption in the United States
Yuanmo He, Milena Tsvetkova, the London School of Economics and Political Science
An important aspect of socioeconomic inequality is the difference in daily consumption practices by socioeconomic status (SES). This difference is not only a manifestation of inequality but also a trigger for further inequality in other life outcomes. For example, constrained by availability and price, people in low SES tend to go to low-price supermarkets and consume unhealthy food and beverages, which could contribute to later health problems. Differential daily consumption patterns result from both economic constraints and social processes. Sociologists Veblen and Bourdieu suggest that people use different consumption behaviour to distinguish their SES, and people in similar SES tend to have similar consumption preferences. Empirical evidence also shows that lifestyle choices could become correlated with demographic characteristics due to homophily and social influence. Therefore, we hypothesize that SES is associated with different consumption preferences for consumer brands, but they do not necessarily correspond to economic constraints driven by the product prices.
To test the hypotheses, we combine data from SafeGraph, Yelp, and US Census. Linking SafeGraph and Census data, we can obtain the distribution of brand visitors’ income from the median income of their home census block groups. We can also use Yelp’s dollar sign as an indicator of the brands’ price levels. Comparing the brands’ price levels with the income distribution of visitors, we can identify outliers that indicate unexpected lifestyle correlations. Based on existing literature, we expect to identify lifestyle groups that exhibit patterns of conspicuous consumption (low-SES people visiting high-SES brands), inconspicuous consumption (high-SES people visiting low- or middle-SES brands), and omnivorousness (high-SES people tend to have more diverse consumption practices than low-SES people). The study adds valuable descriptive detail to our understanding of the socioeconomic inequality in daily consumption and provides behavioral evidence for the arbitrary correlation of consumer and cultural preferences.

Realtime User Ratings as a Strategy for Combatting Misinformation: An Experimental Study
Jonas Stein, University of Groningen
Fact-checking takes time. As a consequence, verdicts are usually reached after a message has already gone viral and late-stage interventions can have only limited effect. An emergent approach (e.g. Twitter’s Birdwatch) is to harness the wisdom of the crowd by enabling recipients of an online message on a social media platform to attach veracity assessments to it, with the intention to allow poor initial crowd reception to temper belief in and further spread of misinformation. We study this approach by letting 4,000 subjects in 80 experimental bipartisan communities sequentially rate the veracity of informational messages. We find that in well-mixed communities, the public display of earlier veracity ratings indeed enhances the correct classification of true and false messages by subsequent users. But when false information is sequentially rated in strongly segregated communities, crowd intelligence backfires. This happens because early raters’ ideological bias, which is aligned with a message, influences later raters’ assessments away from the truth. These results identify an important problem for community misinformation detection systems and suggest that platforms must somehow compensate the deleterious effect of echo chambers in their design.

Online housing search and residential mobility
Joep Steegmans, Leiden University
In recent years, the internet has come to play an important role in housing search. This has led to novel user-generated data that can be used to investigate housing search behaviour. This project uses data from the largest digital housing platform in the Netherlands: Funda. The user-generated data include registered mouse clicks, webpages being opened, etc. The novel data provide detailed information on housing search that until recently remained unobserved.
The study analyses municipal flows of mouse clicks made by individual housing platform users. In order to study buyer search 10 terabytes of data are processed and analysed – thereby creating important computational challenges. More particularly, the study investigates the relationship between online search and real behaviour in the housing market by empirically testing whether online search data can be used to predict real residential mobility flows between municipalities. The first hypothesis is that virtual search flows between municipalities precede real residential moves. The second hypothesis is that the effect increases with the seriousness of the online platform users.
The research project provides important new insights into buyer search dynamics and decision making in the housing market. The study contributes to a better understanding of the role of housing platforms in housing search. The study’s findings are valuable with respect to policy design and spatial planning. Apart from that, the project stimulates the use of novel data sources for both academics and policy makers.

  • Chair: Boukje Cuelenaere, Centerdata
  • Predicting Attrition of LISS Panel Members using Machine Learning and Survey Responsiveness Data
    Isabel van den Heuvel, Eindhoven University of Technology (TU/e); Zhuozhao Zhan, TU/e; Seyit Höcük, Centerdata; Edwin van den Heuvel, TU/e; Joris Mulder, Centerdata
  • Data science challenges in smart surveys. Case studies on consumption, physical activity and travel
    Barry Schouten, Statistics Netherlands (CBS)
  • Effects of Survey Design on Response rate in Crime Surveys
    Jonas Klingwort, Statistics Netherlands (CBS)
  • Willingness and nonparticipation biases in data donation
    Bella Struminskaya, Utrecht University

Predicting Attrition of LISS Panel Members using Machine Learning and Survey Responsiveness Data
Isabel van den Heuvel, Eindhoven University of Technology (TU/e); Zhuozhao Zhan, TU/e; Seyit Höcük, Centerdata; Edwin van den Heuvel, TU/e; Joris Mulder, Centerdata
Background: Panel members that have been recruited for the LISS panel may stop responding to survey requests. This phenomenon is known as panel attrition. The attrition of panel members could lead to an imbalance in subgroup characteristics, making the panel non-representative and population estimates potentially biased. When the attrition moments of panel members can be predicted accurately, they can be approached and motivated to stay active in the panel. Previous studies have demonstrated that attrition is associated with various factors, but the prediction of attrition using survey responsiveness variables (i.e., paradata) has not been thoroughly investigated.
Methods: Attrition is being predicted for the LISS panel members who were active in the period from 2007 to 2019 using both socio-demographic variables and paradata. Statistical analysis was conducted with Cox proportional hazard model (screening variables associated with attrition), Random Forest (static prediction model using all variables), and the landmarking approach (dynamic prediction using survey responsiveness patterns).
Results: The percentage of attrition over the full period was determined at 68.5% [67.8; 69.2]. Many well-known socio-demographic variables were associated with attrition (e.g., sex, age, household size, income, occupation). The random forest data analysis demonstrated good performance (AUC: 83.9%; C-index 79.1%) and showed that the paradata was more important in predicting attrition than the socio-demographic variables. Prediction performance reduced substantially without paradata (AUC: 61.2%; C-index: 57.1%). Using only the survey responsiveness patterns of six-month windows, landmarking had a good prediction of attrition (AUC: 76.0%).
Conclusions: Our analysis shows that the use of paradata, and in particular the survey responsiveness patterns of panel members, in combination with machine learning techniques could predict the attrition of panel members accurately. Landmarking can be further optimized for the LISS panel to help retain panel members.

Data science challenges in smart surveys. Case studies on consumption, physical activity and travel
Barry Schouten, Statistics Netherlands (CBS)
Smart surveys add features of smart devices to surveys such as in-device processing, use of mobile device sensors, linkage to external sensor systems and data donation. They do so with the aim to ease respondent burden, to improve data quality and/or to enrich survey data. Smart features are especially promising for survey topics that are cognitively demanding, require detailed knowledge and recall, or for which questions provide weak proxies to the concepts of interest.
While smart surveys may be promising from a survey error perspective, their design and implementation pose new challenges. Respondents need to be engaged and need to trust statistical institutes in carefully handling data. Respondents need to understand and be able to perform the survey tasks. They also need to provide context to data being measured. The survey back-office IT and logistics become more demanding. And last, but not least, there is a strong reliance on advanced data extraction and machine learning methods to transform new forms of data to official statistics. The latter imply trade-offs in active and online learning and the role of respondents.
In the paper, we explain how computational science methods come in and what choices need to be made at the hand of three case studies. In household budget surveys text extraction and machine learning are used to classify scanned and/or digital receipts, and possibly donated bank transactions data. In physical activity survey trackers with motion sensors are used to predict type and intensity of activity. In travel surveys location data, possibly enriched with open points-of-interest data, are employed to derive travelled distances. Here, stop-detection, travel mode prediction and even prediction of travel purpose may come in. In all cases, data show deficiencies that automated procedures cannot overcome completely and respondents will need to assist.

Effects of Survey Design on Response rate in Crime Surveys
Jonas Klingwort, Statistics Netherlands (CBS)
The number of population surveys conducted is enormous and increasing, but response rates are declining across all data collection modes. With increasing nonresponse, the risk of a nonresponse bias increases resulting in biased survey estimates. The solution to avoid the missing data problem is not to have any. A well-designed survey and professional administration are required to approach this solution.
This work aims to quantify the effects of survey design features on the response rate in surveys. This is demonstrated using German crime surveys. Individual and independent studies dominate criminological survey research in Germany. This circumstance allows systematically studying the effects of different survey design features and their impact on the response rate in crime surveys.
A systematic literature review of German crime surveys between 2000-2022 was conducted, and 138 surveys were identified. Of those, 80 qualified to be eligible for analysis. Furthermore, the survey design features study year (2000-2022), target population (general and non-general), coverage area (local, national, regional), and data collection mode (CATI, CAWI, F2F, PAPI, Other) were collected. A meta-regression model was fitted to quantify the influence of the design features on the response rate.
Preliminary results show significant regression coefficients for most of the included design features, which indicate a linear relationship between predictor and response rate.
Such a model can be used for decision-making when (re-) designing a population survey and which design features to adjust if a high response rate is desired.

Willingness and nonparticipation biases in data donation
Bella Struminskaya, Utrecht University
Recent technological advances and technologies’ integration into people’s lives result in the continuous collection of data by organizations. The current European legislature (GDPR) allows individuals to request information about themselves from the gathering organizations and share it with researchers. Such rich data provides ample opportunities to study human behavior. For example, donation of geolocation history allows to study human mobility at unprecedently granular levels; donation social media data allows insights into individuals’ social networks; donation of fitness-tracking data allows insights into physical activity. The donated data is less susceptible to social desirability and recall biases than self-report and when combined with in-the-moment questionnaires allows addressing research questions about behavior and attitudes. However, critical conceptual challenges remain. Data sharing might be burdensome for participants’ (e.g., due to privacy concerns) potentially introducing selection bias. If those who donate data differ from those who do not in critical study outcomes, the research conclusions can be biased. We implemented a randomized experiment (2x2x2 design) in a Dutch online panel (CentERpanel) to study the mechanisms of willingness and consent to donate Google location history data. Smartphone owners were randomly assigned to the following conditions: (1) showing a visualization of data similar to what participants are asked to donate to increase understanding of the process vs. no visualization, (2) varying the incentive amount for donating the data, (3) checking the understanding of the donation request vs. no questions about the understanding of what is asked from the panelists. In this talk, we focus on the willingness to donate and actual upload of the data extracted from Google location history data donation package as outcomes. In addition, we focus on selection biases by comparing the characteristics of those who donate and those who do not, focusing on demographics, technological skills, and travel behavior.

  • Chair: Anne Kroon, University of Amsterdam
  • Towards Enforcement of the EU Digital Services Act on Content Monetization in Social Media: a Data-Driven Approach
    Moira Loechteken, Zain Ahmad, Thomas Übellacker, Ilinca Dumitras, Adriana Iamnitchi, Maastricht University
  • Criticizing the Executive after Media Capture: Evidence from Egypt
    Christopher Barrie, University of Edinburgh
  • Campaigning by Tweet: Evidence from Ireland
    Martijn Schoonvelde, University of Groningen
  • Open data in communication science: towards a successful implementation of the FAIR principles
    Rens Vliegenthart, Wageningen University & Research

Towards Enforcement of the EU Digital Services Act on Content Monetization in Social Media: a Data-Driven Approach
Moira Loechteken, Zain Ahmad, Thomas Übellacker, Ilinca Dumitras, Adriana Iamnitchi, Maastricht University
With the evolving nature of digital services and the growth of social influencers, the European Commission implemented the Digital Services Act to promote law enforcement on digital media. This includes the aim for transparency in content monetization to ensure that users properly disclose ads and that ads do not violate European advertising rules. However, how to enforce such regulations is not yet determined.
This work investigates the presence of undisclosed ads on two popular platforms for content monetization, Instagram and TikTok. Our analysis on over 50000 Instagram posts and 26000 TikTok posts reveal a large percentage of improperly disclosed and undisclosed ads on both platforms. We found that there are several features that are predictive of ads. For example, posts including certain words related to product promotion, such as “code” or “bio”, are more likely to be ads. Improperly disclosed ads can be flagged by simple rules that the platforms can easily implement.

Criticizing the Executive after Media Capture: Evidence from Egypt
Christopher Barrie, University of Edinburgh
What happens to media criticism after media capture? Developing real-time measures of the criticality of language in the largest existing dataset of Arabic news, we evaluate how political reporting changed in the aftermath of media capture in post-coup Egypt. Using an ALC word-embedding approach, we find that political reporting was demonstrably less critical of the political executive in the immediate aftermath of the coup. By capturing temporally granular change in political reportage following democratic reversal, our work points to new possibilities in the monitoring and measurement of media capture in authoritarian settings.

Campaigning by Tweet: Evidence from Ireland
Martijn Schoonvelde, University of Groningen
Twitter has become an essential tool for candidates seeking election. It provides candidates with a direct line of communication to outside audiences and allows them to build a public profile by posting strategically selected content. Taking the 2020 Irish General Election as our case, this study seeks to examine how Twitter is used by candidates to signal campaign effort and policy priorities. In a series of experiments, we first demonstrate that a supervised machine-learning approach that relies on transformer-based sentence embeddings using transfer learning can successfully capture differences in how candidates present themselves online. After classifying our full set of 122,383 tweets, we then test a series of pre-registered hypotheses. We show that more competitive candidates put more emphasis on policy content. Compared with candidates who ran previously, less experienced candidates prioritise electioneering tweets. Against our expectations, we fail to detect substantively meaningful difference between male and female candidates. Our results show how candidates employ multi-modal online communication strategies to build a public persona during electoral campaigns.

Open data in communication science: towards a successful implementation of the FAIR principles
Rens Vliegenthart, Wageningen University & Research
The communication science discipline has been steadily, yet slowly, starting to adopt open science practices. A recent overview by Bakker and colleagues (2021) demonstrates that there is a general belief among communication scholars that open research practices, including pre-registration, replication, open data and open access are important for research quality. Still, they are far from widely adopted. While the majority of scholars participating in the survey reported to have shared data at least once (64%), the modal response to the question how frequently this is done is ‘occasionally’. In this paper, we present a systematic assessment of content analytical studies on Dutch (social) media in the past three decades. We find that very little studies make their data openly available. We contend that there are three related reasons that underly this relative infrequent use of what should be a common practice. First, the lack of a data sharing culture in the discipline offers relatively little incentives to individual scholars to make use of open data. Second, the nature of the data, and in particular the wide variety of different types of data and datasets (stemming from content analyses, surveys, but also qualitative research) makes it difficult to decide what needs (and is also ethically acceptable) to be open. This problem has increased even further with the rise of computational communication science, where also tools and algorithms that are used to process and analyze data require attention. And third, a lack of clear standards of what and how open data entails withholds scholars to make their data available to the wide research community. In this paper, we discuss ways to overcome those issues and in particular elaborate on how to implement FAIR (Findable, Accessible, Interoperable and Reusable) ways of making data publicly available.

  • Chair: Angelica Maineri, ODISSEI
  • The FAIR Expertise Hub: enacting the FAIR principles in the social sciences
    Angelica Maineri, ODISSEI, Erasmus University Rotterdam; Shuai Wang, VU Amsterdam; Elena Beretta, VU Amsterdam
  • Discovering Data Using the ODISSEI Portal
    Ricarda Braukmann, DANS-KNAW; the ODISSEI Portal Team
  • Total Error Sheets for Datasets – Interdisciplinary, community-centered development of CSS data documentation templates
    Leon Fröhling, GESIS
  • Research data acquisition in the social sciences: Opportunities, challenges, and solutions
    Marco Stam, Leiden University

The FAIR Expertise Hub: enacting the FAIR principles in the social sciences
Angelica Maineri, ODISSEI, Erasmus University Rotterdam; Shuai Wang, VU Amsterdam; Elena Beretta, VU Amsterdam
The FAIR Expertise Hub for the Social Sciences is being established to support data communities in the social sciences with improving their compliance with the FAIR data principles (Findable, Accessible, Interoperable, and Reusable). Since their first formal publication (Wilkinson et al., 2016), FAIR data principles have been widely adopted, globally and across different disciplines, becoming the gold standard for good data management practices. Although FAIR is flexible enough to allow each stakeholder/discipline to adapt the implementation strategy to their needs, there is still a significant amount of resources that are not FAIR, limiting interoperability and reusability within and across disciplines. This calls for a guideline with validated decisions and plans to achieve FAIRness, which is what our FAIR Expertise Hub is about.
In the presentation, we review the current status of FAIR in the Dutch social sciences by drawing on the literature and on the insights gathered in conversations with the communities. We then outline the strategy of how the FAIR Expertise Hub plans to tackle the problems and build the gaps. In particular, we will discuss the FAIR Implementation Profile (FIP), a tool to document and support the convergence in FAIR implementation within and across disciplines. We argue that improving implementation strategies to strengthen the provision of FAIR data contributes to realising an open social science research infrastructure as the one envisioned by ODISSEI.
References: Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18

Discovering Data Using the ODISSEI Portal
Ricarda Braukmann, DANS-KNAW; the ODISSEI Portal team
Datasets are often scattered across different institutional, national, and international repositories. In the case of CBS microdata, the only public information about available datasets is documented in pdf files, organised thematically, accessible only via the CBS website. This fragmented landscape limits the findability of new data sources a researcher might not be accustomed to, and makes the process of selecting an appropriate data source time-consuming and inefficient.
The goal of the ODISSEI Portal is to enable and facilitate the discovery of social sciences datasets across various data providers in the Netherlands within a single interface. The Portal currently collects metadata from social science datasets available at DANS (the Dutch national centre of expertise and repository for research data), from the LISS archive, and from the microdata catalogue available at Statistics Netherlands (CBS). Metadata of these collections is harmonised and enriched through a knowledge graph, while the user interface allows users to search across all datasets available in these collections to find data relevant for their research.
The ODISSEI Portal, although still in an early development stage, has been made already publicly available (https://portal.odissei.nl/) in order to be able to get feedback from the community on the design features and functionalities. During the presentation, after a brief introduction we will show the current version of the Portal to the ODISSEI community. A virtual “suggestion box” will be made available to all participants in the session.

Total Error Sheets for Datasets – Interdisciplinary, community-centered development of CSS data documentation templates
Leon Fröhling, GESIS
As researchers increasingly turn towards novel types of digital behavioral data collected from online platforms to study social phenomena, their characteristics, limitations, and potential biases need to be well understood and documented. While the research processes in the traditional social sciences are well established and researchers may rely on error frameworks and protocols for further guidance, this support is yet to be developed for the Computational Social Sciences.
We propose the Total Error Sheet for Datasets (TES-D) as a combination of the checklist-like templates for the documentation of datasets existent in Machine Learning and the error frameworks that are available for survey researchers. The contribution of TES-D to the quality of CSS research is therefore twofold; first, the resulting documentation template and its questions invite researchers to critically reflect on their data collection process – if consulted before the actual data collection, TES-D helps to increase awareness for how design decisions in the research process could potentially distort the results, and if used afterwards, TES-D guides researchers through a critical reflection of the process and helps to identify biases that might exist in the dataset. The second contribution is through the systematic documentation of errors and biases, allowing for their transparent communication to the research community and better-informed decisions for the re-use of TES-D documented datasets.
The presentation will focus on the interdisciplinary of our community-centered development process of the TES-D approach. TES-D combines the established best practices of two different disciplines into a single approach useful for a third, the CSS, research community. We included different groups of users in the development and evaluation of TES-D, collecting valuable observations of their engagement with TES-D, as well as soliciting their feedback, to better tailor the approach to their needs. Further presentations and discussions of TES-D help us in this endeavor.

Research data acquisition in the social sciences: Opportunities, challenges, and solutions
Marco Stam, Leiden University
The evaluation of causal effects of social policies has greatly advanced due to the advent of computational and econometric techniques to exploit natural experiments. Yet, evidence often remains scarce due to the many empirical challenges of such approaches. In addition to computational burdens, the largest hurdles are often posed by data demands. These demands can be amplified in interdisciplinary research, where highly-detailed data are often required from institutions across multiple domains. Statistics Netherlands has facilitated a great leap forward in meeting such demands, yet many challenges remain. To advance research on these topics, various opportunities, challenges, and solutions by ODISSEI to expand upon existing research provisions are identified.

  • Using OpenStreetMap and Survey Data to Study Interethnic Group Relations in Belgium: A Machine Learning Approach
    Daria Dementeva, KU Leuven
  • Perfectionistic Self Presentation in tweets
    Reshmi Gopalakrishna Pillai, University of Amsterdam
  • A Timely Matter: Using wearable devices to study time use in low-income contexts
    Pradeep Kumar, Centerdata
  • 4P’s and Creativity: Exploring Creativity Measurement Approaches using Computational Techniques
    Ranran Li, VU Amsterdam
  • Do all words mean the same? Exploring differences in inter and intra-groups representation of a far-right community on-line
    Maria Lucia Miotto, Tilburg University
  • Psychological Barriers to Take-Up of Healthcare and Child Support Benefits in the Netherlands
    Olaf Simonse, Leiden University
  • From the dark to the surface web: Scouting eBay for counterfeits
    Felix Soldner, GESIS – Leibniz Institute for the Social Science; Fabian Plum, Imperial College London; Bennett Kleinberg, University College London – Utrecht University; Shane Johnson, University College London
  • COVID-19 Conspiracy Theories and Lockdown Measures – A Multi-Country Analysis of Twitter Data
    Maria Stielow, Emily Straube, Joost Keiren, Maastricht University
  • A look at researcher’s degrees of freedom in text preprocessing
    Samareen Zubair, Tilburg University
  • “Nein” to joint European Debt. Oder Doch?
    Maksim Zubok, Nuffield College, University of Oxford

12.45-13.45 – Parallel session 2

  • Chair: Lucas van der Meer, ODISSEI
  • Inference for data collected by citizen scientists
    Peter Lugtig, Utrecht University (UU); Erik-Jan van Kesteren, UU; Annemarie Timmers, CITO
  • Digital trace data collection through data donation at Centerdata
    Laura Boeschoten, Utrecht University
  • Mass online interactions: an oTree negotiation game in the LISS panel
    Marije Oudejans, Centerdata

Inference for data collected by citizen scientists
Peter Lugtig, Utrecht University (UU); Erik-Jan van Kesteren, UU; Annemarie Timmers, CITO
Citizen science projects in which volunteers collect data are increasingly popular due to their ability to engage the public with a scientific question. However, while large amounts of data can be collected by volunteers in a short amount of time, the scientific value of these data for inference about research questions is often hampered by several biases. Volunteers may self-select the locations where they take observations especially, which results in highly selective geo-spatial observations in citizen science data. In this paper, we deal with geospatial sampling bias by enriching the volunteer-collected data with geographical covariates from registers.
As an example, we correct inferences in a large citizen science project for measuring the brightness of the night sky. We show that night sky brightness estimates change substantially after correction, and that the corrected inferences better represent an external satellite-derived measure of skyglow. We conclude that geospatial bias correction can greatly increase the scientific value of citizen science projects.

Digital trace data collection through data donation at Centerdata
Laura Boeschoten, Utrecht University
Digital traces left by citizens during the natural course of modern life hold an enormous potential for social-scientific discoveries, because they can measure aspects of our social life that are difficult or impossible to measure by more traditional means.
As of May 2018, any entity, public or private, that processes personal data of citizens of the European Union is legally obligated by the EU General Data Protection Regulation to provide that data to the data subject upon request, and in digital format. Most major private data processing entities, comprising social media platforms as well as smartphone systems, search engines, photo storage, e-mail, banks, energy providers, and online shops comply with this right to data access by providing so-called ‘Data Download Packages’ (DDPs) to the data subjects.
A workflow has been introduced that allows to collect and analyse digital traces collected as Data Download Packages (DDPs). This workflow consists of the following steps: First, data subjects are recruited as respondents using standard survey sampling techniques. Next, respondents request their DDP, storing this locally on their own device. Stored DDPs can then be locally processed to extract relevant research variables, after which consent is requested of the respondent to send these derived variables to the researcher for analysis.
Recently, we developed software that allows for the local extraction step of this workflow to take place. This software allows for the collection of digital trace data (i.e. Google Location Data, WhatsApp data) and is now integrated in the online Centerpanel and LISS panel. Because of this integration, we can control the sample of collected digital traces, we can supplement the digital traces with questionnaires helping us to investigate the reliability of those traces as well as differentiating between participants who are willing to share those digital traces and those who are not.

Mass online interactions: an oTree negotiation game in the LISS panel
Marije Oudejans, Centerdata
oTree is an open-source Python-based software platform used for online interactive behavioral research and experiments. An oTree game allows players to interact or negotiate with each other in an online laboratory environment. A representative panel offers a multitude of advantages over a (small-scale) laboratory or an online crowdsourcing platform (e.g. MTurk). For one, the LISS panel is a probability-based online panel representative for the Dutch household population: the LISS Data Archive comprises an extensive database of background characteristics, which can be merged for more in-depth analyses or used to select specific population groups. Furthermore, panel members can be instructed before starting the experiment and receive technical support if needed.
Researchers from Centerdata and Tilburg University implemented a game in the LISS panel for a large-scale experiment on human behavior regarding coalition formation. Compose 2.0 aims to understand how human behavior affects scaling up horizontal collaboration in supply chains. For this project, a game was designed in which 600 LISS participants each represent a transport company. After oTree automatically forms 200 groups of three respondents, they need to collaborate to transport their load from A to B and negotiate on the profit distribution (actually paid out).
The main study was preceded by a pilot study. 744 panel members were invited to participate in the negotiation game. A third of the respondents committed to taking part in the experiment. About 75% of them actually started participating and another three quarters completed the game. The percentage of players that could be grouped and formed a coalition was 60%. The rest either dropped out or stranded because one of the other players departed while playing the game. The pilot study was successfully completed and valuable lessons have been learned from the pilot study and used to optimize the main study in the LISS panel.

  • Chair: Ran van den Boom, Statistics Netherlands, CBS
  • Putting the ‘long’ in longitudinal. Developing social science research infrastructure to add multigenerational family insights to contemporary research
    Richard Zijdeman, Rick J. Mourits, International Institute of Social History
  • Reflecting on the use of register data to explore multigenerational staying and migration histories: Experiences from Sweden
    Jonne Thomassen, University of Groningen
  • The Effect of Unemployment on Interregional Migration in The Netherlands
    Cindy Biesenbeek, De Nederlandsche Bank

Putting the ‘long’ in longitudinal. Developing social science research infrastructure to add multigenerational family insights to contemporary research
Richard Zijdeman, Rick J. Mourits, International Institute of Social History
Contrary to insights that have been defended for decades, sociological and historical studies increasingly show that health and resources are transmitted over not one but multiple generations. Frontier studies have shown that these effects go at least three generations back, and possibly further. By looking at life courses across multiple generations, we can answer questions as: What is the effect of (grand)mothers’ labour market activity on women’s labour market outcomes today? What (branches of) families were over time able to acquire more social, economic and health resources than others? How does the spatial clustering or, conversely, dispersal of families relate to this?
We propose to push this research line even further by combining two large national databases, currently used in isolation to study the causes and consequences of inequality. We research the feasibility of linking the contemporary System of Social-statistical Datasets (SSD) to historical demographic data made available by the HSNDB. In theory the possibilities are endless as the Netherlands registered births, marriages, and deaths nationwide in the civil registry from 1811 onwards and registrations of where and with whom persons lived in the population registers. To test whether this wealth of historical data can be made available to researchers, we match a sample of personal cards issued between 1880 and 1922, by using information on an individual’s own birthdate as well as his/her parents plus available information on the marriage date and date of death.
Having tested a variety of linkage strategies, we find a matching strategy that proves to be robust towards overlinking, while attaining a high retrieval rate (ca. 80%). This leads us to conclude that linkage of SSD and HSNDB data is feasible and believe that investment in digitization and indexation of the personal cards allows for groundbreaking multi-generational research on the Netherlands.

Reflecting on the use of register data to explore multigenerational staying and migration histories: Experiences from Sweden
Jonne Thomassen, University of Groningen
Administrative data – such as the population registers from Northern European countries and the Netherlands – are an extremely rich source of information for the social sciences. Using information about vital events and kin identifiers from population registers, a growing body of literature has started to investigate linked life courses of family members. Many of these studies focus on recent cohorts and their inter- or intra-generational family ties, but few studies take on a multigenerational approach that expands to three or more generations. We reflect back on our previous experiences using Swedish register data to explore the role of multigenerational family roots – i.e. when ancestors have lived in the same geographic location generation upon generation – in the staying or migration behaviour of young adults today. Our work is one example of the increasing possibilities for studying complex interdependencies in people’s life courses due to the rapidly expanding availability of appropriate data. Nevertheless, we also experienced some limitations. By means of a reflection on these experiences, we aim to inform future research and data innovations that combine historical information with contemporary register data for the purpose of multigenerational scholarly research.

The Effect of Unemployment on Interregional Migration in The Netherlands
Cindy Biesenbeek, De Nederlandsche Bank
Using administrative data between 2006 and 2020, I analyze interregional migration in the Netherlands. In theory, individuals move out of regions with high unemployment rates, but most empirical research does not strongly support this prediction. Likewise, I only find a small effect of regional unemployment on interregional migration. Furthermore, I find that the unemployed are more mobile during the first three months of unemployment. In addition, my results suggest that renters in the private sector are much more mobile than homeowners or renters in the social housing sector. Finally, I find that commuters are much more likely to migrate, despite good infrastructure and relative short distances in The Netherlands.

  • Chair: Paulina Pankowska, Utrecht University
  • Legal perspective on possible fairness measures – A legal discussion using the example of hiring decisions
    Maryam Amir Haeri, University of Twente
  • Change Detection of Land Use with Siamese Convolutional Neural Networks
    Marc Ponsen, Statistics Netherlands (CBS)
  • Can a generative language model predict what participants think in the future?
    Bennett Kleinberg, Tilburg University; Maximilian Mozes, University College London; Isabelle van der Vegt, Utrecht University

Legal perspective on possible fairness measures – A legal discussion using the example of hiring decisions
Maryam Amir Haeri, University of Twente
With the increasing use of AI in algorithmic decision-making (e.g. based on neural networks), the question arises how bias can be excluded or mitigated. There are some promising approaches, but many of them are based on a ”fair” ground truth, others are based on a subjective goal to be reached, which leads to the usual problem of how to define and compute ”fairness”. The different functioning of algorithmic decision-making in contrast to human decision-making leads to a shift from a process-oriented to a result-oriented discrimination assessment. We argue that with such a shift society needs to determine which kind of fairness is the right one to choose for which certain scenario. To understand the implications of such a determination we explain the different kinds of fairness concepts that might be applicable for the specific application of hiring decisions, analyze their pros and cons with regard to the respective fairness interpretation and evaluate them from a legal perspective (based on EU law).

Change Detection of Land Use with Siamese Convolutional Neural Networks
Marc Ponsen, Statistics Netherlands (CBS)
Every few years Statistics Netherlands (SN) publishes the so-called Bestand Bodemgebruik (BBG). This is a map containing digital geometries with information about the land use in the Netherlands. It contains polygons each labelled with the most common type of land use (such as forest, road, or residential area, etc.). Creating this map requires a lot of time as data from many different sources are studied by humans to infer the correct land-use. Examples of sources are aerial images, registers available at SN and the official land registry. 
This task could potentially be accelerated by adding another source of information, namely predicted changes of land use by a Siamese Convolutional Neural Network. This network architecture was chosen because Convolutional Neural Networks haven proven to be state-of-the-art with respect to image classification. Additionally, since we chose to detect changes (rather than directly classifying images), a Siamese Neural Network structure was used.
 Our network will be trained on aerial images of the Netherlands labeled with classifications in previous BBGs. The trained model can then be applied to unseen data (i.e., new aerial images of the Netherlands) and can point the creators of the BBG towards areas that are highly likely to have changed, which would allow them to work faster. We will present preliminary evidence that such networks can detect the most common changes with reasonable to high accuracy.

Can a generative language model predict what participants think in the future?
Bennett Kleinberg, Tilburg University; Maximilian Mozes, University College London; Isabelle van der Vegt, Utrecht University
CSS research has an increasing appetite to use “text as data”, and techniques from natural language processing are promising in making inferences from text data about human behaviour. Among the milestones of recent NLP research are generative language models that are trained on (very) large datasets of human language and can be prompted to create human-level text data (e.g., writing novels, answering questions, completing texts). While the release of the models such as GPT-3 has been controversial, they offer an exciting opportunity to study human behaviour. This talk investigates how we can harness generative language models for CSS research. Specifically, we use a dataset of texts that n=1152 participants wrote about their coping with the pandemic measured repeatedly in 2020, 2021 and 2022. We then prompted GPT-3 with each participant’s texts from 2020 and 2021 to generate the text from 2022. This design allowed us to compare each participant’s “true” text to GPT-3’s generated text. Using a range of similarity metrics, we find that GPT-3 can generate texts that are semantically close to the true ones. This surprising finding led us to examine how exactly GPT-3 generated good predictions for what participants would write one year later. In subsequent analyses, we i) assessed whether GPT-3 fails where we would expect it to (e.g. when a participant experienced dramatic life events), ii) examined statistically how computer-generated texts differ from the true human-written ones, and iii) whether the deviation of the predicted text can serve as a meaningful psychological variable. We close the talk with an outlook on what generative text models can do for CSS research and how behavioural science methods can help us understand generative language models.

  • Chair: Giovanni Cassani, Tilburg University
  • Using topic modelling to explore the interaction between spiritual and scientific narratives in an online microdosing community
    Erwin Gielens, Tilburg University
  • Urban Neighborhoods through the Lens of Social Media: A Text Analytic Exploration
    John D. Boy, Leiden University; Jay (Ju-Sung) Lee, Erasmus University Rotterdam
  • Understanding narratives in fertility intention: a neural topic modeling approach
    Xiao Xu, NIDI, University of Groningen

Using topic modelling to explore the interaction between spiritual and scientific narratives in an online microdosing community
Erwin Gielens, Tilburg University
During the covid pandemic, online communities emerged around “microdosing”: using sub-perceptual doses of psychedelic drugs, in an effort to self-treat mental problems or to enhance cognitive functioning and mood in everyday life. The increasing broader interest in medical applications for psychedelics has led to vigorous debate in the psychedelics community on its political and discursive implications. Some argue that the medical interest in psychedelics works as a “doorway” that weakens the boundaries between spiritual and scientific thought (Corbin 2010:4). Others worry that the spiritual and ritualistic narratives are being displaced as a consequence of the “pharmeceuticalization” of psychedelics (Noorani 2020:34).
The emerging microdosing community on the online blog platform Reddit presents us with an exceptional case to explore how a lay audience deals with the distinction between the scientific and spiritual narratives surrounding psychedelics. We estimate a structural topic model to explore the prevalence of these narratives, and the relation between them. The contribution of such an effort is twofold. On the one hand, we want to contribute to the discussion on the consequences of the “pharmeceuticalization” of psychedelics, i.e. whether the scientific interest is embracing or displacing spiritual narratives. On the other hand, we want to illustrate how topic models can be used to study the de- and re-construction of boundaries (Lamont & Molnar 2002:177; Gieryn 1983).
Preliminary results suggest that while the spiritual topic is a relatively minor part of the microdosing discourse, it is consistently discussed in relation to more scientific topics such as cognitive enhancement, physiological effects and scientific research. Thus, even though scientific discourse is dominating the microdosing community, there does appear to be a “doorway” through which the spiritual narrative enters the arena of science.

Urban Neighborhoods through the Lens of Social Media: A Text Analytic Exploration
John D. Boy, Leiden University; Jay (Ju-Sung) Lee, Erasmus University Rotterdam
Operationalizing a framework initially developed through mixed-methods research centered on an economically-deprived neighborhood in The Hague, we employ computational methods to study digital representations of four urban neighborhoods in The Hague circulating on Twitter and Airbnb: Binckhorst, Moerwijk, Schilderswijk, and Statenkwartier. In The Hague as in cities around the world, uneven capital investment and power imbalances culminate in status differentiation. Some neighborhoods are able to “upgrade” symbolically, while others remain stigmatized and marginalized. Digitization impacts how such symbolic trajectories are shaped, and in particular, neighborhoods today rise and decline with the symbolic investments made in them through digital channels such as social media. What room for maneuver do resident initiatives seeking to shape the image of their neighborhood using digital platforms have? By analyzing and visualizing symbolic trajectories over time, we seek to gain insight into opportunities for resident initiatives — as well as pitfalls of a mode of placemaking dependent on a platform ecology that is often hostile to anyone and anything that is not aligned with status aspiration and dominant aesthetics.
Taking a longitudinal and comparative perspective, we gauge the degree to which different kinds of representations (including injurious and promotional ones) are produced, circulated and amplified on social media. Through text analytic methods (including topic modeling and semantic networks), we map who is producing what kinds of representations and whose representations get the most traction. We also relate these various kinds of representations to each other by comparing trajectories of different neighborhoods, as well as to material conditions in the neighborhoods. We visualize and analyze these trajectories in a multidimensional space, showing for instance that, while the representations of some neighborhoods are gradually remade, the case of Schilderswijk shows a durable pattern of stigmatization, suggesting that injurious racialized representations of this disenfranchised neighborhood are further reinforced.

Understanding narratives in fertility intention: a neural topic modeling approach
Xiao Xu, NIDI, University of Groningen
The traditional explanations for fertility intentions, based on only objective measures, has been insufficient in understanding contemporary fertility trends in European countries. Subjective narratives, proposed by Vignoli et al., opened up a new channel to factors behind fertility decision-making. Open-ended questions (OEQs) provide opportunities for respondents to “expand on” their narratives about
fertility plans, and Natural Language Processing (NLP) techniques enable us to delve deeper into the underlying reasoning behind responses with much less human effort. In this study, using automatic neural topic modeling methods, we identify and interpret topics and logic behind the narratives on fertility intentions of women in the Netherlands. We used Contextualized Topic Models (CTM), a neural topic model using pre-trained representations of language, to conduct our analysis. Our results helped reveal factors determining (un)certainty in fertility decisions across different groups, as well as set up the foundation for more comprehensive analysis.

  • Chair: Javier Garcia-Bernardo, Utrecht University
  • Communal data harmonization services for CBS microdata-users
    Bastian Ravesteijn and Mirthe Hendriks, Erasmus University Rotterdam: Erasmus School of Economics
  • SyGNet: Synthetic Data for the Social Sciences using Deep Learning
    Thomas S. Robinson, Durham University; Maksim Zubok, Oxford University; Artem Nesterov, Durham University
  • Powerful and safe computing for social sciences with the ODISSEI Secure Supercomputer
    Marco Verdicchio, SURF

Communal data harmonization services for CBS microdata-users
Bastian Ravesteijn and Mirthe Hendriks, Erasmus University Rotterdam: Erasmus School of Economics
For users of CBS microdata, data harmonization is an unavoidable but time-consuming part the data analysis process. Data harmonization refers to the effort of combining data from different sources with varying file locations, file formats, and naming conventions, and transforming it into a single cohesive data set. Our aim is to provide communal services for CBS microdata-users by making data harmonization scrips openly and easily accessible.
While the CBS microdata infrastructure facilitates ground-breaking research, it remains a challenge for researchers to manage the vast amount of datafiles, the documentation and to link the data. Why is data harmonization challenging? The CBS microdata has different file formats (i.e. -bus and -tab files with observations by year, month or periods), file paths (which change when files are moved or new versions are published), subject areas with data in multiple (sub)folders, and naming conventions. Currently, most CBS microdata-users spend a considerable amount of time harmonizing the same data, reinventing the wheel.
In the project Children and (future) Parents, supported by Prediction and Professionals in Prevention, to improve Opportunity researchers work with CBS microdata on child health and development, demographic variables and parental characteristics. We have harmonized microdata of eleven data topics, and have made the R scripts openly accessible via GitHub. The intention is that other CBS microdata-users can use these ‘harmonization-scripts’; run the scripts and easily extract the harmonized data of interest. Moreover, we will encourage other users of CBS microdata to publish harmonization-scripts on this GitHub repository.
This communal data harmonization services initiative for users of CBS microdata provides a range of benefits: increasing visibility of harmonization efforts, optimizing reproducibility to reduce time-consuming work, and improving efficiency. Not only does this initiative promote the implementation of Open Science and FAIR principles, it might also stimulate users of CBS microdata to cooperate.

SyGNet: Synthetic Data for the Social Sciences using Deep Learning
Thomas S. Robinson, Durham University; Maksim Zubok, Oxford University; Artem Nesterov, Durham University
At the forefront of social science research, novel techniques are being developed that allow researchers to make robust inferences from complex data. These tools and methods rest on proof of their performance, which in turn relies on using the right kind of data to test them. These tests are hard to conduct well because real social science data is so complex: parametric tests may not comport well with actual social science applications. Conversely, benchmarking on well-known studies leaves researchers unable to determine true performance since the population parameters are unknown. In this paper we introduce a new solution using synthetic data: a strategy in which the underlying relationships between variables in real-world data are learned through deep, generative neural networks, and from which an arbitrary number of entirely new but realistic observations can be generated. We demonstrate our method by synthesising realistic looking, but entirely novel, survey data. We discuss how this synthetic data can be used to benchmark statistical designs and methods, and contribute new open-source software for researchers to use.

Powerful and safe computing for social sciences with the ODISSEI Secure Supercomputer
Marco Verdicchio, SURF
Statistics Netherlands (CBS) collects data from the Dutch population for statistical purposes. For more than 15 years this data was made accessible for scientific research under very controlled conditions and with strict privacy protection. Due to the ever-increasing availability of data, the need for ever-increasing computing power is growing.
Until recently, researchers could only work with CBS microdata in the CBS Remote Access environment with limited compute power. CBS, ODISSEI and SURF joined forces to facilitate researchers in working with large and complex sensitive data. As a result of this collaboration, SURF developed a new service: the ODISSEI Secure Supercomputer (OSSC). Via a secure network connection, the CBS environment is connected to a fully isolated environment in the Dutch national supercomputer. The OSSC can also be considered as an enclave of CBS within the domain of SURF
OSSC is a “Platform as a Service” infrastructure which provides researchers access to a secure remote environment where they can process highly sensitive data. This allows researchers to exploit the computational power of modern high-performance computing systems and scale up their research in order to process large datasets and reduce the compute time of their analysis. Last year the OSSC was migrated to the new Dutch supercomputer Snellius. Users only have access via the CBS portal.
Worldwide, the combination of very well annotated and long-term collected data from the population, together with the possibility to conduct research on a supercomputer, is unique. In this presentation, we give an overview of the OSSC environment and show how the security of the system is guaranteed. We also explain how to decide whether a project is suited for analysis on Snellius and how you can make use of the service.

13.50-14.50 – Parallel Session 3

  • Chair: Fatima El-Messlaki, Statistics Netherlands – CBS
  • Links to the future: towards integrated data infrastructures in work & inequality research using administrative microdata
    Zoltan Lippenyi, University of Groningen
  • The Existence of Economies of Scope in Data Aggregation: A big data study merging longitudinal LISS panel data and CBS microdata
    Seyit Höcük, Centerdata
  • System-to-System Data Collection in business surveys applied to an agricultural survey: a Proof of Concept
    Ger Snijkers, Statistics Netherlands (CBS)

Abstracts session 3.1

Links to the future: towards integrated data infrastructures in work & inequality research using administrative microdata
Zoltan Lippenyi, University of Groningen
The growing volume of population microdata on individuals and work organizations from administrative registers provides unique opportunities to extend research lines and experiment with novel computational approaches in research on work and inequality. However, there are unique challenges with these data that standard practices for social surveys do not remedy. For example, linking multi-source, multi-level, and longitudinal administrative datasets raises issues of representativity, and administrative categories of data-providing agencies pose the challenge of measurement validity. There are scattered approaches to solving these common issues, but we lack an overview and clear guidelines on how to build data infrastructures with administrative microdata for work & inequality research. The paper fills this gap by reviewing the state-of-the-art in linked employer-employee data practices in the field of work and inequality research, identifying areas where we should progress in the future, and outlines a roadmap towards integrated knowledge infrastructures to reduce high start-up costs for research projects using these data sources.

The Existence of Economies of Scope in Data Aggregation: A big data study merging longitudinal LISS panel data and CBS microdata
Seyit Höcük, Centerdata
Do predictive models always benefit from more (complementary) data? Economies of scale in data aggregation is a well-studied theory. It motivates to explore the predictive power of a model when the dataset size increases by the number of observations. Economies of scope, on the other hand, is still largely under debate. The question we aimed to answer in this research is if adding more relevant and complementary, but independent variables will consistently lead to better models, as measured by their predictive power. If so, this can be an argument to implement policy towards the disclosure and sharing of large databases.
For the Joint Research Center (JRC) of the European Commission, we linked several longitudinal datasets from the LISS core questionnaires to registry data (microdata) of Statistics Netherlands (CBS). Two machine learning models, Random Forest and Logistic Regression, were built to study the impact of data aggregation from different but related data sources on predictive models. Through an extensive parameter study, we proved the existence of Economies of Scope in Data Aggregation (ESDA) for a use case of health and health-related data in combination with background characteristics of people.
Our findings of this big data study argue for opening health data silos and merging them with socioeconomic data sources in large data pools to deliver better predictive and preventive care. A working paper of this work is available online on the JRC website and an international research paper will be published soon hereafter.

System-to-System Data Collection in business surveys applied to an agricultural survey: a Proof of Concept
Ger Snijkers, Statistics Netherlands (CBS)
In the 20th century, sample surveys have proven to be a cost-efficient method to produce accurate statistics, although they come with a high cost both for the National Statistical Institutes (NSIs) and businesses, who may experience high response burden. Nowadays in the information age, there are a lot of new digital data sources in smart industries, like in precision farming. In some cases, these data sources allow for data communication with other computer systems without human intervention via Application Programming Interfaces (APIs). Based on these software interfaces, we developed a system-to-system data collection methodology that reduces response burden by automating the business’s internal data retrieval process. Applied to the official Crop Yield Survey, a software prototype was developed based on this methodology.
We will present the IT architecture we developed, showing how data capture and processing can be automated. We will discuss the automated pre-filling of the electronic Crop Yield Survey questionnaire using an API provided by a smart farming machine manufacturer, John Deere: the MyJohnDeere API. In a first Proof of Concept, it has been applied to data from a virtual farm, showing that it works, and the farmer’s workload to complete a questionnaire can be limited to a minimum. Our next step is to conduct a small-scale field test with a small number of farmers to study the method in practice. It is our belief that this system-to-system method can be applied to business surveys in
general and in the future will replace the manual completion of business survey questionnaires including the manual retrieval and re-keying of data. Eventually, it may replace the use of questionnaires all together.

  • Chair: Marcel Das, Centerdata
  • Lessons from enriching the German Socio-Economic Panel (SOEP) with genetic data
    Richard Karlsson Linnér, Leiden University
  • 15 years of Lifelines – pioneering into the future
    Trynke de Jong, Lifelines
  • Pushing boundaries in behavior genetics with the OSSC
    Camiel van der Laan, VU Amsterdam

Lessons from enriching the German Socio-Economic Panel (SOEP) with genetic data
Richard Karlsson Linnér, Leiden University
The German Socio-Economic Panel (SOEP) provides the global research community with rich longitudinal household data of importance to social-science and humanities (SSH) research. Genetic array data is now among the most cost-effective and non-intrusive biological assays available to offer unique opportunities to address broad scientific challenges, e.g., to establish causal relationships or to disentangle environmental processes underlying health or income inequalities. We recently embarked on a data collection effort (called Gene-SOEP) that enriched the existing panel with genetic data collection from 2,598 respondents. To support the research community, we provide research infrastructure by generating standardized genetic research variables (polygenic indexes) using state-of-the-art techniques. Polygenic indexes benefit from being accessible to social scientists by requiring only a basic understanding of genetics. This presentation will first briefly summarize the data collection process and then focus on key lessons from our experiences that may be valuable to future data collection efforts (e.g., on deciding between competing technologies; or on interviewer effects and how to avoid them). The presentation will conclude by showing an empirical application where the collected genetic data was used to explain documented changes over time in the German population’s height and BMI as the result of environmental rather than genetic processes, demonstrating some of the many advantages of enriching existing panels with genetic data.

15 years of Lifelines – pioneering into the future
Trynke de Jong, Lifelines
Starting in 2007, Lifelines has established the largest prospective Cohort Study and Biobank in the Netherlands, collecting a comprehensive set of longitudinal health-related data and samples from ~167,000 inhabitants of the three northern provinces (including children and elderly).
Lifelines is unique in many ways: in size, in ambition, in independence, in the wide variety of data and samples we collect, and in our primary aim to share these with researchers all over the world. Our uniqueness has compelled us to pioneer and innovate in many different areas.
We custom-built our own IT-infrastructure in order to efficiently manage our participants, our questionnaires and measurements, our samples, our many subcohorts, the FAIR-ness of our datasets, and our network of multidisciplinary users. We initiated and developed successful linkages with socio-economic, pharmaceutical, environmental and clinical databases. We formulated advanced governance in order to balance efficiency, scientific merit, ethical considerations and the privacy of our participants. And as pioneers do, we also struggled, failed, and found some dead ends along the way.
While we are finishing up our 3rd general assessment and preparing to start the 4th in 2024, we would like to look back on our main successes and pitfalls, discuss our current challenges (e.g. to retain our participants and to find new users), and present our many plans and predictions for the future – including the exciting scientific possibilities of collaborating with ODISSEI.

Pushing boundaries in behavior genetics with the OSSC
Camiel van der Laan, VU Amsterdam
The field of behavioral epidemiology can be broadly divided into two main approaches: twin and family studies, and molecular genetics. Twin and family studies are the basis of behavior genetic research. These studies make use of differences in genetic resemblance between different family members in order to study the relative influence of genetic and environmental influences. The strongest design makes use of monozygotic and dizygotic twins, because they are similar in every regard except for their genetic resemblance. Differences in phenotypic resemblance between mono- and dizygotic twins are therefore attributed to difference in genetic resemblance. We are working on exploring and developing procedures that allow researchers in the Netherlands to leverage the possibilities of twin and family approaches in CBS microdata. Currently, this is hampered by unknown zygosity of same-sex twins, and by difficulties in identifying and demarcating (biological) relatives.
In molecular genetics, associations between specific genetic variants and traits are
investigated. In genome wide association studies (GWAS), this means testing associations
for up to 10 million genetic variants (SNPs). In general, especially with behavioral traits, many of these variants affect a wide variety of traits, each with a very small effect size. Currently, effective GWAS are therefore limited to traits that are studied by multiple research groups, so that – by meta-analyzing – large sample sizes can be achieved. This linkage has already successfully been demonstrated in a study of health-care costs by De Zeeuw (2021). The results were already successfully used to predict health-care costs in Finnish population registry data. But the opportunities the OSSC offer are broader than this. In general, the focus is on additive genetic effects, i.e. the main effect of genetic variants. Interactions between genetic variants have been largely unstudied because of computational demands.

  • Chair: Marco Stam, Leiden University
  • The Efficacy of Energy Efficiency: Measuring the Returns to Home Insulation
    Linde Kattenberg, Maastricht University
  • Breaking the barrier: the effectiveness of an extended school day program on primary school advices
    Gijs Custers, Erasmus University Rotterdam

The Efficacy of Energy Efficiency: Measuring the Returns to Home Insulation
Linde Kattenberg, Maastricht University
Energy efficiency in the housing market is often considered the panacea for reducing carbon emissions and to enhance energy independence. Among the many interventions to improve the energy efficiency of homes, insulation of roofs, walls, and floors plays
an important role. However, the impact of these insulation measures on actual gas consumption is typically based on engineering measures and subject to debate. This study exploits a unique sample of insulation interventions, combined with detailed information on actual gas consumption before and after these interventions, and information on the socio-economic characteristics of occupants. We find that home
insulation reduces gas use by about 20% on average, both for owner-occupied and for rental homes, for which the treatment is plausibly exogenous. We find no evidence of a rebound effect over time: the reduction in gas consumption is consistent up to 9 years after the intervention. At 2022 gas prices, and for the average home in our sample, the treatment effect translates into a €752 reduction in the annual gas bill, and an average
rate of return of 44%. We find the strongest effects for wall insulation, and just minor effects for floor insulation alone.

Breaking the barrier: the effectiveness of an extended school day program on primary school advices
Gijs Custers, Erasmus University Rotterdam
This study investigates the effectiveness of an extended school day program on primary school advices in the context of the National Program Rotterdam South (NPRZ). The program has been implemented in approximately 30 primary schools in Rotterdam South. Microdata from Statistics Netherlands (CBS) are used to examine to what extent school advices have increased between 2010 and 2019, since the program started in 2013. A comparison group of schools is identified using matching methods. For the analysis, a comparative interrupted time series (CITS) is performed to model the effect of the intervention. The study discusses the theoretical background of the paper, earlier findings and limitations of the analysis.

Can we use data on hospitalizations and causes of death to explore gender and socioeconomic differentials in attempted and completed suicides in the Netherlands?
Katya Ivanova, Tilburg University
The goal of this project is to explore gender and socioeconomic inequalities in suicidal behaviors during the period of economic downturn in the Netherlands (i.e., the economic recession of the 2010s). Though theoretically well-informed by key works in the fields of sociology, psychology, and economics one of the leading challenges in the study of suicidal acts has unquestionably been methodological, namely, the availability of high-quality data on the individual level (Wray, Colen, & Pescosolido, 2011). A lot of our current understanding of self-harm acts is based on the study of aggregate level differences in suicide rates across distinct groups, possibly as a function of other macro-level measures of economic or social circumstances (e.g., Case & Deaton, 2015, 2017; Graeff & Mehlkop, 2007; Light & Ulmer, 2016; Reeves & Stuckler, 2016; Shiels et al., 2019; Stockard & O’Brien, 2002). An alternative approach has been the use of individual surveys (at times, based on convenience samples), which include self-reports about suicide attempts or suicide ideation (Abrutyn & Mueller, 2014; Nock, Borges, Bromet, Alonso, et al., 2008; Twenge, Cooper, Joiner, Duffy, & Binau, 2019; for an overview, see Nock, Borges, Bromet, Cha, et al., 2008).
In this contribution, we utilize microdata assembled by the Central Bureau for Statistics (CBS) and more specifically, data from the National Medical Registration, which provides information on reasons for medical assistance (including “intentional self-harm”) and the Dutch registry of causes of death. These data are then combined with individual-level information on the socioeconomic status of individuals in order to explore how the period of economic downturn in the Netherlands impacted different strata within society with respect to both attempted and completed suicidal acts.

  • Chair: Nora Buenemann, PDI-SSH
  • Worry, hope, resignation, and now what? A three-wave study on emotional coping in the pandemic
    Isabelle van der Vegt, Utrecht University; Maximilian Mozes, University College London; Bennett Kleinberg, Tilburg University & University College London
  • Theoretical implications of the corona pandemic for gendered divisions of childcare in the Netherlands
    Stéfanie André, Radboud University Nijmegen; Mara Yerkes, Utrecht University; Chantal Remery, Utrecht University
  • Conspiracy theories on Twitter: Emerging motifs and temporal dynamics during the COVID-19 pandemic
    Veronika Batzdorfer, GESIS — Leibniz Institute for the Social Sciences

Worry, hope, resignation, and now what? A three-wave study on emotional coping in the pandemic
Isabelle van der Vegt, Utrecht University; Maximilian Mozes, University College London; Bennett Kleinberg, Tilburg University, University College London
In March of 2020, we collected the first wave of the “Real World Worry Dataset” aimed at harnessing natural language processing methods to make inferences about emotional responses to the COVID-19 pandemic. We collected data in a survey design from 2,500 participants who indicated their emotions and were subsequently asked to write a text that expresses how they feel about COVID-19. The resulting dataset consists of 2,500 short, Tweet-sized texts and 2,500 longer texts. In 2021 and 2022, we collected a second and third wave of data from the same participants, enabling us to study how individuals (fail to) cope with the pandemic. Using the available text and corresponding emotion data, we showed that emotional responses after the first year in the pandemic follow a heterogeneous pattern of a well-coping and resigning subgroup of participants. In this talk, we will present new findings on the recently collected third wave of the same dataset from 2022. Specifically, we i) examine how text responses from 2020 and 2021 can predict coping and emotions in 2022; ii) assess how the emotional responses developed over time and how participants “migrated” between clusters of emotional coping, and iii) use explanatory modeling to understand how self-reported life events during the pandemic (e.g., birth of a child, suicide in family) affected emotional coping styles. We close with a call for more ground truth data in text-based computational social science research and highlight implications of our work for public mental health efforts.

Theoretical implications of the corona pandemic for gendered divisions of childcare in the Netherlands
Stéfanie André, Radboud University Nijmegen; Mara Yerkes, Utrecht University; Chantal Remery, Utrecht University
At the beginning of the corona pandemic, many scholars wondered what the effect would be of the lockdown on gendered divisions of childcare tasks between working parents. Now, a large 1,5 years into the pandemic it is time to make up the balance. When the lockdown started halfway March 2020, the general expectation was that working from home, and issuing some workers as ‘essential workers’ that were able to enter the workplace might act as a pivotal natural experiment in which the engrained gender inequality in work hours, care for children and households tasks would change. The first lockdown entailed a double expectation for us researchers (e.g. Yerkes et al, 2020). First, we expected gender equality to improve, working from home would give men and fathers the opportunity to take a larger share of the tasks at home on their shoulders, especially now women and mothers were overrepresented in essential occupations. Second, we also imagined that the corona crisis would enlarge existing inequalities between men and women, because the extra care tasks would naturally fall on the shoulders of women and would thus lead to extra burdening of women.
There are many single-country studies that found, for various countries, that inequalities remained, although some fathers did more care. A drawback is that most studies are cross-sectional and focused on a first empirical impression; few studies say anything about the applicability of often-used theories to explain these divisions of care work. To improve upon these earlier studies we want to reflect upon the usability of often used theories of gendered divisions of care and household tasks, relative resources theory, time availability theory and gender role attitudes and test these theories using a probability-based longitudinal panel sample with four time points during the pandemic in the Netherlands with the Dutch LISS panel.

Conspiracy theories on Twitter: Emerging motifs and temporal dynamics during the COVID-19 pandemic
Veronika Batzdorfer, GESIS — Leibniz Institute for the Social Sciences
The COVID-19 pandemic resulted in an upsurge in the spread of diverse conspiracy theories (CTs). In the present study, we leverage Twitter data across 11 months in 2020 from the timelines of 109 CT posters and a comparison group (non-CT group) of equal size. Our study provides a proof-of-concept to differentiate CT language and characterise it by linguistic similar indicators and psychological needs with word embeddings. Secondly, we assess how time series methods can enrich a theory-rooted view of dynamic user engagement. Subsequently, we applied time series analyses to investigate whether there is a difference between CT posters and non-CT posters in non-CT tweets as well as the temporal dynamics of CT tweets. In this regard, we provide a description of the aggregate and individual series, conducted a STL decomposition in trends, seasons, and errors, as well as an autocorrelation analysis, and applied generalised additive mixed models to analyse nonlinear trends and their differences across users. The narrative motifs, characterised by word embeddings, address pandemic-specific motifs alongside broader motifs and can be related to several psychological needs (epistemic, existential, or social). Overall, the comparison of the CT group and non-CT group showed a substantially higher level of overall COVID-19-related tweets in the non-CT group and higher level of random fluctuations. Focussing on conspiracy tweets, we found a slight positive trend but, more importantly, an increase in users in 2020. Moreover, the aggregate series of CT content revealed two breaks in 2020 and a significant albeit weak positive trend since June. On the individual level, the series showed strong differences in temporal dynamics and a high degree of randomness and day-specific sensitivity. The results stress the importance of the type of semantic and temporal approach that can add valuable information to theoretical assumptions on feelings of anxiety and lack of control.

  • Chair: Anja Smit, DANS
  • Textwash – automated open-source text anonymisation
    Bennett Kleinberg, Tilburg University & University College London; Maximilian Mozes, University College London
  • Navigating Stories in Times of Transition
    Erik Tjong Kim Sang, Netherlands eScience Center; Kody Moodley, Netherlands eScience Center
  • Towards a Scientific Holocaust Database for the Netherlands
    Arnoud-Jan Bijsterveld, Tilburg University; Peter Tammes, University of Bristol; Kees Mandemakers, International Institute of Social History & Erasmus University Rotterdam

Textwash – automated open-source text anonymisation
Bennett Kleinberg, Tilburg University & University College London; Maximilian Mozes, University College London
The increased use of text data in social science research has benefited from easy-to-access data (e.g. Twitter). That trend comes at the cost of research that would require sensitive data that cannot easily be shared (e.g. interview data, police reports, electronic health records). We introduce a solution to that stalemate with the text anonymisation software Textwash. Textwash is fully open-source, requires no internet connection or cloud-based services that risk data interception, maintains the usefulness of text data for downstream text analysis, and is empirically validated. The software relies on machine learning to identify and replace potentially sensitive information. This talk presents the tool, its development phases and an in-depth empirical evaluation of Textwash. The evaluation is performed along with the TILD criteria: a technical evaluation (how accurate is the tool?), an information loss evaluation (how much information is lost in the anonymisation process?) and a de-anonymisation test (can humans identify individuals from anonymised text data?). The findings suggest that Textwash performs similar to state-of-the-art entity recognition models and introduces a negligible information loss of 0.80%. For the de-anonymisation test, we first crowdsourced 1200 person descriptions of very famous, semi-famous and non-existing individuals. These sequences were then anonymised, and a new group of participants was tasked to identify the described persons. The de-anonymisation rate ranged from 1.01-2.01% for the realistic use cases of the tool. We replicated the findings in a second study and concluded that Textwash succeeds in removing sensitive information that renders detailed person descriptions practically anonymous. We finish the talk with an outlook of coming features and provide a python demo that makes the tool immediately accessible to others.

Navigating Stories in Times of Transition
Erik Tjong Kim Sang, Netherlands eScience Center
Digital Story Grammar is a method for semi-automatic narrative analysis recently developed for English [Andrade and Andersen, 2020]. We want to adapt this technique for Dutch and make it available to a large community of social science researchers by embedding it in a digital tool. For this purpose we started in 2021 with the project Navigating Stories in Times of Transition, a cooperation of the University of Enschede and the Netherlands eScience Center.
Narrative research has used digital tools for qualitative data analysis for more than twenty years. Examples of widely used tools are NVivo, Atlas.ti and MAXQDA. We chose Orange, a modular graphical data mining tool, as an initial target for our Dutch Digital Story Grammar software. Creating this software involved adding standard natural language modules to Orange, like part-of-speech tagging, dependency parsing and coreference resolution. On top of that we added rules for extracting and visualizing parts of the language analysis that are interesting for narrative research.
To find out how natural language processing and data visualization can support narrative research, we have interviewed several expert researchers [Pijpers, 2022]. These interviews have resulted in five user personas, with different computational skill levels and different demands of supportive computational tools. We will present the tool developed in our project to researchers and students in hands-on workshops. The feedback we hope to collect will be valuable for adapting the software and make it useful for a larger user community.
In this presentation, we will present the current stage of our project. We will demonstrate what the new software can currently do and how it can support narrative research. We will also discuss some of the feedback on the tool from potential users.

Towards a Scientific Holocaust Database for the Netherlands
Arnoud-Jan Bijsterveld, Tilburg University; Peter Tammes, University of Bristol; Kees Mandemakers, International Institute of Social History & Erasmus University Rotterdam
About 73 percent of the 140,000 individuals persecuted as Jews by the Nazis in the Netherlands were killed. In the past decades, the digital accessibility of administrative sources has increased resulting in a growing number of physical and digital monuments providing names, dates, and places of birth and death of these victims.
The relative share of those who survived and those who did not varies significantly across the Netherlands and between socio-demographic groups. These differences in survival rates raise relevant questions as to who was at risk most and why. Yet, answering these questions requires a different database than one primarily aiming at commemoration. Our aim is to create a scientific Holocaust Database for the Netherlands containing historic life course data for both victims and survivors.
We propose three steps to create such a database:
1. A full listing of Jews living in the Netherlands in 1940-1945, building upon earlier research.
2. Including information on items such as exemptions from deportation, residential addresses, survival by hiding or escape, and deportation trajectories, as registered in the Jewish Council card index and in records in the Arolsen Archives, for example. The Jewish Museum and the Memorial Centre Camp Westerbork have shown commitment to building a scientific database and in linking data on survivors.
3. Structuring and formatting individual data, building on expertise developed during the construction of the Digital Jewish Monument at the International Institute of Social History and other datasets within the HSNDB environment.
This initiative connects research in the Humanities with Computational Social Science Research and will attract researchers from various disciplines. Providing access to a rich scientific Holocaust Database would allow researchers to run advanced queries and analytical models. This would result in improving our knowledge about the Holocaust through more elaborate research, supporting education, and enhancing commemorative activities.

15.15-16.15 – Parallel Session 4

  • Chair: Erik-Jan van Kesteren, Utrecht University
  • Empirical calibration of full scale agent-based models of school choice: computational challenges and model validation
    Eric Dignum, Mike Lees, University of Amsterdam
  • Capturing the social fabric: Population-scale socio-economic segregation patterns
    Yuliia Kazmina, University of Amsterdam
  • Using microsimulation, administrative data, and supercomputers to realistically model fertility behaviour: the case of fertility preferences and childlessness
    Gert Stulp, University of Groningen

Empirical calibration of full scale agent-based models of school choice: computational challenges and model validation
Eric Dignum, Mike Lees, University of Amsterdam
School segregation is widely associated with existing inequalities and their reproduction. Although it has been studied for decades and using various methods/techniques, it still is a persistent problem in society. Currently employed methodologies often treat households on the micro-level as utility maximising individuals that decide in isolation or analyse macroscopic trends and correlations. However, these methodologies might miss important interactions within and between these levels. For example, parents rely on their social networks, observe current school compositions, school profiles, live in segregated neighbourhoods and are subject to institutional rules, hence they interact with each other and their environment. Simulation-based techniques, such as Agent-Based Models (ABM) provide a way to explicitly model these features and have shown promising results in other fields of science.
Existing ABM of school choice are mostly based on theoretical rules and smaller scales, hence do not take the full scale of the city into account. On the other hand, currently used fully data driven methods, such as discrete choice analysis, do not consider the potentially complex interactions. Therefore, we present results of one of the first empirically calibrated ABM of school choice on the Amsterdam scale. However, key challenges of ABM are empirical calibration and validation of their simulated (household) behaviour. These are important for confidence in the model and could inform potential policy, but requires a lot of data and computation. Multiple runs are needed to grasp how sensitive the model is to its input parameters, to quantify uncertainty and analyse the impact of specific scenarios. We show some of the benefits of our modelling approach as well as some of the computational difficulties and existing challenges we encountered while modelling school segregation at the Amsterdam scale.

Capturing the social fabric: Population-scale socio-economic segregation patterns
Yuliia Kazmina, University of Amsterdam
Segregation is a widely studied issue traditionally explored from the point of the spatial distribution of different groups as defined by any individual attribute such as race, religion, social class, etc. Nevertheless, we argue that the issues of persistent segregation, specifically socio-economic segregation, are networked phenomena and should be studied as such. In this paper, we make a methodological contribution that would allow the scholarship and policymakers to move away from a traditional spatial understanding of segregation that ignores interactions beyond neighborhoods and shift the focus of segregation measurement to the social network aspect applied to a diverse set of previously unexplored distinct social contexts.
The study is based on the Dutch population register data sourced from multiple existing sub-registers that contain information on formal ties and affiliations of ~17 million legal residents in multiple social contexts such as kinship, household, neighborhood, school, and work. With the multiplex network of geospatially embedded formal ties in hand, we aim to observe to what extent areas of social segregation are clustered in geospatially embedded social networks, and how each network layer contributes to the issue. More specifically, we measure to what extent Dutch residents in different municipalities are exposed to individuals of different socio-economic statuses in diverse social contexts and what social contexts provide diverse social contact opportunities with respect to the socio-economic status and, on the contrary, what social contexts play a role of socio-economic bubbles. Our findings suggest great heterogeneity in socio-economic assortativity between different social contexts (the layers of the analysed network) as well as different municipalities.

Using microsimulation, administrative data, and supercomputers to realistically model fertility behaviour: the case of fertility preferences and childlessness
Gert Stulp, University of Groningen
Education is a strong driver of whether and when women become mothers. Many different and contradicting mechanisms have been proposed to explain why highly educated women are more likely to remain childless and become mothers at higher ages than less educated women. The demands on data to disentangle these mechanisms are extraordinary, and no dataset exists that allows for this. Microsimulation models can help in this situation by explicitly modelling the mechanisms and comparing the outcomes of the models to real-world outcomes. The simulation models presented here simulate fertility outcomes over the life courses of agents based on behavioural factors, such preferences and partnership trajectories, and biological factors that determine the ability to have children, such as the age at sterility, fecundability, and intrauterine mortality. To parametrise the models, we use administrative data from Social Statistics Netherlands, survey data from the LISS panel for the behavioural factors, and findings from reproductive medicine for the biological parameters. To estimate unknown parameters in the models (for which no data are available), we use Approximate Bayesian Computation. This is computationally rather demanding which is why we have used supercomputers. Our simulations reproduce the pattern of unintended childlessness strongly varying across women with different educational levels. Despite higher educated women preferring to have children at a later age, our simulations showed that these preferences hardly played a role in explaining childlessness. The higher age at cohabitation was the main explanation for the higher unintended childlessness among highly educated women. We discuss how these models can be used to explain the surprising reversal in gradient between education and fertility that is observed in Scandinavian countries. We end by discussing the advantages and drawbacks of our simulation approach and how it can contribute to family sociology.

  • Chair: Katya Ivanova, Tilburg University
  • Less for more? Cuts to child allowances and long-run child outcomes in larger families: Evidence from a Dutch reform
    Gabriele Mari, Erasmus University Rotterdam
  • Systematic Income Risk
    Giuseppe Floccari, Tilburg University
  • Money, childbearing, gender: explaining within-couple inequality after parenthood
    Weverthon Machado, Utrecht University

Less for more? Cuts to child allowances and long-run child outcomes in larger families: Evidence from a Dutch reform
Gabriele Mari, Erasmus University Rotterdam
Public policies that provide extra income to families with children are widespread. Children in larger families fall more often below the poverty line and typically receive more generous benefit amounts. Yet larger families have been one of the main targets of cutbacks too. Evaluations of how such cutbacks affect children in the long term are few, and much debate remains over if and how policies aimed at children
should target family income at all.
I study a reform of universal child allowances in the Netherlands. The reform curtailed income support from birth to the 18th birthday of second-born or higher-order children born after 1 January 1995. I use register data from Statistics Netherlands (N ≈ 51,000) and a state-of-the-art regression discontinuity (RD) design. The latter combines local linear regression with computational methods to estimate the causal effect of a given treatment around a pre-determined cutoff in the variable that governs treatment assignment, i.e. children’s birthdate. I examine the educational and mental health outcomes of children born around the reform cutoff (1 January 1995), from adolescence to young adulthood. Whilst mental healthcare utilisation seems not affected, preliminary results suggest that children exposed to the reform might have been less likely to enrol in the secondary-school tracks that give access to higher education in the Dutch system.
Evidence of such negative spillovers across generations may call into question both the efficiency and equity of cutbacks targeted at larger families, and bolster the case for (extra) income support. Next steps include an analyses of heterogeneous effects, both by birth order and by household socioeconomic status at the baseline.
This project is possible thanks to an ODISSEI MAG Grant awarded in 2021.

Systematic Income Risk
Giuseppe Floccari, Tilburg University
We quantify the importance of systematic income risk, defined as the exposure of a worker’s wage to the business cycle, for individuals’ portfolio decisions. Using a novel methodology and matched employer-employee panel data from the Dutch CBS, we show a substantial cross-sectional heterogeneity in systematic income risk, conditional on workers’ characteristics such as age, gender, education, as well as on employers’ industry and size. We document that employers pass a large part of business cycle shocks to wages, while they provide workers with substantial insurance against firms’ specific shocks. Consistently with portfolio theory, workers subject to higher systematic risk are less likely to acquire risky assets and choose safer portfolios. To overcome endogeneity issues resulting from individuals jointly choosing their jobs and their portfolios, we develop an instrumental variable approach based on corporate takeovers. Our instrument is based on the idea that corporate events like mergers and acquisitions are largely exogenous to workers and they are associated with an increase in income risk. Linking our administrative data to survey data on income expectations, we show that workers are aware of the degree of systematic income risk they face. Also this evidence supports the causal interpretation of our instrumental variables results.

Money, childbearing, gender: explaining within-couple inequality after parenthood
Weverthon Machado, Utrecht University
Using population register data for the Netherlands, we analyze the child penalty for new parents in three groups of couples: different-sex and lesbian couples with a biological child and different-sex couples with an adopted child. With a longitudinal design, we follow parents’ earnings from 2 years before to 8 years after the arrival of the child and use event study models to estimate the effects of the transition to parenthood on earnings trajectories. Comparing different groups of couples allows to test hypotheses related to three types of difference that are early impossible to disentangle when studying only heterosexual biological parents: relative earnings, childbearing and gender. Our results offer strong support for gender as the main driver of divergent child penalties: for mothers, the gender of their partners is more consequential for their earnings trajectories than going through pregnancy or being the secondary earner before parenthood.

  • Chair: Christopher Barrie, University of Edinburgh
  • Computational Text Analysis for Sociology – Opportunities and Challenges
    Ana Macanovic, Utrecht University
  • The complementarity between ICT and cognitive skills in the third industrial revolution: an empirical assessment
    Marie Labussiere, University of Amsterdam
  • WordGraph2Vec: using language constructs to create sentence embeddings
    Marc Ponsen, Statistics Netherlands (CBS)

Computational Text Analysis for Sociology – Opportunities and Challenges
Ana Macanovic, Utrecht University
The emergence of big data and computational tools has introduced new possibilities for using large-scale textual sources in sociological research. Recent work in sociology of culture, science, and economic sociology has shown how computational text analysis can be used in theory building and testing. This paper reviews these advances and outlines, using recent sociological work, how five families of computational methods can assist social scientists with different tasks when analysing text.
Dictionary methods can help researchers seek and quantify concepts of interest in text – including, for instance, affective language or hate speech. Semantic and network analysis tools aid the identification of social actors and actions, facilitating narrative analysis and exploration of social action. Language models can help capture complex relationships – between, for instance, cultural categories – across millions of texts. Clustering methods facilitate exploratory analysis of textual data, assisting researchers in inductively capturing media frames or evaluating ideological leaning in texts. Finally, supervised machine learning classification methods can help researchers easily expand manual coding to an unprecedented number of texts.
After exploring these text mining methods, we provide an overview of challenges social scientists face when using computational text analysis in their work. While new textual sources help overcome the shortcomings of conventional sociological data collection methods when it comes to size and depth, their analysis can be rather challenging (e.g., sparse matrices, large number of words, many observations, difficult parameter choices, lack of computational power). Further, we discuss how, while the often inductive nature of many computational text analysis methods can clash with the more deductive tradition of research in sociology, it can also inspire sociological imagination. This paper closes by emphasizing the importance of theory testing and causal inference in sociological research even in the times of data abundance and computational power.

The complementarity between ICT and cognitive skills in the third industrial revolution: an empirical assessment
Marie Labussiere, University of Amsterdam
Over recent decades, accelerating technological change and the increasing digitalization of the economy have altered the work of many employees, raising questions about which skills are most relevant in changing work environments. Previous literature often argues that technological change has spurred employer demand for high-skilled workers, based on the idea that high-skilled jobs consist of cognitive tasks that are complementary with ICT tools and devices. Although intuitive, this hypothesis of increasing complementarity between cognitive and ICT skills has not been operationalized or directly tested. Previous studies typically analyse ICT skills and cognitive skills separately, without considering how these two sets of skills relate to each other at the job level. Are employers increasingly looking for hybrid skill profiles combining ICT and cognitive skills? This paper uses a unique dataset from Burning Glass Technologies of 60 million job postings in Great Britain from 2012 to 2019 to analyse the joint evolution of ICT and cognitive skills in the skill requirements of job postings. First, we distinguish and extract different types of ICT and cognitive skills using supervised machine learning techniques. This allows for more nuanced typologies of ICT and cognitive skills, in contrast to previous studies that tended to treat these skills as fixed and homogeneous categories. Second, we map the joint evolution of different sets of ICT and cognitive skills over time using dynamic network analysis of co-occurrence graphs. This empirical strategy provides a nuanced account of how skills are reshuffled at the job level, allowing new skill profiles and possible complementarities to be identified.

WordGraph2Vec: using language constructs to create sentence embeddings
Marc Ponsen, Statistics Netherlands (CBS)
It is estimated that the bulk of todays generated data is unstructured text. This text data may contain relevant information to statistical institutes such as Statistics Netherlands (SN). Deriving information from texts is challenging. The data does not have a pre-defined data model and might be noisy. Traditional statistical techniques, typically used at statistical institutes, are therefore ill-suited to analyze such data. In recent years, great strides forward have been made with dealing with this type of data, most notably in the field of natural language processing (NLP). We propose a novel algorithm WordGraph2Vec (WG2Vec) for analyzing text data that combines two aspects of NLP: language models and vector embeddings models.
As a first step, WG2Vec uses language models to dissect and understand text on a grammatical level. So-called word graphs are extracted from sentences based on the grammatical properties of (related) words. Word graphs should represent the important
phrases in the (larger) text. The type of word graphs can be tailored to the particular problem domain. As a next step the semantics for word graphs are obtained using vector
embeddings models, such as Word2Vec or Universal Sentence Encoder. WG2Vec then converts sentences into a vector of numbers, i.e., sentence embeddings. The sentences with vectors in close proximity to each other should hold the same semantic meaning. We applied WG2Vec to the task of finding synonyms in vacancy texts given a set of skills denoted in a labor market expert system.
In conclusion, WG2Vec analyses texts both on a grammatical and semantic level, whereas
standard language models or word embedding models only do one or the other. We will present preliminary evidence that WG2Vec can indeed efficiently find semantically similar labor market skills in vacancy data.

  • Chair: Pearl Dykstra, Erasmus University Rotterdam
  • A Rapid Prediction and Response System (RPARS) to Facilitate Guided Self-Regulation During Pandemics
    Jeffrey Sweeney, Erasmus University Rotterdam
  • Disease avoidance may come at the cost of social cohesion: Insights from a large-scale social networking experiment
    Hendrik Nunner, Vincent Buskens, Rense Corten, Casper Kaandorp, Mirjam Kretzschmar, Utrecht University
  • Understanding the spread of COVID-19 using administrative data from Statistics Netherlands
    Javier Garcia-Bernardo, Utrecht University

A Rapid Prediction and Response System (RPARS) to Facilitate Guided Self-Regulation During Pandemics
Jeffrey Sweeney, Erasmus University Rotterdam
It has become imperative to disrupt viral transmission by proactively guiding individual behavior in a targeted approach. While broad population-oriented technologies such as digital contact-tracing and vaccine passport apps have been tried, advanced privacy-preserving approaches are needed to ingest user information in real-time, and better adapt to changing epidemiological circumstances. In response, this paper: 1) briefly discusses guided self-regulation as a possible means of disrupting viral transmission via the use of digital technology, 2) outlines a prototype Rapid Prediction and Response System designed to facilitate guided self-regulation in an advanced privacy-preserving approach, 3) describes the significance of the prototype system to research and practice, and 4) presents a high-level evaluation plan.

Disease avoidance may come at the cost of social cohesion: Insights from a large-scale social networking experiment
Hendrik Nunner, Utrecht University
Avoidance behavior is a typical behavioral adaptation during times of increased health risks. Among others, this includes avoiding public places or transport during epidemics or avoiding others who show signs of infections. Although, it is known that the resulting disruption of our social networks may fundamentally change the course of an epidemic, it is less known to what extent people adapt their behavior and to what extent this changes the dynamics of disease spread. To study the feedback loop between avoidance behavior, social networks, and spread of infectious diseases, we propose the “Networking during Infectious Diseases Task” (NIDM), a part of an incentivized game-theoretic experiment that rewards maintaining social relations and structures and penalizes acquiring infections. Data collected from a large-scale (n = 2,879) online experiment reveals that disease avoidance dominates networking decisions, despite a comparably low penalty for getting infected. As a result, we observe low numbers of infections, but also drastic changes in network structure, such as the deterioration of densely connected clusters. These results imply that the focus on a more obvious signal (i.e., disease avoidance) may lead to unwanted and harder to foresee side effects (i.e., loss of social cohesion). Furthermore, this may have a reinforcing effect on loss of well-being through social isolation, an effect that has been observed during the COVID-19 pandemic, and should therefore be considered for the communication of non-pharmaceutical interventions and for the design of measures to maintain social cohesion during epidemics.

Understanding the spread of COVID-19 using administrative data from Statistics Netherlands
Javier Garcia-Bernardo, Utrecht University
Statistics Netherlands collects data on the “affiliation” of Dutch residents, such as the school they attended. These affiliations can be used to create networks, where individuals are connected by different types of relationships: attending the same school, working for the same employer, belonging to the same family, or living in the same household. Here, we analyze the relationship between such affiliation networks and COVID-19 test results from September2021 to September2022.
Our analysis yields two results. First, we show the extent to which administrative data represent close-contact interactions. We studied the probability of network neighbors (e.g., classmates) being co-infected—testing positive for COVID-19 in the same three-day period—, compared with random pairs of individuals in the Netherlands. We found immediate family networks to be the most correlated with co-infection. Co-parents, parent-child, and siblings are 10-68 times more likely to be co-infected. Immediate family networks are followed by networks of primary schools (4-11 times more likely), extended family relations (1.5-9), workplaces (3-6) and secondary schools (2-6), and post-secondary education (1.1-1.7). Second, we investigated the relative influence of primary schools on the spread of COVID-19. We matched pairs of students who attended the last year of primary school together in 2020/21 in three groups. (i) Pairs who attended secondary school together, (ii) pairs who attended different secondary schools in the same area, (iii) pairs who attended different secondary schools in different areas. We find limited community transmission between students. Compared with pairs of students who did not attend primary school or secondary school together and live in the same area, only group (i) showed a marked increase in co-infection, with group (ii) showing a slight increase. Our results offer insights on the potential of networks created from administrative data and on the importance of school closures as a response to outbreaks of airborne diseases.

  • Chair: Michel Dumontier, Maastricht University
  • Panl, the Dutch participant recruitment platform for SSH research
    Martin Tanis, VU Amsterdam
  • SANE — Analysing sensitive data in an secure analysis environment
    Lucas van der Meer, Erasmus University Rotterdam
  • Accelerating Social Science Knowledge Production with an Open-Source Model: What Have We Learned from the Comparative Panel File
    Konrad Turek, NIDI (Netherlands Interdisciplinary Demographic Institute)

Panl, the Dutch participant recruitment platform for SSH research
Martin Tanis, VU Amsterdam
In 2019 we have been awarded a grant from PDI-SSH (Platform Digital Infrastructure Social Science and Humanities) to develop an online recruitment platform for research participants panl. With panl, we offer the possibility for researchers to get in touch with research participants. Much research in the social sciences and humanities relies on input that can only be obtained from people. This includes experiments or surveys, the development of tests and measurement tools, and annotations of texts, images, or other artefacts. Although international online platforms exist to recruit participants (e.g. MTurk, Prolific, Figure8), these are unsuited for research that is bound to the Dutch linguistic and/or cultural context. In addition, GDPR regulations forbid Dutch researchers to store data outside the EU borders. All in all, sufficient reasons to develop panl and to make this platform available for all researchers in the social sciences and humanities in the Netherlands. In this presentation we want to demonstrate panl and pay special attention to three topics:
(1) how to achieve and maintain good data-quality, (2) how to balance participant privacy vs. effective use of eligibility criteria, and (3) how to create robust relationships with participants.

SANE — Analysing sensitive data in an secure analysis environment
Lucas van der Meer, Erasmus University Rotterdam
Privacy, copyright, and competition barriers limit the sharing of sensitive data for scientific purposes. We propose the Secure Analysis Environment (SANE): a virtual container in which the researcher can analyse sensitive data, and yet leaves the data provider in complete control. By following the Five Safes principles, SANE will enable researchers to conduct research on data that up until now are hardly available to them. SANE comes in two variants. Tinker SANE allows the researcher to see, manipulate and play with the data. In Blind SANE, the researcher submits an algorithm without being able to see the data and the data provider approves the algorithm and output.

Accelerating Social Science Knowledge Production with an Open-Source Model: What Have We Learned from the Comparative Panel File
Konrad Turek, NIDI (Netherlands Interdisciplinary Demographic Institute)
The growing complexity of innovation and knowledge production requires social sciences to develop more open cooperation and management schemes. The explanatory power and applicability of social science are increasingly dependent on the ability to utilize diverse resources, build broad collaborations, and coordinate shared work. A particular problem is their slow responsiveness to societal challenges, such as pandemics and rising inequalities. Conventional knowledge infrastructures tend to be self-limiting in these aspects, as they structurally constrain flexibility and open development.
I will argue that the open-source model is a promising and intriguing organizational scheme that can help opening the knowledge infrastructure in social sciences. First, it offers flexibility, decentralized control, and community-based development, allowing to break path dependencies in data management and open possibilities for new research questions and applications. Second, it can allow social science to better identify and respond to societal challenges, e.g., by improving access to and the processing speed of high-quality data. Third, the open-source model emanates the fundamental idea of science as a communal process, where researchers benefit from sharing their efforts and contribute to faster and more ambitious progress in science.
Although crowd-based collaboration is attracting growing attention from the academic community, it rarely appears in social sciences. I will consider potential areas for its successful implementation, e.g., survey harmonization, secondary-data analysis, computational social science, and computational simulations. As a case study, I will refer to the first open-source survey harmonization project, the Comparative Panel File, and share my two-year experience in leading this experimental initiative. Finally, I will discuss the role, potential, and limitations of the open-source model in opening and accelerating social science knowledge production.

Matthew Salganik is Professor of Sociology at Princeton University, and Director of the Center for Information Technology Policy. He is also affiliated with several of Princeton’s interdisciplinary research centers including the Office of Population Research and the Center for Statistics and Machine Learning. His research interests include social networks and computational social science. He is the author of Bit by Bit: Social Research in the Digital Age (Princeton University Press, 2019).

Chair: Suze Zijlstra (ODISSEI)

Room: Progress

Abstract

The Unpredictability of Life Outcomes
Matthew Salganik, Princeton University
Researchers have long theorized about the processes through which family background and childhood experiences shape life outcomes. However, statistical models that use data on family background and childhood experiences to predict life outcomes often have poor predictive performance. In this talk, we present results from three interrelated studies of the predictability of life outcomes: a scientific mass collaboration involving hundreds of participants, a high-throughput study using hundreds of machine learning pipelines to predict hundreds of life outcomes, and a qualitative study involving in-depth interviews with 40 families. Collectively these studies help to assess and understand the limits to predictability of life outcomes, which has implications for social science theory and for algorithmic decision-making in high-stakes settings.

Pearl Dykstra is ODISSEI’s Scientific Director and Professor of Empirical Sociology at the Erasmus School of Social and Behavioural Sciences.

Room: Progress


Photo by Jaarbeurs Media Plaza