Benchmarking
Benchmarking is about creating a system in which various ways or approaches to address the same task can be compared, in order to be able to analyse the strength and weaknesses of various modelling approaches and gain new insights. For computer scientists, benchmarks provide a standardized way to assess predictive models that has proven to be able to detect major breakthroughs, like deep learning. In the social sciences, benchmarking can help establish which methods and models are most suitable to answer specific research questions.
A benchmarking challenge is when various teams of participants are set a challenge to predict a particular outcome. At the end of the challenge, the teams’ performances are evaluated based on a predefined set of matrices and evaluation criteria. Participants are provided with a training dataset which includes an outcome variable (dependent variable) and a range of predictor variables (independent variables). They then train a model on this dataset that can predict the outcome variable based on values of the predictors. The resulting model is evaluated by predicting values in a holdout dataset. This is a dataset containing typically around 20% of observations which the participants have not had access to. The winner of the challenge is the group which can most effectively predict the target variable in the holdout data.
While benchmarking challenges are popular in data science, machine learning and other fields, there have been very few benchmark challenges in the social sciences. Therefore, during the 2022 ODISSEI-SICSS Summer School, we organized a social science benchmark, which included approx. 20 participants, who were divided into six groups. The aim of the challenge was to predict precarious employment (defined using a combination of income level and contract type) in 2020, based on predictors from 2010 or earlier. Each team was provided with a simple baseline training dataset which included the outcome variable and several basic demographic indicators. The participants were also able to search for additional datasets and data and link these data at the individual level to the training data.
The submissions of the teams were evaluated quantitively, looking at the prediction accuracy of precarious employment in the holdout dataset (that they did not have access to when training the models). Additionally, the submissions were also evaluated qualitatively by external experts, based on a short narrative submitted by the teams, explaining their method and how their method relates to existing theories. The experts ranked these narratives based on two criteria: embeddedness in existing research and literature and innovativeness. As this was the first way a social science benchmarking challenge was set up using the rich Dutch administrative data, the experience and results of this challenge provide crucial insight into the ways in which benchmarking for the social sciences can be further improved. The ODISSEI benchmarking team expects to build on this with a follow-up benchmarking challenge.
Project team Benchmarking: Paulina Pankowska (Utrecht University – Task Leader), Adriënne Mendrik (Eyra), and Daniel Oberski (Utrecht University).
Questions regarding Benchmarking? Contact Tom Emery (ODISSEI Coordination Team).