By Chang Sun – Maastricht University
The main goal of the project is to develop a synthetic data generator framework using artificial intelligence technologies while concurrently exploring ethical-legal perspectives in the trade-off between data privacy and the potential utilization of synthetic representations. We will study 1) the quality of synthetically generated data to real-world data as a function of privacy cost, 2) the quality of preservation of multi-attribute relations in the face of increased individual variation, and 3) the utility of synthetic data in certain kinds of social science research.
Together with our collaborators from the Inspectorate of Education, Netherlands Initiative for Education Research (NRO), and CBS, we will conduct several regression analyses on both real and synthetic data for studying if and how students with Special Emotional Needs (SEN) affect non-SEN students. We will test the effects of privacy preservation on the utility of synthetic data in this social science use case. We will collaborate with Erik-Jan van Kesteren (ODISSEI Social Data Science (SoDa) Team) on the synthetic data generation research area to share the same research interests and direction in balancing data privacy and synthetic data utility.
We will learn a large size of social science data (millions of data records) and generate synthetic data using a fully data-driven method (Generative Adversarial Networks). To conduct such computationally expensive experiments, we will use a GPU node from ODISSEI Secure Supercomputer (Snellius systems) facilitated by SURF. OSSC provides a secure environment using an end-to-end VPN to connect with CBS Remote Access Environment where we can run our generator model using GPU on the CBS data directly. OSSC makes this task possible and efficient to run deep learning-based generators on large-size sensitive data.
The code and publications from this project are shared in an open-source and open-access manner on Github.
Photo by Towfiqu barbhuiya on Unsplash