Privacy Preserving Techniques

Data of relevance to the social sciences are increasingly becoming available for analysis. These data offer an unparalleled resource to challenge long-standing paradigms of the human experience including ethical and philosophical concerns.  Such data are often (sensitive) personal data and require special attention and protection, which simultaneously makes them difficult to access for a large part of the research community. Sharing personal data typically requires legal agreements among stakeholders, which may take years to put in place, if they are successful at all. 

In this ODISSEI task, we ambition to provide a viable alternative to procuring real world data by developing a novel and unique method to generate synthetic social science data. This synthetic data would share the characteristics of real-world data but would not have data that are composed from any particular individual. Instead, our approach learns a model of real-world data such that it can generate representations that are structurally and statistically similar to the real-world data. The existing synthetic data generators are mostly built on joint distribution of the data (e.g., Synthpop) or conditional probability distribution of the data (e.g., Bayesian Network based models). These generative models usually require a comparatively large amount of effort and prior knowledge to build and use. They also suffer from the difficulty of handling complex datasets (such as multivariate categorical data or imbalanced data) and the fragility to represent higher-order dependencies of the data. These could be crucial drawback to generate high-quality synthetic data for social science research because social science data is normally heterogeneous and contains important and complex dependencies among variables. Therefore, our data-driven approach shows its promising potential with greater flexibility in modelling distributions, capturing dependencies among variables, and being more realistic in simulating the individual samples compared to conventional methods. Additionally, we use differential privacy techniques to add noise to the generation process, thereby vastly reducing the risk of reconstituting the original data. However, ever stronger application of the measure reduces its similarity to the original data, and consequently becomes less “useful” as a direct proxy to those data. The development of synthetic social science data raises key questions pertaining to their legal status and broader acceptance in society.

The main goal of the project is to develop a synthetic data generator framework using artificial intelligence technologies while concurrently exploring ethical-legal perspectives in the trade-off between data privacy and the potential utilization of synthetic representations. We will study 1) the quality of synthetically generated data to real world data as a function of privacy cost, 2) the quality of preservation of multi-attribute relations in the face of increased individual variation, and 3) the utility of synthetic data in certain kinds of social science research.

Together with our collaborators from the Inspectorate of Education, Netherlands Initiative for Education Research (NRO), and CBS, we will conduct several regression analyses on both real and synthetic data for studying if and how students with Special Emotional Needs (SEN) affect non-SEN students. We will test the effects of privacy preservation on the utility of synthetic data in this social science use case. We will collaborate with Erik-Jan van Kesteren (ODISSEI Social Data Science (SoDa) Team) on the synthetic data generation research area to share the same research interests and direction in balancing the data privacy and synthetic data utility.

From a ethical-legal perspective, this task will focus on the in-depth legal analysis of both EU law (particularly, but not exclusively the General Data Protection Regulation (‘the GDPR’)), regulation and policies and Dutch law, regulation and policies pertaining to the generation and use of synthetic data from personal data. The goal of this WP is to create a legal framework allowing for the use of synthetic data as an alternative to real-world data that meets the standards set by EU law (in particular the GDPR), regulation and policy.

We publish our code and publications in an open-source and open-access manner [Github].

Project team (Maastricht University): Michel Dumontier (Task leader), Chang Sun, Birgit Wouters, Carlos Utrilla Guerrero

Questions regarding Privacy preserving techniques? Contact Tom Emery (ODISSEI Coordination Team).