Data Facility

The ODISSEI Data Facility is a cluster of systems to strengthen the findability, accessibility, interoperability, and reusability of research data in the Netherlands.

It is a work stream in the ODISSEI Roadmap project. The other work streams are the Observatory, Laboratory and Hub.

The Data Facility consists of three closely interrelated tasks:

  1. Widening CBS Microdata access
  2. ODISSEI Secure Supercomputer
  3. ODISSEI Portal

Data Facility

1.1 Widening CBS Microdata access

The use of the Statistics Netherlands secure Microdata Services Facility has been growing rapidly and this is anticipated to continue over the coming years within the context of ODISSEI. Statistics Netherlands’ Microdata Services Facility was started more than fifteen years ago as a small-scale service to selected researchers. It has now become a standard facility for an extended research community. For that reason, the facility has been included in the Dutch national landscape of research infrastructures. The Microdata Services Facility has grown such that it requires the direction and investment of the research community to fulfil its potential. For example, requirements for regular (i.e. non-supercomputer) computation and associated services are increasing to a level that can no longer be supplied independently by Statistics Netherlands. The research community requires services and infrastructure beyond the remit of Statistics Netherlands. To address this, ODISSEI will create a more comprehensive and sustainable basis for linkage of administrative microdata, at the levels of IT infrastructure, tooling, and data stewardship.

ODISSEI provides additional data stewardship via experts at Statistics Netherlands. The data stewardship will be provided to projects selected through open calls via the Microdata Access Program which will be administered by the Coordination Team at EUR, for both the Statistics Netherlands Microdata Services Facility and ODISSEI Secure Supercomputer. Due to the growing complexity of both research data and the various analytical tools, leaving it to researchers to find their way through the administrative data catalogue and range of available compute options is becoming increasingly inefficient. Support from substantive data experts, data analysts, and technical consultants is required. Data stewards will be involved in the support of individual research projects but will also be available as a help desk.

In addition, ODISSEI integrates Statistics Netherlands tools, both existing and to be further developed, into the ODISSEI Portal. The correct, efficient, and fast linking of large microdata sets is not easy for every researcher. This phase of the research cycle would therefore benefit strongly from flexible, easy-to-handle tools. In Statistics Netherlands internally, the SSD (System of Social Statistical Datasets) includes sets of tools for the flexible selection and linkage of populations and variables by Statistics Netherlands employees. The proposed project will explore and test whether this facility can be used in the Microdata Services Facility for external users, as it is.

Status: CBS Microdata and the related ODISSEI grants are available for usage. Other elements are under development and will be delivered constantly.

Project team Widening CBS Microdata Access: Ruurd Schoonhoven (CBS – Task Leader), John Kartopawiro (CBS).

Questions? Contact Kasia Karpinska (ODISSEI Coordination Team).

1.3 ODISSEI Secure Supercomputer

The ODISSEI Secure Supercomputer (OSSC) has shown its massive potential in large-scale analysis and processing of sensitive data on high-performance computing (HPC) facilities. The OSSC been also used as a Trusted Third Party (TTP) platform for secure and trusted linking of research data with the Statistics Netherlands data. Currently, the user identification and access to the OSSC platform is through the Remote Access environment of the Statistics Netherlands. To support other data providers besides Statistics Netherlands, it is imperative that ODISSEI builds on the success of it by extending OSSC towards a scalable and generic secure high-performance computing platform for processing and analyzing sensitive data for research purposes.

As part of ODISSEI, SURF will create a scalable secure data transfer environment to transfer privacy sensitive and large data to a supercomputer storage cluster. The connection will enable the analysis of data held by ODISSEI member organisations and other data sources, whilst enabling the data controllers at these organisations to remain in full control of the data throughout. Transferring data to the cluster will be supported by DANS through the ODISSEI Data Node. SURF will act as a Trusted Third Party by combining multiple sensitive datasets in a secure manner (see pilot project 2). The researcher can then perform the analysis on the ODISSEI Secure Supercomputer. This environment will also easily scale to multiple use cases and will be able to handle the transfer of large amounts of data in a timely manner. After any necessary disclosure check by the data controller, the output data are released to the researcher.

OSSC Architecture
Architecture of the ODISSEI Secure Supercomputer that has been improved for generic use

In the pilot phase, the emphasis was on providing typical high-performance workloads. Nevertheless, users tend to have more diverse needs. Some require a ‘classic’ supercomputer cluster for batch-like workloads, while others simply require a ‘bigger’ workstation for interactive work. SURF will undertake consultations within the ODISSEI community and diversify and increase the accessibility of the compute facilities available via the ODISSEI Secure Supercomputer in an iterative fashion through a gradually expanding set of open calls for new and increasingly diverse projects. SURF will offer access to different infrastructures, more cloud-based data-analytics tools such as RStudio and Jupyter Notebooks, and an intuitive interface to Python, R, and STATA for accessing and processing data. SURF will also ensure the further integration of storage systems, so that the storage on the supercomputer cluster and HPC Cloud VMs at SURF are unified from the user perspective.

Status: The ODISSEI Secure Supercomputer is opened-up to the ODISSEI community on 1 October 2020. Improvements will be delivered on a continuous basis.

Project team ODISSEI Supercomputer: Narges Zarrabi (SURF – Task Leader), Annette Langedijk (SURF), Maxime Mogé (SURF), Michel Scheerman (SURF), John Kartopawiro (CBS), Ruurd Schoonhoven (CBS).

Questions regarding the ODISSEI Secure Supercomputer? Contact Lucas van der Meer (ODISSEI Coordination Team).

1.4 ODISSEI Portal

The ODISSEI Portal combines metadata from a wide variety of research data repositories into a single interface, allows advanced semantic queries to support findability, and facilitates data access.

The Portal, Data Node, Secure Supercomputer, together with the Microdata Facilities, form the ODISSEI Data Facility.

Add data

The ODISSEI Portal’s metadata catalogue will extend the coverage of available data by including key research datasets that are currently not findable via NARCIS (the main national portal for information about researchers and their work). The project will extend the current catalogue with the metadata of (a) all datasets of Statistics Netherlands, including the metadata of the microdata catalogue, (b) all datasets developed in the ODISSEI Laboratory (LISS), and (c) all datasets developed in the ODISSEI Observatory (EVS, GGP, SHARE, ESS, NTR, HSN).

This dataset extension task is a joint effort of trained data stewards at DANS, the ODISSEI team of Data Scouts at the Observatory, and the aforementioned data repositories. In collaboration with these partners and experts at VU Amsterdam, this task will also develop a metadata ingestion pipeline to make sure that the Portal can be maintained and kept up to date. This pipeline will be used by the hosting party to add new datasets during and after the end of the project. All new datasets that will become available via the ODISSEI Portal will also be added to the national NARCIS research dataset catalogue.

Search functionality

Currently, existing tools for findability in the social sciences are limited in that they only identify specific terms or synonyms (e.g. United Kingdom question bank or the Question Variable Database). The ODISSEI Portal will extend and improve search functionality by using semantic queries which will enable broader probabilistic matching and link functions over an enriched knowledge graph representation of the FAIR metadata catalogue. This incorporates the context of specific terms which are crucial in social research. By using rich and extensible data structures developed within the linked data community, ODISSEI will evolve the relatively flat metadata catalogues in use today into the highly interlinked and graph-based structures needed to conduct advanced semantic searches. The richness in the metadata catalogue does not solely rely on the high manual documentation and curation standards that already exist across ODISSEI associated data such as DDI. The social sciences are fortunate to have an advanced automated metadata capture system that documents the data collection process, principally through survey software. Besides the manual documentation and curation standards, it takes advantage of automatic and semi-automatic metadata enrichment and entity linking to enrich the available curated information. Where relevant, there will be alignment with standards used in CESSDA (CESSDA Metadata Model) at the European level and the Open Science Framework.

Access management

The Portal also facilitates automatic and semi-automatic data access policy management between the producers and users of research datasets. Unclear data licensing or access policies are currently an obstacle in open science and the application of the FAIR principles, even for research datasets that are available as open data. ODISSEI will enrich its research data catalogue with explicit, and as detailed as possible information on licensing and access policies, preferably in a machine-readable format. The owners of each dataset will be able to provide the Portal with metadata describing what the policy for obtaining access entails. The access process varies between data providers: Statistics Netherlands requests that the user is affiliated with an authorised research institute and using their data involves formalities and costs, whereas other research data are often freely available for download to anyone around the globe. For datasets with machine-readable access policy metadata, the ODISSEI Data Node, an automated system that is closely connected to the Portal, will be able to facilitate the researcher, for example by sending data access request to the data owner, by initiating a federated authentication session, or by redirecting researchers to the landing page of the open dataset. In case a dataset does not yet have fully machine-readable access policy metadata, the ODISSEI Data Steward based at EUR will help the data owner and researcher with the access process.

Once the data owner reaches an agreement with the researchers, the owner allows the ODISSEI Data Node to transfer the data to the designated analysis environment, typically the ODISSEI Secure Supercomputer (in case of large, complex or sensitive data) or the computer of the researcher (in case of small and/or open data).

Status: A first prototype of the ODISSEI Portal will be delivered in 2021.

Project team ODISSEI Portal: Jacco van Ossenbruggen (VU – Task Leader), Albert Mereño (VU), Ricarda Braukmann (DANS), Herbert van de Sompel (DANS), Narges Zarrabi (SURF), Freek Dijkstra (SURF), Mike Kotsur (SURF).

Questions regarding the ODISSEI Portal? Contact Lucas van der Meer (ODISSEI Coordination Team).