Increasing the visibility and FAIRness of child development data

11 February 2025

Written by Dorien Huijser, Otto Lange, Pascal Pas, Chantal Kemner.

Longitudinal cohort studies can provide a wealth of information on development during infancy, childhood and beyond. However, finding these data can be challenging. To address this, the leaders of the cohorts involved in the Consortium on Individual Development wrote a proposal for the Platform Digital Infrastructure Social Sciences and Humanities (PDI-SSH) and got funded. The aim of the project was to make the data of the six child cohort studies (more) FAIR. The project resulted in a metadata catalogue that allows anyone to find measures collected within these studies, and to potentially request them for reuse. In this post, we share the development process, and considerations we had to make within it.

The challenge

The project, called “Connecting Data in Child Development (CD2)”, began in 2020. The project team included scientific representatives of each study, a project manager, a metadata expert from the Utrecht University Library, and two developers. Their main challenge was to figure out how to create an environment that allows researchers from multiple disciplines to find the data they are looking for. They needed to develop a way to describe the data consistently while respecting each study’s data management practices. Key to the success was the development of a metadata model that provided standardized descriptions of different measurements and enabled the identification of scientific overlap.

Metadata model

One of the first questions that needed answering was what counts as a dataset: the study as a whole, a wave of measurements, or data from a single experiment? Secondly, what information would researchers need to find the specific datasets they are interested in? Furthermore, can existing metadata standards be used? The metadata model that resulted from answering these questions was designed to closely resemble elements of the DDI lifecycle and to be easily transferable to other infrastructures. 

The model includes two CID-specific vocabularies: one for specifying biomedical collection modes, such as “MRI”, “EEG”, “biomedical samples”, etc. and another for categorizing the measurements by construct, such as “mental health”, “cognition” and “parenting”. 

An important difference with existing standards is the recognition that waves (repeated measurements on the same participants) play an essential role in these cohort studies. In the final metadata model, waves are included explicitly, with a description, timing information, the age (range) of the participants, and who the respondent was. For example, the Adult Self Report (ASR) questionnaire in the TRAILS study was measured in multiple waves of the cohort study, and in most cases, the child reported on themselves. On the other hand, the ASR questionnaire in NTR was measured in only one of Young NTR’s projects, and the parents (not the child) reported about themselves.

Collecting the metadata

Much of the time of the cohort representatives went into figuring out what data were actually collected, and how much overlap there was across cohorts. One complicating factor was that the maturity of research data management differed widely between cohort studies. For example, one cohort study merely had PDF files of scanned questionnaires, while others used codebooks to list the available data and items, and yet others used Excel overviews. One study even had a database with all the individual items listed for the entire study. 

All these different sources were eventually united in one overview of available measures per cohort study. These were then compared to identify overlapping measures and standardize their descriptions. This harmonization step was crucial for enabling comparison between cohorts, and potentially combining datasets from different cohorts.

The catalogue infrastructure

After an in-depth comparison, CKAN was chosen as the underlying software for the catalogue. The choice was made mostly due to its open character (open source) and available in-house experience with the software. Besides CKAN, additional modules were developed to style and configure the catalogue, import the collected metadata into CKAN, request DOIs for individual measures at DataCite and make the metadata harvestable. All software components are available on GitHub under an open license.

The outcome

The CID metadata catalogue is now available at https://data.individualdevelopment.nl/. Users can find 1010 measures from the CID studies using free text search, advanced search and filtering functionalities. Every measure is considered an important research object that has its own Digital Object Identifier (DOI) and basic information. Additionally, every measure contains information about waves, participants’ ages, and information about the study and waves in which the measure was collected. Measures that overlap between cohort studies can be identified under “Similar measures”, so that data from multiple cohorts may be combined or compared. Finally, when interested in using the data, the user is led to the cohort’s data request page.

Next steps

Currently, steps are being taken to embed the metadata catalogue in local and national research infrastructures. An initial step in this regard has been to include the CID metadata in the ODISSEI portal as a crucial next phase in the project: long-term sustainability so that we can assure FAIR data access in the future as well. 

Acknowledgements

The CID metadata catalogue was made possible in large part by the hard work of all cohorts’ representatives and data managers, and with funding from the Platform Digital Infrastructure – Social Sciences and Humanities (PDI-SSH).

We thank Angelica Manieri and Evgeniia Krichever for their feedback on the draft version of this blog post.