Balancing privacy and reusability: The why, what, and how of de-identifying research data

18 September 2024

Written by Deborah Thorpe (DANS, FAIR Expertise Hub) and Ricarda Braukmann (DANS, FAIR Expertise Hub)

DOI

Introduction

Data shapes our lives. Our personal and sometimes sensitive data is provided to and stored by a variety of different organisations and companies on a day-to-day basis. We are increasingly aware that our data are being harvested, and are growing more wary about how they are being used and shared.

As researchers, we may be collecting personal data from other people as part of our research, and we need to think carefully about how we will protect the privacy of these participants during, and after, a research project. On the other hand, the importance of sharing data is being increasingly emphasised and making data reusable for other, where possible, is an important task within the research process. So, how can we prepare data to be useful for future reuse, whilst protecting the privacy of our human participants? 

In this blog post, we look at the need for de-identifying data, the different ways of doing this, and some of the issues around balancing the competing interests of anonymity and reusability of data. Finally, we leave you with a number of tools to explore, which can identifying information quickly, without compromising too much of the utility of the data.

Why would you need to de-identify data?

Where personal information is being collected from research participants, your planning process should involve considering and deciding on what you can do to protect the privacy of your participants, and the risks of them taking part.  The development of a Data Management Plan, combined with the process of applying for ethical approval, will help you to determine the sensitivity of the data that you are collecting and thus how and if it should be collected and, if possible, shared. 

You may be gathering personal information, such as names, contact details, or other demographic details. In addition to this, you may be researching a sensitive topic and/or working with vulnerable people. Where the topics of the research are sensitive and/or you are working with vulnerable people, the importance of applying these measures becomes even more acute. 

One of the measures to be considered is applying is de-identification of the data, which addresses a range of different concerns in relation to personal data: the fact that the General Data Protection Regulation (GDPR) asks for data minimisation; your obligation to protect your research subjects (especially where they are vulnerable and/or the topics are sensitive); and where the subject’s information is not relevant to the research questions. 

When ‘deidentifying’ data, there are two different approaches that can be taken: anonymisation and pseudonymisation.  Both processes involve removing or editing direct and indirect personal identifiers, but they each have their own respective roles in data privacy.

What is ‘anonymisation’ and how does it compare with ‘pseudonymisation’?

There are two different approaches to the deidentification of data, which are neatly outlined in the Utrecht University Data Privacy Handbook, together with links to the relevant parts of the GDPR:

Anonymisation: ‘is a de-identification process that results in data that are “rendered anonymous in such a manner that the data subject is not or no longer identifiable” (rec. 26), neither directly nor indirectly, and by no one, including the researcher who collected the data. When data are anonymised, they are no longer personal data, and thus no longer subject to the GDPR.’

However, the Utrecht handbook emphasises that:

  • Everything that you do before data are anonymised is subject to GDPR
  • It is very difficult to accomplish anonymisation in practice. 

This video helps to explain why it is so difficult.

Pseudonymisation, in contrast, ‘is a safeguard that reduces the linkability of your data to your data subjects (rec. 28). It means that you de-identify the data in such a way that they can no longer lead to identification without additional information (art. 4(5)). In theory, removing this additional information should lead to anonymised data.’

However, pseudonymisation provides much more limited protection for your participant, who may be reidentified – so extra care needs to be taken in, for example, keeping the data separate from the key that links the direct identifiers to the pseudonym. Thus, good data management planning and practices become even more important. It is also important to note that pseudonymous data ‘are still personal data and thus subject to the GDPR. This is because the de-identification is reversible: identifying data subjects is still possible, just more difficult’.

A useful overview has been published, with more information than is presented here, by Health RI, and a really useful reference card has been published by the National Coordination Point Research Data. In all cases, it is advisable to speak with the local Privacy Officer and/or the data management supporters in your university/library as early as possible when you are working with personal and/or sensitive data.

The competing interests of privacy and reusability

When de-identifying research data, such as the insights provided during interviews, it is important to de-identify in a way that also maintains the value of this data for your research. 

Removing or replacing too little information is the more obvious risk in qualitative research involving participants, since those individuals may be identified by the context and through aspects of the data that are hard to predict beforehand. However, removing or diluting too much of the indirectly identifying information collected during a qualitative research project is also a problem, because it risks stripping the data ‘of its unique value, eliminating potential types of analysis and uses’. The Guide to Anonymising Qualitative Data from the UK Data Service gives an example of over-anonymised data, which is no longer of use to the researcher because the unique data points have been removed throughout:

Original: So my first workplace was Arronal, which was about 20 minutes from my home in Norwich. My best colleagues from day one were Andy, Julie and Louise and in fact, I am still very good friends with Julie to this day. She lives in the same parish still with her husband Owen and their son Ryan.

Example A, ‘over’ anonymisation: So my first workplace was X, which was about X minutes from my home in X. My best colleagues from day one were X, X and X and in fact, I am still very good friends with X to this day. X lives in the same parish still with her husband X and their X X.

The aforementioned UK Data Service (UKDS) Guide contains a wealth of approaches and best practices in how to de-identify data in a way that balances the need to protect your participants with the importance of preserving the data’s (re)usability. In particular instead of replacing all identifying information with “X”, they can be replaced with meaningful labels which retain some information (e.g. “My best colleagues from day one were [Name1], [Name2] and [Name3] and in fact, I am still very good friends with [Name2] to this day.”). 

De-identifying quantitative data, also involves issues around balancing privacy with data reusability. For instance, a researcher may aggregate categories (for example, religious denomination) to avoid reidentification, especially in combination with other information (e.g. sex, region, occupation), though in doing so some of the nuance of the original data is lost. The UKDS also provides a useful guide to Anonymising Quantitative Data.

Where anonymisation may not be necessary, or desirable…

Depending on the type of research with human participants, it may not be necessary or desirable to anonymise data that is being collected and shared. For instance, oral history projects and/or projects where there is a low risk to the participant in their identity being revealed in the data, and/or there are rewards in their identity being made known. 

Consider, for example, an interview with an artist or author. The publication of this interview, complete with their name and descriptions of their work, may be a reward of taking part in the research. In addition, in the case of this kind of research involving with individuals within a small, localised, and/or very specialised communities, removing all identifying information is a very high standard of information protection that it might be difficult to offer – and which might compromise the utility of the research data. In this case, it is important to specify in your information and consent form that you will be sharing identifiable personal data.

However, it should be noted that all researchers working with human participants should consider data minimisation – i.e. the researcher should be collecting only the data that is necessary for the research. In addition, de-identification plans and sharing of personal data need to be addressed in the informed consent forms. 

A focus on tools for deidentification

We have seen that it is important but difficult to de-identify data to an extent that it both protects research participants and preserves the value of the information that has been collected. These processes are also time consuming, which is a problem when there are competing demands on the time of researchers. 

We would like to close this blog by highlighting just some of the tools that can be used to redact names and remove other identifying information quickly, without compromising too much of the utility of the data.

anonymoUUS is a python package created at Utrecht University to pseudonymise folders and files in your documentation. The goal of anonymoUUs is to substitute multiple identifying strings with pseudo-IDs to avoid tracable relationships between data batches.

Textwash is an automated text anonymisation tool written in Python. We provided an example above of data that had been over-anonymised – too much identifying information had been removed, and replaced by ‘X’, rendering it meaningless. This tool, in contrast, anonymises texts in a ‘semantically-meaningful manner’ by replacing identifiable information with ‘meaning-preserving tokens that ensure that the data remains useable for subsequent analyses. 

To achieve this, Textwash identifies and extracts personally-identifiable information (e.g., names, dates) from text and replaces the identified entities with a generic identifier. For example, Joe would be replaced with [firstname1], Biden with [lastname1], president with [occupation1], and United States with [location1]. More information, the tool itself, and a Quick Start guide can be found on Github

The UKDS has a text anonymisation helper tool that highlights terms in your text that may be disclosive, so you can decide to alter them accordingly. The tool does not anonymise or make changes to data, but uses MS Word macros to find and highlight numbers and words starting with capital letters in text.

QAMyData, also developed by the UKDS, this open source tool can be used to automatically assess and report on elements of quality, such as missingness, labelling, duplication, formats, outliers as well as direct identifiers.

ARX is a comprehensive open source software for anonymizing sensitive personal data.

For more information about de-identifying personal data in qualitative research, plus an overview of the range of other challenges involved in making qualitative data reusable, see the DANS guide Making Qualitative Data Reusable – A Short Guidebook For Researchers And Data Stewards Working With Qualitative Data.

Conclusion

There is a delicate balance to be achieved between the need and obligation to protect research participants and their personal data, and the importance of maintaining the utility and reusability of your data. We have touched on the observation that it may not always be necessary or desirable to anonymise data, where they choose for their identity to be known.

Where de-identification is necessary, there are a range of tools that can help to streamline this process, performing the processes accurately whilst reducing the burden on the researcher. 

We hope that this blog has been useful as a source of information and resources on de-identifying personal information. 

If you have any feedback or suggestions for future blog post topics, please contact us at fairsupport@odissei-data.nl

Photo by Dimitri Karastelev on Unsplash