Written by Angelica Maineri , ODISSEI Data Manager
The why and what of code sharing
For most social scientists nowadays, the research process involves some sort of data processing and analysis with the aid of software – think Stata, R, Python for quantitative data, but also Atlas.ti for qualitative data. Compared to manual processes, software allows to parse larger amounts of data and produce results more quickly. Another advantage is that it allows us to create digital logs of all the steps required to go from the raw data to the published results – this is what we refer to as code. Examples include Stata .do files or .R R scripts.
Sharing code, which is also increasingly requested by universities and journals, is very important for various reasons, of which we highlight two in particular:
- it increases the reproducibility of research and thereby its trustworthiness as others are able to inspect the process and reproduce the results;
- it accelerates scientific discovery since the same code can be reused and adapted for future research.
At ODISSEI we support innovative and groundbreaking research, and we want to ensure it is well-documented and reusable. For this reason, the ODISSEI User policy sets out the following requirements:
- Code used to produce, enhance or adapt the User Data should be made available alongside the User Data where possible.
- Efforts should be made to enhance the FAIRness of the code, e.g. by following the recommendations for FAIR-software (www.fair-software.nl)
However, code sharing is not yet standard practice in the SSH field. Krähmer, Schächtele and Schneck (2023) ran a study to evaluate to what extent researchers working on data from the European Social Survey (ESS) between 2015 and 2020 were willing to share their code. Out of over 1,000 cases, only 37.5% of the authors agreed to share code – around 5 percentage points higher than the average of the data/code sharing prevalence found in previous literature, but not quite a widespread practice. Among reasons not to share code, researchers often mention lack of time/resources, fear of mistakes being found, but also poor data management/coding practices (inability to locate code after project is concluded, poor documentation, etc…). While the first two concerns require a holistic cultural change to be solved, the latter can be improved with good planning and a clear folder organisation/structure. Some precautions can be taken during the research process, others after a study is completed.
In this short article, we review some tips and best practices to share code, as a complement to the ODISSEI lecture on the ODISSEI Code Library (see slides). The short article is meant to help ODISSEI users to comply with the User Policy, but it also provides general tips and resources that can also be used by social scientists outside of the ODISSEI infrastructure.
During the research process: Clear folder structure
As explained in more detail in A short practical guide for preparing and sharing your analysis code by Erik-Jan van Kesteren and the ODISSEI SoDa team, preparing the folder structure is a first, crucial step. Your folder should have a logical and understandable structure (e.g. a subfolder for data, another one for scripts, etc). The folder structure and its content should be explained in a README. That is often a simple text file which helps others (but also your future self!) understanding where-to-find-what in the folder. It is also advisable to make sure from the start that the code includes no sensitive information: be careful with writing comments that point to specific observations, keep sensitive data in a different folder, but also be careful with hard-coded file paths (e.g. C:\Users\YOURNAME\Documents\GitHub) in your code.
!Tip: If you work with R, using RStudio Projects can be an easy and effective way to make your code easy to run for others.
During the research process: Ensure code quality
Before making your code publicly available, you may want to ensure its quality. What does quality mean in this context? That may depend on the field of practice, but for the purpose of this short article quality in code means that the code works smoothly without interruptions for those who have access to the underlying data (make sure all underlying requirements and/or packages are specified at the start of the syntax), that no sensitive information is accidentally disclosed via the code and, very important, that the code is well documented with a README file, licence and comments. You can use tools like Quarto, R markdown, or Jupyter notebooks that provide a great way to share code and narrative text in one document. This will make it much easier to clearly document the steps that were taken to process and analyse the data. Keep in mind that good documentation is helpful for others that are willing to reuse your code, but also for your future self if you need to pick up the project again after a break.
How do you check for quality? There are online resources to check code/software against formal requirements (see https://fair-software.eu/recommendations/checklist). In most cases, it may be enough to ask a peer to double check the code and, if the data can be shared, perhaps even try to run your code. Reprohacks are hackathon-like events to check reproducibility of code, and CODECHECK provides guidance to assess research code. Through CODECHECK you can get your code externally reviewed and even achieve a “certificate of executable computation”.
!Tip: At this link, you can find a draft of a code quality checklist based on the tips and tricks listed in this short article. You can comment and contribute to the list using the issues button.
! Tip: using free and open-source software (FOSS) allows everyone to rerun the code without the need for a commercial licence. To maximise the reuse potential of your code, consider using FOSS (e.g. R instead of Stata/SPSS).
During the research process: Use a version control system
A version control system keeps track of all the changes in a given folder, allowing you to revert changes when something goes wrong. This is particularly important when working in teams, to keep a log of each member’s individual contribution. Moreover, for reproducibility purposes, it is usually possible to refer to the exact commit producing a certain output – that means that in a study you can report the exact version of your code (and sometimes data) that produced the results. Git is the most widespread free and open source distributed version control system to manage version control. Different platforms enable to host and share Git repositories, such as GitHub, GitLab, and BitBucket. It is important to note, however, that if you work in closed environments such as the CBS Remote Access environment or SANE, it may not always be possible to implement version control.
!Tip: did you know that you can use the Terminal in RStudio to work with Git? Check out https://jennybc.github.io/2014-05-12-ubc/ubc-r/session03_git.html.
During or after the research process: make your repository public
When you feel ready, you can make your Git folder (called repository, or just repo) public. Make sure you are not disclosing sensitive or private information, or redistributing data you are not supposed to publish. There are a few building blocks that can help you document your work and/or enable collaborations:
- Include a README file, which could be the same as mentioned in the previous paragraph, where you describe the structure and content of your repo, perhaps include a list contributors and any other information useful to those finding and reusing your code. This will be the main landing page of your repository. If your code runs on data that is not available in the repo, indicate it here.
- Include a LICENCE. On GitHub, there are many templates available. Remember that if you don’t attach a licence, no one has permission to use, modify, or share your work. If you need help choosing a licence for your code, check out https://choosealicense.com/, and check out this blog post in the ODISSEI FAIR series to know more about data licences [2].
!Tip: If you want to learn more on how to use GitHub, check out this GitHub tutorial by Malvika Sharan ([3]). If you prefer working from the command line, then check out the GitHub cheat sheet.
After the research process: deposit code in a registry
In order to ensure findability (e.g., other users know where to locate your code) and long-term preservation, after the completion of a research project you should deposit your code in a registry. Please note that git platforms (e.g. GitHub) are not registries. There are several options to choose from, depending on your needs, preferences, but also common practices in your field. For instance, publishing your code in a registry that assigns a PID (e.g., DOIs) makes it easier to cite your code for those who reuse it; if a journal requires you to share research materials during the peer-review process, you may opt for a registry allowing anonymous links. Moreover, some of these registries make it possible to directly connect to your GitHub repository, reducing the risk of duplication and mismatches. We list some of the registries and their features in Table 1.
Table 1. Overview of registries to publish code and their features
Registry | Domain | Assigns PIDs | Handles versioning | Connection to GitHub | Anonymised packages for blind review |
Open Science Framework (OSF) | Generic | ✅ | ✅ | ✅ | ✅ |
Zenodo | Generic | ✅ | ✅ | ✅ | ❌ |
FigShare | Generic | ✅ | ✅ | ❌ | ✅ |
4TU dataset and software registry | Natural and Engineering Sciences | ✅ | ✅ | ✅ | ✅ |
CLARIAH Tools | Humanities | ❌ | ✅ | ✅ | ❌ |
Research Software Directory | Generic | ❌ | ✅ | ✅ | ❌ |
Research Box | Generic | ❌ | ❌ | ❌ | ✅ |
See more on https://github.com/NLeSC/awesome-research-software-registries
Once your code is made available on a registry, it will be easier for other users to find and reuse it, and also for search engines and aggregators to point to it. For instance, if your code deals with CBS microdata or LISS panel data, it will be listed on the ODISSEI Code Library.
!Tip: do you want to enable citation for your software? Check out https://citation-file-format.github.io/
Final tips
If you are unsure about how to publish your code, there are many ways to seek help:
- Ask your local data steward, they often know a lot about this;
- Join the ODISSEI SoDa team Data-drop in every third Thursday of the month;
- If you are based in the Netherlands, for more complex software questions you may also reach out to the Netherlands eScience Center.
- Join a community of people valuing open and reproducible science such as the Open Science community at your institution, or the international The Turing way.
Acknowledgments
I am very grateful to Eduardo Klapwijk and Deborah Thorpe for their suggestions on the text and resources to add.
References
[1] Krähmer D, Schächtele L, Schneck A (2023) Care to share? Experimental evidence on code sharing behavior in the social sciences. PLoS ONE 18(8): e0289380. https://doi.org/10.1371/journal.pone.0289380
[2] Thorpe, D., & Braukmann, R. (2024). How can I use this data? The importance of licences to facilitate reuse. Zenodo. https://doi.org/10.5281/zenodo.10986072
[3] Sharan, M. (2020). Developing Collaborative Document on GitHub – Tutorial for new learners (v1.0). Zenodo. https://doi.org/10.5281/zenodo.3835657
Photo by Mohammad Rahmani on Unsplash