LibGuides: Open Science & Research Services: Open Code

Video on Open Code

What is Open Code?

Open Code refers to custom, author-generated code used in a scientific research study — often during data collection, interpretation or analysis—and subsequently made publicly available under an Open Access license via a linked repository, or as Supporting Information.

Reference: Public Library of Science (PLOS). Open Code.

If you wish to find out more about or participate in Open Source, please see here.

Open Code Checklist

You may like to refer to this checklist when making your code open:

If you have written code, have you documented your test protocol?
Have you scripted your analysis, including data cleaning and wrangling, where applicable?
Is your code well annotated and documented for ease of understanding and reuse?
Has the use of all software versions and computing environments been documented where applicable?
Are you able to openly share your code?
If you have shared your code, have you shared it under an open license?
Have you adopted the FAIR principles for research software?

Tips for reproducible code

Same code

The code should be well documented and can actually be executed. This is influenced by:

Dependencies: How do you manage third party packages; are they actively maintained, are the versions pinned? Do you have robust control over system level dependencies?
Environment: What language version did you build your product in? Will the application environment use the same?

Same data

Tell the story about your data using data versioning across stages of transformation, i.e. from raw, interim, processed, will allow stakeholders to validate that the logic is sound and data can be trusted. The analysis can be extended or reverted as necessary.

Same random numbers

Random numbers will always be a part of machine learning workflows, when train/test splits, cross validation, or optimization takes place to name a few. You can control them with seed numbers. The “seed” is a starting point for the sequence and the guarantee is that if you start from the same seed you will get the same sequence of numbers. Random seeds allow for quick troubleshooting of problems as the pipeline is built out, because they introduce Reproducibility into your model outputs. This is especially important when you use a learning algorithm with random effects in it, like neural nets or random forest. If you don’t use seeds, then you don’t know if the change in model outputs, standard errors, etc. is due to random effects or due to a change in the hyper-parameters. To ensure that this randomness is at least temporarily consistent while you build out your product, then setting a random seed controls and eliminates random deviation in your machine learning pipeline.

Reference: Carlos Brown (2020). Reproducibility in Data Science. Medium.

Open Code Best Practices

FAIR Principles for Research Software (FAIR4RS Principles)

To improve the sharing and reuse of research software, the FAIR for Research Software (FAIR4RS) Working Group has applied the FAIR Guiding Principles for scientific data management and stewardship to research software.

Adoption and implementation of the FAIR for Research Software principles will create significant benefits for many stakeholders, including increased research reproducibility for research organizations, better practices and more software usage for its developers, clarity for funders around their own policies and requirements for software investments, and guidelines for publishers on sharing requirements.

Anonymous code sharing as part of peer review editorial process:

Anonymous GitHub: Allows you to simply anonymize your Github repository. Several anonymization options are available to ensure that you do not break the double-anonymize such as removing links, images or specific terms. You still keep control of your repository, define an expiration date to make your repository unavailable after the review.

Open Code Resources

Health Data Research UK: Open Science Open Code

We have brought together over 150 repositories of open standards, data and source code, tackling some of the most important challenges in wrangling multi-modal data and generating replicable insights.

Open Science & Research Services