Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation checklist for scientific code repositories #2

Open
krassowski opened this issue Oct 4, 2021 · 3 comments
Open

Documentation checklist for scientific code repositories #2

krassowski opened this issue Oct 4, 2021 · 3 comments

Comments

@krassowski
Copy link
Collaborator

I wonder if anyone came across a checklist describing how to prepare a code repository before sharing it in a paper? I know of the Ten simple rules for documenting scientific software list which touches on some good practices which could reduce the problem of being unable to gather what is happening in others code, but it is oriented towards re-usable software, while a lot of the worst examples of repositories are for the papers where the author does not expect their code to be re-used (i.e. it is there only to document that they did an analysis/performed a simulation, etc.).

Certainly https://the-turing-way.netlify.app/ made a lot of effort to make research reproducible and encourage minimal reasonable practices, such as file naming, linting, and importantly repository organization.

Do you know of other resources targeted at researchers sharing their small software/analysis code which would encourage best practices such as:

  • providing explicit annotation places in the codebase relevant to published research (e.g. "for code used to generate figure 3 please see file X.ipnb"; "for script performing simulation Y please see file Z.R")
  • not agglomerating multiple projects into a single repository (which makes it hard to find the relevant pieces)
  • not compressing the code in a zip or another archive
  • always adding a README file
  • describing where to find data if data if data is required to run the code
  • linking documentation if present.
@mstimberg
Copy link

The codecheck project has a quite formalized process with a so-called manifest file: https://codecheck.org.uk/guide/community-workflow#requirements

Here's a checklist for machine learning papers: https://medium.com/paperswithcode/ml-code-completeness-checklist-e9127b168501

And here's a guide specific to Python: https://docs.python-guide.org/writing/structure/

@krassowski
Copy link
Collaborator Author

Thank you! The Papers with Code checklist and template are amazing, it really makes it easy to find the relevant files and re-run the code. The codecheck reminded me of three software journals/collections:

  • JOSS has some good conditions in their submission guidelines e.g. enforcing that repository is browsable without a need to login or download, and it also links a script to auto-generate minimal paper-related metadata by extracting info from git history - this one can be useful!
  • ROpenSci has an extensive dev guide which serves as a basis for their review standards and includes releasing section mentioning using markdown files over plain text and including R world's version of CHANGELOG.md; it also links to a very relevant resource:
  • "happy git with R" has an entire chapter on workflows browsability which is super awesome - they recommend committing derived outputs (a thing controversial in software dev itself, but I believe a very good practice in scientific code sharing), using specific file formats which can be displayed in browser/via GitHub (e.g. .tsv, .png/.svg) over proprietary ones which cannot/are more difficult (.xls, .pdf)
    • a total bummer in the R world are repositories using a single directory of .S files "for compatibility with S" (this is an ironic expression used as a self-critique by the R community when it comes to deficiencies of the language due to support of a - at places very poorly designed - legacy predecessor from 45 years ago; here it is used lterally). Example: https://github.com/harrelfe/rms/blob/master/R these files are not navigable on GitHub (no syntax highlighting/code-jumping) and while the flat structure is nice for some cases, the semi-flat an unintuitive names of files in some other packages can be really off-putting
  • pyOpenSci provides a cookiecutter for projects and packaging guide but it does not go too far beyond mentioning README.md and what should be in documentation at minimal

@krassowski
Copy link
Collaborator Author

The important thing is that the repositories shared via ROpenSci/pyOpenSci/JOSS are likely a bit easier better built (aimed at re-use by others, as this is how those get citations) and are usually used by the researchers with sufficient background to get by and find their way around weirdly/inconveniently structured repo (e.g. where GitHub search fail they can clone and grep easily).

The bigger problem are repositories with analysis papers where the target audience is a PhD student/postdoc who often only knows one (statistical) programming language and maybe even only to a degree allowing to do their analysis, but not necessarily to understand someone else's code, or wrap their head around the current software dev practices/tooling (I know a lot of excellent scientists who are like that).

There is also a group of repositories which lays in between a re-usable code and analysis code. I call it MatLab, but really in covers other languages too; these are specialised languages which are unlikely to be re-used by researchers outside of departments having relevant licence and expertise, yet the authors seem to think that many researchers will re-use them (but oftentimes do not document them sufficiently). I believe there is a small group of methodologists who will in fact use that (MatLab or other) code and a larger group who just want to use it as a basis for re-implementing/understanding the algorithm by analysing of the code (for which MatLab is often a good choice!).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants