-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathoutline.qmd
36 lines (30 loc) · 5.16 KB
/
outline.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
---
title: "outline"
---
1. Software reproducibility
- what is it?
- what are the goals?
- what needs to be tracked?
- varies based on project needs. a completely reproducible system would have the same versions of software on the same versions of the operating system on the same hardware models. The hardware aspect is difficult enough as is, but softwares also have a variety of different versions, sometimes with minor changes that won't affect output in any noticeable way, and sometimes those changes are breaking such that code simply no longer runs, or worse, the code still runs but provides different results than expected
- what are the challenges?
2. Dependency Mangement
- what is it? the process/steps for actually controlling and documenting the softwares and specific versions used in a project.
- these can be managed informally with e.g. a written list of softwares used and versions if needed.
- pros: there's little technical knowledge needed here. anyone can type these out.
- cons: inexact, error prone, still requires another user to manually download each package/library by hand, and (big issue) the dependencies of packages are not explicitly declared.
- what does that last point mean? many packages or libraries that one will download also depend on other software packages. When downloading a package, the required dependencies are generally declared and they should be downloaded too. However, not all packages/libraries will declare specific versions of their dependencies, and a well-developed project will track these too.
- However, when including package dependencies, a project that appears to only use 5-10 libraries can actually be using a few hundred. (Note that this process works recursively.)
- Given the complexity in all of this, automated solutions for dependency tracking are recommended.
- Python: `requirements.txt`, `Pipfile`, and `pyproject.toml` for pip, pipenv, and poetry respectively can, generally speaking, be used for library management and for declaring necessary Python versions.
- this tutorial focuses on R and {renv} so no further comparison of these will be provided here
- R: {renv}, {packrat}, and {groundhogr} are the major options in the R ecosystem
3. Why {renv}?
- {packrat} was for many years the most popular environment/dependency manager package for R. It is maintained by RStudio/Posit, and still receives bug updates. However, it has been soft-deprecated by Posit in favor of the {renv} package since at least [April 2020](https://github.com/rstudio/packrat/commit/90520e2247f65b2b9fb21bfa804444bd5ab2b78c). You should not start new projects with {packrat}, and, if possible, migrate existing projects using {packrat} over to {renv}. That said, it is worth knowing about the existence of this package because old tutorials, projects, and repos may still be using it.
- [{groundhogr}](https://groundhogr.com/): this is an open-source project to enable dependency management, and appears to have some popularity within psychological research and possibly other fields. It claims to address [some shortcomings](https://groundhogr.com/p3m/) of Posit Package Manager (discussed later), but my opinion is that the comparisons are misleading; {groundhogr} must be compared against {renv} combined with PPM, not just against PPM. That said, if your collaborators are using {groundhogr}, then I would recommend learning it to use it for those projects, but the benefits of {renv} make it the overwhelming favorable option.
- {renv}: this is another one of Posit's inventions, and is now in its 1.X.X version iterations meaning it can be considered stable without any breaking changes likely to occur at least for many years. (This would generally be indicated with a minor release to e.g. 1.X+1.X or a major release to e.g. 2.X.X)
- pros:
- developed by professional software engineers at Posit so there is paid support. development, however, is also done open-source so people can independently propose bug fixes or raise issues
- likely the most popular option already. learning this is a transferable skill between academia and industry.
- integration with other Posit packages and development pipelines, in particular with CI/CD pipelines on GitHub repos. Another tutorial will discuss CI/CD pipelines, but, in short, using {renv} means you can more easily automate deploying websites through GitHub Pages, checking the development status of R packages you work on, deploy Shiny apps to servers, and run other automated process that require specific package versions *because* Posit has developed extensions that make this easier.
- it works at a project level (i.e. it should be paired with the use of an R Project/`.RProj` file). {groundhogr} notes this as a disadvantage so I suppose this requirement is subjective. However, I argue this is a pro because using an R Project structure helps manage your working environment
- Niche: {renv} also has integrations allowing it to be used with Python which can be useful in instances where you have a multi-language project. This would likely only be encountered in complex, multi-team analysis projects or possibly in industry settings.