Skip to content

bdverse's development and QA frameworks

Sunny Dhoke edited this page Apr 17, 2020 · 9 revisions

Background

The bdverse is a family of R packages that allow users to conveniently employ R, for biodiversity data exploration, quality assessment (QA), data cleaning, and standardization. It comprised of several unique packages in a hierarchical structure — representing different functionality levels and tools. The bdverse was designed to support different user needs and programming capabilities. To facilitate its longevity, care needs to be taken to ensure development best-practices and QA best-practices are being established.

Related work

Recently, the rOpenSci organization published a detailed guide for R packages development and maintenance. We plan to submit the bdverse packages to rOpenSci software peer review as soon as they meet all core requirements. Nonetheless, we plan on developing robust QA foundations tailored to address bdverse's vulnerabilities. The bdverse package system is envisioned as a comprehensive and long-term solution to issues related to biodiversity data. As such, its frameworks must also address the unique aspects of biodiversity data, such as:

  • The Darwin Core (DwC) data standard.
  • Various taxonomic, spatial, and temporal complexities.
  • Integrating and developing new concepts and methodologies for biodiversity data fitness for use.
  • Unique and demanding architecture (i.e., intense connectivity between bdverse packages and Shiny modules).
  • The continuous development and maintenance of a backend (R functionality packages) and a frontend (Shiny app packages).

Details of your coding project

We have identified seven components that need to be developed and integrated within the bdverse package system:

  • Defensive Programming strategy: defensive programming is a practice where you add supporting code to detect, isolate, and in some cases, recover from anticipated failures in your code. Such failures can occur due to malformed inputs or bugs within the code. Defensive programming is about letting the user know when an error has occurred and providing some context about what caused the error. One of the main tools is Assertions, which asserts that a given statement is true at a given point in the code, and are a common tool in this regard. A common example is checking input parameters to a function. In R, this is often implemented using the stopifnot() function, though packages such as assertthat also exist. A coherent and practical defensive programming strategy will be developed for our R functionality packages and our Shiny app packages.

  • Automated Tests: The best way to ensure all functions work as expected is to use unit testing to assist with writing testing functions, specifying inputs, and checkout outputs. In R, the most widely used testing framework is testthat, which comes with excellent documentation and is integrated with R studio and many other downstream testing frameworks. Testing a Shiny app is challenging in different ways, as testing for GUI consistency, reactivity, app logic, and app performance are not trivial tasks. The shinytest package has been developed to facilitate these types of tests. Several of the bdverse packages include unit testing and Shiny tests; however, the number one challenge is to strategically build and order them in a way that the first test to fail in the chain of tests, single out the problematic area. We named this principle the ‘Bonanza of First Fail’ - our BFF principle indeed. Different types of tests weaved in such manner require (i) superb planning; (ii) maintainable array of tests; and (iii) a strict adoption of the proposed principle. Well-designed and well-implemented tests are the foundation of a fruitful CI scheme. To facilitate package development, testing development, benchmarking and pre-release checklists, we have started to develop bdtests - an in-house package, sort of a devtools but for the bdverse packages system. Hopefully, as part of this project, bdtests will take better form; now, it is just a basic skeleton and a place to store ideas.

  • Continuous Integration: CI is sort of a continuous alert system. All commits, pull requests, and new branches are run through R CMD check, which executes a fixed set of core checks and the automated tests (if developed). Doing so indicates whether the package fails any of the checks/tests, thus, flagging a potential issue or a bug. It is also recommended to use a code coverage service, which assesses how many lines of code are covered by the developed tests. Different CI services will support builds on various operating systems; some are easier to set up than others; some are more stable (i.e., less false fails). To identify the most suitable CI service or the combination of services to bdverse unique needs (e.g., intense package dependencies, CI connectivity between upstream packages, complex monitoring), a comparative analysis between services will be designed and performed. We envision that the selected CI architecture will be accompanied by a monitoring app (control-center like), probably using flexdashboard - an easy to set up and to maintain interactive R Markdown dashboards. Right now, we only have a name for it, bdmonitor: bdverse baby monitoring app.

  • Dependency Management: dependency on other packages have been described as “an invitation for other people to break your code.” Although, this risk can be minimized when implementing CI with high code coverage, reduce dependencies is especially important for difficult to install packages, even more so, if it for unjustified functionality. We plan on developing a package-dependency risk management analysis to be stored in bdtests.

  • Docker (a computer inside a computer): a Docker container makes it easier to create, deploy, and run an application or analysis by controlling its whole environment (OS, R version, RStudio version, libraries, and R packages). In the case of the bdverse, it enables us to build stable and controlled environments for testing the packages before CRAN release (e.g., R-hub), deploying the Shiny apps, and speeding up binder image builds. Fully and cleverly utilizing Docker in R will enhance bdverse reproducibility, accessibility, and scalability. In this part of the project, we will revisit existing images, brainstorm new ones, and implement automatic builds.

  • GitHub Actions for R: a GitHub action allows you to trigger automated steps after almost any type of GitHub interaction (push, pull, pull request, issue, etc.). All of bdverse repositories are stored and managed via GitHub (https://github.com/bd-R). Recently, many useful actions for R were developed (complete CI; site builds; file rendering and many more); most are extremely easy to set up. The least we can do is to make sure we are taking full advantage of them, and perhaps even develop a few of our own, designed to our unique needs.

  • Git strategy: synchronizing the perpetual development and releases of many R packages is challenging; nevertheless, when they have a strict hierarchical dependency architecture. Fully embracing Git best practices is crucial to our sanity (i.e., project longevity). Having a clear, robust, consistent, and well-documented branching strategy is most likely the key. We intend to adopt and fine-tune an existing git workflow (e.g., How to use Git efficiently).

Skills Required

R package development, Shiny, Docker, Linux, QA best practices (e.g., testthat, shinytest, CI services, R-hub).
Advantage: experience in working with biodiversity data.

Expected impact

Developing and implementing a robust QA infrastructure, as challenging as it may be, is not a prerogative but a necessity. This developed framework will ensure that the bdverse features are held to a high standard, as these tools play an essential role in a scientific workflow, and after all, it is the quality of data quality tools. Investing in software development best practices will facilitate contribution from the community (e.g., R developers, Shiny developers, R users, biodiversity data users, domain experts, biodiversity informatics experts) - bdverse cannot sustain without it. The bdverse team responsibility is to design and build good foundations; by doing so, hopefully, recruiting the community will be easier.

Mentors

Students, please contact mentors below after completing at least one of the tests below.

  • Tomer Gueta [email protected] is the founding director of the bdverse project. He is a postdoctoral fellow at the Faculty of Civil and Environmental Engineering at the Technion, working with Prof. Yohay Carmel. His research deals with developing tools and methodologies for data-intensive biodiversity research. During the last three years, Tomer served as a GSoC mentor with the R project organization.

  • Thiloshon Nagarajah [email protected] is the Shiny lead of the bdverse development team. He was past GSoC and GCI student for Fedora Project, Sahana Foundation and R Language. Thiloshon joined bdverse as a Google Summer of Code student developer in 2017 and has been a student, contributor, mentor and now, a core member of the bdverse team. All things Shiny of bdverse is the magic of Thiloshon.

  • Vijay Barve [email protected] is the author and maintainer of bdvis and a key member in the bdverse development team. Vijay is a biodiversity data scientist and has been a GSoC student and mentor since 2012 with the R project organization. Vijay has contributed to several packages on CRAN.

Tests

Students, please do one or more of the following tests before contacting the mentors above. We designed these tests to be incorporated into your proposal rather quickly.

  • Easy: fork the bdutilitiespackage, develop two unit tests, and submit a PR.

  • Medium: fork the bdutilities.app package, build a simple Shiny app by incorporating the three modules in this repository (look at bdverse other Shiny apps to see how this can be done easily); develop two Shiny tests, and submit a PR with a feature branch.

  • Hard: study the bdDwC package and formulate an R markdown document describing its ideal testing strategy, the more detailed, the merrier.

  • Hard: pick four CI services, and describe all their pros and cons, present this comparison in an app using flexdashboard.

Solutions of tests

Students, please post here a GitHub link to your test solutions in the format:

  1. Name - Email - University - Link to solutions
  2. Sunny Dhoke - sunnydhoke22[at]gmail[dot]com - Indian Instititue Of Information Technology, Nagpur,India - Solution(Easy Test) - Solution(Medium Tet) - Solution(Hard Test/CI/DOC) - Solution(Hard Test/CI/DashBoard)
  3. Aaradhya Saxena - [email protected] - Indian Instititue Of Technology, Roorkee, India - Solution_Test1 - Solution_Test2
Clone this wiki locally