Skip to content

rco: The R Code Optimizer

Rahul Saxena edited this page Mar 29, 2020 · 10 revisions

rco: The R Code Optimizer

Background

A brief search on the web suffices to notice that R is slow compared to other popular programming languages. “The R interpreter is not fast and execution of large amounts of R code can be unacceptably slow” [1]. The main reason for this is because “R was purposely designed to make data analysis and statistics easier for you to do. It was not designed to make life easier for your computer” [2]. Currently the most widely used R interpreter is GNU-R, although there are several implementations of R interpreters that attempt to improve execution speed [3–8], “switching interpreters is something to consider carefully” [9].

“Beyond performance limitations due to design and implementation, it has to be said that a lot of R code is slow simply because it’s poorly written. Few R users have any formal training in programming or software development … This means that it’s relatively easy to make most R code much faster” [2].

“It is important to pursue efficiency issues, and in particular, speed” [10]. “A good deal of work is going into making R more efficient. Much of this work consists of reimplementing interpreted R code” [1].

Thanks to the GSoC program, last year, the student Juan Cruz Rodriguez developed the rco R package. This package analyzes R code and, automatically, applies different optimization strategies that return an R code that runs faster. Although seven automatic code optimization strategies were developed in rco during this past GSoC project, dozens of others remain to be developed. That is why in this new edition of GSoC, it is proposed to continue working on the rco package, the main tasks will be the development of new optimization strategies, improve its usability and fix bugs.

Related work

To the best of our knowledge, apart from rco, the only existing tool to automatically optimize R code is the compiler library. The high impact of such library was demonstrated as it was added to GNU-R since version 2.13.0. Although the compiler library manages, in certain cases, to improve the execution time of the R code, its main objective is to compile expressions into byte code. Since the main goal of the compiler package is not optimization, it is that this library leaves aside several optimization strategies commonly known by the community [11]. In addition to this, as the result of applying the functions of the compiler library is byte code, it does not allow the user to easily understand which modifications make their code more efficient.

Other types of related work include blog posts, web pages, and books that provide tips and guides to follow in order to omptimize R code [2,9,12–16]. Although intuitive and easy to apply strategies are found in these texts, none of them provide an automatic way of optimizing the code.

Automatic code optimization strategies were firstly implemented for compiled languages, the best known example being the GNU Compiler Collection (gcc; formerly called GNU C Compiler). This C code compiler was initially developed more than 30 years ago and implements more than 100 different code optimization techniques. While it is known that R is interpreted and therefore certain optimization techniques for compiled code cannot be implemented, many of these ideas can be applied to interpreted languages. As a precedent of interpreted languages that have tools for code optimization are the case of PMD for Java, or Vulture and PyCC for Python.

Details of your coding project

The tasks to be carried out during the present summer of code project will be:

Development of new optimization strategies

  • Study several code optimization strategies. Evaluate the complexity of implementing them in R, and their efficiency gains (mainly speed).
  • Rank the optimization strategies based on efficiency gain against complexity.
  • Develop (tests, document, etc.), at least, five of these optimization strategies.

Examples of such optimizations strategies are:

Common optimization strategies:

R-specific optimization strategies:

Improve usability and fix bugs

As can be seen at the GitHub page of rco, there are plenty of issues and pull requests to work on in order to improve the package’s usability. The student will have to review all these topics and decide which one deserves working on, which will be discussed with mentors.

Expected impact

  • Since the output of the package functions will be R code, it is expected to be used to teach/learn efficient coding practices.
  • The most ambitious impact of this project would be to replicate the success generated by the compiler package. Even more, a pipeline of R Code Optimizer %>% compiler would generate great results. While this expectation sounds ambitious, by checking the correctness of the implementation of each optimization strategy then this objective would be a reality.

Mentors

Students, please contact mentors below after completing at least one of the tests below.

  1. EVALUATING MENTOR: Dr. Juan Cruz Rodriguez - was the creator of the rco package during the GSoC 2019 project. He has a great knowledge of high-performance computing, optimizing compilers, low-level programming, etc. He has been working with the R language for more than six years, and he has several R packages developed, with some of them published on Bioconductor and CRAN. He has been a Google Code-in 2019 mentor.

  2. Mr. Mauricio "Pachá" Vargas Sepúlveda - has industry-class knowledge of R, Shiny and relational databases. He has been working with the R language for more than five years, and he has several R packages developed, with some of them published on rOpenSci <3 and CRAN with R Consortium funding in some of them.

Tests

Students, please do one or more of the following tests before contacting the mentors above.

Solutions must be submited via a pull request to the rco GitHub repository.

  • Easy: write a chunk of code that would result automatically optimized by applying the rco::optimize_files function, and compare its execution time (with microbenchmark::microbenchmark) against the non-optimized code. This should be submited as a vignette named “rco-optimization-example-by-STUDENT.Rmd”.
  • Medium: develop the code of one optimization strategy, it can be one present at the previous section.
  • Hard: fully develop (code, tests & vignette) of one optimization strategy, it can be one present at the previous section.

Solutions of tests

Students, please post a link to your test results here.

References

[1] R. Ihaka, R: Lessons learned, directions for the future, in: Joint Statistical Meetings, The Authors, 2010. https://www.stat.auckland.ac.nz/~ihaka/downloads/JSM-2010.pdf.

[2] H. Wickham, Advanced r, Chapman; Hall/CRC, 2014. http://adv-r.had.co.nz/.

[3] Microsoft r open, 2018. https://mran.microsoft.com/open.

[4] PqR - a pretty quick version of r, 2018. http://www.pqr-project.org/.

[5] Renjin, 2018. http://www.renjin.org/.

[7] Riposte, a fast interpreter and jit for r, 2015. https://github.com/jtalbot/riposte/tree/library.

[9] C. Gillespie, R. Lovelace, Efficient r programming, O’Reilly Media, Incorporated, 2016. https://csgillespie.github.io/efficientR/.

[10] R. Ihaka, R: Past and future history, Computing Science and Statistics. 392396 (1998). https://www.stat.auckland.ac.nz/~ihaka/downloads/Interface98.pdf.

[11] K. Cooper, L. Torczon, Engineering a compiler, Elsevier, 2011. https://www.elsevier.com/books/engineering-a-compiler/cooper/978-0-12-088478-0.

[12] P. Burns, The r inferno, 2011. https://www.burns-stat.com/pages/Tutor/R_inferno.pdf.

[14] Strategies to speedup r code, 2016. https://datascienceplus.com/strategies-to-speedup-r-code/.

[15] FasteR! HigheR! StrongeR! - a guide to speeding up r code for busy people, 2013. http://www.noamross.net/blog/2013/4/25/faster-talk.html.

[16] Making r code faster : A case study, 2017. https://robinsones.github.io/Making-R-Code-Faster-A-Case-Study/.

Clone this wiki locally