-
Notifications
You must be signed in to change notification settings - Fork 7
rco: The R Code Optimizer
A brief search on the web suffices to notice that R is slow compared to other popular programming languages. “The R interpreter is not fast and execution of large amounts of R code can be unacceptably slow” [1]. The main reason for this is because “R was purposely designed to make data analysis and statistics easier for you to do. It was not designed to make life easier for your computer” [2]. Currently the most widely used R interpreter is GNU-R, although there are several implementations of R interpreters that attempt to improve execution speed [3–8], “switching interpreters is something to consider carefully” [9].
“Beyond performance limitations due to design and implementation, it has to be said that a lot of R code is slow simply because it’s poorly written. Few R users have any formal training in programming or software development … This means that it’s relatively easy to make most R code much faster” [2].
“It is important to pursue efficiency issues, and in particular, speed” [10]. “A good deal of work is going into making R more efficient. Much of this work consists of reimplementing interpreted R code” [1].
Thanks to the GSoC program, last year, the student Juan Cruz
Rodriguez developed the rco
R
package. This package analyzes
R code and, automatically, applies different optimization strategies
that return an R code that runs faster. Although seven automatic
code optimization strategies were developed in rco
during this past
GSoC project, dozens of others remain to be developed. That is why in
this new edition of GSoC, it is proposed to continue working on the
rco
package, the main tasks will be the development of new
optimization strategies, improve its usability and fix
bugs.
To the best of our knowledge, apart from rco
, the only existing tool
to automatically optimize R code is the compiler
library.
The high impact of such library was demonstrated as it was added to
GNU-R since version 2.13.0. Although the compiler
library manages,
in certain cases, to improve the execution time of the R code, its main
objective is to compile expressions into byte
code.
Since the main goal of the compiler
package is not optimization, it is
that this library leaves aside several optimization strategies commonly
known by the
community
[11]. In addition to this, as the result of applying the functions of
the compiler
library is byte code, it does not allow the user to
easily understand which modifications make their code more efficient.
Other types of related work include blog posts, web pages, and books that provide tips and guides to follow in order to omptimize R code [2,9,12–16]. Although intuitive and easy to apply strategies are found in these texts, none of them provide an automatic way of optimizing the code.
Automatic code optimization strategies were firstly implemented for compiled languages, the best known example being the GNU Compiler Collection (gcc; formerly called GNU C Compiler). This C code compiler was initially developed more than 30 years ago and implements more than 100 different code optimization techniques. While it is known that R is interpreted and therefore certain optimization techniques for compiled code cannot be implemented, many of these ideas can be applied to interpreted languages. As a precedent of interpreted languages that have tools for code optimization are the case of PMD for Java, or Vulture and PyCC for Python.
The tasks to be carried out during the present summer of code project will be:
- Study several code optimization strategies. Evaluate the complexity of implementing them in R, and their efficiency gains (mainly speed).
- Rank the optimization strategies based on efficiency gain against complexity.
- Develop (tests, document, etc.), at least, five of these optimization strategies.
Common optimization strategies:
R-specific optimization strategies:
As can be seen at the GitHub page of
rco
, there are plenty of
issues and pull requests to work on in order to improve the package’s
usability. The student will have to review all these topics and decide
which one deserves working on, which will be discussed with
mentors.
- Since the output of the package functions will be R code, it is expected to be used to teach/learn efficient coding practices.
- The most ambitious impact of this project would be to replicate the
success generated by the
compiler
package. Even more, a pipeline ofR Code Optimizer %>% compiler
would generate great results. While this expectation sounds ambitious, by checking the correctness of the implementation of each optimization strategy then this objective would be a reality.
Students, please contact mentors below after completing at least one of the tests below.
-
EVALUATING MENTOR: Dr. Juan Cruz Rodriguez - was the creator of the
rco
package during the GSoC 2019 project. He has a great knowledge of high-performance computing, optimizing compilers, low-level programming, etc. He has been working with the R language for more than six years, and he has several R packages developed, with some of them published on Bioconductor and CRAN. He has been a Google Code-in 2019 mentor. -
Mr. Mauricio "Pachá" Vargas Sepúlveda - has industry-class knowledge of R, Shiny and relational databases. He has been working with the R language for more than five years, and he has several R packages developed, with some of them published on rOpenSci <3 and CRAN with R Consortium funding in some of them.
Students, please do one or more of the following tests before contacting the mentors above.
Solutions must be submited via a pull request to the rco
GitHub
repository.
- Easy: write a chunk of code that would result automatically
optimized by applying the
rco::optimize_files
function, and compare its execution time (withmicrobenchmark::microbenchmark
) against the non-optimized code. This should be submited as a vignette named “rco-optimization-example-by-STUDENT.Rmd”. - Medium: develop the code of one optimization strategy, it can be one present at the previous section.
- Hard: fully develop (code, tests & vignette) of one optimization strategy, it can be one present at the previous section.
Students, please post a link to your test results here.
- Rahul Saxena Github_Profile Solutions PR_1 for Column extraction OLD PR_1 for Column extraction NEW PR_2 for Value Extraction
[1] R. Ihaka, R: Lessons learned, directions for the future, in: Joint Statistical Meetings, The Authors, 2010. https://www.stat.auckland.ac.nz/~ihaka/downloads/JSM-2010.pdf.
[2] H. Wickham, Advanced r, Chapman; Hall/CRC, 2014. http://adv-r.had.co.nz/.
[3] Microsoft r open, 2018. https://mran.microsoft.com/open.
[4] PqR - a pretty quick version of r, 2018. http://www.pqr-project.org/.
[5] Renjin, 2018. http://www.renjin.org/.
[6] FastR, 2018. https://github.com/oracle/fastr.
[7] Riposte, a fast interpreter and jit for r, 2015. https://github.com/jtalbot/riposte/tree/library.
[8] Rho, 2017. https://github.com/rho-devel/rho.
[9] C. Gillespie, R. Lovelace, Efficient r programming, O’Reilly Media, Incorporated, 2016. https://csgillespie.github.io/efficientR/.
[10] R. Ihaka, R: Past and future history, Computing Science and Statistics. 392396 (1998). https://www.stat.auckland.ac.nz/~ihaka/downloads/Interface98.pdf.
[11] K. Cooper, L. Torczon, Engineering a compiler, Elsevier, 2011. https://www.elsevier.com/books/engineering-a-compiler/cooper/978-0-12-088478-0.
[12] P. Burns, The r inferno, 2011. https://www.burns-stat.com/pages/Tutor/R_inferno.pdf.
[13] Fast r code, n.d. http://www.dartistics.com/fast-r-code.html.
[14] Strategies to speedup r code, 2016. https://datascienceplus.com/strategies-to-speedup-r-code/.
[15] FasteR! HigheR! StrongeR! - a guide to speeding up r code for busy people, 2013. http://www.noamross.net/blog/2013/4/25/faster-talk.html.
[16] Making r code faster : A case study, 2017. https://robinsones.github.io/Making-R-Code-Faster-A-Case-Study/.