Skip to content
xendai66 edited this page Mar 6, 2023 · 12 revisions

Background

polars is the fastest new data manipulation library written in Rust using Apache arrow storage. For e.g. larger data pipelines polars brings to R:

  • Lazy file scanners (parquet, csv, idf, ....)
  • Lazy interaction with SQL databases
  • Query optimization across mixed data sources
  • Larger than memory data manipulation
  • Seemless multi-threading
  • Easy and powerful scalability to hundreds of CPU's without cluster computing or much configuration
  • A type rich environment
  • The immutable + (copy-on-write) data structures are very true to the spirit of the R functional paradigms

Related work (opinionated, feel free to disagree)

  • data.table package: C instead of Rust. Not arrow storage. No query optimization. No lazy evaluation (*footnote). Still pretty awesome.
  • arrow package: arrow storage + dplyr. No optimization, no extensive multithredding. A very popular syntax.
  • sparkR: polars copied the syntax. Great for Big Data. Cumbersome to setup, especially in a CI/CD machine-learning environment. Not very efficient (computation/resource). Long boot-up times. Only reasonable fast when using large clusters for long periods.

{{*footnote To be fair the term 'lazy' has several meanings, here's a list in ways one may think of data.table as a 'lazy' syntax that saves allocations or computation:

  • dt_plyr has a lazy syntax, to translate into data.table syntax. But the query engine is still the same.
  • data.table are using views to avoid copying.
  • data.table does e.g. .eachi keyword to perform a non allocated join and then an immediate aggregation after, which can be huge speed up. In that way data.table do have some special operations, that would require several operations to emulate in e.g. dplyr or base R.

data.table will though evaluate every expression immediately, and not like polars combine it into a logical plan of operations, which optionally can be optimized before execution.}}

Details of project: Bring awesome polars to R now!

Polars has +800 'functions/methods' to implement, and new ones are added every week. Very much needed contributions are: To bind features in rust API, write the R function + docs + tests. Many tasks are not that difficult, but there is a lot of work to be done! There are also R only tasks to improve the R syntax with best practices + write vignettes and tutorials. If you are a rust and C wizz there are still interesting task on performance improvements to be sorted out.

r-polars is now hosted under the pola-rs umbrella.

early proof of concept including the R altrep vector:

early requests on extendr and polars issue track:

Also important:

  • extendr: invaluable ground work to fuse R and rust. Template.
  • py-polars: How polars was implemented in python.
  • nodejs-polars: How polars was implemented in node-js.
  • The book for starting rust: It's an amazing journey. After, you will see programming differently.

Expected impact

If R should stay relevant as a production language, then polars is a great stepping stone. For tabular data where computation resources are a limiting factor, polars should be considered.

Mentors

Contributors, please contact mentors below after completing at least one of the tests below.

  • Soren H. Welling [email protected] author of r-polars. New to R-GSOC. Independent consultant tackling data science problems with R, C++ and python. On a deep dive into rust since last year. PhD in some ML + computational chemistry.

  • Toby Hocking [email protected] has 10+ years experience in R-GSOC, and can co-mentor.

Tests

Contributors, please do one or more of the following tests before contacting the mentors above.

MENTORS: write several tests that potential contributors can do to demonstrate their capabilities for this particular project. Ask some hard questions that will give you insight about how the contributors write code to solve problems. You'll see that the harder the questions that you ask, the easier it will be for you to choose between the contributors that apply for your project! Please modify the suggestions below to make them specific for your project.

  • Easy: Install r-polars directly from precompiled binary, see github Readme.md . Write a lazy query which lazily reads two csv files + join them + filter + column manipulation via expressions. Build the query with at least 15 of the already translated expression functions. Use apply and/or map to execute an R user function within a lazy polars query.

  • Medium: Use rustup to install rust nightly. Clone r-polars. Restore renv environment. Build/compile the r-polars package locally. Implement a function Expr_sum_add2 which should behave like Expr_sum but also add 2 to the result. The add 2 implementation should be on the rust side (see Expr::add and eager/lazy cookbooks in polars rust api docs). Add documentation and show the function can be used in the package.

  • Hard: Make a pull request implementing scan_ipc(py-polars example) similar to scan_csv(r-polars example. Far from all types are auto translated by extendr, so you will likely have to write/use some wrapper types.

Solutions of tests

Contributors, please post a link to your test results here.

Contributor Name GitHub Profile Test Results
Himanshu Kumar Singh https://github.com/xendai66 https://github.com/xendai66/polars-in-R-solutions
  • EXAMPLE CONTRIBUTOR 1 NAME, LINK TO GITHUB PROFILE, LINK TO TEST RESULTS.
Clone this wiki locally