-
Notifications
You must be signed in to change notification settings - Fork 0
polars in R
polars is the fastest new data manipulation library written in Rust using Apache arrow storage. For e.g. larger data pipelines polars brings to R:
- Lazy file scanners (parquet, csv, idf, ....)
- Lazy interaction with SQL databases
- Query optimization across mixed data sources
- Larger than memory data manipulation
- Seemless multi-threading
- Easy and powerful scalability to hundreds of CPU's without cluster computing or much configuration
- A type rich environment
- The immutable + (copy-on-write) data structures are very true to the spirit of the R functional paradigms
- data.table package: C instead of Rust. Not arrow storage. No query optimization. No lazy evaluation (*footnote). Still pretty awesome.
- arrow package: arrow storage + dplyr. No optimization, no extensive multithredding. A very popular syntax.
- sparkR: polars copied the syntax. Great for Big Data. Cumbersome to setup, especially in a CI/CD machine-learning environment. Not very efficient (computation/resource). Long boot-up times. Only reasonable fast when using large clusters for long periods.
{{*footnote To be fair the term 'lazy' has several meanings, here's a list in ways one may think of data.table as a 'lazy' syntax that saves allocations or computation:
- dt_plyr has a lazy syntax, to translate into data.table syntax. But the query engine is still the same.
- data.table are using views to avoid copying.
- data.table does e.g. .eachi keyword to perform a non allocated join and then an immediate aggregation after, which can be huge speed up. In that way data.table do have some special operations, that would require several operations to emulate in e.g. dplyr or base R.
data.table will though evaluate every expression immediately, and not like polars combine it into a logical plan of operations, which optionally can be optimized before execution.}}
Polars has +800 'functions/methods' to implement, and new ones are added every week. Very much needed contributions are: To bind features in rust API, write the R function + docs + tests. Many tasks are not that difficult, but there is a lot of work to be done! There are also R only tasks to improve the R syntax with best practices + write vignettes and tutorials. If you are a rust and C wizz there are still interesting task on performance improvements to be sorted out.
- extendr: invaluable ground work to fuse R and rust. Template.
- py-polars: How polars was implemented in python.
- nodejs-polars: How polars was implemented in node-js.
- The book for starting rust: It's an amazing journey. After, you will see programming differently.
If R should stay relevant as a production language, then polars is a great stepping stone. For tabular data where computation resources are a limiting factor, polars should be considered.
Contributors, please contact mentors below after completing at least one of the tests below.
-
Soren H. Welling [email protected] author of r-polars. New to R-GSOC. Independent consultant tackling data science problems with R, C++ and python. On a deep dive into rust since last year. PhD in some ML + computational chemistry.
-
Toby Hocking [email protected] has 10+ years experience in R-GSOC, and can co-mentor.
Contributors, please do one or more of the following tests before contacting the mentors above.
MENTORS: write several tests that potential contributors can do to demonstrate their capabilities for this particular project. Ask some hard questions that will give you insight about how the contributors write code to solve problems. You'll see that the harder the questions that you ask, the easier it will be for you to choose between the contributors that apply for your project! Please modify the suggestions below to make them specific for your project.
-
Easy: Install r-polars directly from precompiled binary, see github Readme.md . Write a lazy query which lazily reads two csv files + join them + filter + column manipulation via expressions. Build the query with at least 15 of the already translated expression functions. Use apply and/or map to execute an R user function within a lazy polars query.
-
Medium: Use rustup to install rust nightly. Clone r-polars. Restore renv environment. Build/compile the r-polars package locally. Implement a function
Expr_sum_add2
which should behave likeExpr_sum
but also add2
to the result. The add 2 implementation should be on the rust side (see Expr::add and eager/lazy cookbooks in polars rust api docs). Add documentation and show the function can be used in the package. -
Hard: Make a pull request implementing
scan_ipc
(py-polars example) similar toscan_csv
(r-polars example. Far from all types are auto translated by extendr, so you will likely have to write/use some wrapper types.
Contributors, please post a link to your test results here.
Contributor Name | GitHub Profile | Test Results |
---|---|---|
Himanshu Kumar Singh | https://github.com/xendai66 | https://github.com/xendai66/polars-in-R-solutions |
- EXAMPLE CONTRIBUTOR 1 NAME, LINK TO GITHUB PROFILE, LINK TO TEST RESULTS.