Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Move hive partitioning outside of readers #20203

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

coastalwhite
Copy link
Collaborator

@coastalwhite coastalwhite commented Dec 6, 2024

This is a very early stage PR exploring whether it is possible to take the hive partitioning out of the readers. This will also make the Hive partitioning complete file type agnostic.

Currently, this PR does not do anything unless POLARS_NEW_HIVE is set to 1.

Here is a rough todo list.

  • Row Indexes
  • Include File Paths
  • Projection Pushdown
  • Slicing
    • Positive
    • Negative
  • Hive Predicates
  • Non hive predicates
  • Lazy Loading of Non-Hive data
  • Allow Missing Columns
  • Other file types besides Parquet
  • New streaming engine sink

ping @ritchie46, @nameexhaustion

This is a very early stage PR exploring whether it is possible to take the hive
partitioning out of the readers. This will also make the Hive partitioning
complete file type agnostic.

Currently, this PR does not do anything unless `POLARS_NEW_HIVE` is set to `1`.

Here is a rough todo list.

- [x] Row Indexes
- [x] Include File Paths
- [x] Projection Pushdown
- [ ] Slicing
- [ ] Hive Predicates
- [ ] Lazy Loading of Non-Hive data
- [ ] Allow Missing Columns
- [ ] Other file types besides Parquet
- [ ] New streaming engine sink
@github-actions github-actions bot added internal An internal refactor or improvement python Related to Python Polars rust Related to Rust Polars labels Dec 6, 2024
Copy link

codecov bot commented Dec 6, 2024

Codecov Report

Attention: Patch coverage is 11.11111% with 168 lines in your changes missing coverage. Please review.

Project coverage is 79.56%. Comparing base (9e32651) to head (99d6a93).
Report is 12 commits behind head on main.

Files with missing lines Patch % Lines
...rates/polars-mem-engine/src/executors/hive_scan.rs 0.00% 155 Missing ⚠️
crates/polars-mem-engine/src/planner/lp.rs 60.86% 9 Missing ⚠️
...es/polars-mem-engine/src/executors/scan/parquet.rs 63.63% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #20203      +/-   ##
==========================================
- Coverage   79.63%   79.56%   -0.08%     
==========================================
  Files        1564     1566       +2     
  Lines      217804   218364     +560     
  Branches     2477     2475       -2     
==========================================
+ Hits       173439   173731     +292     
- Misses      43796    44066     +270     
+ Partials      569      567       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

}

pub struct HiveExec {
sources: ScanSources,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can accept a generic here. One that take the filtered out paths and creates a Box<dyn Executror>. Then this node doesn't have to know anything about the actual implementations of parquet, csv, etc. The planner deals with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal An internal refactor or improvement python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants