Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding RowsReader and writer #14149

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

Lordworms
Copy link
Contributor

@Lordworms Lordworms commented Jan 16, 2025

Which issue does this PR close?

part of #7053
Adding Rowformat reader and writer for spill

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the physical-expr Physical Expressions label Jan 16, 2025
@Lordworms
Copy link
Contributor Author

I got two following PR for implement SortPreservingMergeStream in Row format and change the logics in SortExec

Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this feature is important to external sorting's performance, thank you. Left some suggestions.


I got two following PR for implement SortPreservingMergeStream in Row format and change the logics in SortExec

Perhaps first let SPM's input and output both support Rows format? This seems easier to do because only one operator needs to be changed. And larger sort query includes two levels of of SPM, we can get some performance improvement from it

datafusion/physical-plan/src/sorts/row_serde.rs Outdated Show resolved Hide resolved
let mut current_offset = 0u32;
let mut row_data = Vec::new();

for i in 0..rows.num_rows() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can directly copy all rows' data at once, instead of one by one.

But perhaps we can leave those performance-related improvements to a follow on PR, and set up a benchmark first. I suspect we can also miss some other unnecessary mem copies 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I want to do in the begining, but the implementation does not allow us to get data of Rows.

use std::sync::Arc;
use tempfile::NamedTempFile;

use crate::sorts::row_serde::{RowReader, RowWriter};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can enumerate more tests for edge cases:

  • call write_rows() multiple times
  • Write only one row
  • Write one row for multiple times
  • Include variable length field like string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add more test coverage

row_offsets.len() * 4 // row offsets
}

pub fn finish(mut self) -> Result<(), DataFusionError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need some mechanism to prevent writing again into a finished RowsWriter

writer.write_rows(batch1);
writer.write_rows(batch2);
writer.finsh();
writer.write_rows(batch3); // should fail

@Lordworms
Copy link
Contributor Author

Lordworms commented Jan 16, 2025

I believe this feature is important to external sorting's performance, thank you. Left some suggestions.

I got two following PR for implement SortPreservingMergeStream in Row format and change the logics in SortExec

Perhaps first let SPM's input and output both support Rows format? This seems easier to do because only one operator needs to be changed. And larger sort query includes two levels of of SPM, we can get some performance improvement from it

I believe this feature is important to external sorting's performance, thank you. Left some suggestions.

I got two following PR for implement SortPreservingMergeStream in Row format and change the logics in SortExec

Perhaps first let SPM's input and output both support Rows format? This seems easier to do because only one operator needs to be changed. And larger sort query includes two levels of of SPM, we can get some performance

I think we have to both change GroupHashExec and SortExec as well since these two Executions are using column format right now.

improvement from it

Also since we keep column format for single column sort, I'm not sure whether change SortPreservingMergeStream should be a good choice over adding RowformatMergeStream. Kind of hard to choose here

@alamb alamb changed the title adding RowrsReader and writer adding RowsReader and writer Jan 17, 2025
@alamb
Copy link
Contributor

alamb commented Jan 18, 2025

@2010YOUY01 when you think this is ready for me to review please let me know. @Lordworms 👋 thank you 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants