-
Notifications
You must be signed in to change notification settings - Fork 367
[Bug]: Amoro optimization can result in the input files and the merge… #3856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Please help review whether this repair plan is feasible. Fix Results:
This ensures that: multi small files (even if the total size is greater than or equal to 128MB, but the average file size is less than 128MB) can be merged.
|
…d output files having the same number of files, and this can cause the merge to fail and keep triggering the merge task. apache#3855
…lave mode is enabled. apache#3845
| if (inputSize < targetSize) { | ||
| return Long.MAX_VALUE; | ||
| } | ||
| // Even if total size >= targetSize, if average file size is small (less than targetSize), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this output a single file with too big file size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wardlican
I have the same concern: could simply setting it to the maximum value result in a single file being excessively large? Could we perhaps refer to Spark's estimation logic?
Refer to https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/actions/SizeBasedFileRewritePlanner.java#L199-L206
Why are the changes needed?
Amoro optimization can result in the input files and the merged output files having the same number of files, and this can cause the merge to fail and keep triggering the merge task.
Close #3855
Brief change log
How was this patch tested?
Add some test cases that check the changes thoroughly including negative and positive cases if possible
Add screenshots for manual tests if appropriate
Run test locally before making a pull request
Documentation