-
-
Notifications
You must be signed in to change notification settings - Fork 824
fix: fixed integer overflow in ExpUnrolledLinkedList for large datasets #2735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@mdashti Which means you have gigantic segments. The recommended max size for a segment is 10million docs. Do you have a strong reason for this segment size? |
|
@fulmicoton That's right. But, this is our user data, which they can have any shape for their data. We've advised them to improve their usage, but it's best if we don't panic on this issue. |
Sorry I think I misunderstood the problem. I thought it was caused by a super high number of rows. I'd like to merge this but it comes with a performance regression. |
|
Can you share more detail about the problem? These ids are local to a single posting lists. They rapidly saturate to a size of 32K, so it should support posting lists of around 2GB. How do you end up with a single posting list being this long? |
|
The performance regression was just caused by lack of inlining. |
fulmicoton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add inlining.
I cannot push to your remote branch.
Here's the original report on ParadeDB:
with this index definition:
and here's the query:
|
mdashti
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fulmicoton Thanks for the comments.
What
Changes
ExpUnrolledLinkedList::block_numfromu16tou32to prevent integer overflow when indexing large datasets. The structure now supports up to ~4 billion blocks (128 TB) instead of just 65,535 blocks (2.1 GB).Why
Users were experiencing index creation failures with the error
"mid > len"when creating BM25 indexes on tables with large integer arrays (100k rows × 6,700 elements = 660M operations). This required ~103,000 blocks, exceeding theu16::MAXlimit of 65,535, causing:"mid > len"errors"attempt to add with overflow"How
block_numtype:u16→u32(supports 65,536× more blocks)checked_add()inincrement_num_blocks()assert!()inread_to_end()Tests
Added 8 tests to verify the fix.