fix: fixed integer overflow in ExpUnrolledLinkedList for large datasets #2735

mdashti · 2025-11-18T17:15:24Z

What

Changes ExpUnrolledLinkedList::block_num from u16 to u32 to prevent integer overflow when indexing large datasets. The structure now supports up to ~4 billion blocks (128 TB) instead of just 65,535 blocks (2.1 GB).

Why

Users were experiencing index creation failures with the error "mid > len" when creating BM25 indexes on tables with large integer arrays (100k rows × 6,700 elements = 660M operations). This required ~103,000 blocks, exceeding the u16::MAX limit of 65,535, causing:

Integer overflow in release builds → memory corruption → "mid > len" errors
Direct overflow panic in debug builds → "attempt to add with overflow"

How

Changed block_num type: u16 → u32 (supports 65,536× more blocks)
Added safety measures:
- Overflow protection with checked_add() in increment_num_blocks()
- Metadata corruption detection with assert!() in read_to_end()
Maintained compatibility: Block sizes still cap at 32 KB; only the count limit increased

Tests

Added 8 tests to verify the fix.

fulmicoton · 2025-12-01T11:21:59Z

@mdashti Which means you have gigantic segments. The recommended max size for a segment is 10million docs. Do you have a strong reason for this segment size?

mdashti · 2025-12-01T22:42:23Z

@fulmicoton That's right. But, this is our user data, which they can have any shape for their data. We've advised them to improve their usage, but it's best if we don't panic on this issue.

fulmicoton · 2025-12-02T20:11:09Z

But, this is our user data, which they can have any shape for their data.

Sorry I think I misunderstood the problem. I thought it was caused by a super high number of rows.

I'd like to merge this but it comes with a performance regression.

~/git/tantivy (paradedb-paradedb/fix-overflow-issue*) » cargo bench --bench index-bench                   paul.masurel@COMP-QMLQQJH2R1
   Compiling tantivy-stacker v0.6.0 (/Users/paul.masurel/git/tantivy/stacker)
    Building [=======================> ] 244/248: tantivy-stacker
   Compiling tantivy-columnar v0.6.0 (/Users/paul.masurel/git/tantivy/columnar)
   Compiling tantivy v0.26.0 (/Users/paul.masurel/git/tantivy)
    Finished `bench` profile [optimized + debuginfo] target(s) in 17.77s
     Running benches/index-bench.rs (target/release/deps/index_bench-fa60c5a819b439e8)
index-hdfs/only-indexed-no-commit
                        time:   [232.87 ms 236.64 ms 240.73 ms]
                        thrpt:  [88.768 MiB/s 90.300 MiB/s 91.763 MiB/s]
                 change:
                        time:   [+14.745% +17.293% +19.919%] (p = 0.00 < 0.05)
                        thrpt:  [-16.611% -14.744% -12.851%]
                        Performance has regressed.
Benchmarking index-hdfs/only-indexed-with-commit: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 6.9s, or reduce sample count to 10.
index-hdfs/only-indexed-with-commit
                        time:   [324.49 ms 327.59 ms 331.02 ms]
                        thrpt:  [64.555 MiB/s 65.231 MiB/s 65.854 MiB/s]
                 change:
                        time:   [+11.808% +14.003% +15.959%] (p = 0.00 < 0.05)
                        thrpt:  [-13.762% -12.283% -10.561%]
                        Performance has regressed.
Found 2 outliers among 20 measurements (10.00%)

fulmicoton · 2025-12-02T20:20:23Z

Can you share more detail about the problem?

These ids are local to a single posting lists. They rapidly saturate to a size of 32K, so it should support posting lists of around 2GB. How do you end up with a single posting list being this long?
Is this because you encode positions, and the token is very frequent?

fulmicoton · 2025-12-02T20:43:33Z

The performance regression was just caused by lack of inlining.

stacker/src/expull.rs

fulmicoton

We need to add inlining.

I cannot push to your remote branch.

fulmicoton · 2025-12-02T20:56:00Z

@mdashti I do not have permission to push changes to your upstream repo.
You have been invited to this repo. Next time, can you push to a dev branch in tantivy or give me write permissions to push extra commits to your repo?

(@stuhood same situation)

mdashti · 2025-12-03T00:28:03Z

Can you share more detail about the problem?

These ids are local to a single posting lists. They rapidly saturate to a size of 32K, so it should support posting lists of around 2GB. How do you end up with a single posting list being this long?
Is this because you encode positions, and the token is very frequent?

Here's the original report on ParadeDB:

running into a weird indexing problems we've tried to flatten some of our lists over into a list_id column in our main cmptbl_full table. We've tried this on our DEVELOP environment and the parade index built fine, but when trying it on PROD where we have much more list data, we kept encountering one of two errors, either
FATAL: server conn crashed?

SSL connection has been closed unexpectedly
or
ERROR: mid > len

CONTEXT: parallel worker
Making an index without list_id worked fine, and making an index with only list_id still proc'd the above two errors. Is this something that you've run into before?

with this index definition:

CREATE INDEX cmptbl_full_new_idx ON public.cmptbl_full_new USING bm25 (contact_id, ent_domain, ent_industry, ent_sector, ent_sub_sectors, ent_name, ent_shorthand_name, contact_business_email, contact_canonical_shorthand_name, contact_first_name, contact_full_name, contact_job_title, contact_last_name, contact_mobile_phone, ent_id, employee_rank, revenue_rank, ent_emp_rev_details, ent_locations_details, contact_job_details, contact_locations_details, contact_confirmed_connect_date, list_id) WITH (key_field=contact_id, text_fields='{
"ent_domain": {"fast": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"},
"ent_industry": {"fast": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "raw"}},
"ent_sector": {"fast": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "raw"}},
"ent_name": {"fast": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"},
"ent_shorthand_name": {"fast": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"},
"contact_business_email": {"fast": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"},
"contact_canonical_shorthand_name": {"fast": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"},
"contact_first_name": {"fast": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"},
"contact_full_name": {"fast": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"},
"contact_job_title": {"fast": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"},
"contact_last_name": {"fast": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"},
"contact_mobile_phone": {"fast": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"}
}', numeric_fields='{
"ent_id": {"indexed": true},
"employee_rank": {"indexed": true},
"revenue_rank": {"indexed": true},
"list_id": {"indexed": true}
}', boolean_fields='{}', json_fields='{
"ent_emp_rev_details": {"fast": true, "indexed": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "lowercase"}},
"ent_locations_details": {"fast": true, "indexed": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "lowercase"}},
"ent_sub_sectors": {"fast": true, "indexed": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "lowercase"}},
"contact_job_details": {"fast": true, "indexed": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "lowercase"}},
"contact_locations_details": { "fast": true, "indexed": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "lowercase"}}
}', range_fields='{}', datetime_fields='{
"contact_confirmed_connect_date": {"indexed": true}
}');

and here's the query:

CREATE INDEX cmptbl_full_old_list_id_idx ON cmptbl_full_old
USING bm25 (contact_id, list_id)
WITH (
key_field=contact_id,
numeric_fields='{"list_id": {"indexed": true}}'
);
ERROR: XX000: mid > len
CONTEXT: parallel worker
LOCATION: column_operation.rs:100

SELECT
array_length(list_id, 1) as array_len,
pg_column_size(list_id) as size_bytes,
list_id[1:10] as first_10_elements
FROM cmptbl_full_old
WHERE list_id IS NOT NULL
ORDER BY array_length(list_id, 1) DESC
LIMIT 10;
array_len | size_bytes | first_10_elements
-----------+------------+-------------------------------------------------------
6700 | 26820 | {407,434,638,769,1184,1202,1781,2012,2108,2270}
6640 | 26580 | {935,1259,1383,1411,1418,1624,1881,2113,2728,2965}
6639 | 26576 | {407,434,638,769,1184,1202,1683,1781,2012,2270}
6635 | 26560 | {1086,1901,2105,4181,4213,4894,5168,5221,68213,68222}
6635 | 26560 | {871,1067,1683,2204,2345,2805,2572,3356,4311,4557}
6628 | 26532 | {1901,2845,2848,3836,4213,5221,68213,7274,7462,8076}
6627 | 26528 | {935,1259,1383,1418,1881,2345,2728,2965,3164,3172}
6623 | 26512 | {1086,2845,2848,3836,4213,5221,68213,7274,7462,7557}
6606 | 26444 | {564,1634,1683,1856,2022,2405,2805,2572,2988,3730}
6604 | 26436 | {407,434,638,769,1184,1202,1781,2012,2270,2718}

mdashti

@fulmicoton Thanks for the comments.

stacker/src/expull.rs

mdashti added 2 commits November 18, 2025 09:13

Fixed the overflow issue.

e2a07cd

Fixed lint issues.

d98acd7

fulmicoton requested a review from PSeitz December 1, 2025 11:24

fulmicoton reviewed Dec 2, 2025

View reviewed changes

stacker/src/expull.rs Show resolved Hide resolved

fulmicoton reviewed Dec 2, 2025

View reviewed changes

stacker/src/expull.rs Outdated Show resolved Hide resolved

fulmicoton reviewed Dec 2, 2025

View reviewed changes

stacker/src/expull.rs Outdated Show resolved Hide resolved

fulmicoton requested changes Dec 2, 2025

View reviewed changes

Applied PR fixes.

d2466c6

mdashti commented Dec 3, 2025

View reviewed changes

stacker/src/expull.rs Show resolved Hide resolved

stacker/src/expull.rs Outdated Show resolved Hide resolved

stacker/src/expull.rs Outdated Show resolved Hide resolved

mdashti requested a review from fulmicoton December 3, 2025 00:28

Fixed a lint issue.

78eca38

Uh oh!

fix: fixed integer overflow in ExpUnrolledLinkedList for large datasets #2735

Are you sure you want to change the base?

fix: fixed integer overflow in ExpUnrolledLinkedList for large datasets #2735

Uh oh!

Conversation

mdashti commented Nov 18, 2025

What

Why

How

Tests

Uh oh!

fulmicoton commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdashti commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fulmicoton commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fulmicoton commented Dec 2, 2025

Uh oh!

fulmicoton commented Dec 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fulmicoton left a comment

Choose a reason for hiding this comment

Uh oh!

fulmicoton commented Dec 2, 2025

Uh oh!

mdashti commented Dec 3, 2025

Uh oh!

mdashti left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fulmicoton commented Dec 1, 2025 •

edited

Loading

mdashti commented Dec 1, 2025 •

edited

Loading

fulmicoton commented Dec 2, 2025 •

edited

Loading