Skip to content

Conversation

@mdashti
Copy link
Collaborator

@mdashti mdashti commented Nov 18, 2025

What

Changes ExpUnrolledLinkedList::block_num from u16 to u32 to prevent integer overflow when indexing large datasets. The structure now supports up to ~4 billion blocks (128 TB) instead of just 65,535 blocks (2.1 GB).

Why

Users were experiencing index creation failures with the error "mid > len" when creating BM25 indexes on tables with large integer arrays (100k rows × 6,700 elements = 660M operations). This required ~103,000 blocks, exceeding the u16::MAX limit of 65,535, causing:

  • Integer overflow in release builds → memory corruption → "mid > len" errors
  • Direct overflow panic in debug builds → "attempt to add with overflow"

How

  1. Changed block_num type: u16u32 (supports 65,536× more blocks)
  2. Added safety measures:
    • Overflow protection with checked_add() in increment_num_blocks()
    • Metadata corruption detection with assert!() in read_to_end()
  3. Maintained compatibility: Block sizes still cap at 32 KB; only the count limit increased

Tests

Added 8 tests to verify the fix.

@fulmicoton
Copy link
Collaborator

fulmicoton commented Dec 1, 2025

@mdashti Which means you have gigantic segments. The recommended max size for a segment is 10million docs. Do you have a strong reason for this segment size?

@fulmicoton fulmicoton requested a review from PSeitz December 1, 2025 11:24
@mdashti
Copy link
Collaborator Author

mdashti commented Dec 1, 2025

@fulmicoton That's right. But, this is our user data, which they can have any shape for their data. We've advised them to improve their usage, but it's best if we don't panic on this issue.

@fulmicoton
Copy link
Collaborator

fulmicoton commented Dec 2, 2025

But, this is our user data, which they can have any shape for their data.

Sorry I think I misunderstood the problem. I thought it was caused by a super high number of rows.

I'd like to merge this but it comes with a performance regression.

~/git/tantivy (paradedb-paradedb/fix-overflow-issue*) » cargo bench --bench index-bench                   paul.masurel@COMP-QMLQQJH2R1
   Compiling tantivy-stacker v0.6.0 (/Users/paul.masurel/git/tantivy/stacker)
    Building [=======================> ] 244/248: tantivy-stacker
   Compiling tantivy-columnar v0.6.0 (/Users/paul.masurel/git/tantivy/columnar)
   Compiling tantivy v0.26.0 (/Users/paul.masurel/git/tantivy)
    Finished `bench` profile [optimized + debuginfo] target(s) in 17.77s
     Running benches/index-bench.rs (target/release/deps/index_bench-fa60c5a819b439e8)
index-hdfs/only-indexed-no-commit
                        time:   [232.87 ms 236.64 ms 240.73 ms]
                        thrpt:  [88.768 MiB/s 90.300 MiB/s 91.763 MiB/s]
                 change:
                        time:   [+14.745% +17.293% +19.919%] (p = 0.00 < 0.05)
                        thrpt:  [-16.611% -14.744% -12.851%]
                        Performance has regressed.
Benchmarking index-hdfs/only-indexed-with-commit: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 6.9s, or reduce sample count to 10.
index-hdfs/only-indexed-with-commit
                        time:   [324.49 ms 327.59 ms 331.02 ms]
                        thrpt:  [64.555 MiB/s 65.231 MiB/s 65.854 MiB/s]
                 change:
                        time:   [+11.808% +14.003% +15.959%] (p = 0.00 < 0.05)
                        thrpt:  [-13.762% -12.283% -10.561%]
                        Performance has regressed.
Found 2 outliers among 20 measurements (10.00%)

@fulmicoton
Copy link
Collaborator

Can you share more detail about the problem?

These ids are local to a single posting lists. They rapidly saturate to a size of 32K, so it should support posting lists of around 2GB. How do you end up with a single posting list being this long?
Is this because you encode positions, and the token is very frequent?

@fulmicoton
Copy link
Collaborator

The performance regression was just caused by lack of inlining.

Copy link
Collaborator

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add inlining.

I cannot push to your remote branch.

@fulmicoton
Copy link
Collaborator

@mdashti I do not have permission to push changes to your upstream repo.
You have been invited to this repo. Next time, can you push to a dev branch in tantivy or give me write permissions to push extra commits to your repo?

(@stuhood same situation)

@mdashti
Copy link
Collaborator Author

mdashti commented Dec 3, 2025

Can you share more detail about the problem?

These ids are local to a single posting lists. They rapidly saturate to a size of 32K, so it should support posting lists of around 2GB. How do you end up with a single posting list being this long?
Is this because you encode positions, and the token is very frequent?

Here's the original report on ParadeDB:

running into a weird indexing problems we've tried to flatten some of our lists over into a list_id column in our main cmptbl_full table. We've tried this on our DEVELOP environment and the parade index built fine, but when trying it on PROD where we have much more list data, we kept encountering one of two errors, either
FATAL: server conn crashed?

SSL connection has been closed unexpectedly
or
ERROR: mid > len

CONTEXT: parallel worker
Making an index without list_id worked fine, and making an index with only list_id still proc'd the above two errors. Is this something that you've run into before?

with this index definition:

CREATE INDEX cmptbl_full_new_idx ON public.cmptbl_full_new USING bm25 (contact_id, ent_domain, ent_industry, ent_sector, ent_sub_sectors, ent_name, ent_shorthand_name, contact_business_email, contact_canonical_shorthand_name, contact_first_name, contact_full_name, contact_job_title, contact_last_name, contact_mobile_phone, ent_id, employee_rank, revenue_rank, ent_emp_rev_details, ent_locations_details, contact_job_details, contact_locations_details, contact_confirmed_connect_date, list_id) WITH (key_field=contact_id, text_fields='{
"ent_domain": {"fast": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"},
"ent_industry": {"fast": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "raw"}},
"ent_sector": {"fast": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "raw"}},
"ent_name": {"fast": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"},
"ent_shorthand_name": {"fast": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"},
"contact_business_email": {"fast": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"},
"contact_canonical_shorthand_name": {"fast": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"},
"contact_first_name": {"fast": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"},
"contact_full_name": {"fast": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"},
"contact_job_title": {"fast": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"},
"contact_last_name": {"fast": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"},
"contact_mobile_phone": {"fast": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"}
}', numeric_fields='{
"ent_id": {"indexed": true},
"employee_rank": {"indexed": true},
"revenue_rank": {"indexed": true},
"list_id": {"indexed": true}
}', boolean_fields='{}', json_fields='{
"ent_emp_rev_details": {"fast": true, "indexed": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "lowercase"}},
"ent_locations_details": {"fast": true, "indexed": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "lowercase"}},
"ent_sub_sectors": {"fast": true, "indexed": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "lowercase"}},
"contact_job_details": {"fast": true, "indexed": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "lowercase"}},
"contact_locations_details": { "fast": true, "indexed": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "lowercase"}}
}', range_fields='{}', datetime_fields='{
"contact_confirmed_connect_date": {"indexed": true}
}');

and here's the query:

CREATE INDEX cmptbl_full_old_list_id_idx ON cmptbl_full_old
USING bm25 (contact_id, list_id)
WITH (
key_field=contact_id,
numeric_fields='{"list_id": {"indexed": true}}'
);
ERROR: XX000: mid > len
CONTEXT: parallel worker
LOCATION: column_operation.rs:100

SELECT
array_length(list_id, 1) as array_len,
pg_column_size(list_id) as size_bytes,
list_id[1:10] as first_10_elements
FROM cmptbl_full_old
WHERE list_id IS NOT NULL
ORDER BY array_length(list_id, 1) DESC
LIMIT 10;
array_len | size_bytes | first_10_elements
-----------+------------+-------------------------------------------------------
6700 | 26820 | {407,434,638,769,1184,1202,1781,2012,2108,2270}
6640 | 26580 | {935,1259,1383,1411,1418,1624,1881,2113,2728,2965}
6639 | 26576 | {407,434,638,769,1184,1202,1683,1781,2012,2270}
6635 | 26560 | {1086,1901,2105,4181,4213,4894,5168,5221,68213,68222}
6635 | 26560 | {871,1067,1683,2204,2345,2805,2572,3356,4311,4557}
6628 | 26532 | {1901,2845,2848,3836,4213,5221,68213,7274,7462,8076}
6627 | 26528 | {935,1259,1383,1418,1881,2345,2728,2965,3164,3172}
6623 | 26512 | {1086,2845,2848,3836,4213,5221,68213,7274,7462,7557}
6606 | 26444 | {564,1634,1683,1856,2022,2405,2805,2572,2988,3730}
6604 | 26436 | {407,434,638,769,1184,1202,1781,2012,2270,2718}

Copy link
Collaborator Author

@mdashti mdashti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fulmicoton Thanks for the comments.

@mdashti mdashti requested a review from fulmicoton December 3, 2025 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants