Skip to content

MINOR: [C++][CSV] avoid int32 overflow in block parser value counts#50074

Open
metsw24-max wants to merge 3 commits into
apache:mainfrom
metsw24-max:csv-block-parser-int32-overflow
Open

MINOR: [C++][CSV] avoid int32 overflow in block parser value counts#50074
metsw24-max wants to merge 3 commits into
apache:mainfrom
metsw24-max:csv-block-parser-int32-overflow

Conversation

@metsw24-max

Copy link
Copy Markdown
Contributor

Rationale for this change

The CSV block parser sizes its per-chunk value array from num_cols, the column count inferred from the first line of the input, times the rows-in-chunk count. PresizedValueDescWriter computes 2 + num_rows * num_cols, and ParseSpecialized computes num_cols_ * (num_rows_ - start) * 10, both in int32_t. A CSV whose first line carries a few million fields pushes these products past INT32_MAX, which is signed-integer-overflow UB (UBSan flags both expressions).

What changes are included in this PR?

Widen both multiplications to int64_t, matching their int64_t destinations.

Are these changes tested?

Existing CSV parser tests pass. The overflow was confirmed with a standalone UBSan build of the two expressions, clean after widening.

Are there any user-facing changes?

No.

@metsw24-max metsw24-max requested a review from pitrou as a code owner June 23, 2026 11:42
@metsw24-max

Copy link
Copy Markdown
Contributor Author

Pushed a clang-format fix for the C++ Format lint failure. The macOS Python job looks like an unrelated teardown issue, not from this change.

@pitrou

pitrou commented Jun 23, 2026

Copy link
Copy Markdown
Member

ParsedValueDesc stores a 32-bit offset, so we'll be having a problem anyway, right?
How about we simply error out when the product of rows and columns is greater than 2**31?

@metsw24-max

Copy link
Copy Markdown
Contributor Author

Good point, the offset is only 31 bits so widening the products doesn't actually save us once the value count passes int32. I've switched to erroring out before presizing when rows x cols would exceed INT32_MAX, and added a test that drives num_cols high enough to trip it. Kept the two multiplications in int64 as well, since the capacity's +2 and the bulk-filter threshold's * 10 can still overflow within that limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants