Skip to content

GH-47662: [C++][Parquet] Reject metadata with null_count on required column#49909

Open
ArnavBalyan wants to merge 1 commit intoapache:mainfrom
ArnavBalyan:arnavb/fix-upstream-malformed-data
Open

GH-47662: [C++][Parquet] Reject metadata with null_count on required column#49909
ArnavBalyan wants to merge 1 commit intoapache:mainfrom
ArnavBalyan:arnavb/fix-upstream-malformed-data

Conversation

@ArnavBalyan
Copy link
Copy Markdown
Member

@ArnavBalyan ArnavBalyan commented May 2, 2026

Rationale for this change

  • Parquet spec requires that REQUIRED columns can't encode nulls (max_definition_level = 0 always has null_count = 0).
  • When this is violated by a writer (eg. parquet-mr with disabled validating record consumer), the reader fails with "Unexpected end of stream" during decode.
  • Ensure that this issue can be detected eagerly by the reader, and show correct error. We should consider mandating the validating consumer in upstream to ensure it cannot accidentally write such cases.
  • Closes [Parquet][C++] pyArrow fails to read a file from parquet-testing #47662.

What changes are included in this PR?

  • Eagerly detect corrupt file and surface relevant error.

Are these changes tested?

  • Yes

Are there any user-facing changes?

  • Yes

@ArnavBalyan ArnavBalyan requested a review from wgtmac as a code owner May 2, 2026 15:35
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

⚠️ GitHub issue #47662 has been automatically assigned in GitHub to PR creator.

@ArnavBalyan
Copy link
Copy Markdown
Member Author

cc @pitrou could you ptal thanks!

Comment thread cpp/src/parquet/metadata_test.cc Outdated
column_metadata.__isset.statistics = true;

EXPECT_THROW(ColumnChunkMetaData::Make(&column_chunk, schema_descr.Column(0)),
ParquetException);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please test something about the exception message? You can use the EXPECT_THROW_THAT macro.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment thread cpp/src/parquet/metadata.cc Outdated

// Per the Parquet spec, a column with max_definition_level == 0 cannot
// have nulls, so null_count must be 0. Reject inconsistent metadata
// from writers that skip ValidatingRecordConsumer checks for missing
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readers of the Parquet C++ code will not know anything about ValidatingRecordConsumer, can we make this comment more generic?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@pitrou
Copy link
Copy Markdown
Member

pitrou commented May 5, 2026

@wgtmac Could you please take a look?

@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 5, 2026
@ArnavBalyan ArnavBalyan force-pushed the arnavb/fix-upstream-malformed-data branch from d0043f4 to 76111a6 Compare May 5, 2026 10:34
@ArnavBalyan
Copy link
Copy Markdown
Member Author

updated thanks

@ArnavBalyan ArnavBalyan force-pushed the arnavb/fix-upstream-malformed-data branch from 76111a6 to 018c631 Compare May 5, 2026 10:42
Comment thread cpp/src/parquet/metadata.cc Outdated
ss << "Malformed Parquet file: column '" << descr_->path()->ToDotString()
<< "' has max_definition_level == 0 but statistics report null_count="
<< column_metadata_->statistics.null_count;
throw ParquetException(ss.str());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of throwing an exception here, is it better to unset the null_count field when building the statistics?

Just FYI that I recently reviewed apache/parquet-java#3458 which has dealt with a similar case. From downstream users' standpoint, it is a bad experience if the Parquet file cannot be read simply because of malformed statistics.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, is it possible to get such a malformed file using the regular Parquet writing APIs?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some legacy versions can get it through with disabled validating record consumer

Copy link
Copy Markdown
Member Author

@ArnavBalyan ArnavBalyan May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unset the null_count field when building the statistics

Yeah agreed, thanks for sharing, updated to maintain parity with java

@ArnavBalyan ArnavBalyan force-pushed the arnavb/fix-upstream-malformed-data branch from 018c631 to 6cd1b89 Compare May 5, 2026 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Parquet][C++] pyArrow fails to read a file from parquet-testing

3 participants