Skip to content

Conversation

aihuaxu
Copy link

@aihuaxu aihuaxu commented Oct 16, 2025

Rationale for this change

According to the Variant specification, the specification_version field must be set to 1 to indicate Variant encoding version 1. Currently, this field defaults to 0, which violates the specification. Parquet readers that strictly enforce specification version validation will fail to read files containing Variant types.
image

What changes are included in this PR?

The change includes defaulting the specification version to 1.

Are these changes tested?

The change is covered by unit test.

Are there any user-facing changes?

The Parquet files produced the variant logical type annotation VARIANT(1).

Schema:
message schema {
  optional group V (VARIANT(1)) = 1 {
    required binary metadata;
    required binary value;
  }
}

@aihuaxu aihuaxu requested a review from wgtmac as a code owner October 16, 2025 17:52
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@aihuaxu
Copy link
Author

aihuaxu commented Oct 16, 2025

@wgtmac Can you take a look?

@aihuaxu aihuaxu changed the title Set Variant specification version to 1 to align with the variant spec GH-47838: [C++] Set Variant specification version to 1 to align with the variant spec Oct 16, 2025
Copy link

⚠️ GitHub issue #47838 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 16, 2025
@aihuaxu aihuaxu changed the title GH-47838: [C++] Set Variant specification version to 1 to align with the variant spec GH-47838: [C++][Parquet] Set Variant specification version to 1 to align with the variant spec Oct 17, 2025
@aihuaxu aihuaxu requested a review from wgtmac October 17, 2025 04:45
@aihuaxu aihuaxu force-pushed the aixu-update-spec-version branch from a7c4d0d to 3450f1c Compare October 17, 2025 05:00
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

I'm fine with the current change for now. Eventually we may need to add setter and getter to VariantLogicalType for specification_version and validating its value.

cc @pitrou @raulcd

@pitrou
Copy link
Member

pitrou commented Oct 17, 2025

Parquet readers that strictly enforce specification version validation will fail to read files containing Variant types.

Why is it not the case for our own reader?

@aihuaxu
Copy link
Author

aihuaxu commented Oct 17, 2025

Parquet readers that strictly enforce specification version validation will fail to read files containing Variant types.

Why is it not the case for our own reader?

You mean Parquet C++ reader? Currently Parquet C++ reader hasn't implemented Variant read/write yet. The only thing it added is to support Variant logical type and we are not doing the check. Currently Variant spec version is version 1 (and this is the only version) and the other readers may/may not add the check.

@aihuaxu aihuaxu force-pushed the aixu-update-spec-version branch from 4b65ada to 38a919d Compare October 17, 2025 18:37
@aihuaxu
Copy link
Author

aihuaxu commented Oct 17, 2025

@wgtmac and @pitrou I added specification_version to VariantLogicalType. Please take another look. Thanks.

@raulcd
Copy link
Member

raulcd commented Oct 17, 2025

You mean Parquet C++ reader? Currently Parquet C++ reader hasn't implemented Variant read/write yet. The only thing it added is to support Variant logical type and we are not doing the check. Currently Variant spec version is version 1 (and this is the only version) and the other readers may/may not add the check.

Sorry, because I might be missing something obvious, I am not too familiar with this part of the codebase, but if we haven't implemented Parquet C++ Variant write yet, I am not sure I understand how can a user would be able to create Variant files with a logical type annotation that will be incorrect with Parquet C++ if we release without this fix.

@aihuaxu
Copy link
Author

aihuaxu commented Oct 17, 2025

You mean Parquet C++ reader? Currently Parquet C++ reader hasn't implemented Variant read/write yet. The only thing it added is to support Variant logical type and we are not doing the check. Currently Variant spec version is version 1 (and this is the only version) and the other readers may/may not add the check.

Sorry, because I might be missing something obvious, I am not too familiar with this part of the codebase, but if we haven't implemented Parquet C++ Variant write yet, I am not sure I understand how can a user would be able to create Variant files with a logical type annotation that will be incorrect with Parquet C++ if we release without this fix.

The engines will implement the reader/writer parts but will use the variant type defined in Arrow Parquet. That would cause the engines to write incorrect annotation. That's what I'm seeing internally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants