Skip to content

GH-48254: [Python][Parquet] Support extension types in read_schema#48255

Open
Kuinox wants to merge 1 commit intoapache:mainfrom
Kuinox:schema_uuid_fix
Open

GH-48254: [Python][Parquet] Support extension types in read_schema#48255
Kuinox wants to merge 1 commit intoapache:mainfrom
Kuinox:schema_uuid_fix

Conversation

@Kuinox
Copy link
Copy Markdown

@Kuinox Kuinox commented Nov 25, 2025

Rationale for this change

pq.read_schema drops extension types (UUID comes back as fixed_size_binary[16]), while ParquetFile.schema_arrow and read_table preserve them. Schema inspection via metadata should match table/extension behavior.

What changes are included in this PR?

  • Plumb arrow_extensions_enabled into read_schema and return schema_arrow when enabled so extension types are preserved.
  • Add regression test ensuring UUID extension types are retained by read_schema and downgraded to binary(16) when extensions are disabled.

Are these changes tested?

  • Yes: added unit test test_read_schema_uuid_extension_type

Are there any user-facing changes?

  • Behavior improvement: read_schema now preserves extension types (e.g., UUID) when extensions are enabled; no API break

Notes:

  • I don't know if the fact the column types being returned are now extension<arrow.uuid> instead of fixed_size_binary[16], is considered a breaking change.
  • This PR patch was AI generated, but I personally reviewed it, the scope is small, and it looks fine to me.

@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #48254 has been automatically assigned in GitHub to PR creator.

@Kuinox Kuinox force-pushed the schema_uuid_fix branch 2 times, most recently from 2fcb4b7 to 820ae83 Compare December 17, 2025 17:34
Copy link
Copy Markdown
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @Kuinox!
You can see my comments bellow.

Comment thread python/pyarrow/parquet/core.py
Comment thread python/pyarrow/parquet/core.py Outdated
Comment thread python/pyarrow/tests/parquet/test_data_types.py Outdated
@Kuinox Kuinox force-pushed the schema_uuid_fix branch 2 times, most recently from e16f96f to a144bc4 Compare February 4, 2026 12:03
@Kuinox
Copy link
Copy Markdown
Author

Kuinox commented Mar 3, 2026

I had issues running the tests on my machines (it was indicated green), I now have a non windows machine, so i'll try on it.

@Kuinox Kuinox force-pushed the schema_uuid_fix branch from a144bc4 to 966df38 Compare March 3, 2026 22:13
@AlenkaF
Copy link
Copy Markdown
Member

AlenkaF commented May 6, 2026

@github-actions crossbow submit -g python

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Revision: 966df38

Submitted crossbow builds: ursacomputing/crossbow @ actions-e7fd264d23

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-1.3.4-numpy-1.21.2 GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.12-cpython-debug GitHub Actions
test-conda-python-3.12-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.12-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.13 GitHub Actions
test-conda-python-3.13-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.13-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-3.14 GitHub Actions
test-conda-python-emscripten GitHub Actions
test-cuda-python-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-42-python-3 GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-22.04-python-313-freethreading GitHub Actions
test-ubuntu-24.04-python-3 GitHub Actions

@Kuinox
Copy link
Copy Markdown
Author

Kuinox commented May 6, 2026

Are the error expected? The build errors doesn't seems related to my change.

@AlenkaF
Copy link
Copy Markdown
Member

AlenkaF commented May 6, 2026

Some of them are, but not that many. Could you first try to rebase again please?

@Kuinox Kuinox force-pushed the schema_uuid_fix branch from 966df38 to 808df3d Compare May 6, 2026 16:21
@AlenkaF
Copy link
Copy Markdown
Member

AlenkaF commented May 7, 2026

@github-actions crossbow submit -g python

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Revision: 808df3d

Submitted crossbow builds: ursacomputing/crossbow @ actions-8aeeb0e39d

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-1.3.4-numpy-1.21.2 GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.12-cpython-debug GitHub Actions
test-conda-python-3.12-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.12-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.13 GitHub Actions
test-conda-python-3.13-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.13-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-3.14 GitHub Actions
test-conda-python-emscripten GitHub Actions
test-debian-13-python-3-amd64 GitHub Actions
test-debian-13-python-3-i386 GitHub Actions
test-fedora-42-python-3 GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-22.04-python-313-freethreading GitHub Actions
test-ubuntu-24.04-python-3 GitHub Actions

Copy link
Copy Markdown
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!
The failures that are left are expected.

@raulcd mind giving one extra look before I merge?

@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 7, 2026
Copy link
Copy Markdown
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor nit and a question but approving as it LGTM,
Thanks @Kuinox for the PR

Comment on lines +820 to +822
data = [
b'\xe4`\xf9p\x83QGN\xac\x7f\xa4g>K\xa8\xcb',
b'\x1et\x14\x95\xee\xd5C\xea\x9b\xd7s\xdc\x91BK\xaf',
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we add a comment on what is this / how was it generated?
If we ever want to change that or fix a bug in the future it could be useful.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoses are two uuid, i'll add comments


file_path = tmp_path / "uuid.parquet"
file_path_str = str(file_path)
pq.write_table(table, file_path_str, store_schema=False)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, is store_schema=False relevant?

Copy link
Copy Markdown
Author

@Kuinox Kuinox May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was 6 months ago so I'm only guessing now:
I remember that there was differents behavior depending if arrow loaded it's stored schema or not.
I don't remember if it was needed here, but store_schema=False would allow to be sure that an uuid logical type is detected as is without arrow getting the information from it's own schema.

I can confirm it if you want

@github-actions github-actions Bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants