Skip to content

feat(python): Python binding for iceberg-rust Schema#3

Draft
abnobdoss wants to merge 8 commits into
mainfrom
schema-binding-poc
Draft

feat(python): Python binding for iceberg-rust Schema#3
abnobdoss wants to merge 8 commits into
mainfrom
schema-binding-poc

Conversation

@abnobdoss
Copy link
Copy Markdown
Owner

@abnobdoss abnobdoss commented May 24, 2026

Status

Blocked for now while the Predicate binding stack settles. This draft is runtime-only; Python typing stubs are deferred to a separate package-wide follow-up PR.

Summary

Adds a Python binding for iceberg::spec::Schema as an opaque pyiceberg_core.schema.Schema handle constructed from Iceberg schema JSON.

  • Schema.from_json(s) parses V1 or V2 schema JSON and lets iceberg-rust serde enforce schema validity
  • schema_id(), highest_field_id(), column_names(), identifier_field_ids() expose cheap schema metadata; identifier field IDs are returned in ascending order
  • find_field_by_name(name) does case-sensitive dotted-path lookup and returns {id, name, type, required} or None
  • field_by_id(id) returns the same field dict shape and raises KeyError when absent
  • to_json() emits parseable schema JSON for semantic round trips
  • to_arrow_schema() exports a pyarrow.Schema; field IDs are preserved in PARQUET:field_id metadata
  • __arrow_c_schema__() implements the Arrow PyCapsule Interface with capsule name "arrow_schema"
  • _capsule() returns a PyCapsule named "iceberg_core_schema" wrapping Arc<Schema> for future sibling modules in this binding crate

The field dict keeps type as the Iceberg spec JSON representation rather than exposing a parallel Python type tree. That keeps this PR focused on an opaque schema handle while preserving enough information for callers that need to inspect a field type.

Files changed

  • bindings/python/src/schema.rs - Schema binding implementation
  • bindings/python/src/lib.rs - registers the schema submodule
  • bindings/python/Cargo.toml - adds the explicit serde_json dependency used by from_json() and to_json()
  • bindings/python/tests/test_schema.py - schema binding tests

Verification

  • maturin build --release --out dist - clean, zero warnings
  • pytest bindings/python/tests/test_schema.py - 32 passed
  • cargo test -p iceberg --lib - 1294 passed

Design notes

Arc<Schema> makes clones cheap and gives _capsule() a clean ownership story: each capsule owns its own Arc clone, so the capsule remains valid after the Python Schema object is dropped.

#[pyclass(..., from_py_object)] is included so follow-up methods in this binding crate can accept Schema as a typed Python argument under PyO3 0.28 without relying on deprecated implicit extraction behavior.

Typing stubs and py.typed are intentionally deferred to a package-wide typing PR so this feature PR only changes runtime behavior.

Abanoub Doss added 8 commits May 24, 2026 16:45
Add serde_json dep (needed for from_json/to_json in schema.rs) and
register the schema submodule in lib.rs alongside the existing modules.
Parse once via Schema.from_json(); Arc<Schema> shared across callers.
Exposes schema_id, highest_field_id, column_names, identifier_field_ids,
find_field_by_name, field_by_id, to_json, to_arrow_schema,
__arrow_c_schema__ (Arrow PyCapsule Interface), and _capsule (Rust→Rust
handoff via PyCapsule named "iceberg_core_schema").
Covers all public methods with full docstrings, including the
__arrow_c_schema__ PyCapsule dunder and the _capsule() Rust handoff.
Covers construction (V1/V2 JSON, error cases), all getter methods,
case-sensitive field lookup, to_json round-trips, PyCapsule lifecycle,
__arrow_c_schema__ PyCapsule Interface, and to_arrow_schema with
PARQUET:field_id preservation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant