Skip to content

feat(python): Python binding for iceberg-rust FileIO#2

Draft
abnobdoss wants to merge 7 commits into
mainfrom
fileio-binding-poc
Draft

feat(python): Python binding for iceberg-rust FileIO#2
abnobdoss wants to merge 7 commits into
mainfrom
fileio-binding-poc

Conversation

@abnobdoss
Copy link
Copy Markdown
Owner

@abnobdoss abnobdoss commented May 24, 2026

Status

Blocked for now while the Predicate binding stack settles. This draft is runtime-only; Python typing stubs are deferred to a separate package-wide follow-up PR.

What is this

Exposes iceberg::io::FileIO to Python as three pyclasses in pyiceberg_core.file_io:

  • FileIO — reusable handle constructed via FileIO.from_props(props: dict[str, str]); backed by OpenDalResolvingStorageFactory so it resolves supported storage schemes from the path
  • InputFile — returned by FileIO.new_input(path), exposes location(), exists(), read() -> bytes, and size() -> int
  • OutputFile — returned by FileIO.new_output(path), exposes location() and write(bytes)

FileIO also exposes path-based exists(path) and delete(path).

Motivation

This gives Python callers direct access to iceberg-rust's existing FileIO abstraction, including reuse of the same handle across many file opens. That is the binding shape used by the Rust API and avoids forcing callers into one-off dict-based helper functions for every path.

The blocking I/O calls release the Python GIL via PyO3's Python::detach, so Python threads are not serialized while the Rust runtime waits on storage operations.

This PR is one small building block for Rust-backed PyIceberg reads; it does not claim to complete the broader Python integration surface.

Files changed

Path Change
bindings/python/src/file_io.rs New PyO3 binding
bindings/python/src/lib.rs Wires the file_io module
bindings/python/tests/test_file_io.py Local filesystem behavior tests
bindings/python/Cargo.lock Refreshes the Python binding lockfile so cargo update --workspace --locked passes with the existing opendal-all dependency set

No new direct Cargo dependency is added.

Test coverage

21 pytest cases pass against the built wheel. All tests use local file:// URIs via tmp_path; no network or cloud credentials are required.

Covered behavior includes construction, reusable handles across many opens, credential redaction in FileIO.__repr__, path-based exists / delete, write create / overwrite / boundary payload sizes, read, size, missing-input errors, and handle locations / reprs.

What this PR is NOT

  • Not PyIceberg-side wiring
  • Not a consumer of FileScanTask or ArrowReader
  • Not a replacement for PyArrowFileIO or fsspec in PyIceberg
  • Not a cloud-provider integration test suite
  • Not a typing/stub PR; .pyi files and py.typed are deferred to a package-wide follow-up

Abanoub Doss added 7 commits May 24, 2026 16:24
…ore_rust

Add `bytes = "1"` to the Python binding's Cargo.toml (needed for
explicit byte-slice conversion in file_io.rs) and register
file_io::register_module in lib.rs, placing it alongside the existing
transform/manifest registrations.
…-rust FileIO

Exposes iceberg-rust's `FileIO` to Python via three pyclasses:

- `FileIO.from_props(dict)` — primary constructor matching the same
  OpenDalResolvingStorageFactory plumbing already used by
  IcebergDataFusionTable, now returning a reusable handle instead of
  discarding after construction. Callers amortize setup across thousands
  of file opens in a single query.
- `FileIO.exists(path)` / `FileIO.delete(path)` — async ops via the
  shared Tokio runtime handle.
- `FileIO.new_input(path)` / `FileIO.new_output(path)` — sync
  (InputFile/OutputFile hold the storage Arc internally).
- `InputFile.read()` → `bytes`, `InputFile.exists()`, `InputFile.metadata()` → dict.
- `OutputFile.write(bytes)` — one-shot write.
- `__repr__` on FileIO redacts any key containing secret/key/token/password/credential/passphrase.
Bare signatures for FileIO, InputFile, and OutputFile with a module-level
docstring explaining from_props(dict) as the primary constructor and the
credential-redaction behaviour of __repr__.
30 tests covering:
- from_props construction (empty dict, partial props, handle independence)
- __repr__ credential redaction for 7 sensitive key patterns
- exists/delete via FileIO
- OutputFile.write (create, overwrite, empty bytes)
- InputFile.exists, read, metadata
- round-trip write→read
- repr format for InputFile and OutputFile

All tests use tmp_path for filesystem isolation; no network deps.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant