ENH: Allow third-party packages to register IO engines #61642

datapythonista · 2025-06-12T21:58:17Z

xref Proposal to allow third-party engines for readers and writers #61584
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Added the new system to the Iceberg connection only to keep this smaller. The idea is to add the decorator to all other connectors, happy to do it here or in a follow up PR.

mroeschke · 2025-06-16T17:24:42Z

doc/source/development/extending.rst

+method on it with the arguments provided by the user (except the ``engine`` parameter).
+
+To avoid conflicts in the names of engines, we keep an "IO engines" section in our
+[Ecosystem page](https://pandas.pydata.org/community/ecosystem.html#io-engines).


This will need different formatting since rst hyperlink syntax is different from md

True, thanks for the heads up. I updated it.

mroeschke · 2025-06-16T17:34:53Z

pandas/io/iceberg.py

@@ -52,6 +56,10 @@ def read_iceberg(
    scan_properties : dict of {str: obj}, optional
        Additional Table properties as a dictionary of string key value pairs to use
        for this scan.
+    engine : str, optional


Should the read_* and to_* signatures also have an engine_kwargs: dict[str, Any] | None argument to allow specific engine arguments to be passes per implementation?

Very good point. In read_parquet we already have a **kwargs for engine specific arguments. In map, apply... it's a normal engine_kwargs since **kwargs is used in some cases for the udf keyword arguments. I think for IO readers/writers **kwargs as read_parquet does is fine.

I didn't want to add the engine to all connectors in this PR to keep it simpler, but I'm planning to follow up with another PR that adds it, and adds **kwargs for connectors where it's not there already. Surely happy to add both things here if you prefer, just thought it would make reviewing simpler to keep the implementation separate from all the changes to parameters.

if engine-specific kwargs are needed, isn't that a good reason to use engine.read_whatever(path, **kwargs) instead of pd.read_[...]?

This is a good point. Thinking about readers we don't care about I think what you propose is the best choice. And this PR doesn't really prevent that from happening anyway. But for readers we cared enough to include in pandas, I think this new interface offers an advantage. For example, there was some discussion on whether we should move the fastparquet engine out of pandas, Patrick suggested it. I think this interface allows moving the fastparquet engine to the fastparquet package, users with fastparquet installed will still have it available in the same way as it is now, but we can forget about it.

Of course discussions about moving readers out of pandas will have to happen later. But this interface seems quite useful and it's very simple, so in my opinion it's a good deal.

…_engines

datapythonista · 2025-06-16T22:39:04Z

/preview

github-actions · 2025-06-16T22:39:25Z

Website preview of this PR available at: https://pandas.pydata.org/preview/pandas-dev/pandas/61642/

datapythonista · 2025-06-25T17:42:46Z

@mroeschke I addressed your comments and I think this should be ready when you've got time. Thanks!

mroeschke

Overall I think this is a good idea. Might be good to get another opinion to vet the idea.

I think still having a **kwargs like argument so users can pass engine specific arguments without manually expanding the pandas signatures would be good. But since this is for the new IO method iceberg, it's not as critical now.

datapythonista · 2025-06-26T15:38:21Z

Fully agree, I just didn't want to make this PR too big by allowing the engines and adding**kwargs everywhere here. I'll do it in a follow up.

@pandas-dev/pandas-core any opinion or comment before merging this?

WillAyd · 2025-06-26T15:43:47Z

Thanks for the ping. I haven't been involved enough to really block, but I'm curious what the advantage of this is over leveraging the Arrow PyCapsule interface for data exchange; I feel like the latter would be a better thing to build against, given it is a rather well adopted standard in the ecosystem

datapythonista · 2025-06-26T15:55:59Z

Good point, thanks for the feedback. I think this is a higher interface that still allows for using the Arrow pycapsule internally. Surely an option would be to get rid of I/O in pandas, and have an ecosystem of readers that can be used via a single pd.read(engine) or something similar. But I think it's better to use the current API, allow third party engines as implemented here, and let the engines decide how the data is exchanged. Or what is your idea?

WillAyd · 2025-06-26T16:17:17Z

It would depend on a resolution to #59631, but for demo purposes let's assume we decide to implement a new DataFrame.from_pycapsule class method.

So instead of an I/O method like:

df = pd.read_iceberg(..., kwargs)

You could construct a dataframe like:

df = pd.DataFrame.from_pycapsule(
    # whatever the iceberg API is here to create a table
    pyiceberg.table(..., kwargs)
)

The main downside is verbosity, but the upsides are:

Generic approach to read any datasource
pandas itself has to do very little development (third parties are responsible for integration)
polars, cudf, etc... all benefit in kind

So yea this could be extended to say even Delta if they decided to implement the PyCapsule interface (whether they do currently or not, I don't know):

df = pd.DataFrame.from_pycapsule(
    # whatever the delta API is here to create a table
    DeltaLake.table(..., kwargs)
)

and if polars decided on the same API you could create that dataframe as well:

df = pl.DataFrame.from_pycapsule(
    # whatever the delta API is here to create a table
    DeltaLake.table(..., kwargs)
)

WillAyd · 2025-06-26T16:17:53Z

As I said though, I haven't been involved enough to really block, so if this PR has some support happy to roll with it and can clean up later if it comes to it. Thanks for giving it consideration @datapythonista

jbrockmendel · 2025-06-26T16:22:54Z

i don't know anything about pycapsules. is that effectively a pd.from_arrow_table method that isn't pyarrow-specific?

WillAyd · 2025-06-26T16:41:32Z

Yea kind of. In a generic sense, a PyCapsule is just a Python construct that can wrap generic memory. The Arrow PyCapsule interface further defines lifetime semantics for passing Arrow data across a PyCapsule:

https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html

Somewhat separately there is the question of how you would read and write the data in a capsule. For pandas that is pyarrow, but other libraries may choose a tool like nanoarrow for a smaller dependency

datapythonista · 2025-06-26T18:08:35Z

@WillAyd what you propose seems reasonable, but I guess we aren't planning to remove all pandas IO anytime soon. And if we keep our readers and writers with multiengine support, I think this interface is going to be useful, even if long term we move into the pyarrow capsule reader you propose. Also, since pandas is not arrow based yet, this PR could be used to move the xarray connectors to the xarray package, while using pycapsule wouldn't be ideal for pandas/xarray interchange, as they are numpy based.

Dr-Irv · 2025-06-26T16:28:06Z

doc/source/development/extending.rst

+it. This is done in ``pyproject.toml``:
+
+```toml
+[project.entry-points."pandas.io_engine"]
+empty = empty_data:EmptyDataEngine
+```


I'm not 100% sure if this can happen, but what if the project isn't using pyproject.toml for some reason. Is there another way to do the configuration or is using pyproject.toml required?

Entry points existed before pyproject.toml, and can also be added to setup.py. it makes no difference how the package defines them, pip or conda will add the entry point to the environment registry, and pandas will be able to find them regardless of how the project created them.

The language here suggests that the only way to add the entry point is via pyproject.toml. If this is the recommended way, we can say that. Or if other ways are supported, we should show that too.

doc/source/development/extending.rst

Dr-Irv · 2025-06-26T16:31:42Z

doc/source/whatsnew/v3.0.0.rst

@@ -90,6 +90,7 @@ Other enhancements
 - Support passing a :class:`Iterable[Hashable]` input to :meth:`DataFrame.drop_duplicates` (:issue:`59237`)
 - Support reading Stata 102-format (Stata 1) dta files (:issue:`58978`)
 - Support reading Stata 110-format (Stata 7) dta files (:issue:`47176`)
+- Third-party packages can now register engines that can be used in pandas I/O operations :func:`read_iceberg` and :meth:`DataFrame.to_iceberg` (:issue:`61584`)


This sentence makes it seem that it only applies to read_iceberg . But doesn't the engine comment apply to ANY of the current IO routines that have an engine argument?

Good point. This PR creates the new system for third party engines in a generic way, and the idea is to use it everywhere, but the PR only applies it to iceberg for now. The reason is to make reviewing easier, as adding the engine keyword to mamy connectors will make the PR significantly bigger.

My idea is to add the whatsnew note for what's delivered in this PR, and in the follow up PR update it to what you suggest.

Actually, since read_iceberg() and to_iceberg() are new for 3.0 anyway, I don't think you need a whatsnew item for this PR, because, as you say, it is just functionality that applies to iceberg right now. Then, if it is accepted, you can add a whatsnew to point to all the readers/writers that support it, if you add support for those readers/writers.

pandas/core/frame.py

Dr-Irv · 2025-06-26T18:08:22Z

pandas/io/common.py

+                package_name = entry_point.dist.metadata["Name"]
+            else:
+                package_name = None
+            if entry_point.name in _io_engines:
+                _io_engines[entry_point.name]._packages.append(package_name)
+            else:
+                _io_engines[entry_point.name] = entry_point.load()
+                _io_engines[entry_point.name]._packages = [package_name]


I have to wonder if it is better to just get the entry points here but NOT load them., and then load them on demand. So the dict would just have EntryPoint objects, or a tuple of EntryPoint and package names.

Sounds like a good idea, I didn't think about it before. I think it'll make the code slightly more complex, but not loading the code of unused connectors would be nice, in case a package takes a long time to run.

I won't be updating this PR, as I don't think it's likely that it'll be merged, so not worth the effort. But I'd be happy to implement it in a follow up.

jbrockmendel · 2025-06-26T18:57:38Z

And if we keep our readers and writers with multiengine support

ATM that's just csv and parquet? And the parquet one plausibly is not needed?

AFAICT this adds a bunch of new tests/code/docs, complicates the import with entrypoints, and lets 3rd parties hijack our namespace. All when there's a perfectly good option of using their own namespace.

Also if we ever did change defaults like #61618 or PDEP16, that would Break The World for any 3rd party readers that we are implicitly committed to supporting.

-1.

WillAyd · 2025-06-26T19:23:38Z

Also, since pandas is not arrow based yet, this PR could be used to move the xarray connectors to the xarray package, while using pycapsule wouldn't be ideal for pandas/xarray interchange, as they are numpy based.

Definitely not an expert, but I want to point out that DLPack also offers a PyCapsule for data exchange. See https://dmlc.github.io/dlpack/latest/python_spec.html

So depending on how generic we want things to be, PyCapsule support doesn't just mean consuming Arrow data, but could mean dlpack data as well (which I assume xarray can do, if it doesn't already)

datapythonista · 2025-06-26T19:37:00Z

ATM that's just csv and parquet? And the parquet one plausibly is not needed?

Any reader could have an engine option, this would allow having xarray, sas... as other packages.

AFAICT this adds a bunch of new tests/code/docs, complicates the import with entrypoints, and lets 3rd parties hijack our namespace. All when there's a perfectly good option of using their own namespace.

This adds minimal tests, code or docs, any other PR adds as many as this. We already allow entrypoints in pandas, and 3rd parties can not use this to change the namespace directly, just to allow engine="foo".

Also if we ever did change defaults like #61618 or PDEP16, that would Break The World for any 3rd party readers that we are implicitly committed to supporting.

We could add the value of the flag setting the types to use to the interface, so third parties can transition in the same way as us.

-1.

Being honest I think you'll block anything that is Bodo related. I think it was best to be -1 to accept their money when they offered it. In any case, I'll close this, and I'll open a separate issue to discuss returning the remaining funds, as I think it can make sense at this point.

jbrockmendel · 2025-06-26T20:03:15Z

I'm not blocking, just voting against. If everyone else disagrees, I'll make my peace with it.

I'm not against everything bodo-related, but I do default pretty hard against letting 3rd parties piggyback on our namespace.

twoertwein · 2025-06-26T21:05:32Z

it might be nice to sequester all third-party integrations into an external (or thirdparty) namespace, like pd.external.read_iceberg (and something like pd.DataFrame.external.to_iceberg).

Co-authored-by: Irv Lustig <[email protected]>

datapythonista · 2025-06-28T07:38:04Z

I'm not blocking, just voting against. If everyone else disagrees, I'll make my peace with it.

I apologize @jbrockmendel as I think I overreacted. While I don't understand or share your concerns, I think it's good to have your feedback.

The problem is that I think the new pandas governance, instead of fixing any of our decision making problems, made them even worst. While your -1 vote is fair and just a point of view that I appreciate having, in practice means that this PR is dead unless I try the PDEP path and wait the 3 months time window, which I won't do.

I'll leave this open for a bit longer, just in case there is enough support on this to make you be ok with it. And otherwise I'll close later when it becomes stale.

jbrockmendel · 2025-06-28T16:20:04Z

While I don't understand or share your concerns

Not sharing them is fine, but if you don't understand that means I've done a poor job explaining them, so I'll try again:

Zero users have asked for this. There are zero use cases in which bodo.read_iceberg is not a fully-functional (and clearer!) alternative. To my mind, that means there are zero upsides.

On the other hand, this complicates are API, which users frequently do complain about. Sure it doesn't complicate it much, but these things add up. Similarly more docs and more code and more tests increase burdens, and these things add up. These are very real downsides.

But the part that I actually think will come to bite us is that it muddles responsibility. Users won't know what we do and don't maintain, and will complain to us. If we want to change something that breaks a 3rd party reader, people will argue that needs a deprecation process. There will be crappy poorly-implemented engines out there with our name associated with them.

All downsides, no upsides.

instead of fixing any of our decision making problems, made them even worst

No argument there. FWIW I'm pretty comfortable being the "that doesn't need a full PDEP" guy.

Dr-Irv · 2025-06-28T18:35:13Z

FWIW I'm pretty comfortable being the "that doesn't need a full PDEP" guy.

As the person usually asking for a PDEP, I don't think one is needed here.

But @jbrockmendel brings up some valid concerns, and I leave it to others to determine whether those concerns mean we don't accept this addition to pandas.

datapythonista · 2025-06-28T18:45:01Z

Thanks @jbrockmendel, this is helpful, I understand better your concerns, and I actually agree in two of your point.

While I think this is actually a very small change, it wouldn't be worth if the goal was to allow Bodo (or others) to write a reader. We got rid of read_gbq long ago, because maintenance of just the wrapper was annoying, and also not an open source format, but a paid infra. They implemented pandas_gbq.read_gbq, and no big deal. That was a whole format, not an engine, but we could do the same with anything.

I just find this has some advantages:

For writers, what you propose doesn't support method chaining
With your proposal, if I want to get rid of the xarray IO code in pandas, it's a breaking change. With this proposal I can move the wrapper to xarray, add the entrypoint, and users with xarray installed won't see a difference. Users without xarray may see a different error message if using xarray. And our CI wouldn't need to test for xarray, and more important, our environment wouldn't have xarray. If we can do this with few engines, I'll probably won't have to continue spending countless hours fighting with the problems of the conda solver.
If our CSV readers (or Excel readers) were moved out to other packages, with your proposal their signatures would become independent. And changing from one engine to another would require changing the module, possibly the function name, and the parameters. With this PR it'd be just the engine name, and parameters only if engine specific parameters exist. For context, I'd like to have pandas CSV engines for Polars and DuckDB, and I'd like that the code lives in those projects. This PR (and the needed follow up to support all formats) would allow it from the pandas side.

Second point I agree is that users may not immediately underatand that the engines don't live in pandas. I don't have a solution, but two things to consider.

For most connectors the code already lives in an external package, we just maintain a thin wrapper. So, in a way, we already take responsibility for other packages code.
The docs won't mention external engines, but the API docs can mention external engines are supoorted, as well as the user guide. For new engines, I think users will have to learn them from the other packages docs, or our ecosystem page. So, except for the users that read code written by someone else, I think most users can become familiar with the idea that enginea are not part of pandas.

jbrockmendel · 2025-06-28T18:52:43Z

For writers, what you propose doesn't support method chaining

A user who is heroin-level obsessed with method chaining can use .pipe(bodo.to_iceberg)

if I want to get rid of the xarray IO code in pandas

If that is the real goal here, please just make an issue for that. In that scenario, I would also say that the relevant reader/writer belongs in its own namespace.

datapythonista · 2025-06-28T18:54:04Z

@twoertwein we discussed supporting what you mention in PDEP-9. And if people aren't convinced to have this for engines only, I don't think there can be consensus for supporring aebitrary formats in the pandas namespace.

datapythonista · 2025-06-29T07:55:09Z

If that is the real goal here, please just make an issue for that.

I wouldn't say it's the real goal, but surely one of the main reasons. I wrote PDEP-9 to go into the details on why I think pandas IO should work as Python modules.

I think the main reason why people use Python is because code is readable, and it's batteries included via pip/conda. I'd say people use pandas because it's a Swiss knife, also with everything included. If a user in Python is asked to implement code in a C extension, is like telling them it's not possible, because they are used to pip install + import. In a similar way (in my opinion), telling users to import another module to read, and pipe to write, is telling them the reader is not supported by pandas. Surely the difficulty is not comparable to writing a C extension, but the feeling that is not supported and that a hack is needed are probably the same.

The real goal here is to reduce the gap between a pandas core IO connector, and an external IO connector. To the same as a standard library package and a cheeseshop package. And one of the main motivations is that moving IO connectors into and out of pandas would become trivial, both technically and in terms of backward compatibility. PDEP-9 tried that fully, this is just for engines of pandas supported formats. But same idea, just that this PR is trivial, both in code and conceptually, and PDEP-9 came with problems of naming conflicts, pollution of the pandas namespace.

But I personally don't think discussing again PDEP-9 is needed. I think it's mostly whether the advantages here are worth the added complexity. To me that's an absolute yes. I guess you don't see the advantages as significant as you think it's fine to just use pandas modules and pipe. I disagree with that, but it's surely a valid point of view. In practice we won't move any IO connector out of pandas with this PR. But it's surely not clear if that was going to happen anyway, so not an immediate advantage.

datapythonista · 2025-06-29T08:00:54Z

@Dr-Irv you are still blocking this PR. Is it that you want it to be blocked, or that you forgot to remove the requested change flag? From your last comment I can't tell which of Brock's comments you share, amd if they are a blocker. But if you just didn't forgot to remove the flag, I don't think it's very nice to block someone's work without being clear what change is expected, or why this shouldn't be merged in any form.

datapythonista added 3 commits June 12, 2025 16:23

New third-party IO engines

f33778c

Add tests and fix bugs

555459b

Finishing docs and tests

1ca77c1

datapythonista requested a review from noatamir as a code owner June 12, 2025 21:58

datapythonista added IO Data IO issues that don't fit into a more specific label API Design labels Jun 12, 2025

datapythonista added 2 commits June 13, 2025 00:31

typo in doc label and typing issues

d388101

Merge branch 'main' into io_engines

e333510

mroeschke reviewed Jun 16, 2025

View reviewed changes

datapythonista added 3 commits June 16, 2025 22:29

Fix link in markdown

cb82ffb

Merge branch 'io_engines' of github.com:datapythonista/pandas into io…

088e5de

…_engines

Merge main

9e71a9d

Fix link

ebfc20c

mroeschke approved these changes Jun 26, 2025

View reviewed changes

Dr-Irv requested changes Jun 26, 2025

View reviewed changes

datapythonista closed this Jun 26, 2025

datapythonista reopened this Jun 28, 2025

datapythonista and others added 3 commits June 28, 2025 09:12

Update doc/source/development/extending.rst

a4b6cdc

Co-authored-by: Irv Lustig <[email protected]>

Update doc/source/development/extending.rst

0b3b00c

Co-authored-by: Irv Lustig <[email protected]>

Update pandas/core/frame.py

776e04c

Co-authored-by: Irv Lustig <[email protected]>

Uh oh!

ENH: Allow third-party packages to register IO engines #61642

Are you sure you want to change the base?

ENH: Allow third-party packages to register IO engines #61642

Conversation

datapythonista commented Jun 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

datapythonista commented Jun 16, 2025

Uh oh!

github-actions bot commented Jun 16, 2025

Uh oh!

datapythonista commented Jun 25, 2025

Uh oh!

mroeschke left a comment

Choose a reason for hiding this comment

Uh oh!

datapythonista commented Jun 26, 2025

Uh oh!

WillAyd commented Jun 26, 2025

Uh oh!

datapythonista commented Jun 26, 2025

Uh oh!

WillAyd commented Jun 26, 2025

Uh oh!

WillAyd commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbrockmendel commented Jun 26, 2025

Uh oh!

WillAyd commented Jun 26, 2025

Uh oh!

datapythonista commented Jun 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Jun 26, 2025

Uh oh!

WillAyd commented Jun 26, 2025

Uh oh!

datapythonista commented Jun 26, 2025

Uh oh!

jbrockmendel commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

twoertwein commented Jun 26, 2025

Uh oh!

datapythonista commented Jun 28, 2025

Uh oh!

jbrockmendel commented Jun 28, 2025

Uh oh!

Dr-Irv commented Jun 28, 2025

WillAyd commented Jun 26, 2025 •

edited

Loading

jbrockmendel commented Jun 26, 2025 •

edited

Loading