Arrow adoption #2

TNieuwdorp · 2023-03-28T10:06:02Z

TNieuwdorp
Mar 28, 2023

Recent (versions of) dataframe libraries such as pandas and polars in Python support Arrow as the underlying memory backend. Arrow is quickly becoming the de facto standard for columnar storage. It's is an in-memory columnar data format that can improve the performance of data processing and analysis.
By using Arrow, we can avoid the cost of serialization and deserialization when passing data between different parts of our system, and take advantage of optimizations that are specific to columnar data. Many other data processing and analysis tools already support Arrow, so adding Arrow integration to PGM would make it easier for users to work with this library in conjunction with other tools. Even between different languages this memory can be shared, which in the context of Alliander still using R a lot could mean that PGM could be easier to adopt in that language too.

I'd love to hear your thoughts on this. Do you think it makes sense to add Arrow integration to PGM? Are there any potential downsides or challenges that I'm not aware of? Please let me know your opinions and feedback.

TNieuwdorp · 2023-03-29T07:06:13Z

TNieuwdorp
Mar 29, 2023
Author

I did a quick exploration of some transitions between different formats of data

The first cell creates a polars dataframe with floats. When changing these to a numpy format it's a zero-copy operation. However Arrow memory is immutable, and the numpy array is a view of this Arrow data. This means the numpy data is read-only (as can be seen in the last cell). When creating a pandas frame from this, it will get copied. This behvaiour results in data duplication on a machine which is undesirable in the case of larger dataframes.

I'm curious if PGM can handle read-only numpy arrays?

3 replies

SteffanWullems Mar 30, 2023

The pandas 2.0 uses pyarrow (if you want it) just to remove numpy issues like missing values and strings. Pandas with pyarrow still has some issues. PyArrow would be nice to have.
On the issue with read-only from Polars its because the RUST enviroment does not give ownership to python interop. If you copy/clone the numpy array it should te copy should be writable.

mgovers Jul 13, 2023
Maintainer

Apache provides C and C++ libraries so we don't have to do the data conversions ourselves if we want to throw this over the C API boundary

otherwise if we only want to support it in python, i guess the only thing we should theoretically have to do to provide interop is to ensure that we can deal with read-only data, but that is not gonna work when writing the output.

as a workaround, we could use the Apache Arrow python library to convert our numpy arrays to PyArrow data format to ensure the user gets the same data format, but then the advantage would only on the data provisioning end.

mgovers Jul 24, 2023
Maintainer

Cfr. the teams meeting, providing documentation in the power-grid-model-io project is a good enough first step. Issue PowerGridModel/power-grid-model-io#190 is created and will be picked up next sprint.

mgovers · 2023-07-24T08:54:55Z

mgovers
Jul 24, 2023
Maintainer

Cfr. the teams meeting, providing documentation in the power-grid-model-io project is a good enough first step. Issue PowerGridModel/power-grid-model-io#190 is created and will be picked up next sprint.

1 reply

mgovers Jul 31, 2023
Maintainer

@TNieuwdorp there is a pull request open with an illustrative example as documentation in PowerGridModel/power-grid-model-io#191 .

A preview of the documentation can be found in https://power-grid-model-io--191.org.readthedocs.build/en/191/examples/arrow_example.html .

If you desire, you can provide feedback to the pull request PowerGridModel/power-grid-model-io#191

mgovers · 2024-10-11T06:52:53Z

mgovers
Oct 11, 2024
Maintainer

Support for columnar data formats (PowerGridModel/power-grid-model#548) is in preview now. That should also provide a better fit to data in Arrow format.

We intend to do a minor version bump (to the v1.10.x release series) in the coming weeks to signify general availability of columnar data support when documentation and examples are updated. The intention is to also update the Arrow example after that.

Official support for PyArrow data format would be a new feature in the power-grid-model-io library. If such integration is desired, please upvote and leave a comment.

0 replies

nitbharambe · 2024-10-29T12:54:10Z

nitbharambe
Oct 29, 2024
Maintainer

We are close to the release for fully supporting the columnar data format in PGM.

Meanwhile sharing an update:
The Arrow example is updated to demonstrate one way the user can work with PGM's columnar data format.

0 replies

figueroa1395 · 2025-01-23T14:47:59Z

figueroa1395
Jan 23, 2025
Collaborator

The columnar data format is now fully supported in PGM and it was released on v.1.10.

In addition, there is also the updated Arrow example.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Power Grid Model

Arrow adoption #2

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Power Grid Model

Arrow adoption #2

TNieuwdorp Mar 28, 2023

Replies: 5 comments · 4 replies

TNieuwdorp Mar 29, 2023 Author

SteffanWullems Mar 30, 2023

mgovers Jul 13, 2023 Maintainer

mgovers Jul 24, 2023 Maintainer

mgovers Jul 24, 2023 Maintainer

mgovers Jul 31, 2023 Maintainer

mgovers Oct 11, 2024 Maintainer

nitbharambe Oct 29, 2024 Maintainer

figueroa1395 Jan 23, 2025 Collaborator

TNieuwdorp
Mar 28, 2023

Replies: 5 comments 4 replies

TNieuwdorp
Mar 29, 2023
Author

mgovers Jul 13, 2023
Maintainer

mgovers Jul 24, 2023
Maintainer

mgovers
Jul 24, 2023
Maintainer

mgovers Jul 31, 2023
Maintainer

mgovers
Oct 11, 2024
Maintainer

nitbharambe
Oct 29, 2024
Maintainer

figueroa1395
Jan 23, 2025
Collaborator