Skip to content

Paratext <-> Apache Arrow bridge #55

@wesm

Description

@wesm

@deads at some point in the next 6 months, I would like to use the paratext codebase to emit native Arrow C++ array objects (and native categorical aka arrow::DictionaryArray). Eventually we can deprecate the existing CSV reader in pandas and make the paratext+Arrow-powered CSV reader the next-gen CSV reader for pandas (since I've already spent a lot of time optimizing the Arrow->pandas code path -- in pandas 2.0 the overhead should drop to 0).

The simplest thing would be to fork the codebase into a libarrow_csv shared library that lives in the Arrow codebase, since the code might diverge (and there will be overlapping concerns where code sharing might benefit, like on-the-fly dictionary encoding). Another option is to add a libparatext_arrow library within this repo, and make that a dependency of the pyarrow library, similar to how we've already build libparquet_arrow inside parquet-cpp (https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow). Thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions