Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port more Python catalog and persistence logic to Rust #2062

Open
twitu opened this issue Nov 17, 2024 · 0 comments
Open

Port more Python catalog and persistence logic to Rust #2062

twitu opened this issue Nov 17, 2024 · 0 comments
Labels
enhancement New feature or request improvement Improvement to existing functionality RFC A request for comment rust Relating to the Rust core

Comments

@twitu
Copy link
Collaborator

twitu commented Nov 17, 2024

The persistence Python module exposes a number of useful ways to interact with data. The key modules are:

  • schema.py - It defines pyarrow schema for many Nautilus classes. For some of the classes - the data model - the schema is defined by the Rust implementation.
  • writer.py - Writes a stream of all serializable Nautilus objects to feather files partitioned by data type. This is particularly useful for recording live data and replaying events for debugging or analysis. The Rust data model objects use Rust methods to convert lists of objects to return Arrow record batches.
  • loader.py - Loads some data model objects from CSV and parquet files into a pandas dataframe. It is used to read test data in TestDataProvider. It is completely independent of schemas and Rust logic.
  • wrangler_v2.py - Build Nautilus data model objects from pandas and Arrow and JSON. Pandas is converted to Arrow and Arrow is converted to list of objects using Rust decode_batch implementation.

A major chunk of the functionality is the Parquet catalog. It is used to read and write data model objects to and from parquet files:

  • Write mixed data stream into separate parquet files partitioned by data type. The data is encoded to bytes using Rust logic but written to an abstract fsspec file system using pyarrow. The catalog is also updated with a mapping from data type to directory and parquet file path.
  • Read data stream can choose to use datafusion backend session or pyarrow to read the data. Pyarrow is mostly used for loading Cython classes.
  • Create a catalog using streams of data written to feather files by the writer.py logic.
  • Helper methods and getters for various kinds of information.

parquet.py is a major chunk that will benefit from being ported to Rust. The new implementation will only support PyO3 style classes. The existing catalog implementation will be retained until Cython backward compatibility is needed. Another consideration is that an alternative to fsspec will have to be found which may not support as many filesystems.

If a no-copy, pyo3/ffi interface can be found for pandas dataframe other logic in wrangler_v2.py and loader.py can also be ported.


The current implementation works well, however, the recent datafusion upgrade shows that the schemas are fragile and there is duplication. By pushing more common logic to Rust, the schemas can be centralized and some duplication reduced by making using of Rust generics. However, we should strike balance since pyarrow and Python allow a level of flexibility and extensibility particularly when dealing with custom data that is very hard with Rust.

@twitu twitu added enhancement New feature or request RFC A request for comment rust Relating to the Rust core improvement Improvement to existing functionality labels Nov 17, 2024
@twitu twitu mentioned this issue Nov 24, 2024
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request improvement Improvement to existing functionality RFC A request for comment rust Relating to the Rust core
Projects
None yet
Development

No branches or pull requests

1 participant