Port more Python catalog and persistence logic to Rust #2062
Labels
enhancement
New feature or request
improvement
Improvement to existing functionality
RFC
A request for comment
rust
Relating to the Rust core
The persistence Python module exposes a number of useful ways to interact with data. The key modules are:
schema.py
- It defines pyarrow schema for many Nautilus classes. For some of the classes - the data model - the schema is defined by the Rust implementation.writer.py
- Writes a stream of all serializable Nautilus objects to feather files partitioned by data type. This is particularly useful for recording live data and replaying events for debugging or analysis. The Rust data model objects use Rust methods to convert lists of objects to return Arrow record batches.loader.py
- Loads some data model objects from CSV and parquet files into a pandas dataframe. It is used to read test data inTestDataProvider
. It is completely independent of schemas and Rust logic.wrangler_v2.py
- Build Nautilus data model objects from pandas and Arrow and JSON. Pandas is converted to Arrow and Arrow is converted to list of objects using Rustdecode_batch
implementation.A major chunk of the functionality is the Parquet catalog. It is used to read and write data model objects to and from parquet files:
fsspec
file system usingpyarrow
. The catalog is also updated with a mapping from data type to directory and parquet file path.pyarrow
to read the data. Pyarrow is mostly used for loading Cython classes.writer.py
logic.parquet.py
is a major chunk that will benefit from being ported to Rust. The new implementation will only support PyO3 style classes. The existing catalog implementation will be retained until Cython backward compatibility is needed. Another consideration is that an alternative tofsspec
will have to be found which may not support as many filesystems.If a no-copy, pyo3/ffi interface can be found for pandas dataframe other logic in
wrangler_v2.py
andloader.py
can also be ported.The current implementation works well, however, the recent datafusion upgrade shows that the schemas are fragile and there is duplication. By pushing more common logic to Rust, the schemas can be centralized and some duplication reduced by making using of Rust generics. However, we should strike balance since
pyarrow
and Python allow a level of flexibility and extensibility particularly when dealing with custom data that is very hard with Rust.The text was updated successfully, but these errors were encountered: