Port more Python catalog and persistence logic to Rust #2062

twitu · 2024-11-17T07:14:41Z

The persistence Python module exposes a number of useful ways to interact with data. The key modules are:

schema.py - It defines pyarrow schema for many Nautilus classes. For some of the classes - the data model - the schema is defined by the Rust implementation.
writer.py - Writes a stream of all serializable Nautilus objects to feather files partitioned by data type. This is particularly useful for recording live data and replaying events for debugging or analysis. The Rust data model objects use Rust methods to convert lists of objects to return Arrow record batches.
loader.py - Loads some data model objects from CSV and parquet files into a pandas dataframe. It is used to read test data in TestDataProvider. It is completely independent of schemas and Rust logic.
wrangler_v2.py - Build Nautilus data model objects from pandas and Arrow and JSON. Pandas is converted to Arrow and Arrow is converted to list of objects using Rust decode_batch implementation.

A major chunk of the functionality is the Parquet catalog. It is used to read and write data model objects to and from parquet files:

Write mixed data stream into separate parquet files partitioned by data type. The data is encoded to bytes using Rust logic but written to an abstract fsspec file system using pyarrow. The catalog is also updated with a mapping from data type to directory and parquet file path.
Read data stream can choose to use datafusion backend session or pyarrow to read the data. Pyarrow is mostly used for loading Cython classes.
Create a catalog using streams of data written to feather files by the writer.py logic.
Helper methods and getters for various kinds of information.

parquet.py is a major chunk that will benefit from being ported to Rust. The new implementation will only support PyO3 style classes. The existing catalog implementation will be retained until Cython backward compatibility is needed. Another consideration is that an alternative to fsspec will have to be found which may not support as many filesystems.

If a no-copy, pyo3/ffi interface can be found for pandas dataframe other logic in wrangler_v2.py and loader.py can also be ported.

The current implementation works well, however, the recent datafusion upgrade shows that the schemas are fragile and there is duplication. By pushing more common logic to Rust, the schemas can be centralized and some duplication reduced by making using of Rust generics. However, we should strike balance since pyarrow and Python allow a level of flexibility and extensibility particularly when dealing with custom data that is very hard with Rust.

The text was updated successfully, but these errors were encountered:

twitu added enhancement New feature or request RFC A request for comment rust Relating to the Rust core improvement Improvement to existing functionality labels Nov 17, 2024

twitu mentioned this issue Nov 24, 2024

Catalog V2 #2071

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port more Python catalog and persistence logic to Rust #2062

Port more Python catalog and persistence logic to Rust #2062

twitu commented Nov 17, 2024 •

edited by cjdsellers

Loading

Port more Python catalog and persistence logic to Rust #2062

Port more Python catalog and persistence logic to Rust #2062

Comments

twitu commented Nov 17, 2024 • edited by cjdsellers Loading

twitu commented Nov 17, 2024 •

edited by cjdsellers

Loading