Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet][Python]: Reading subset by feeding data asynchronously to parquet parser #45352

Open
MarkusSintonen opened this issue Jan 26, 2025 · 1 comment
Labels
Component: Python Type: usage Issue is a user question

Comments

@MarkusSintonen
Copy link

MarkusSintonen commented Jan 26, 2025

There doesn't seem to be anyway in IO interfaces to use async code to feed data into the parquet parsers. However there seems to be a hacky workaround which seems to work via using anonymous mmap and feeding the parser via that:

class MyAsyncReader(Protocol):
    async def parquet_size(self) -> int: ...  # Size stored separately when writing elsewhere
    async def parquet_meta(self) -> bytes: ...  # Metadata stored separately when writing elsewhere
    async def parquet_data(self, start_offset: int, end_offset: int) -> bytes: ...


async def query(reader: MyAsyncReader, filter: Expression) -> Table:
    size = await reader.parquet_size()
    meta = await reader.parquet_meta()

    anon_mmap = mmap.mmap(-1, size, flags=mmap.MAP_ANONYMOUS | mmap.MAP_PRIVATE)
    try:
        anon_mmap.seek(size - len(meta))  # Meta to tail
        anon_mmap.write(meta)

        frag = ParquetFileFormat().make_fragment(anon_mmap).subset(filter)

        first_row_col = frag.row_groups[0].metadata.column(0)
        last_row_col = frag.row_groups[-1].metadata.column(frag.metadata.num_columns - 1)
        start_offset = offset(first_row_col)
        end_offset = offset(last_row_col) + last_row_col.total_compressed_size

        anon_mmap.seek(start_offset)
        anon_mmap.write(await reader.parquet_data(start_offset, end_offset))  # Feed needed data for parser

        return frag.to_table()  # Parse the subset of row groups
    finally:
        anon_mmap.close()


def offset(meta: ColumnChunkMetaData) -> int:
    return (
        min(meta.dictionary_page_offset, meta.data_page_offset)  # Is there a better way to get this?
        if meta.dictionary_page_offset is not None
        else meta.data_page_offset
    )

Is there any other way to feed data into the file parser externally? Using the anonymous mmap feels hacky to feed data into the parser. There are the IO interfaces but none of these are suitable for async code. Also is there a better way to get the file offsets based on the filter-expression other than above?

We can not rely on ThreadPoolExecutor (or ProcessPoolExecutor) for doing the blocking IO. We can not consume threads as the processing is heavily IO bound with very high level of concurrency. Where most of the work goes into waiting for the IO. So we can not consume threads to just wait for the IO. With the pure async IO code it is able to handle much higher level of concurrency as the work is not bound by the parquet-parsing.

Component(s)

Python, Parquet

@MarkusSintonen MarkusSintonen added the Type: usage Issue is a user question label Jan 26, 2025
@MarkusSintonen
Copy link
Author

ParquetFileFragment has a buffer prop but seems its None and can not be changed to point to the pre-read data. :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Python Type: usage Issue is a user question
Projects
None yet
Development

No branches or pull requests

1 participant