[Parquet][Python]: Reading subset
by feeding data asynchronously to parquet parser
#45352
Labels
subset
by feeding data asynchronously to parquet parser
#45352
There doesn't seem to be anyway in IO interfaces to use async code to feed data into the parquet parsers. However there seems to be a hacky workaround which seems to work via using anonymous mmap and feeding the parser via that:
Is there any other way to feed data into the file parser externally? Using the anonymous mmap feels hacky to feed data into the parser. There are the IO interfaces but none of these are suitable for async code. Also is there a better way to get the file offsets based on the filter-expression other than above?
We can not rely on
ThreadPoolExecutor
(orProcessPoolExecutor
) for doing the blocking IO. We can not consume threads as the processing is heavily IO bound with very high level of concurrency. Where most of the work goes into waiting for the IO. So we can not consume threads to just wait for the IO. With the pure async IO code it is able to handle much higher level of concurrency as the work is not bound by the parquet-parsing.Component(s)
Python, Parquet
The text was updated successfully, but these errors were encountered: