You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Write timestamp sorted metadata to parquet and provide
external sort information to datafusion. This way the
SortExec can be avoided in execution plan with most
queries which use order by p_timestamp.
For example, the query
"explain select p_timestamp from {{stream_name}}
order by p_timestamp asc"
In physical plan it is visible that SortExec is eliminated as
output_ordering is pushed to ParquetExec node
"plan": "SortPreservingMergeExec: [p_timestamp@0 ASC NULLS LAST]
ParquetExec: file_groups={4 groups: [.....]}, projection=[p_timestamp],
output_ordering=[p_timestamp@0 ASC NULLS LAST]",
Note that this is still not the most optimised version of this query as
SortPreservingExec is not really needed here. The issue here
is that the datafusion is not aware that the partitions / files are
non overlapping when considering timestamp
Also if the target partition limit is crossed then datafusion again
adds SortExec to physical plan.
Fixes#430
I've noticed queries are running extremely slow when I try and
ORDER BY
on the logsYou can give it a test on Postman
Slow query:
Fast query:
The text was updated successfully, but these errors were encountered: