Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performace when using ORDER BY #430

Closed
Tracked by #438
Ryiski opened this issue Jun 4, 2023 · 2 comments · Fixed by #443
Closed
Tracked by #438

Slow performace when using ORDER BY #430

Ryiski opened this issue Jun 4, 2023 · 2 comments · Fixed by #443
Assignees
Labels
bug Something isn't working
Milestone

Comments

@Ryiski
Copy link

Ryiski commented Jun 4, 2023

I've noticed queries are running extremely slow when I try and ORDER BY on the logs

You can give it a test on Postman

Slow query:

{
    "query": "SELECT * FROM frontend ORDER BY p_timestamp LIMIT 30 OFFSET 0",
    "startTime": "2022-03-01T00:00:00Z",
    "endTime": "2023-06-03T23:59:59.999Z"
}

Fast query:

{
    "query": "SELECT * FROM frontend LIMIT 30 OFFSET 0",
    "startTime": "2022-03-01T00:00:00Z",
    "endTime": "2023-06-03T23:59:59.999Z"
}
@nitisht
Copy link
Member

nitisht commented Jun 4, 2023

Thanks @Ryiski we’ll take a look.

@nitisht nitisht added the bug Something isn't working label Jun 4, 2023
@nitisht nitisht added this to the Release 0.5.1 milestone Jun 4, 2023
@nitisht nitisht mentioned this issue Jun 20, 2023
3 tasks
@trueleo trueleo mentioned this issue Jun 25, 2023
3 tasks
@nitisht
Copy link
Member

nitisht commented Jun 25, 2023

Will be fixed by #443

nitisht pushed a commit that referenced this issue Jun 25, 2023
Write timestamp sorted metadata to parquet and provide 
external sort information to datafusion. This way the 
SortExec can be avoided in execution plan with most 
queries which use order by p_timestamp.

For example, the query
"explain select p_timestamp from {{stream_name}} 
order by p_timestamp asc"

In physical plan it is visible that SortExec is eliminated as 
output_ordering is pushed to ParquetExec node

"plan": "SortPreservingMergeExec: [p_timestamp@0 ASC NULLS LAST]
  ParquetExec: file_groups={4 groups: [.....]}, projection=[p_timestamp], 
output_ordering=[p_timestamp@0 ASC NULLS LAST]",

Note that this is still not the most optimised version of this query as 
SortPreservingExec is not really needed here. The issue here 
is that the datafusion is not aware that the partitions / files are 
non overlapping when considering timestamp

Also if the target partition limit is crossed then datafusion again 
adds SortExec to physical plan.

Fixes #430
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants