Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bf: search with aggregate instead of find #1213

Open
wants to merge 1 commit into
base: development/8.1
Choose a base branch
from

Conversation

vrancurel
Copy link
Contributor

The search was using find().sort() and was disrupting user defined
search queries and custom indexes. The sort() is needed to implement
a stateless paging system. The combo of user defined query and sort is
now implemented with a 2 stage aggregate on server side.
We always limit the execution time maxTimeMs to 5mn (tunable by an
environment variable).

Note that it the number of concurrent search queries is for now not limited (and it should). We know that the aggregate will put an additional burden on the primary, so micro-sharding shall be implemented to divide and conquer. But this is out of scope of this PR so far.

@bert-e
Copy link
Contributor

bert-e commented Jul 15, 2020

Hello vrancurel,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Status report is not available.

@bert-e
Copy link
Contributor

bert-e commented Jul 15, 2020

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

Peer approvals must include a mandatory approval from @jonathan-gramain.

@vrancurel vrancurel force-pushed the bugfix/ZENKO-2633-search-with-aggregates branch from f6de7b0 to cecd214 Compare July 15, 2020 19:22
Copy link
Collaborator

@rahulreddy rahulreddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Were you able to measure the impact (cpu, elapsed_ms, index usage etc) of using this aggregate?

Comment on lines 64 to 65
{ $match: searchOptions }, // user query
{ $match: query }, // for paging
Copy link
Contributor

@alexanderchan-scality alexanderchan-scality Jul 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was wondering if the pipeline would benefit from first filtering by the indexed _id, then followed by the { $match: searchOptions}, which can possibly involve a full collection scan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we want is first to do the user defined query: searchOptions, then save it on disk (potentially larger results set than 100MB), then sort this results set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think we need to add a $out stage into a temporary collection and obtain a cursor on this temp collection instead.
Later on a sweeper shall delete temp collections.

@vrancurel vrancurel force-pushed the bugfix/ZENKO-2633-search-with-aggregates branch 2 times, most recently from 6401dc7 to c5ede0a Compare July 16, 2020 02:55
@vrancurel
Copy link
Contributor Author

I did not find a way to $out in a separate namespace. So we will need to modify the oplog to filter out __search collections.

@vrancurel vrancurel force-pushed the bugfix/ZENKO-2633-search-with-aggregates branch from c5ede0a to 47c18d6 Compare July 16, 2020 15:59
@vrancurel
Copy link
Contributor Author

We should also include the date of the query for making the cache hash.

allowDiskUse: true, // stage large queries on disk
},
null);
_cursor.toArray(err => {
Copy link
Contributor

@alexanderchan-scality alexanderchan-scality Jul 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can possibly lead to out of memory if the aggregate results becomes too large

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No the result is empty because there is a $out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But one problem of this approach is that it returns nothing until the query is done. This is weird from the API client perspective. Any suggestion ?

Copy link
Collaborator

@rahulreddy rahulreddy Jul 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's streaming the output, may be use a combined approach of streaming 1000 documents back to the client and starting a background async job that writes to the temporary collection
https://mongodb.github.io/node-mongodb-native/3.6/reference/cursors/#stream-api
It sounds complex and dirty, ideally I would delegate the task to some other worker but I can't see how.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we could to basically the user-query in an aggregate AND in a find() at the same time. But it will execute the query twice... just to avoid sessions, that's a bit too much...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the $sort here is also killing the query plan, and the good news is ... no need to do the $sort because in the $out collection the keys will already be sorted...

@vrancurel vrancurel force-pushed the bugfix/ZENKO-2633-search-with-aggregates branch from 47c18d6 to 5132cdb Compare July 18, 2020 02:31
// fallthrough
// eslint-disable-next-line
params.searchOptions = searchOptions;
return this.internalListObject(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first page it reverts to regular search (as before).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to also limit it.

@vrancurel vrancurel force-pushed the bugfix/ZENKO-2633-search-with-aggregates branch from 5132cdb to 48ef858 Compare July 18, 2020 03:08
The search was using find().sort() and was disrupting user defined
search queries and custom indexes. The sort() is needed to implement
a stateless paging system. The combo of user defined query and sort is
now implemented with a 2 stage aggregate on server side.
We always limit the execution time maxTimeMs to 5mn (tunable by an
environment variable).
The result is staged in a temporary bucket and cached for paging.
We rely on an external job to cleanup the searches (e.g. daily).
@vrancurel vrancurel force-pushed the bugfix/ZENKO-2633-search-with-aggregates branch from 48ef858 to d8e51a6 Compare July 20, 2020 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants