-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bf: search with aggregate instead of find #1213
base: development/8.1
Are you sure you want to change the base?
bf: search with aggregate instead of find #1213
Conversation
Hello vrancurel,My role is to assist you with the merge of this Status report is not available. |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
Peer approvals must include a mandatory approval from @jonathan-gramain. |
f6de7b0
to
cecd214
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Were you able to measure the impact (cpu, elapsed_ms, index usage etc) of using this aggregate?
{ $match: searchOptions }, // user query | ||
{ $match: query }, // for paging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was wondering if the pipeline would benefit from first filtering by the indexed _id
, then followed by the { $match: searchOptions}
, which can possibly involve a full collection scan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we want is first to do the user defined query: searchOptions, then save it on disk (potentially larger results set than 100MB), then sort this results set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think we need to add a $out
stage into a temporary collection and obtain a cursor on this temp collection instead.
Later on a sweeper shall delete temp collections.
6401dc7
to
c5ede0a
Compare
I did not find a way to $out in a separate namespace. So we will need to modify the oplog to filter out __search collections. |
c5ede0a
to
47c18d6
Compare
We should also include the date of the query for making the cache hash. |
allowDiskUse: true, // stage large queries on disk | ||
}, | ||
null); | ||
_cursor.toArray(err => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can possibly lead to out of memory if the aggregate results becomes too large
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No the result is empty because there is a $out
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But one problem of this approach is that it returns nothing until the query is done. This is weird from the API client perspective. Any suggestion ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's streaming the output, may be use a combined approach of streaming 1000 documents back to the client and starting a background async job that writes to the temporary collection
https://mongodb.github.io/node-mongodb-native/3.6/reference/cursors/#stream-api
It sounds complex and dirty, ideally I would delegate the task to some other worker but I can't see how.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we could to basically the user-query in an aggregate AND in a find() at the same time. But it will execute the query twice... just to avoid sessions, that's a bit too much...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the $sort here is also killing the query plan, and the good news is ... no need to do the $sort because in the $out collection the keys will already be sorted...
47c18d6
to
5132cdb
Compare
// fallthrough | ||
// eslint-disable-next-line | ||
params.searchOptions = searchOptions; | ||
return this.internalListObject( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the first page it reverts to regular search (as before).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to also limit it.
5132cdb
to
48ef858
Compare
The search was using find().sort() and was disrupting user defined search queries and custom indexes. The sort() is needed to implement a stateless paging system. The combo of user defined query and sort is now implemented with a 2 stage aggregate on server side. We always limit the execution time maxTimeMs to 5mn (tunable by an environment variable). The result is staged in a temporary bucket and cached for paging. We rely on an external job to cleanup the searches (e.g. daily).
48ef858
to
d8e51a6
Compare
The search was using find().sort() and was disrupting user defined
search queries and custom indexes. The sort() is needed to implement
a stateless paging system. The combo of user defined query and sort is
now implemented with a 2 stage aggregate on server side.
We always limit the execution time maxTimeMs to 5mn (tunable by an
environment variable).
Note that it the number of concurrent search queries is for now not limited (and it should). We know that the aggregate will put an additional burden on the primary, so micro-sharding shall be implemented to divide and conquer. But this is out of scope of this PR so far.