-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up top-k retrieval on filtered conjunctions. #13994
base: main
Are you sure you want to change the base?
Conversation
A while back we added an optimized bulk scorer that implements block-max AND, this yielded a good speedup on nightly benchmarks, see annotation `FP` at https://benchmarks.mikemccandless.com/AndHighHigh.html. With this PR, filtered conjunctions now also run through this optimized bulk scorer by doing two things: - It flattens inner conjunctions. This makes queries initially written as something like `+(+term1 +term2) #filter` rewritten to `+term1 +term2 #filter`. - It evaluates queries that have a mix of MUST and FILTER clauses evaluated through `BlockMaxConjunctionBulkScorer` by treating FILTER clauses as scoring clauses that produce a score of 0.
I ran luceneutil on wikibigall and the new filtered tasks from wikinightly.tasks (https://github.com/mikemccand/luceneutil/blob/main/tasks/wikinightly.tasks#L355-L390):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inlining and optimization makes sense to me. Minor thoughts around naming things differently so that the flattening logic is easier to follow.
for (BooleanClause clause : clauses) { | ||
if (clause.isRequired() && clause.query() instanceof BooleanQuery innerQuery) { | ||
if (innerQuery.getMinimumNumberShouldMatch() == 0 | ||
&& innerQuery.getClauses(Occur.SHOULD).isEmpty()) { | ||
actuallyRewritten = true; | ||
for (BooleanClause innerClause : innerQuery) { | ||
Occur occur = innerClause.occur(); | ||
if (occur == Occur.FILTER | ||
|| occur == Occur.MUST_NOT | ||
|| clause.occur() == Occur.MUST) { | ||
builder.add(innerClause); | ||
} else { | ||
assert clause.occur() == Occur.FILTER && occur == Occur.MUST; | ||
// In this case we need to change the occur of the inner query from MUST to FILTER. | ||
builder.add(innerClause.query(), Occur.FILTER); | ||
} | ||
} | ||
} else { | ||
builder.add(clause); | ||
} | ||
} else { | ||
builder.add(clause); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Holy smokes I had to read this like 5 times to fully grok what it is doing.
Could you name occur
to innerClauseOccur
and change clause
to outerClause
?
@jpountz don't forget CHANGES ;) but this lgtm |
A while back we added an optimized bulk scorer that implements block-max AND, this yielded a good speedup on nightly benchmarks, see annotation
FP
at https://benchmarks.mikemccandless.com/AndHighHigh.html. With this PR, filtered conjunctions now also run through this optimized bulk scorer by doing two things:+(+term1 +term2) #filter
rewritten to+term1 +term2 #filter
.BlockMaxConjunctionBulkScorer
by treating FILTER clauses as scoring clauses that produce a score of 0.