Update lastDoc in ScoreCachingWrappingScorer #13987

msfroh · 2024-11-12T00:13:49Z

Description

I noticed that ScoreCachingWrappingScorer never updates lastDoc, so it's always -1. Technically, it's probably fine, since it still ends up returning the same score for multiple score() calls between collect calls, but I think this change better reflects the intended logic. (In particular, if the same doc was somehow collected multiple times, then the score would get recalculated.)

I noticed that ScoreCachingWrappingScorer never updates lastDoc, so it's always -1. Technically, it's probably fine, since it still ends up returning the same score for multiple score() calls between collect calls, but I think this is the intended logic. (In particular, if the same doc was somehow collected multiple times, then the score would get recalculated.)

A couple of unit tests seemed to rely on scores not being cached for subsequent collect calls with the same doc id.

mikemccand · 2024-11-12T11:52:15Z

Wow, good catch @msfroh. Could we maybe add a new test case that explicitly confirms that the wrapped Scorable's score method is indeed only called once even if the outer user calls .score() multiple times on the same docID? This seems to be the primary purpose of this class, and it was failing this, for a very long time!

mikemccand

Phew, this was a great catch of a long standing bug ... I wonder how much net performance was lost across all Lucene applications that were using this class thinking it actually saved CPU by caching the score like it claims it does. Perhaps this class used to work and then a refactoring (maybe the addition of Scorable?) accidentally introduced the bug ...

The change looks great to me, and it's sweet you also then found two separate pre-existing test bugs. I'm hoping we can also add a dedicated test case confirming this class is actually doing its one job.

mikemccand · 2024-11-12T11:50:46Z

lucene/core/src/test/org/apache/lucene/search/TestPositiveScoresOnlyCollector.java

-      ac.collect(0);
+    int docId;
+    while ((docId = s.iterator().nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+      ac.collect(docId);


Wow, good catch! So this was a latent pre-existing test bug, and with your above bug fix this test is now (correctly!) failing, and with your fix to this separate test bug, the test now passes?

mikemccand · 2024-11-12T11:51:24Z

lucene/core/src/test/org/apache/lucene/search/TestTopFieldCollector.java

@@ -359,7 +359,7 @@ public void testTotalHitsWithScore() throws Exception {
      leafCollector.collect(1);

      scorer.score = 4;
-      leafCollector.collect(1);
+      leafCollector.collect(2);


Another pre-existing test bug fixed?

With the change, the call to leafCollector.collect(1) would get a score of 3 (since it's cached for document 1), instead of the 4 that the test expected.

jpountz

Thanks for catching this! This suggests that the caching of scores did not actually work, can you add a test that fails when a score is computed twice for the same doc even though ScoreCachingWrapperScorer is used?

msfroh · 2024-11-12T19:00:33Z

Luckily, this is a Lucene 10-only bug (from when docId() was removed from Scorable).

I came across it when updating OpenSearch to support Lucene 10 and needed to refactor some code that used ScoreCachingWrapperScorer.

I'll add the extra unit test to cover this.

msfroh · 2024-11-13T00:47:19Z

I think the caching worked fine, albeit in a funny way.

When you call collect() with any valid doc ID, it invalidates the cache, causing scores to be computed. After the score was computed, currDoc was set to -1. Repeated calls to score() (until the next collect()) will reuse the cached value because -1 == -1.

As long as you don't call collect() on the same document multiple times (which you probably shouldn't be doing anyway -- those existing unit tests were a little weird), it should be okay.

msfroh · 2024-11-13T17:04:13Z

lucene/core/src/test/org/apache/lucene/search/TestScoreCachingWrappingScorer.java

+    for (int i = 0; i < scores.length; i++) {
+      assertEquals(scores[i], scc.mscores[i * 2], 0f);
+      assertEquals(scores[i], scc.mscores[i * 2 + 1], 0f);
+    }
+
+    assertEquals(scores.length, countingScorable.count);


This is similar to the other test method (testGetScores), but we assert that each score is collected twice (since we collect each document twice), but the score() method is only called once per document.

Ahh I see -- you want to assert that indeed score() was called (up on top) multiple times per hit, but then down below in the "true" scorer, it was dedup'd to just once per hit. So that's why you .collect(doc) twice, OK ...

jpountz

I'm curious about the questions that @mikemccand raised, otherwise the change looks good to me.

jpountz · 2024-11-13T17:39:38Z

lucene/core/src/test/org/apache/lucene/search/TestScoreCachingWrappingScorer.java

+    int doc;
+    while ((doc = s.iterator().nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+      lc.collect(doc);
+      lc.collect(doc);


It's illegal to call collect() multiple times on the same doc, this shouldn't be necessary as collect() already calls Scorable#score() multiple times?

Yeah -- that's why I'd hesitate to really call this a "bug fix".

I'm pretty sure you meant to write lastDoc = currDoc in #12407, but functionally currDoc = lastDoc works fine as long as you don't call collect() multiple times on the same doc.

I only noticed this because I was copy/pasting the logic to adapt it for LeafBucketCollector in OpenSearch and IntelliJ warned me that lastDoc could be final.

mikemccand

Thanks @msfroh -- does the new test fail when you revert the bug fix?

mikemccand · 2024-11-13T17:42:46Z

lucene/core/src/test/org/apache/lucene/search/TestScoreCachingWrappingScorer.java

@@ -157,4 +157,40 @@ public void testGetScores() throws Exception {
    ir.close();
    directory.close();
  }
+
+  private static class CountingScorable extends FilterScorable {
+    int count = 0;


We don't need the = 0 since it's java's default already?

mikemccand · 2024-11-13T17:44:57Z

lucene/core/src/test/org/apache/lucene/search/TestScoreCachingWrappingScorer.java

+    int doc;
+    while ((doc = s.iterator().nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+      lc.collect(doc);
+      lc.collect(doc);


Hmm why are we collecting the same doc twice? (I thought this is smelly -- we fixed the other two tests to stop doing that?).

Oh I see, is so score() happens more than once while on the same doc? Actually, since ScoreCachingCollector's LeafCollector is calling .score() three times for each .collect(), you should be able to do only one lc.collect(doc) here?

Does this new test case fail if you revert the bug fix?

The new test does fail if I revert the fix.

As I mentioned in #13987 (comment), the condition for this being an actual bug is probably not really valid (i.e. you need to call collect() on the ScoreCachingWrappingLeafCollector multiple times for the same document).

The caching does work if the wrapped LeafCollector calls score() multiple times, which can happen in MultiCollector (for example). The layering is roughly:

ScoreCachingWrappingLeafCollector wraps Some other LeafCollector whose score is ScoreCachingWrappingScorer, which wraps The real Scorable

The multiple calls to score() from Some other LeafCollector work just fine. Again, it's the call to collect() on ScoreCachingWrappingLeafCollector that was invalidating the cache.

Those other two tests broke because my "fix" causes the caching to work (but they didn't expect it).

mikemccand · 2024-11-13T17:45:09Z

lucene/core/src/test/org/apache/lucene/search/TestScoreCachingWrappingScorer.java

+  public void testRepeatedCollectReusesScore() throws Exception {
+    Scorer s = new SimpleScorer();
+    CountingScorable countingScorable = new CountingScorable(s);
+    ScoreCachingCollector scc = new ScoreCachingCollector(scores.length * 2);


Maybe add a comment about why we need * 2 here?

mikemccand · 2024-11-13T17:50:08Z

lucene/core/src/test/org/apache/lucene/search/TestScoreCachingWrappingScorer.java

+    for (int i = 0; i < scores.length; i++) {
+      assertEquals(scores[i], scc.mscores[i * 2], 0f);
+      assertEquals(scores[i], scc.mscores[i * 2 + 1], 0f);
+    }
+
+    assertEquals(scores.length, countingScorable.count);


Ahh I see -- you want to assert that indeed score() was called (up on top) multiple times per hit, but then down below in the "true" scorer, it was dedup'd to just once per hit. So that's why you .collect(doc) twice, OK ...

msfroh added 2 commits November 11, 2024 16:03

Fix unit tests

65f1079

A couple of unit tests seemed to rely on scores not being cached for subsequent collect calls with the same doc id.

mikemccand approved these changes Nov 12, 2024

View reviewed changes

jpountz reviewed Nov 12, 2024

View reviewed changes

Add unit test for repeated collection of the same doc ID

ebfc425

msfroh commented Nov 13, 2024

View reviewed changes

jpountz approved these changes Nov 13, 2024

View reviewed changes

mikemccand reviewed Nov 13, 2024

View reviewed changes

Address feedback from @mikemccand

003dcac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update lastDoc in ScoreCachingWrappingScorer #13987

Update lastDoc in ScoreCachingWrappingScorer #13987

msfroh commented Nov 12, 2024 •

edited

Loading

mikemccand commented Nov 12, 2024

mikemccand left a comment

mikemccand Nov 12, 2024

mikemccand Nov 12, 2024

msfroh Nov 14, 2024

jpountz left a comment

msfroh commented Nov 12, 2024 •

edited

Loading

msfroh commented Nov 13, 2024

msfroh Nov 13, 2024 •

edited

Loading

mikemccand Nov 13, 2024

jpountz left a comment

jpountz Nov 13, 2024

msfroh Nov 13, 2024

mikemccand left a comment

mikemccand Nov 13, 2024

mikemccand Nov 13, 2024

msfroh Nov 13, 2024

mikemccand Nov 13, 2024

mikemccand Nov 13, 2024

Update lastDoc in ScoreCachingWrappingScorer #13987

Are you sure you want to change the base?

Update lastDoc in ScoreCachingWrappingScorer #13987

Conversation

msfroh commented Nov 12, 2024 • edited Loading

Description

mikemccand commented Nov 12, 2024

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz left a comment

Choose a reason for hiding this comment

msfroh commented Nov 12, 2024 • edited Loading

msfroh commented Nov 13, 2024

msfroh Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msfroh commented Nov 12, 2024 •

edited

Loading

msfroh commented Nov 12, 2024 •

edited

Loading

msfroh Nov 13, 2024 •

edited

Loading