Skip to content

Conversation

sgup432
Copy link
Contributor

@sgup432 sgup432 commented Aug 26, 2025

Description

Related issue - #15097

Instead of using the default 1024 bytes for query size, we try to use RamUsageEstimator to calculate approx size. As seen in above issues, using 1024 as a default can cause issues(like heap exhaustion) due to underestimation, and even cause overestimation leading to fewer queries getting cached.

RamUsageEstimator.sizeOf() is usually pretty fast, I made a sample BooleanQuery with 15 clauses with a size of around 3kb. It took around 14959ns to calculate its size via RamUsageEstimator.sizeOf().

Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@sgup432 sgup432 changed the title Use RamUsageEstimator to estimate query size instead of defaulting to 1024 bytes for non-accountable queries. [QueryCache] Use RamUsageEstimator instead of defaulting to 1024 bytes for non-accountable query size Aug 26, 2025
@sgup432
Copy link
Contributor Author

sgup432 commented Aug 26, 2025

@msfroh Would you mind taking a look at this?

Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Comment on lines +428 to +429
long queryRamBytesUsed = RamUsageEstimator.sizeOf(query, 0);
return LINKED_HASHTABLE_RAM_BYTES_PER_ENTRY + queryRamBytesUsed;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize you have a small benchmark that you did, but real world performance is still a concern of mine here.

I am not sure how to benchmark this, but my concern is that non-accountable queries just got more expensive, we calculate memory used for a query on cache and eviction

Copy link
Contributor Author

@sgup432 sgup432 Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I just timed one 3kb boolean query.

I am not sure how to benchmark this, but my concern is that non-accountable queries just got more expensive

The only way I see it is to micro-benchmark RamUsageEstimator with different query types and size, might give more idea. I had tried to do same but with only term and boolean queries.

but my concern is that non-accountable queries just got more expensive

I am not sure whether it would be that expensive(atleast in absolute terms). Like for TermQuery and similar cheaper queries, RamUsageEstimator should be pretty fast considering we already cache ramBytesUsed for the underlying terms etc. In my opinion, complex BooleanQueries with many nested clauses might be more expensive, as we need to visit all the underlying children for size calculation. To handle that, we already have a limit of 16 clauses beyond which we can cannot cache, and it can be refined further if needed.

I don't see any other good way of calculating query size until we implement Accountable for all queries, but it would too intrusive for just the caching use-case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benwtrent So I wrote a simple code to micro-benchmark putIfAbsent method which is the one which calls getRamBytesUsed(Query query) during caching and eviction.
Created a cache with MAX_SIZE = 10000 and MAX_SIZE_IN_BYTES = 1048576, and created N sample boolean queries and the logic looks like something below

for (int i = 0; i < MAX_SIZE; i++) {
      TermQuery must = new TermQuery(new Term("foo", "bar" + i));
      TermQuery filter = new TermQuery(new Term("foo", "quux" + i));
      TermQuery mustNot = new TermQuery(new Term("foo", "foo" + i));
      BooleanQuery.Builder bq = new BooleanQuery.Builder();
      bq.add(must, BooleanClause.Occur.FILTER);
      bq.add(filter, BooleanClause.Occur.FILTER);
      bq.add(mustNot, BooleanClause.Occur.MUST_NOT);
      queries[i] = bq.build();
    }

JMH method to test 100% writes workload for 10 threads

@Benchmark
  @Group("concurrentPutOnly")
  @GroupThreads(10)
  public void testConcurrentPuts() {
    int random = ThreadLocalRandom.current().nextInt(MAX_SIZE);
    queryCache.putIfAbsent(
        queries[random], this.sampleCacheAndCount, cacheHelpers[random & (SEGMENTS - 1)]);
  }

Baseline numbers

Benchmark                               Mode  Cnt        Score       Error  Units
QueryCacheBenchmark.concurrentPutOnly  thrpt   15  4102080.220 ± 80816.546  ops/s

My changes

Benchmark                               Mode  Cnt       Score       Error  Units
QueryCacheBenchmark.concurrentPutOnly  thrpt   15  901925.345 ± 38059.230  ops/s

So it became ~4.5x slower. There were lot of evictions as well. Though this probably one of the worst case scenario with only write workload and boolean query. For a mixed read/write or simple filter queries, this might be way less.
And this is the method profile (taken from JFR). Kind of expected.

Screenshot 2025-09-17 at 3 53 18 PM

I don't know if there is a way out or we can further improve query visitor logic itself.

Copy link
Contributor Author

@sgup432 sgup432 Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like I was doing

long queryRamBytesUsed = RamUsageEstimator.sizeOf(query, 0);

long sizeOf(Query q, long defSize) Here defSize represents the shallow size of the query, and if 0 is passed, it calculates the shallow size during runtime which was doing a lot of object allocations(adding to higher CPU%) and therefore adding to latency.

I changed to RamUsageEstimator.sizeOf(query, 32);, Here 32 is a rough shallow bytes for a term/boolean query which I calculated manually by calling that method

After running with above change, performance improved but still 2.5x slower than the baseline.

Benchmark                               Mode  Cnt        Score        Error  Units
QueryCacheBenchmark.concurrentPutOnly  thrpt   15  1608680.127 ± 229218.026  ops/s

Copy link
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added Stale and removed Stale labels Sep 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants