Skip to content

Conversation

RamakrishnaChilaka
Copy link
Contributor

This PR optimizes the expand8 routine by leveraging the JDK Vector API.

Benchmarks

I have validated performance using a standalone benchmark (see postings_expand_benchmark) for block_size: 256. Key take-aways are as follows.

Benchmark Mode Cnt Score Error Units
expand16 (Scalar) thrpt 5 112.842 ± 0.221 ops/us
expand16 (Vector) thrpt 5 105.594 ± 1.307 ops/us
expand8 (Scalar) thrpt 5 66.726 ± 0.452 ops/us
expand8 (Vector) thrpt 5 105.821 ± 0.272 ops/us
  • expand8: Vectorized version is ~59% faster than scalar (66.7 → 105.8 ops/us).
  • expand16: Scalar slightly outperforms vector (112.8 vs 105.6 ops/us).

Lucene Microbenchmarks


baseline
Benchmark                                (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode            2  thrpt   15  35.409 ± 0.120  ops/us
PostingIndexInputBenchmark.decode            3  thrpt   15  29.128 ± 0.017  ops/us
PostingIndexInputBenchmark.decode            4  thrpt   15  41.492 ± 0.305  ops/us
PostingIndexInputBenchmark.decode            5  thrpt   15  32.205 ± 0.350  ops/us
PostingIndexInputBenchmark.decode            6  thrpt   15  31.237 ± 0.245  ops/us
PostingIndexInputBenchmark.decode            7  thrpt   15  29.984 ± 0.582  ops/us
PostingIndexInputBenchmark.decode            8  thrpt   15  56.366 ± 0.134  ops/us
PostingIndexInputBenchmark.decode            9  thrpt   15  22.802 ± 0.077  ops/us
PostingIndexInputBenchmark.decode           10  thrpt   15  23.502 ± 0.037  ops/us
PostingIndexInputBenchmark.decodeVector      2  thrpt   15  53.151 ± 0.070  ops/us
PostingIndexInputBenchmark.decodeVector      3  thrpt   15  48.863 ± 1.455  ops/us
PostingIndexInputBenchmark.decodeVector      4  thrpt   15  54.284 ± 2.195  ops/us
PostingIndexInputBenchmark.decodeVector      5  thrpt   15  39.302 ± 0.659  ops/us
PostingIndexInputBenchmark.decodeVector      6  thrpt   15  38.414 ± 0.830  ops/us
PostingIndexInputBenchmark.decodeVector      7  thrpt   15  39.609 ± 0.551  ops/us
PostingIndexInputBenchmark.decodeVector      8  thrpt   15  56.373 ± 0.118  ops/us
PostingIndexInputBenchmark.decodeVector      9  thrpt   15  27.295 ± 0.351  ops/us
PostingIndexInputBenchmark.decodeVector     10  thrpt   15  30.058 ± 0.172  ops/us


contender
Benchmark                                (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode            2  thrpt   15  35.238 ± 0.209  ops/us
PostingIndexInputBenchmark.decode            3  thrpt   15  29.214 ± 0.098  ops/us
PostingIndexInputBenchmark.decode            4  thrpt   15  41.559 ± 0.580  ops/us
PostingIndexInputBenchmark.decode            5  thrpt   15  32.543 ± 0.175  ops/us
PostingIndexInputBenchmark.decode            6  thrpt   15  31.323 ± 0.061  ops/us
PostingIndexInputBenchmark.decode            7  thrpt   15  29.525 ± 0.315  ops/us
PostingIndexInputBenchmark.decode            8  thrpt   15  52.348 ± 0.079  ops/us
PostingIndexInputBenchmark.decode            9  thrpt   15  24.919 ± 0.056  ops/us
PostingIndexInputBenchmark.decode           10  thrpt   15  26.581 ± 0.049  ops/us
PostingIndexInputBenchmark.decodeVector      2  thrpt   15  71.223 ± 6.921  ops/us
PostingIndexInputBenchmark.decodeVector      3  thrpt   15  53.237 ± 1.962  ops/us
PostingIndexInputBenchmark.decodeVector      4  thrpt   15  73.437 ± 0.284  ops/us
PostingIndexInputBenchmark.decodeVector      5  thrpt   15  41.201 ± 2.067  ops/us
PostingIndexInputBenchmark.decodeVector      6  thrpt   15  46.622 ± 0.289  ops/us
PostingIndexInputBenchmark.decodeVector      7  thrpt   15  45.505 ± 1.044  ops/us
PostingIndexInputBenchmark.decodeVector      8  thrpt   15  58.368 ± 0.977  ops/us
PostingIndexInputBenchmark.decodeVector      9  thrpt   15  27.243 ± 0.358  ops/us
PostingIndexInputBenchmark.decodeVector     10  thrpt   15  30.059 ± 0.105  ops/us

Summary

bpv -9,10 uses primitive size as 16, hence no change in performance.

bpv baseline vector (ops/μs) contender vector (ops/μs) Δ
2 53.2 71.2 +33.8 %
3 48.9 53.2 +8.8 %
4 54.3 73.4 +35.2 %
5 39.3 41.2 +4.8 %
6 38.4 46.6 +21.4 %
7 39.6 45.5 +14.9 %
8 56.3 58.4 +3.7 %
9 27.3 27.2 –0.4 %
10 30.1 30.1 0.0 %

Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@github-actions github-actions bot added this to the 10.4.0 milestone Sep 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant