Skip to content

Implement off-heap quantized scoring #14863

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kaivalnp
Copy link
Contributor

Description

Off-heap scoring for quantized vectors! Related to #13515

This scorer is in-line with Lucene99MemorySegmentFlatVectorsScorer, and will automatically be used with PanamaVectorizationProvider (i.e. on adding jdk.incubator.vector). Note that the computations are already vectorized, but we're avoiding the unnecessary copy to heap here..

I added off-heap Dot Product functions for two compressed 4-bit ints (i.e. no need to "decompress" them) -- I can try to come up with similar ones for Euclidean if this approach seems fine..

Copy link

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@kaivalnp
Copy link
Contributor Author

I ran some benchmarks on Cohere vectors (768d) for 7-bit and 4-bit (compressed) quantization..

main without jdk.incubator.vector:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.860        2.815   2.806        0.997  100000   100      50       64        250     7 bits     44.07       2269.17           46.79             1          373.72       366.592       73.624       HNSW
 0.545        3.193   3.185        0.997  100000   100      50       64        250     4 bits     47.26       2115.95           50.04             1          338.13       329.971       37.003       HNSW

main with jdk.incubator.vector:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.863        1.904   1.886        0.991  100000   100      50       64        250     7 bits     28.65       3490.65           29.66             1          373.69       366.592       73.624       HNSW
 0.545        1.313   1.305        0.994  100000   100      50       64        250     4 bits     22.86       4373.88           17.84             1          338.13       329.971       37.003       HNSW

This PR without jdk.incubator.vector:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.861        2.774   2.765        0.997  100000   100      50       64        250     7 bits     44.60       2242.00           46.71             1          373.73       366.592       73.624       HNSW
 0.545        3.147   3.139        0.997  100000   100      50       64        250     4 bits     47.93       2086.51           50.20             1          338.11       329.971       37.003       HNSW

This PR with jdk.incubator.vector:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.861        1.612   1.603        0.994  100000   100      50       64        250     7 bits     22.99       4349.53           24.78             1          373.70       366.592       73.624       HNSW
 0.545        1.277   1.269        0.994  100000   100      50       64        250     4 bits     21.60       4630.49           17.41             1          338.11       329.971       37.003       HNSW

I did see slight fluctuation across runs, but the search time was ~10% faster for 7-bit and very slightly faster for 4-bit (compressed). Indexing and force merge times have improved by ~15%

@kaivalnp
Copy link
Contributor Author

FYI I observed a strange phenomenon where if the query vector is on heap like:

this.query = MemorySegment.ofArray(targetBytes);

instead of the current off-heap implementation in this PR:

this.query = Arena.ofAuto().allocateFrom(JAVA_BYTE, targetBytes);

..then we see a performance regression:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.862        3.043   3.034        0.997  100000   100      50       64        250     7 bits     23.25       4301.82           25.29             1          373.70       366.592       73.624       HNSW
 0.545        2.060   2.049        0.995  100000   100      50       64        250     4 bits     22.19       4506.33           17.99             1          338.17       329.971       37.003       HNSW

Maybe I'm missing something obvious, but I haven't found the root cause yet..

@ChrisHegarty
Copy link
Contributor

..then we see a performance regression:
...
Maybe I'm missing something obvious, but I haven't found the root cause yet..

yeah. I've seen similar before. You might be hitting a problem with the loop bound not being hoisted. I will try to take a look.

@kaivalnp
Copy link
Contributor Author

Thanks @ChrisHegarty! I saw that we use a heap-backed MemorySegment while scoring byte vectors -- so I opened #14874 to investigate if we can improve performance by moving to an off-heap query

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jul 15, 2025
@kaivalnp
Copy link
Contributor Author

After this conversation I re-ran some benchmarks with -XX:CompileCommand=inline,*PanamaVectorUtilSupport.* to force inlining of dot product functions

main:

recall latency index time (s) force merge (s) quantization index
0.530 1.675 75.65 59.97 4 bit fresh
0.860 1.844 81.75 71.79 7 bit fresh
0.530 1.582 - - 4 bit no reindex
0.860 1.859 - - 7 bit no reindex
0.529 1.682 79.32 62.62 4 bit reindex
0.859 1.821 103.78 48.36 7 bit reindex

This PR:

recall latency index time (s) force merge (s) quantization index query type
0.529 2.132 131.86 85.86 4 bit fresh MemorySegment.ofArray
0.858 2.797 133.71 80.81 7 bit fresh MemorySegment.ofArray
0.529 2.081 - - 4 bit no reindex MemorySegment.ofArray
0.858 1.670 - - 7 bit no reindex MemorySegment.ofArray
0.529 2.140 130.82 85.08 4 bit reindex MemorySegment.ofArray
0.858 2.883 132.66 83.05 7 bit reindex MemorySegment.ofArray
0.529 1.511 164.22 110.18 4 bit fresh Arena.ofAuto().allocateFrom
0.859 1.728 132.41 81.35 7 bit fresh Arena.ofAuto().allocateFrom
0.529 1.551 - - 4 bit no reindex Arena.ofAuto().allocateFrom
0.859 1.704 - - 7 bit no reindex Arena.ofAuto().allocateFrom
0.529 1.574 164.04 112.60 4 bit reindex Arena.ofAuto().allocateFrom
0.859 1.774 135.20 83.72 7 bit reindex Arena.ofAuto().allocateFrom

Looks like the changes in this PR have a small benefit on the search side, but slow down indexing by a lot..

@github-actions github-actions bot removed the Stale label Jul 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants