[Draft] Support Multi-Vector HNSW Search via Flat Vector Storage #14173

vigyasharma · 2025-01-27T18:09:56Z

Another take at #12313

The following PR adds support for independent multi-vectors, i.e. scenarios where a single document is represented by multiple independent vector values. The most common example for this, is the passage-vector search use case, where we create a vector for every paragraph chunk in the document.

Currently, Lucene only supports a single vector per document. Users are required to create parent-child relationships with the chunked vectors of a document, and call a ParentBlockJoin query to run passage vector search. This change allows indexing multiple vectors within the same document.

Each vector is still assigned a unique int ordinal but multiple ordinals can now map to the same document. We use additional metadata to maintain the many-one ordToDoc mapping, and also quickly figure out the first indexed vector ordinal for a document (called baseOrdinal (baseOrd)). This gives us new APIs that fetch all vectors for a document, which can be used for faster scoring (as opposed to the child doc query in ParentJoin approach):

// iterator on vector values for the doc corresponding to provided ord
public Iterator<float[]> allVectorValues(int ord);

// simpler API, returns iterator on all vectors for doc corresponding to given base ord.
public Iterator<float[]> allVectorValues(int baseOrd, int ordCount); 

// ... same APIs for ByteVectorValues

Interface

The interface to use multi-vector values is quite simple now:

// indexing
Document doc = new Document();
doc.add(vector1);
doc.add(vector2);
...
doc.add(vectorN);
iw.addDocument(doc);

// query
KnnFloatMultiVectorQuery query = new KnnFloatMultiVectorQuery(field, target, k);
searcher.search(query, k);

I was able to add a multi-vector benchmark to luceneutil to run this setup end to end. Will link results and a luceneutil PR in comments.

Pending Tasks:

This is an early draft to get some feedback, I have TODOs across the code for future improvements. Here are some big items pending:

Backward compatibility for the storage format
New version for vector storage format (Lucene 111)?
Support for merging on multi vector values
Optimization for single-valued vectors (store less metadata)
Support for scoring based on all vectors of a document (e.g. ScoreMode.Avg)
Unit tests
Support for multi-valued vectors in quantized vectors.

__
Note: This change does not include dependent multi-valued vectors like ColBERT, where the multiple vectors must used together to compute similarity. It does however lay essential ground work which can subsequently be extended for this support.

… addValue

vigyasharma · 2025-01-28T20:42:29Z

Ran some early benchmarks to compare this flat storage based multi-vector approach with the existing parent-join approach. I would appreciate any feedback on the approach, benchmark setup, or any mistakes you spot along the way.

Observations:

Latency and recall are better with multiVectors, when both parentJoin and multiVector benchmarks are run on my branch. However, the parentJoin benchmark has significantly better latency and recall when run on main branch. Some key differences between my branch and main branch runs are:
1. My branch always creates and loads the metadata needed for multiVector, even in the single vector (parentJoin) case. I went with a simplistic approach here so my guess is that this is the source of latency.
2. I compared by disabling merging for both benchmarks, because I haven't implemented merging changes yet.
3. I've added run results below, but I wouldn't put too much faith in them until we narrowed down the latency cause.
For parentJoin benchmark run on main, there is a visible drop in recall when I disable merges (as compared to a main branch run with merges enabled). Is this expected?

...

Note that nDoc on parentJoin is numVectors + nDoc on multiVector runs. This is from the parent documents created in addition to child vector documents.

ParentJoin v/s MultiVector (on multi-vector branch)

# multivector
recall  latency (ms)  nVectors  nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.673         3.548     10000   103   100      50       32        100         no     1.62         63.58             1            29.40         29.297        29.297
 0.431         4.857    100000  1323   100      50       32        100         no    11.42        115.84             3           294.08        292.969       292.969
 0.461         8.034    200000  2939   100      50       32        100         no    22.62        129.92             6           588.27        585.938       585.938
 0.496        16.040    500000  8773   100      50       32        100         no    53.50        163.98            14          1470.72       1464.844      1464.844


# parentJoin on multi-vector branch 
# (merges disabled, creates and loads multivector config)
recall  latency (ms)  nVectors    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.610         4.644     10000   10103   100      50       32        100         no     1.70       5946.44             1            29.51         29.297        29.297
 0.242         5.189    100000  101323   100      50       32        100         no    11.34       8935.80             3           295.17        292.969       292.969
 0.275         8.988    200000  202939   100      50       32        100         no    22.54       9005.50             6           590.51        585.938       585.938
 0.290        16.605    500000  508773   100      50       32        100         no    52.70       9654.32            14          1476.26       1464.844      1464.844

...

ParentJoin (on main) v/s MultiVector (on multivector branch)

# parentJoin (on main)
recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  force merge s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.958         1.160   10000   100      50       32        100         no     1.49       6706.91           1.85             1            29.67         29.297        29.297
 0.925         2.392  100000   100      50       32        100         no    34.98       2858.86           7.86             1           297.91        292.969       292.969
 0.914         2.972  200000   100      50       32        100         no    63.80       3134.94          43.48             1           596.14        585.938       585.938
 0.904         4.292  500000   100      50       32        100         no   151.49       3300.57         147.08             1          1491.81       1464.844      1464.844

# multivector
recall  latency (ms)  nVectors  nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.673         3.793     10000   103   100      50       32        100         no     1.59         64.78             1            29.40         29.297        29.297
 0.431         4.572    100000  1323   100      50       32        100         no    11.22        117.87             3           294.08        292.969       292.969
 0.461         7.681    200000  2939   100      50       32        100         no    22.38        131.32             6           588.27        585.938       585.938
 0.496        16.292    500000  8773   100      50       32        100         no    54.10        162.17            14          1470.72       1464.844      1464.844

...

ParentJoin with merges v/s ParentJoin with merges disabled (both on main)

# parentJoin (on main)
recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  force merge s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.958         1.160   10000   100      50       32        100         no     1.49       6706.91           1.85             1            29.67         29.297        29.297
 0.925         2.392  100000   100      50       32        100         no    34.98       2858.86           7.86             1           297.91        292.969       292.969
 0.914         2.972  200000   100      50       32        100         no    63.80       3134.94          43.48             1           596.14        585.938       585.938
 0.904         4.292  500000   100      50       32        100         no   151.49       3300.57         147.08             1          1491.81       1464.844      1464.844

## parentJoin on main (merge disabled):
recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  force merge s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.440         1.297   10000   100      50       32        100         no     1.76       5694.76           2.03             1            29.67         29.297        29.297
 0.692         2.596  100000   100      50       32        100         no    11.35       8807.47          29.76             1           297.86        292.969       292.969
 0.530         3.173  200000   100      50       32        100         no    22.03       9077.71          67.91             1           596.24        585.938       585.938
 0.598         4.368  500000   100      50       32        100         no    53.20       9398.50         204.29             1          1493.26       1464.844      1464.844

benwtrent · 2025-01-29T13:26:57Z

For parentJoin benchmark run on main, there is a visible drop in recall when I disable merges (as compared to a main branch run with merges enabled). Is this expected?

I wonder if you are comparing vector IDs correctly?

Looking at the "merges enabled/disabled" it doesn't make sense to me as one would be better than the other as both are then force-merged into a single segment.

I also don't understand the recall change between parentJoin on main vs. parentJoin in your branch. That is a significant difference, which seems like there is a bug in the test runner or in the code itself.

These numbers indeed confuse me. Maybe there is a bug in sparse vector index handling?

benwtrent · 2025-01-30T15:08:56Z

I like where this PR is going.

Note: This change does not include dependent multi-valued vectors like ColBERT, where the multiple vectors must used together to compute similarity. It does however lay essential ground work which can subsequently be extended for this support.

I think this PR is still doing globally unique ordinals for vectors? So, vectors 1, 2, 3 go to document 1 and ordinals 4, 5 go to doc 2? If so, I think we should "bite the bullet" and make vector ordinals long values. I know this makes HNSW graph building 2x as expensive when it comes to memory usage. But it seems like something we should do.

Models like ColPALI (and ColBERT) will index 100s or as much as 1k vectors per document. This will cause the number of vectors per lucene segment to be restricted to 2-20M, much lower than it is now.

vigyasharma · 2025-01-31T16:52:43Z

I also don't understand the recall change between parentJoin on main vs. parentJoin in your branch.

The parentJoin on my branch runs with merges disabled, and loads the extra metadata created for multiVectors (I haven't added the single vector optimization yet).

Merging should not have had an impact on recall. I think there is some bug in how recall is calculated in the parentJoin benchmark today. Created – mikemccand/luceneutil#333

vigyasharma · 2025-02-01T23:56:25Z

I think this PR is still doing globally unique ordinals for vectors? So, vectors 1, 2, 3 go to document 1 and ordinals 4, 5 go to doc 2? If so, I think we should "bite the bullet" and make vector ordinals long values.

That is correct. I did play around with "long" nodeIDs in an alternate implementation but ran into a whole bunch of challenges:

In a lot of places, we hold all nodes of the graph (or at a level) in a single array, and use vector ordinals as array index addresses. Java limits the size of arrays (and lists) to 'int max' and does not allow 'long' array indices. These will need to be changed to use a different data structure.
We use bitsets on vector ordinals for multiple use cases, and almost all bitsets work with integers. There is one LongBitSet but it doesn't integrate well with BitSetIterators etc.

At one point, I had this idea of representing vectors using ordinals and subOrdinals, where ordinals correspond to documents, i.e. one ordinal per document, and subOrdinals to the multiple vector values per document. For example, for a multi-vector with 3 values, the (ord, subOrd) values would be (ord, 0), (ord, 1), (ord, 2). For single valued vectors, we will only have subOrdinal=0 for each vector.

I would then pack the ordinal and subOrdinal values into a single long graph nodeId. Ordinal took the lower 32 bits, and subOrdinal took 32 MSB. Since we use VLong to write them, we don't need extra storage for single valued vectors where subOrdinal bits were all zero. This was also neat because I could score >INT_MAX vectors in the graph, and use bit masks to find the baseOrdinal (first vector of the doc) from any graph nodeId.

However, in addition to the "long" nodeId challenges above, this approach broke latent assumptions like "vector ordinals are densely packed". For example, we could no longer assume that maxOrdinal is the graph size. Hence I pivoted back to "int ordinals" and started maintaining ordToDoc maps with many-one mapping.
Given the volume of changes with "long" graph nodeIds, I wonder if we should do it when we add a new ANN implementation (like DiskANN maybe?).

benwtrent · 2025-02-04T14:40:45Z

Java limits the size of arrays (and lists) to 'int max' and does not allow 'long' array indices. These will need to be changed to use a different data structure.

Yeah, I don't like that we have this right now for int values at all. I don't immediately know how to solve this, but its pretty dang expensive as it is now.

We use bitsets on vector ordinals for multiple use cases, and almost all bitsets work with integers.

Correct, filtering though is always on doc ids, we filter on the doc-ids so we are ok there.

I understand the complexity of keeping track of visitation to prevent re-visiting a node. However, I would argue that any graph based or modern vector index needs to keep track of the vectors it has visited to prevent recalculating the score or prevent further exploration down a previously explored path.

For example, we could no longer assume that maxOrdinal is the graph size.

We can easily write out the graph size and indicate the graph size is the count of vectors (which is the sum of non-deleted vectors). True, it gets complex with deleted docs. Likely we would need to iterate and count during merges, though I would expect that to be a minor cost in addition to all the other work we currently do during merging.

I don't think this is a particular problem.

Given the volume of changes with "long" graph nodeIds, I wonder if we should do it when we add a new ANN implementation (like DiskANN maybe?).

I don't understand how DiskANN would solve any of the previously expressed problems.

DiskANN still requires the graph to be available when doing insertion and querying.
We still need to keep track of filtering on docs (so bit set filtering on doc ids, which is OK)
During graph exploration, we would need to keep track of vectors already visited.
We would still need to resolve vector ordinal -> doc_id

vigyasharma · 2025-02-10T04:04:30Z

I don't understand how DiskANN would solve any of the previously expressed problems.

No it wouldn't solve any of these problems. I meant that since we'd be writing a new implementations for buildGraph etc, merging etc, it might be easier to account for long nodeIds from the get go. (Not specific to the DiskANN, it could be any new impl.). Overall, I think we should do a separate PR that just shifts from int to long nodeIds.

benwtrent · 2025-02-10T13:18:06Z

I meant that since we'd be writing a new implementations for buildGraph etc, merging etc, it might be easier to account for long nodeIds from the get go

Ah, I understand and I agree :).

I misunderstood your initial comment.

Overall, I think we should do a separate PR that just shifts from int to long nodeIds.

I think if we are to support many vectors for a single doc, this seems like the only way to really do it :(. I realize it will be complex (not just from an API perspective).

alessandrobenedetti · 2025-03-21T11:52:14Z

Catching up on this and trying to understand how far we are now from my original idea and implementation:
#12314

Obviously, my code is completely outdated, but reading across this PR and #13525, it seems we are converging again to what I originally proposed.

I'll work on this for the next couple of weeks, so I should be able to add some comments and additional opinions.

Main concern is still related to ordinals to become long as far as I can see :)

@benwtrent do you confirm that, according to your knowledge, any relevant and active work toward multi-valued vectors in Lucene is effectively aggregated here?

alessandrobenedetti · 2025-03-21T12:16:19Z

@vigyasharma, from a first superficial pass, I see that this PR touches similar points of my original outdated one:
#12314, but it seems redoing similar things from scratch.
Aside from the difficulties of adapting an old pull request, are there other reasons?
Is it any better from any angle?

I'll proceed with a deeper review, but if you know already about pain points that were in my original PR that were not worth to be ported to 2025, I would be glad to hear them!

alessandrobenedetti · 2025-03-21T12:24:01Z

lucene/core/src/java/org/apache/lucene/util/hnsw/UpdatableScoreHeap.java

For example, what are the benefits of this in comparison to the changes I proposed: lucene/core/src/java/org/apache/lucene/util/LongHeap.java in https://github.com/apache/lucene/pull/12314/files?

I'd like to keep the logic to update scores for already ingested docs encapsulated within the heap. By returning the array index within the heap (the LongHeap changes in #12314), we shift this responsibility to consumers, like the NeighborQueue changes, which can be trappy and cause repeated code.

benwtrent · 2025-03-21T19:13:05Z

do you confirm that, according to your knowledge, any relevant and active work toward multi-valued vectors in Lucene is effectively aggregated here?

@alessandrobenedetti I think so. This is the latest stab at it.

Main concern is still related to ordinals to become long as far as I can see :)

Indeed, I just don't see how Lucene can actually support multi-value vectors without switching to long ordinals for the vectors. Otherwise, we enforce some limitation on the number of vectors per segment, or some limitation on the number of vectors per doc (e.g. every doc can only have 256/65535 vectors).

Making HNSW indexing & merging ~2x (given other constants, it might not be exactly 2x, maybe a little less) more expensive for heap usage is a pretty steep cost. Especially for something I am not sure how many folks will actually use.

vigyasharma · 2025-03-21T22:09:01Z

Thanks for looking into this PR @alessandrobenedetti , this is the latest iteration on multi-vector support.

It does build on the same central idea of assigning a unique ordinal to each vector and mapping multiple ordinals to a single doc. I tried a few other approaches, but this one seemed cleanest.

I think the key difference over #12314 , are changes to store metadata that lets us map multiple ordinals to a single doc. This is implemented in MultiVectorOrdConfiguration using DirectMonotonicWriter/Reader. For every doc, I maintain the ordinal of its first vector (baseOrdinal) along with no. of vectors in the doc, and use these to do the ordToDoc mapping for vectors. I didn't fully understand how this was done in your orginal PR, specifically how it mapped an ordinal back to its docId, given we can have variable no. of vectors per doc. Maybe I missed something. If you had a simpler implementation, I'm happy to circle back to it.

I also added an allVectorValues() API to Byte|FloatVectorValues, which I think will help during query time. Other that this, the changes are mostly around integrating multi-vector support and will likely have a lot of overlap.

vigyasharma · 2025-03-21T22:27:21Z

re: using long for graph node ids, I can see how using int ordinals can be limiting for the no. of vectors we can index per segment. However, adapting to long node ids is also a non-trivial change because of how our current code is structured. I described it in more detail here.

I think we should do both (multi-vectors and long node ids) but decouple these two changes. If we can support multi-vectors without impacting performance for existing single-vector use cases, we should add that support. Separately, we should work on moving to long nodeIds, which will give us the full power of multi-vector models.

alessandrobenedetti · 2025-03-25T11:51:28Z

do you confirm that, according to your knowledge, any relevant and active work toward multi-valued vectors in Lucene is effectively aggregated here?

@alessandrobenedetti I think so. This is the latest stab at it.

Main concern is still related to ordinals to become long as far as I can see :)

Indeed, I just don't see how Lucene can actually support multi-value vectors without switching to long ordinals for the vectors. Otherwise, we enforce some limitation on the number of vectors per segment, or some limitation on the number of vectors per doc (e.g. every doc can only have 256/65535 vectors).

Making HNSW indexing & merging ~2x (given other constants, it might not be exactly 2x, maybe a little less) more expensive for heap usage is a pretty steep cost. Especially for something I am not sure how many folks will actually use.

I agree, I don't think it makes sense to deteriorate single-valued performance at all (didn't investigate that, but I trust your judgement in terms of the int->long ordinal impact, in case you want me to double check let me know).

Another option I was pondering is adding a new field type dedicated to multi-valued vectors.
Sure, there will be tons of classes to "duplicate" and make multi-valued compliant, but I believe we'll be able to re-use most of the code, so a huge number of classes but not a massive new code quantity (hopefully).
Before even exploring this, I want to better check the current parent join approach i.e. native multi-valued, needs to bring advantages (mostly being faster in retrieving top-K 'parent' documents), if not, it won't make much sense to do this huge amount of work.

P.S. aside from the fact that current nested vectors approach still suffer of the 2B int ordinal limitation

vigyasharma · 2025-03-25T16:37:55Z

Another option I was pondering is adding a new field type dedicated to multi-valued vectors.

I tried this in my first stab at this issue (#13525). IIRC, one concern with a separate field, was that it limits users from converting their previously single-valued fields to multi-valued vectors later if they need to. And since single-valued is a base case of multi-valued, why would anyone even use the single valued fields.
The idea in this PR was to treat single-valued as an optimization over multi-valued vectors, that can be turned on/off by a flag in stored metadata.

FWIW, the PR (#13525) has pieces to use the separate field, and shows the extent of duplication across classes (it's not very much). I had only added support for ColBERT style dependent multi-vectors, but that can be extended with the independent vector pieces in this PR.

..

Before even exploring this, I want to better check the current parent join approach i.e. native multi-valued, needs to bring advantages (mostly being faster in retrieving top-K 'parent' documents),

Agreed. The next step for this PR is to benchmark parent-join runs and see an improvement, esp. in cases where we need query time scoring on top of all the vector values.

vigyasharma added 21 commits January 26, 2025 20:47

add APIs to VectorValues with default impl.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

464e198

store ordToDoc, baseOrd and nextBaseOrd in DMW, allow multi-values in…

1c50015

… addValue

write for fields without index sort

45c0143

handle writing vectors with index sort fields

6cad586

start offHeapVectorValues impl.

81e184e

Implement OffHeapFloatVectorValues, pending docIndexToOrdMap

7b8ce19

OffHeap float and byte vector values done. flat writer done.

eb30793

move MultiVectorMaps to mvConfig

b93475b

flat vectors reader done

1bf4b96

hnsw field writer and float/byte vector values with mv

41ff8d6

hnsw writer with remap ordinals for multivec. pending mergedvecvalues

3c06d13

gradle jar passes

e73f012

tidy

001ffb8

updatable score heap, collector and colMgr

08907d7

add multivector query

47a22a1

add apache license to new classes

c7a130e

bug fixes

4ec7dae

log lines to id multivec jar

03e1ddf

send ordCount to writeMeta in HnswWriter

bc32a46

clean up for draft pr

c68770f

tidy

Loading
Loading status checks…

a889eff

vigyasharma mentioned this pull request Jan 27, 2025

[Draft] Benchmark for Multi-Vector Knn Search mikemccand/luceneutil#331

Draft

vigyasharma mentioned this pull request Jan 28, 2025

[WIP] Multi-Vector support for HNSW search #13525

Open

vigyasharma mentioned this pull request Jan 31, 2025

Bug in ParentJoin Benchmark Recall mikemccand/luceneutil#333

Open

alessandrobenedetti reviewed Mar 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] Support Multi-Vector HNSW Search via Flat Vector Storage #14173

[Draft] Support Multi-Vector HNSW Search via Flat Vector Storage #14173

vigyasharma commented Jan 27, 2025

vigyasharma commented Jan 28, 2025

benwtrent commented Jan 29, 2025

benwtrent commented Jan 30, 2025

vigyasharma commented Jan 31, 2025 •

edited

Loading

vigyasharma commented Feb 1, 2025

benwtrent commented Feb 4, 2025

vigyasharma commented Feb 10, 2025

benwtrent commented Feb 10, 2025

alessandrobenedetti commented Mar 21, 2025 •

edited

Loading

alessandrobenedetti commented Mar 21, 2025

alessandrobenedetti Mar 21, 2025

vigyasharma Mar 21, 2025

benwtrent commented Mar 21, 2025

vigyasharma commented Mar 21, 2025

vigyasharma commented Mar 21, 2025

alessandrobenedetti commented Mar 25, 2025 •

edited

Loading

vigyasharma commented Mar 25, 2025

[Draft] Support Multi-Vector HNSW Search via Flat Vector Storage #14173

Are you sure you want to change the base?

[Draft] Support Multi-Vector HNSW Search via Flat Vector Storage #14173

Conversation

vigyasharma commented Jan 27, 2025

Interface

Pending Tasks:

vigyasharma commented Jan 28, 2025

ParentJoin v/s MultiVector (on multi-vector branch)

ParentJoin (on main) v/s MultiVector (on multivector branch)

ParentJoin with merges v/s ParentJoin with merges disabled (both on main)

benwtrent commented Jan 29, 2025

benwtrent commented Jan 30, 2025

vigyasharma commented Jan 31, 2025 • edited Loading

vigyasharma commented Feb 1, 2025

benwtrent commented Feb 4, 2025

vigyasharma commented Feb 10, 2025

benwtrent commented Feb 10, 2025

alessandrobenedetti commented Mar 21, 2025 • edited Loading

alessandrobenedetti commented Mar 21, 2025

alessandrobenedetti Mar 21, 2025

Choose a reason for hiding this comment

vigyasharma Mar 21, 2025

Choose a reason for hiding this comment

benwtrent commented Mar 21, 2025

vigyasharma commented Mar 21, 2025

vigyasharma commented Mar 21, 2025

alessandrobenedetti commented Mar 25, 2025 • edited Loading

vigyasharma commented Mar 25, 2025

vigyasharma commented Jan 31, 2025 •

edited

Loading

alessandrobenedetti commented Mar 21, 2025 •

edited

Loading

alessandrobenedetti commented Mar 25, 2025 •

edited

Loading