-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Draft] Support Multi-Vector HNSW Search via Flat Vector Storage #14173
base: main
Are you sure you want to change the base?
Conversation
Ran some early benchmarks to compare this flat storage based multi-vector approach with the existing parent-join approach. I would appreciate any feedback on the approach, benchmark setup, or any mistakes you spot along the way. Observations:
... Note that ParentJoin v/s MultiVector (on multi-vector branch)# multivector
recall latency (ms) nVectors nDoc topK fanout maxConn beamWidth quantized index s index docs/s num segments index size (MB) vec disk (MB) vec RAM (MB)
0.673 3.548 10000 103 100 50 32 100 no 1.62 63.58 1 29.40 29.297 29.297
0.431 4.857 100000 1323 100 50 32 100 no 11.42 115.84 3 294.08 292.969 292.969
0.461 8.034 200000 2939 100 50 32 100 no 22.62 129.92 6 588.27 585.938 585.938
0.496 16.040 500000 8773 100 50 32 100 no 53.50 163.98 14 1470.72 1464.844 1464.844
# parentJoin on multi-vector branch
# (merges disabled, creates and loads multivector config)
recall latency (ms) nVectors nDoc topK fanout maxConn beamWidth quantized index s index docs/s num segments index size (MB) vec disk (MB) vec RAM (MB)
0.610 4.644 10000 10103 100 50 32 100 no 1.70 5946.44 1 29.51 29.297 29.297
0.242 5.189 100000 101323 100 50 32 100 no 11.34 8935.80 3 295.17 292.969 292.969
0.275 8.988 200000 202939 100 50 32 100 no 22.54 9005.50 6 590.51 585.938 585.938
0.290 16.605 500000 508773 100 50 32 100 no 52.70 9654.32 14 1476.26 1464.844 1464.844 ... ParentJoin (on main) v/s MultiVector (on multivector branch)# parentJoin (on main)
recall latency (ms) nDoc topK fanout maxConn beamWidth quantized index s index docs/s force merge s num segments index size (MB) vec disk (MB) vec RAM (MB)
0.958 1.160 10000 100 50 32 100 no 1.49 6706.91 1.85 1 29.67 29.297 29.297
0.925 2.392 100000 100 50 32 100 no 34.98 2858.86 7.86 1 297.91 292.969 292.969
0.914 2.972 200000 100 50 32 100 no 63.80 3134.94 43.48 1 596.14 585.938 585.938
0.904 4.292 500000 100 50 32 100 no 151.49 3300.57 147.08 1 1491.81 1464.844 1464.844
# multivector
recall latency (ms) nVectors nDoc topK fanout maxConn beamWidth quantized index s index docs/s num segments index size (MB) vec disk (MB) vec RAM (MB)
0.673 3.793 10000 103 100 50 32 100 no 1.59 64.78 1 29.40 29.297 29.297
0.431 4.572 100000 1323 100 50 32 100 no 11.22 117.87 3 294.08 292.969 292.969
0.461 7.681 200000 2939 100 50 32 100 no 22.38 131.32 6 588.27 585.938 585.938
0.496 16.292 500000 8773 100 50 32 100 no 54.10 162.17 14 1470.72 1464.844 1464.844 ... ParentJoin with merges v/s ParentJoin with merges disabled (both on main)# parentJoin (on main)
recall latency (ms) nDoc topK fanout maxConn beamWidth quantized index s index docs/s force merge s num segments index size (MB) vec disk (MB) vec RAM (MB)
0.958 1.160 10000 100 50 32 100 no 1.49 6706.91 1.85 1 29.67 29.297 29.297
0.925 2.392 100000 100 50 32 100 no 34.98 2858.86 7.86 1 297.91 292.969 292.969
0.914 2.972 200000 100 50 32 100 no 63.80 3134.94 43.48 1 596.14 585.938 585.938
0.904 4.292 500000 100 50 32 100 no 151.49 3300.57 147.08 1 1491.81 1464.844 1464.844
## parentJoin on main (merge disabled):
recall latency (ms) nDoc topK fanout maxConn beamWidth quantized index s index docs/s force merge s num segments index size (MB) vec disk (MB) vec RAM (MB)
0.440 1.297 10000 100 50 32 100 no 1.76 5694.76 2.03 1 29.67 29.297 29.297
0.692 2.596 100000 100 50 32 100 no 11.35 8807.47 29.76 1 297.86 292.969 292.969
0.530 3.173 200000 100 50 32 100 no 22.03 9077.71 67.91 1 596.24 585.938 585.938
0.598 4.368 500000 100 50 32 100 no 53.20 9398.50 204.29 1 1493.26 1464.844 1464.844 |
I wonder if you are comparing vector IDs correctly? Looking at the "merges enabled/disabled" it doesn't make sense to me as one would be better than the other as both are then force-merged into a single segment. I also don't understand the recall change between parentJoin on main vs. parentJoin in your branch. That is a significant difference, which seems like there is a bug in the test runner or in the code itself. These numbers indeed confuse me. Maybe there is a bug in sparse vector index handling? |
I like where this PR is going.
I think this PR is still doing globally unique ordinals for vectors? So, vectors Models like ColPALI (and ColBERT) will index 100s or as much as 1k vectors per document. This will cause the number of vectors per lucene segment to be restricted to 2-20M, much lower than it is now. |
The parentJoin on my branch runs with merges disabled, and loads the extra metadata created for multiVectors (I haven't added the single vector optimization yet). Merging should not have had an impact on recall. I think there is some bug in how recall is calculated in the parentJoin benchmark today. Created – mikemccand/luceneutil#333 |
That is correct. I did play around with "long" nodeIDs in an alternate implementation but ran into a whole bunch of challenges:
At one point, I had this idea of representing vectors using I would then pack the ordinal and subOrdinal values into a single However, in addition to the "long" nodeId challenges above, this approach broke latent assumptions like "vector ordinals are densely packed". For example, we could no longer assume that maxOrdinal is the graph size. Hence I pivoted back to "int ordinals" and started maintaining |
Yeah, I don't like that we have this right now for
Correct, filtering though is always on I understand the complexity of keeping track of visitation to prevent re-visiting a node. However, I would argue that any graph based or modern vector index needs to keep track of the vectors it has visited to prevent recalculating the score or prevent further exploration down a previously explored path.
We can easily write out the graph size and indicate the graph size is the count of vectors (which is the sum of non-deleted vectors). True, it gets complex with deleted docs. Likely we would need to iterate and count during merges, though I would expect that to be a minor cost in addition to all the other work we currently do during merging. I don't think this is a particular problem.
I don't understand how DiskANN would solve any of the previously expressed problems.
|
No it wouldn't solve any of these problems. I meant that since we'd be writing a new implementations for buildGraph etc, merging etc, it might be easier to account for long nodeIds from the get go. (Not specific to the DiskANN, it could be any new impl.). Overall, I think we should do a separate PR that just shifts from int to long nodeIds. |
Ah, I understand and I agree :). I misunderstood your initial comment.
I think if we are to support many vectors for a single doc, this seems like the only way to really do it :(. I realize it will be complex (not just from an API perspective). |
Catching up on this and trying to understand how far we are now from my original idea and implementation: Obviously, my code is completely outdated, but reading across this PR and #13525, it seems we are converging again to what I originally proposed. I'll work on this for the next couple of weeks, so I should be able to add some comments and additional opinions. Main concern is still related to ordinals to become long as far as I can see :) @benwtrent do you confirm that, according to your knowledge, any relevant and active work toward multi-valued vectors in Lucene is effectively aggregated here? |
@vigyasharma, from a first superficial pass, I see that this PR touches similar points of my original outdated one: I'll proceed with a deeper review, but if you know already about pain points that were in my original PR that were not worth to be ported to 2025, I would be glad to hear them! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, what are the benefits of this in comparison to the changes I proposed: lucene/core/src/java/org/apache/lucene/util/LongHeap.java in https://github.com/apache/lucene/pull/12314/files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to keep the logic to update scores for already ingested docs encapsulated within the heap. By returning the array index within the heap (the LongHeap changes in #12314), we shift this responsibility to consumers, like the NeighborQueue changes, which can be trappy and cause repeated code.
@alessandrobenedetti I think so. This is the latest stab at it.
Indeed, I just don't see how Lucene can actually support multi-value vectors without switching to long ordinals for the vectors. Otherwise, we enforce some limitation on the number of vectors per segment, or some limitation on the number of vectors per doc (e.g. every doc can only have 256/65535 vectors). Making HNSW indexing & merging ~2x (given other constants, it might not be exactly 2x, maybe a little less) more expensive for heap usage is a pretty steep cost. Especially for something I am not sure how many folks will actually use. |
Thanks for looking into this PR @alessandrobenedetti , this is the latest iteration on multi-vector support. It does build on the same central idea of assigning a unique ordinal to each vector and mapping multiple ordinals to a single doc. I tried a few other approaches, but this one seemed cleanest. I think the key difference over #12314 , are changes to store metadata that lets us map multiple ordinals to a single doc. This is implemented in I also added an |
re: using long for graph node ids, I can see how using int ordinals can be limiting for the no. of vectors we can index per segment. However, adapting to long node ids is also a non-trivial change because of how our current code is structured. I described it in more detail here. I think we should do both (multi-vectors and long node ids) but decouple these two changes. If we can support multi-vectors without impacting performance for existing single-vector use cases, we should add that support. Separately, we should work on moving to long nodeIds, which will give us the full power of multi-vector models. |
I agree, I don't think it makes sense to deteriorate single-valued performance at all (didn't investigate that, but I trust your judgement in terms of the int->long ordinal impact, in case you want me to double check let me know). Another option I was pondering is adding a new field type dedicated to multi-valued vectors. P.S. aside from the fact that current nested vectors approach still suffer of the 2B int ordinal limitation |
I tried this in my first stab at this issue (#13525). IIRC, one concern with a separate field, was that it limits users from converting their previously single-valued fields to multi-valued vectors later if they need to. And since single-valued is a base case of multi-valued, why would anyone even use the single valued fields. FWIW, the PR (#13525) has pieces to use the separate field, and shows the extent of duplication across classes (it's not very much). I had only added support for ColBERT style dependent multi-vectors, but that can be extended with the independent vector pieces in this PR. ..
Agreed. The next step for this PR is to benchmark parent-join runs and see an improvement, esp. in cases where we need query time scoring on top of all the vector values. |
Another take at #12313
The following PR adds support for independent multi-vectors, i.e. scenarios where a single document is represented by multiple independent vector values. The most common example for this, is the passage-vector search use case, where we create a vector for every paragraph chunk in the document.
Currently, Lucene only supports a single vector per document. Users are required to create parent-child relationships with the chunked vectors of a document, and call a
ParentBlockJoin
query to run passage vector search. This change allows indexing multiple vectors within the same document.Each vector is still assigned a unique int ordinal but multiple ordinals can now map to the same document. We use additional metadata to maintain the many-one ordToDoc mapping, and also quickly figure out the first indexed vector ordinal for a document (called baseOrdinal
(baseOrd)
). This gives us new APIs that fetch all vectors for a document, which can be used for faster scoring (as opposed to the child doc query in ParentJoin approach):Interface
The interface to use multi-vector values is quite simple now:
I was able to add a multi-vector benchmark to luceneutil to run this setup end to end. Will link results and a luceneutil PR in comments.
Pending Tasks:
This is an early draft to get some feedback, I have TODOs across the code for future improvements. Here are some big items pending:
ScoreMode.Avg
)__
Note: This change does not include dependent multi-valued vectors like ColBERT, where the multiple vectors must used together to compute similarity. It does however lay essential ground work which can subsequently be extended for this support.