-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Make flatVectorsFormat
injectable in Lucene99HnswVectorsFormat
to allow custom format and scorers
#15090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Make flatVectorsFormat
injectable in Lucene99HnswVectorsFormat
to allow custom format and scorers
#15090
Conversation
Signed-off-by: ManasviGoyal <[email protected]>
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
@ChrisHegarty @benwtrent Please have a look. Thanks! |
All lucene formats are named based. Meaning, no arguments can actually be applied to the reader. How does this actually work in practice? I would expect a new format class is required for any different flat format that is used for HNSW because of names format loader |
Yes it is correct that each distinct flat format still needs its own named This PR just lets |
This isn't how it works: you should create an outer format for each variation. We can't support backwards compatibility for custom formats. |
More context: Currently we do have outer HNSW format wrappers for each variation of flat vectors format. But in order to do so we are creating multiple duplicates of Can you please elaborate on your point regarding backward compatibility concerns? |
I understand the desire to get new changes out of the box. However, all formats are named based. If there was a substantial change to the HNSW format that required a new name, you would need a new inherited class that provides a NEW name for your format that utilizes a new flat format. The SPI loader cannot provide any parameters. When constructing the reader, its done with the default ctor (e.g. This API change with how things are now just will not work. Further complexity in the named format loader to handle recursively named things just seems way too complicated to justify the 20-30 lines of code saved. |
I can understand the reasoning behind this request - I encountered a somewhat similar situation in the past and had considered making a similar change, but didn't (for the same reasons as given by @benwtrent and @rmuir). That said, I'm not sure this PR is addressing the core issue. If there's a meaningful, reusable piece here, it might make sense to refactor it out of the format so that custom formats can take advantage of it - but it's unclear to me what that reusable part would be. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
@rmuir @ChrisHegarty @msokolov I have been thinking about this. It would be useful to have a true injectable format so that Lucene itself doesn't have multiple variations of HNSW formats. What if we updated the HNSW format to write the inner format name into its metadata? Then it could use the loader when constructing its reader given the name written metadata. This way we still contain the customization to the outer format. A concern that I have is that this may require too much knowledge for users (e.g. to use quantized vectors, be sure to use HNSW with a quantized inner format, etc.). |
It reminds me of something I learned about at ApacheCoC where @kaivalnp presented the Faiss vectors format -- FAISS expects its users to provide a complex string describing the format it supports including the algorithm name, the parameters specific to that algorithm, pre-steps like dimensionality reduction and quantization, and even ganging together multiple steps in case there is a re-scoring phase. All of this encoded in a compact incomprehensible string, that of course can have syntax errors and impossible combinations. In theory we could support something like that, but I think (?) we'd prefer something more structured and less error-prone. I hope we can find a middle ground between our current situation, which is totally safe but extremely rigid, and something totally loose that can only do runtime checking like FAISS has. I guess people working in Java are here at least in part for the type-safety and compile-time checking, but perhpas we could create a format - builder? Java developers love builders, right :)? |
I mean at the end of the day we need to have a format name that corresponds to a class name so that SPI can load the format, but the format itself can have some parameterization that is saved to the index, as @benwtrent was suggesting. The only change I'm suggesting here is we can provide guidance on how to construct this parameterization using a builder class that has defaults, clearly-named and documented methods, checks the validity of the combinations of parameters it's given and so on. We want to have a similar such thing for FAISS too... |
Right LOL
Yeah, I definitely don't want to do that. I think something pretty simple, like simply writing "inner" names to the metadata and using the named loader works pretty well. Then things that provide more "options" can provide those options via some nicer API (whatever that is, possibly just different ctors). This seems pretty nice to me (vs. what we have now, fully fqdn stuff):
Then the writers keep track of their inner values, which recursively keep track in their own metadata. Of course, we can make the format declarations nicer (e.g. a builder). But that is just fancy My concern is that: Are we cool with writing that stuff to metadata and loading? If so, I can make a chnage this week so we can see it in action and then go through the process of adding a new HNSW format that satisfies this. |
|
I think it's fine, thanks for volunteering. This should make extensibility, configuration, and so on easier. As this gets more complex I think we need it. It's not easy to find a one-size-fits-all setting; we know for example that different quantization settings are appropriate with different use cases and vector data varies a lot in its compressibility. Unless we're willing to fully embed hyper-parameter optimization into Lucene (and we shouldn't, it's too much magic) I think we have to expose the complexity to users so they can choose. |
This PR makes
Lucene99HnswVectorsFormat
accept an injectedFlatVectorsFormat
to allow custom flatvector format to be used with the existing codec.For testing and BWC, the original package-private 5-arg constructor is retained. No on-disk format or runtime behavior changes occur unless a custom format/scorer is provided.