Skip to content

Conversation

ManasviGoyal
Copy link

@ManasviGoyal ManasviGoyal commented Aug 19, 2025

This PR makes Lucene99HnswVectorsFormat accept an injected FlatVectorsFormat to allow custom flatvector format to be used with the existing codec.

For testing and BWC, the original package-private 5-arg constructor is retained. No on-disk format or runtime behavior changes occur unless a custom format/scorer is provided.

@ManasviGoyal ManasviGoyal marked this pull request as draft August 19, 2025 04:24
Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@ManasviGoyal
Copy link
Author

@ChrisHegarty @benwtrent Please have a look. Thanks!

@benwtrent
Copy link
Member

All lucene formats are named based. Meaning, no arguments can actually be applied to the reader.

How does this actually work in practice?

I would expect a new format class is required for any different flat format that is used for HNSW because of names format loader

@ManasviGoyal
Copy link
Author

ManasviGoyal commented Aug 20, 2025

All lucene formats are named based. Meaning, no arguments can actually be applied to the reader.

How does this actually work in practice?

I would expect a new format class is required for any different flat format that is used for HNSW because of names format loader

Yes it is correct that each distinct flat format still needs its own named FlatVectorsFormat

This PR just lets Lucene99HnswVectorsFormat delegate to whichever named flat format is provided at write time, so we don’t need a new HNSW outer format class for each variation. HNSW stays the same and delegates to the named flat format chosen at write time.

@rmuir
Copy link
Member

rmuir commented Aug 20, 2025

This isn't how it works: you should create an outer format for each variation.

We can't support backwards compatibility for custom formats.

@ManasviGoyal
Copy link
Author

This isn't how it works: you should create an outer format for each variation.

We can't support backwards compatibility for custom formats.

More context: Currently we do have outer HNSW format wrappers for each variation of flat vectors format. But in order to do so we are creating multiple duplicates of Lucene99HnswVectorsFormat with the only change being the flat vectors format. This means losing out on any future updates Lucene99HnswVectorsFormat and manually getting the changes for each variation.

Can you please elaborate on your point regarding backward compatibility concerns?

@benwtrent
Copy link
Member

Currently we do have outer HNSW format wrappers for each variation of flat vectors format. But in order to do so we are creating multiple duplicates of Lucene99HnswVectorsFormat with the only change being the flat vectors format. This means losing out on any future updates Lucene99HnswVectorsFormat and manually getting the changes for each variation.

I understand the desire to get new changes out of the box. However, all formats are named based. If there was a substantial change to the HNSW format that required a new name, you would need a new inherited class that provides a NEW name for your format that utilizes a new flat format.

The SPI loader cannot provide any parameters. When constructing the reader, its done with the default ctor (e.g. new Lucene99HnswVectorsFormat()). Without a custom name, you lose your custom flat format.

This API change with how things are now just will not work. Further complexity in the named format loader to handle recursively named things just seems way too complicated to justify the 20-30 lines of code saved.

@ChrisHegarty
Copy link
Contributor

ChrisHegarty commented Aug 28, 2025

I can understand the reasoning behind this request - I encountered a somewhat similar situation in the past and had considered making a similar change, but didn't (for the same reasons as given by @benwtrent and @rmuir).

That said, I'm not sure this PR is addressing the core issue. If there's a meaningful, reusable piece here, it might make sense to refactor it out of the format so that custom formats can take advantage of it - but it's unclear to me what that reusable part would be.

Copy link
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Sep 12, 2025
@benwtrent
Copy link
Member

@rmuir @ChrisHegarty @msokolov

I have been thinking about this. It would be useful to have a true injectable format so that Lucene itself doesn't have multiple variations of HNSW formats.

What if we updated the HNSW format to write the inner format name into its metadata? Then it could use the loader when constructing its reader given the name written metadata.

This way we still contain the customization to the outer format.

A concern that I have is that this may require too much knowledge for users (e.g. to use quantized vectors, be sure to use HNSW with a quantized inner format, etc.).

@github-actions github-actions bot removed the Stale label Sep 14, 2025
@msokolov
Copy link
Contributor

It reminds me of something I learned about at ApacheCoC where @kaivalnp presented the Faiss vectors format -- FAISS expects its users to provide a complex string describing the format it supports including the algorithm name, the parameters specific to that algorithm, pre-steps like dimensionality reduction and quantization, and even ganging together multiple steps in case there is a re-scoring phase. All of this encoded in a compact incomprehensible string, that of course can have syntax errors and impossible combinations. In theory we could support something like that, but I think (?) we'd prefer something more structured and less error-prone. I hope we can find a middle ground between our current situation, which is totally safe but extremely rigid, and something totally loose that can only do runtime checking like FAISS has. I guess people working in Java are here at least in part for the type-safety and compile-time checking, but perhpas we could create a format - builder? Java developers love builders, right :)?

@msokolov
Copy link
Contributor

I mean at the end of the day we need to have a format name that corresponds to a class name so that SPI can load the format, but the format itself can have some parameterization that is saved to the index, as @benwtrent was suggesting. The only change I'm suggesting here is we can provide guidance on how to construct this parameterization using a builder class that has defaults, clearly-named and documented methods, checks the validity of the combinations of parameters it's given and so on. We want to have a similar such thing for FAISS too...

@benwtrent
Copy link
Member

I guess people working in Java are here at least in part for the type-safety and compile-time checking, but perhpas we could create a format - builder? Java developers love builders, right :)?

Right LOL

FAISS expects its users to provide a complex string describing the format it supports including the algorithm name, the parameters specific to that algorithm, pre-steps like dimensionality reduction and quantization, and even ganging together multiple steps in case there is a re-scoring phase....

Yeah, I definitely don't want to do that.

I think something pretty simple, like simply writing "inner" names to the metadata and using the named loader works pretty well. Then things that provide more "options" can provide those options via some nicer API (whatever that is, possibly just different ctors).

This seems pretty nice to me (vs. what we have now, fully fqdn stuff):

new Lucene99Hnsw(int M, ..., new Lucene104Quantized(int bits, ..., new Lucene99FlatRaw(...)));

Then the writers keep track of their inner values, which recursively keep track in their own metadata.

Of course, we can make the format declarations nicer (e.g. a builder). But that is just fancy ctors.

My concern is that:

Are we cool with writing that stuff to metadata and loading? If so, I can make a chnage this week so we can see it in action and then go through the process of adding a new HNSW format that satisfies this.

@msokolov
Copy link
Contributor

Are we cool with writing that stuff to metadata and loading? If so, I can make a chnage this week so we can see it in action and then go through the process of adding a new HNSW format that satisfies this.

@msokolov
Copy link
Contributor

Are we cool with writing that stuff to metadata and loading? If so, I can make a chnage this week so we can see it in action and then go through the process of adding a new HNSW format that satisfies this.

I think it's fine, thanks for volunteering. This should make extensibility, configuration, and so on easier. As this gets more complex I think we need it. It's not easy to find a one-size-fits-all setting; we know for example that different quantization settings are appropriate with different use cases and vector data varies a lot in its compressibility. Unless we're willing to fully embed hyper-parameter optimization into Lucene (and we shouldn't, it's too much magic) I think we have to expose the complexity to users so they can choose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants