Make `flatVectorsFormat` injectable in `Lucene99HnswVectorsFormat` to allow custom format and scorers #15090

ManasviGoyal · 2025-08-19T04:24:04Z

This PR makes Lucene99HnswVectorsFormat accept an injected FlatVectorsFormat to allow custom flatvector format to be used with the existing codec.

For testing and BWC, the original package-private 5-arg constructor is retained. No on-disk format or runtime behavior changes occur unless a custom format/scorer is provided.

Signed-off-by: ManasviGoyal <[email protected]>

github-actions · 2025-08-19T04:24:56Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

ManasviGoyal · 2025-08-19T17:41:01Z

@ChrisHegarty @benwtrent Please have a look. Thanks!

benwtrent · 2025-08-20T16:39:11Z

All lucene formats are named based. Meaning, no arguments can actually be applied to the reader.

How does this actually work in practice?

I would expect a new format class is required for any different flat format that is used for HNSW because of names format loader

ManasviGoyal · 2025-08-20T20:26:07Z

All lucene formats are named based. Meaning, no arguments can actually be applied to the reader.

How does this actually work in practice?

I would expect a new format class is required for any different flat format that is used for HNSW because of names format loader

Yes it is correct that each distinct flat format still needs its own named FlatVectorsFormat

This PR just lets Lucene99HnswVectorsFormat delegate to whichever named flat format is provided at write time, so we don’t need a new HNSW outer format class for each variation. HNSW stays the same and delegates to the named flat format chosen at write time.

rmuir · 2025-08-20T21:18:42Z

This isn't how it works: you should create an outer format for each variation.

We can't support backwards compatibility for custom formats.

ManasviGoyal · 2025-08-20T22:42:22Z

This isn't how it works: you should create an outer format for each variation.

We can't support backwards compatibility for custom formats.

More context: Currently we do have outer HNSW format wrappers for each variation of flat vectors format. But in order to do so we are creating multiple duplicates of Lucene99HnswVectorsFormat with the only change being the flat vectors format. This means losing out on any future updates Lucene99HnswVectorsFormat and manually getting the changes for each variation.

Can you please elaborate on your point regarding backward compatibility concerns?

benwtrent · 2025-08-25T13:28:15Z

Currently we do have outer HNSW format wrappers for each variation of flat vectors format. But in order to do so we are creating multiple duplicates of Lucene99HnswVectorsFormat with the only change being the flat vectors format. This means losing out on any future updates Lucene99HnswVectorsFormat and manually getting the changes for each variation.

I understand the desire to get new changes out of the box. However, all formats are named based. If there was a substantial change to the HNSW format that required a new name, you would need a new inherited class that provides a NEW name for your format that utilizes a new flat format.

The SPI loader cannot provide any parameters. When constructing the reader, its done with the default ctor (e.g. new Lucene99HnswVectorsFormat()). Without a custom name, you lose your custom flat format.

This API change with how things are now just will not work. Further complexity in the named format loader to handle recursively named things just seems way too complicated to justify the 20-30 lines of code saved.

ChrisHegarty · 2025-08-28T11:09:23Z

I can understand the reasoning behind this request - I encountered a somewhat similar situation in the past and had considered making a similar change, but didn't (for the same reasons as given by @benwtrent and @rmuir).

That said, I'm not sure this PR is addressing the core issue. If there's a meaningful, reusable piece here, it might make sense to refactor it out of the format so that custom formats can take advantage of it - but it's unclear to me what that reusable part would be.

github-actions · 2025-09-12T00:24:26Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

benwtrent · 2025-09-13T14:49:38Z

@rmuir @ChrisHegarty @msokolov

I have been thinking about this. It would be useful to have a true injectable format so that Lucene itself doesn't have multiple variations of HNSW formats.

What if we updated the HNSW format to write the inner format name into its metadata? Then it could use the loader when constructing its reader given the name written metadata.

This way we still contain the customization to the outer format.

A concern that I have is that this may require too much knowledge for users (e.g. to use quantized vectors, be sure to use HNSW with a quantized inner format, etc.).

msokolov · 2025-09-15T13:31:13Z

It reminds me of something I learned about at ApacheCoC where @kaivalnp presented the Faiss vectors format -- FAISS expects its users to provide a complex string describing the format it supports including the algorithm name, the parameters specific to that algorithm, pre-steps like dimensionality reduction and quantization, and even ganging together multiple steps in case there is a re-scoring phase. All of this encoded in a compact incomprehensible string, that of course can have syntax errors and impossible combinations. In theory we could support something like that, but I think (?) we'd prefer something more structured and less error-prone. I hope we can find a middle ground between our current situation, which is totally safe but extremely rigid, and something totally loose that can only do runtime checking like FAISS has. I guess people working in Java are here at least in part for the type-safety and compile-time checking, but perhpas we could create a format - builder? Java developers love builders, right :)?

msokolov · 2025-09-15T13:34:59Z

I mean at the end of the day we need to have a format name that corresponds to a class name so that SPI can load the format, but the format itself can have some parameterization that is saved to the index, as @benwtrent was suggesting. The only change I'm suggesting here is we can provide guidance on how to construct this parameterization using a builder class that has defaults, clearly-named and documented methods, checks the validity of the combinations of parameters it's given and so on. We want to have a similar such thing for FAISS too...

benwtrent · 2025-09-15T13:37:20Z

I guess people working in Java are here at least in part for the type-safety and compile-time checking, but perhpas we could create a format - builder? Java developers love builders, right :)?

Right LOL

FAISS expects its users to provide a complex string describing the format it supports including the algorithm name, the parameters specific to that algorithm, pre-steps like dimensionality reduction and quantization, and even ganging together multiple steps in case there is a re-scoring phase....

Yeah, I definitely don't want to do that.

I think something pretty simple, like simply writing "inner" names to the metadata and using the named loader works pretty well. Then things that provide more "options" can provide those options via some nicer API (whatever that is, possibly just different ctors).

This seems pretty nice to me (vs. what we have now, fully fqdn stuff):

new Lucene99Hnsw(int M, ..., new Lucene104Quantized(int bits, ..., new Lucene99FlatRaw(...)));

Then the writers keep track of their inner values, which recursively keep track in their own metadata.

Of course, we can make the format declarations nicer (e.g. a builder). But that is just fancy ctors.

My concern is that:

Are we cool with writing that stuff to metadata and loading? If so, I can make a chnage this week so we can see it in action and then go through the process of adding a new HNSW format that satisfies this.

msokolov · 2025-09-15T13:47:59Z

Are we cool with writing that stuff to metadata and loading? If so, I can make a chnage this week so we can see it in action and then go through the process of adding a new HNSW format that satisfies this.

msokolov · 2025-09-15T13:50:50Z

Are we cool with writing that stuff to metadata and loading? If so, I can make a chnage this week so we can see it in action and then go through the process of adding a new HNSW format that satisfies this.

I think it's fine, thanks for volunteering. This should make extensibility, configuration, and so on easier. As this gets more complex I think we need it. It's not easy to find a one-size-fits-all setting; we know for example that different quantization settings are appropriate with different use cases and vector data varies a lot in its compressibility. Unless we're willing to fully embed hyper-parameter optimization into Lucene (and we shouldn't, it's too much magic) I think we have to expose the complexity to users so they can choose.

Make flatVectorsFormat injectable for custom format and scorers

977b7ea

Signed-off-by: ManasviGoyal <[email protected]>

github-actions bot added the module:core/codecs label Aug 19, 2025

ManasviGoyal marked this pull request as draft August 19, 2025 04:24

ManasviGoyal mentioned this pull request Aug 19, 2025

Enable FP16 support for Exact Search and Lucene ANN opensearch-project/k-NN#2850

Merged

5 tasks

ManasviGoyal marked this pull request as ready for review August 19, 2025 16:11

shatejas approved these changes Aug 19, 2025

View reviewed changes

ManasviGoyal mentioned this pull request Aug 20, 2025

[FEATURE] [RFC] FP16 Support for Exact Search and Lucene ANN opensearch-project/k-NN#2811

Open

github-actions bot added the Stale label Sep 12, 2025

github-actions bot removed the Stale label Sep 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make `flatVectorsFormat` injectable in `Lucene99HnswVectorsFormat` to allow custom format and scorers #15090

Make `flatVectorsFormat` injectable in `Lucene99HnswVectorsFormat` to allow custom format and scorers #15090

Uh oh!

ManasviGoyal commented Aug 19, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 19, 2025

Uh oh!

ManasviGoyal commented Aug 19, 2025

Uh oh!

benwtrent commented Aug 20, 2025

Uh oh!

ManasviGoyal commented Aug 20, 2025 •

edited

Loading

Uh oh!

rmuir commented Aug 20, 2025

Uh oh!

ManasviGoyal commented Aug 20, 2025

Uh oh!

benwtrent commented Aug 25, 2025

Uh oh!

ChrisHegarty commented Aug 28, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 12, 2025

Uh oh!

benwtrent commented Sep 13, 2025

Uh oh!

msokolov commented Sep 15, 2025

Uh oh!

msokolov commented Sep 15, 2025

Uh oh!

benwtrent commented Sep 15, 2025

Uh oh!

msokolov commented Sep 15, 2025

Uh oh!

msokolov commented Sep 15, 2025

Uh oh!

Uh oh!

Make flatVectorsFormat injectable in Lucene99HnswVectorsFormat to allow custom format and scorers #15090

Are you sure you want to change the base?

Make flatVectorsFormat injectable in Lucene99HnswVectorsFormat to allow custom format and scorers #15090

Uh oh!

Conversation

ManasviGoyal commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 19, 2025

Uh oh!

ManasviGoyal commented Aug 19, 2025

Uh oh!

benwtrent commented Aug 20, 2025

Uh oh!

ManasviGoyal commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmuir commented Aug 20, 2025

Uh oh!

ManasviGoyal commented Aug 20, 2025

Uh oh!

benwtrent commented Aug 25, 2025

Uh oh!

ChrisHegarty commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 12, 2025

Uh oh!

benwtrent commented Sep 13, 2025

Uh oh!

msokolov commented Sep 15, 2025

Uh oh!

msokolov commented Sep 15, 2025

Uh oh!

benwtrent commented Sep 15, 2025

Uh oh!

msokolov commented Sep 15, 2025

Uh oh!

msokolov commented Sep 15, 2025

Uh oh!

Uh oh!

Make `flatVectorsFormat` injectable in `Lucene99HnswVectorsFormat` to allow custom format and scorers #15090

Make `flatVectorsFormat` injectable in `Lucene99HnswVectorsFormat` to allow custom format and scorers #15090

ManasviGoyal commented Aug 19, 2025 •

edited

Loading

ManasviGoyal commented Aug 20, 2025 •

edited

Loading

ChrisHegarty commented Aug 28, 2025 •

edited

Loading