Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some basic HNSW graph checks to CheckIndex #13984

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

benchaplin
Copy link
Contributor

Description

This is a first pass on #13724 - I've added the following checks:

  • verify neighbor list is in order
  • verify no duplicate neighbors

The code was adapted from KnnGraphTester in luceneutil.

Here's some test results from an index containing the cohere-wikipedia-docs-768d dataset:

    test: open reader.........OK [took 0.010 sec]
    test: check integrity.....OK [took 2.207 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [2 fields] [took 0.000 sec]
    test: field norms.........OK [0 fields] [took 0.000 sec]
    test: terms, freq, prox...    test: stored fields.......OK [1500000 total field count; avg 1.0 fields per doc] [took 0.398 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET; 0 SKIPPING INDEX] [took 0.000 sec]
    test: points..............OK [0 fields, 0 points] [took 0.000 sec]
    test: vectors.............OK [1 fields, 1500000 vectors] [took 0.497 sec]
    test: hnsw graph..........OK [4 levels, 1547684 nodes (over all levels)] [took 0.485 sec]

No problems were detected with this index.

I manually corrupted some neighbors lists in order to check the failure cases:

    test: open reader.........OK [took 0.012 sec]
    test: check integrity.....OK [took 3.633 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [2 fields] [took 0.000 sec]
    test: field norms.........OK [0 fields] [took 0.000 sec]
    test: terms, freq, prox...    test: stored fields.......OK [1500000 total field count; avg 1.0 fields per doc] [took 0.393 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET; 0 SKIPPING INDEX] [took 0.000 sec]
    test: points..............OK [0 fields, 0 points] [took 0.000 sec]
    test: vectors.............OK [1 fields, 1500000 vectors] [took 0.504 sec]
    test: hnsw graph..........ERROR: org.apache.lucene.index.CheckIndex$CheckIndexException: Neighbors out of order for node 31142: 839445<1326517 1st=33190
org.apache.lucene.index.CheckIndex$CheckIndexException: Neighbors out of order for node 31142: 839445<1326517 1st=33190
	at [email protected]/org.apache.lucene.index.CheckIndex.testHnswGraph(CheckIndex.java:2798)
	at [email protected]/org.apache.lucene.index.CheckIndex.testSegment(CheckIndex.java:1109)
	at [email protected]/org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:806)
	at [email protected]/org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:576)
	at [email protected]/org.apache.lucene.index.CheckIndex.doCheck(CheckIndex.java:4479)
	at [email protected]/org.apache.lucene.index.CheckIndex.doMain(CheckIndex.java:4316)
	at [email protected]/org.apache.lucene.index.CheckIndex.main(CheckIndex.java:4248)

I intend to dig into the HNSW creation and think about more checks that would be useful here, but I wanted to start small with these neighbor checks only.

@tteofili
Copy link
Contributor

this might also check for graph connectivity, see #12627

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for tackling this @benchaplin! This is a blind spot for CheckIndex today, which is scary. I left some minor comments...

private static HnswGraph getHnswGraph(CodecReader reader) throws IOException {
KnnVectorsReader vectorsReader = reader.getVectorReader();
if (vectorsReader instanceof PerFieldKnnVectorsFormat.FieldsReader) {
vectorsReader = ((PerFieldKnnVectorsFormat.FieldsReader) vectorsReader).getFieldReader("knn");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm the field can be anything (not necessarily knn like luceneutil uses)? Can we change this so we iterate FieldInfos and for any field that has KNN vectors enabled (FieldInfo.hasVectorValues, hmm but also has the HNSW graph (since I think a field can be only "flat" vectors?)) we do this check?

Copy link
Contributor Author

@benchaplin benchaplin Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I didn't quite understand fields when I wrote this - I think I get it now. Alright, I've done what you suggested (as is also done in testVectors) and iterated over FieldInfos, performing the check only when it applies.

Because we might now parse several HNSW graphs, I've restructured the status object to support per-graph data. Successful output will now look like:

    test: open reader.........OK [took 0.010 sec]
    test: check integrity.....OK [took 2.216 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [2 fields] [took 0.000 sec]
    test: field norms.........OK [0 fields] [took 0.000 sec]
    test: terms, freq, prox...    test: stored fields.......OK [1500000 total field count; avg 1.0 fields per doc] [took 0.390 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET; 0 SKIPPING INDEX] [took 0.000 sec]
    test: points..............OK [0 fields, 0 points] [took 0.000 sec]
    test: vectors.............OK [1 fields, 1500000 vectors] [took 0.496 sec]
    test: hnsw graphs.........OK [2 fields: (field name: knn1, levels: 4, total nodes: 1547684), (field name: knn2, levels: 4, total nodes: 1547684)] [took 0.979 sec]

testVectors doesn't do this, it just sums vectors over all fields. I could do that too, but this felt most complete.

}
status.hnswGraphSize++;
}
status.hsnwGraphNumLevels++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of incrementing here, you could just set it to numLevels from above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could, but wanted to respect the object as a "status" - so if a check fails, numLevels will indicate on which level it failed. However we don't print the status if an exception is thrown. What do you think?

@mikemccand
Copy link
Member

this might also check for graph connectivity, see #12627

+1 -- this is a tricky thing about these HNSW graphs (that they are not necessarily born connected but rather must be stitched up somehow afterwards).

But maybe do this separately? I'd love to just get this initial coverage in first...

@mikemccand
Copy link
Member

Actually, CheckIndex does have some coverage for vectors and KNN graph (it confirms it can enumerate all vectors, and also runs some searches on it if it's not just flat vectors (checkByte/FloatVectorValues) , so it's not completely blind :)

hnswGraph.seek(level, node);
int nbr, lastNeighbor = -1, firstNeighbor = -1;
while ((nbr = hnswGraph.nextNeighbor()) != NO_MORE_DOCS) {
if (firstNeighbor == -1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps at some point we could also check for uniqueness of the neighbors, and also check that the neighbors are in range [0, numGraphNodes], and finally on levels > 0, we would want to assert that the neighbors are on this level. But this can all be done separately; maybe we could add a comment here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll add a TODO (and add those checks in a followup PR). A couple questions first:

  • Doesn't (neighbors are in order) and (neighbors are not repeated) verify uniqueness?
  • Also, why wouldn't we assert neighbors are on this level for level 0?

@msokolov
Copy link
Contributor

> Doesn't (neighbors are in order) and (neighbors are not repeated) verify uniqueness?

duh, yes :)

> Also, why wouldn't we assert neighbors are on this level for level 0?

yes, we should, it's just that for level 0 this is the same as node in [0, size) wheras other levels are sparse - they have the same max = size-1, but not every node is included

@benchaplin
Copy link
Contributor Author

Gotcha, thanks for the comments @msokolov!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants