-
Notifications
You must be signed in to change notification settings - Fork 1.3k
BP Reordering Codec #15430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
BP Reordering Codec #15430
Conversation
Use BpVectorReorderer to reorder when merging
|
OK after finally having ironed out the bugs, I have some results. The situation is a little complicated as the change here really doesn't help much with the typical "dense" index where every document has a vector. I think the reason is that any gains are masked by the additional cost of having a node->doc mapping that must be traversed. On the other hand, in the "sparse" case where some documents have no vectors, we already have such a mapping, so we can see the impact of this change more clearly. Net/net we see improvements in search latency, increasing with index size. On indexes of 1-2MM I see 5% improvement, on 10MM, a 10% improvement. As expected, It may be possible to reduce the merge times by tweaking the parameters of the BP execution to make it recurse less? I'll see if I can do that while retaining the latency improvements. Then it might be best to enable this only for sparse indexes. |
|
luceneutil results from one run with 10MM cohere 768-dim wiki documents, with a 50% selectivity filter (I made some changes to luceneutil to be able to create an index with 50% "empty" documents): It's a little surprising we see recall changes, since the graphs should be the same, merely reordered? They seem to be sometimes worse, sometimes better, and never more than about 0.002 different |
Could it be accumulated floating point errors? |
If a graph builds with seeing the vectors in a different order (which can happen in concurrent merging), I would expect slight recall deviations on lower ef_search parameters. |
|
I've done many runs, and for the most part I'm seeing identical recall, so I think we can safely ignore these occasional variants. It probably is down to order differences from concurrent indexing, as @benwtrent says. I can also report that the merge time difference is greatly reduced if I enable concurrent reordering by passing the mergestate intra-merge executor in. EG: and then by reducing the reorder MAX_ITERS from 20 to 10 we can regain some more: (note: recall is not being shown in this table because it's identical. I made a local change to the script to hide columns that don't vary) |
This is a stab at implementing the previously-published BpVectorReorderer as part of the codec. The version we have now can be used as a merge policy to reorder docids by doing BP over a vector field. This version allows for the ordering to be done at the vector ordinal level for each field independently while docids may be soprted by some other IndexSort.
The code is fully working as far as I can tell, and I think the design is reasonable, but some TODOs and nocommits remain: