-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse Jaccard via python interface #109
Comments
Sorry for the late response. |
Thanks for the reply. My concern was with the input data format for the Jaccard benchmark in ann-benchmarks. It is a (74962 x 27983) sparse matrix. This can be formatted as a boolean numpy array of shape (74962, 27983) which I believe can be fed to NGT's ngtpy jaccard. Given the very high dimensionality this will be slower than it could be. I will try this none the less and get back to you with preliminary results. It seemed that the SparseJaccard comparator can act directly on the lists of indices of non-zero elements per row -- essentially directly on the sparse data format -- which will be much faster (other algorithms that support the Jaccard benchmark dataset in ann-benchmarks essentially do this). If possible I would prefer to be able to use this approach, as it would give a much fairer comparison in ann-benchmarks. |
Although I surely implemented Jaccard into NGT, I had almost forgotten that. There are two types of Jaccard for dense and sparse on NGT. However, as you mentioned, it seems that sparse Jaccard is not available on ngtpy. I will try to implement sparse jaccard into ngtpy if it is easy. |
Thanks. My initial efforts with the dense matrix are mostly just timing out on the runs, so I fear it is too expensive. Hopefully the sparse jaccard version is not too challenging to make available in ngtpy. It would certainly be useful for many people, but it is definitely a less used feature, so if it is too difficult to implement, please don't invest time in it. Thanks again for all your work. |
I have been endeavoring to get more ANN libraries working with sparse Jaccard data in ann-benchmarks. There are surprisingly few libraries that support this. I was pleased to see that NGT does have support for sparse Jaccard, however (as far as I can tell) this is only accessible via the C++ API, and not via python (which ann-benchmarks would require).
Is it possible to add sparse Jaccard support to the python interface? If it is not too challenging I could attempt to provide a PR myself, but that would likely take some time, as I have little familiarity with the code.
The text was updated successfully, but these errors were encountered: