Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse Jaccard via python interface #109

Open
lmcinnes opened this issue Jan 20, 2022 · 4 comments
Open

Sparse Jaccard via python interface #109

lmcinnes opened this issue Jan 20, 2022 · 4 comments

Comments

@lmcinnes
Copy link

I have been endeavoring to get more ANN libraries working with sparse Jaccard data in ann-benchmarks. There are surprisingly few libraries that support this. I was pleased to see that NGT does have support for sparse Jaccard, however (as far as I can tell) this is only accessible via the C++ API, and not via python (which ann-benchmarks would require).

Is it possible to add sparse Jaccard support to the python interface? If it is not too challenging I could attempt to provide a PR myself, but that would likely take some time, as I have little familiarity with the code.

@masajiro
Copy link
Member

masajiro commented Jan 24, 2022

Sorry for the late response.
Since NGT python bindings (ngtpy) support Jaccard, I think that you only have to update the ann-benchmarks ngt file. If the data format ann-benchmarks required is different from that of ngtpy, please let me know. I will help you.

@lmcinnes
Copy link
Author

Thanks for the reply. My concern was with the input data format for the Jaccard benchmark in ann-benchmarks. It is a (74962 x 27983) sparse matrix. This can be formatted as a boolean numpy array of shape (74962, 27983) which I believe can be fed to NGT's ngtpy jaccard. Given the very high dimensionality this will be slower than it could be. I will try this none the less and get back to you with preliminary results.

It seemed that the SparseJaccard comparator can act directly on the lists of indices of non-zero elements per row -- essentially directly on the sparse data format -- which will be much faster (other algorithms that support the Jaccard benchmark dataset in ann-benchmarks essentially do this). If possible I would prefer to be able to use this approach, as it would give a much fairer comparison in ann-benchmarks.

@masajiro
Copy link
Member

Although I surely implemented Jaccard into NGT, I had almost forgotten that. There are two types of Jaccard for dense and sparse on NGT. However, as you mentioned, it seems that sparse Jaccard is not available on ngtpy. I will try to implement sparse jaccard into ngtpy if it is easy.

@lmcinnes
Copy link
Author

Thanks. My initial efforts with the dense matrix are mostly just timing out on the runs, so I fear it is too expensive. Hopefully the sparse jaccard version is not too challenging to make available in ngtpy. It would certainly be useful for many people, but it is definitely a less used feature, so if it is too difficult to implement, please don't invest time in it. Thanks again for all your work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants