-
Notifications
You must be signed in to change notification settings - Fork 347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use fast immutable bitset for better bloom filter #716
Comments
I had tried out this implementation from cats and have the code in this branch It is indeed much faster. The benchmark results are here. I did this a few months back, and I would need to check some of the finer details again.
|
I'm looking a bit into this ... I'll make a draft PR so we can talk about it and I'll post some numbers as well. @johnynek you want to keep the old impl in |
@regadas Ideally, I really hate to leave people high-and-dry.... So, the most compatibility we can retain the better. That said, the above numbers seem ambiguous: like, using the immutable bitset was only a win for some cases. It would be nice if we could tweak things so that we always have a win. We can copy the implementation into this repo, I think, especially if we might want to try some optimizations to improve performance. One thing that we didn't do with old implementations is try to get the size information at the type-level. I think that is a really good idea since it makes sure the bitsizes are aligned and serializations don't need to write out the size. So, that is something worth exploring. The two ways to do this are to make a module:
or alternatively
|
Sorry for the late reply @johnynek!
Totally agree, I moved the current impl into a
In the numbers below I think i managed to get it on par with the I had to cheat in on case ... I believe there's some other minor things we can optimize where and there I'll need to play around a little bit more. I spotted yours and @non work on improving some of them and there's also some new ones like
|
@johnynek I was looking at this and I didn't actually commented the sized bloom filter;
These are very good points! it's easy to make a mistake and yeah improving on serialization is always a big plus. I guess we should really explore the idea, but before that I just wanted to see how far we could take the new BitSet and see if we could actually replace the existing one. |
Closed via #840 |
see: https://github.com/typelevel/cats-collections/blob/master/core/src/main/scala/cats/collections/BitSet.scala
@non and I implemented that to see if we could make a fast, sparse, immutable bitset. We could! In fact, for some operations it is faster than mutable, and for the use case of a Monoid where we have to copy inputs, it is much faster.
We could copy that implementation into Algebird and remove use of the bitset implementation we have now (java ewah), and it would also remove one of the dependencies of core taking us down to 1 (typelevel algebra): https://github.com/twitter/algebird/blob/develop/build.sbt#L258
this would be binary incompatible. We could move the old BloomFilter into
algebird.legacy
and make the javaewah dep provided so people who don't use it don't pick it up in their build.If we do this, we should add benchmarks to compare, but my guess is this will actually be faster.
cc @anish749 @regadas @nevillelyh
The text was updated successfully, but these errors were encountered: