-
Notifications
You must be signed in to change notification settings - Fork 3
Incremental topK with fractional index #72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…plementation details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great stuff! Looking forward to seeing the benchmarks
@@ -50,6 +50,7 @@ | |||
}, | |||
"dependencies": { | |||
"fractional-indexing": "^3.2.0", | |||
"murmurhash-js": "^1.0.0" | |||
"murmurhash-js": "^1.0.0", | |||
"sorted-btree": "^1.8.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5.8kb gzipped https://bundlephobia.com/package/[email protected]
This is enough extra code weight (~24% increase to tanstack/db) that depending on where the crossover point ends up being, this could be an opt-in thing. I.e. only use if you have 50k+ items in a a collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the idea. We want to do some initial benchmarking to see when the turnover point is between using the array version or the tree version. We could automatically switch between them based on the size of the collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok perfect, yeah that'd be easy with an async import 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kevin-dp, all looks really good!
My one suggestion is we split the BTree version into a serrate operator, in a separate file.
So the array version is topKWithFractionIndex
and orderByWithFractionIndex
, and then we have a separate topKWithFractionIndexBTree
and orderByWithFractionIndexBTree
. That way when the Btree isn't used it won't be bundled - at the moment the condition on which implementation to use will cause the Btree to be pulled in all the time. It should be possible to do this without duplication is you subclass TopKWithFractionalIndexOperator
as TopKWithFractionalIndexBtreeOperator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One note, and it needs a changeset, but other than that
I'd like to benchmark it before we ship it, to make sure the two versions perform as expected. |
This PR introduces changes to the topK operator because the previous implementation was not incremental. This PR provides 2 implementations: an array and a B+ tree implementation. The array implementation internally keeps a sorted array of elements to efficiently find the position where to insert/delete but the actual insertion/deletion is still in linear time. This is fine for small to medium collections. For big collections, we want to use the B+ tree implementation such that insertions and deletions are in logarithmic time.
TODO: