Skip to content

Incremental topK with fractional index #72

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

kevin-dp
Copy link

This PR introduces changes to the topK operator because the previous implementation was not incremental. This PR provides 2 implementations: an array and a B+ tree implementation. The array implementation internally keeps a sorted array of elements to efficiently find the position where to insert/delete but the actual insertion/deletion is still in linear time. This is fine for small to medium collections. For big collections, we want to use the B+ tree implementation such that insertions and deletions are in logarithmic time.

TODO:

  • Benchmark these 2 implementations to confirm their theoretical time complexity holds in practice

Copy link
Contributor

@KyleAMathews KyleAMathews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great stuff! Looking forward to seeing the benchmarks

@@ -50,6 +50,7 @@
},
"dependencies": {
"fractional-indexing": "^3.2.0",
"murmurhash-js": "^1.0.0"
"murmurhash-js": "^1.0.0",
"sorted-btree": "^1.8.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5.8kb gzipped https://bundlephobia.com/package/[email protected]

This is enough extra code weight (~24% increase to tanstack/db) that depending on where the crossover point ends up being, this could be an opt-in thing. I.e. only use if you have 50k+ items in a a collection.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the idea. We want to do some initial benchmarking to see when the turnover point is between using the array version or the tree version. We could automatically switch between them based on the size of the collection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok perfect, yeah that'd be easy with an async import 🚀

Copy link
Contributor

@samwillis samwillis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kevin-dp, all looks really good!

My one suggestion is we split the BTree version into a serrate operator, in a separate file.

So the array version is topKWithFractionIndex and orderByWithFractionIndex, and then we have a separate topKWithFractionIndexBTree and orderByWithFractionIndexBTree. That way when the Btree isn't used it won't be bundled - at the moment the condition on which implementation to use will cause the Btree to be pulled in all the time. It should be possible to do this without duplication is you subclass TopKWithFractionalIndexOperator as TopKWithFractionalIndexBtreeOperator.

@kevin-dp kevin-dp requested a review from samwillis June 24, 2025 08:45
Copy link
Contributor

@samwillis samwillis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One note, and it needs a changeset, but other than that :shipit:

@kevin-dp
Copy link
Author

One note, and it needs a changeset, but other than that :shipit:

I'd like to benchmark it before we ship it, to make sure the two versions perform as expected.

@kevin-dp kevin-dp closed this Jun 24, 2025
@kevin-dp kevin-dp reopened this Jun 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants