Skip to content

Switch to FXHash #1752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

MeetThePatel
Copy link

Split PR #1733 into two separate PRs, as per @McPatate 's suggestion. This PR contains the changes for adding FXHash.

The benchmarks (from cargo bench) are in cargo_bench.txt. Some performance gains, but not as large as the short-string optimization.

Note: The changes in this PR are slightly different than the ones from the other PR. I refactored some things to make this change mostly invisible to the user (this only applies to the Rust library; the Python and Node users shouldn't feel any difference). The only exposure these changes have to a library user are the return types for the "getters". For the setters (and other functionality), you can provide a std::collections::HashMap and it will convert to a rustc_hash::FxHashMap (and the same for HashSet) internally.

I'll work on the PR for the short-string optimization after this PR has been reviewed. I think that would be easier, as changes here would impact the other.

Copy link
Member

@McPatate McPatate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the split, much easier to reason about the change!

QQ: have you considered importing FxHashMap like so: use rustc_hash::FxHashMap as HashMap;?
I assume this won't work because APIs seem a bit too different, but wondering if feasible. Would also make sure you haven't missed a HashMap somewhere in the codebase (same goes for HashSet & FxHashSet).

@@ -61,8 +64,8 @@ impl WordLevelBuilder {

/// Set the vocab (token -> ID) mapping.
#[must_use]
pub fn vocab(mut self, vocab: HashMap<String, u32>) -> Self {
self.config.vocab = vocab;
pub fn vocab<S: BuildHasher>(mut self, vocab: HashMap<String, u32, S>) -> Self {
Copy link
Member

@McPatate McPatate Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this (<S: BuildHasher>) need to be specified?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern of specifying a trait bound on S is to prevent the user from having to deal with the change in hashmap. For example, if the function signature is: pub fn vocab(mut self, vocab: FxHashMap<String, u32>) -> Self, the user of the library would need to convert to a FxHashMap on their end. This not only causes them to take another direct dependency, it would also cause breaking changes. I ran into this (I believe it is still in my other PR) where I would need to modify the tests to convert into a FxHashMap. Since FxHashMap is just a type-specified version of std::collections::HashMap using FxHash, this type signature lets the user interact with the library without any extra effort. I tried my best to make the least impact on the API, using the tests as an indicator.

This also explains my reasoning for not using the use rustc_hash::FxHashMap as HashMap, as in some files, such as this one, I need to specify using std::collections::HashMap. The reason for not using the import alias in the rest of the codebase is just for consistency. It would be harder to navigate the codebase if HashMap was defined as separate things in different files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the trait BuildHasher a requirement for FxHashMap?

You could also alias like so: use std::collections::HashMap as StdHashMap which shouldn't change anything API-wise. But I understand the rationale, sgtm.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, FxHashMap<K, V> := std::collections::HashMap<K, V, rustc_hash::FxBuildHasher>, and rustc_hash::FxBuildHasher impls BuildHasher.

This way, the function accepts rustc_hash::FxBuildHasher as well as std::hash::DefaultHasher (which would be the default std::collections::HashMap.

@mjbommar
Copy link

mjbommar commented Mar 22, 2025

@MeetThePatel , did you see the timing with dary_heap as well? It seems like you're implementing the same kinds of changes we had researched in #1618, so you might find that useful for timing reference.

Edit: I checked your timings in cargo_bench.txt and I think there are a few more % to squeeze out if you check our PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants