-
Notifications
You must be signed in to change notification settings - Fork 899
Switch to FXHash #1752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Switch to FXHash #1752
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the split, much easier to reason about the change!
QQ: have you considered importing FxHashMap
like so: use rustc_hash::FxHashMap as HashMap;
?
I assume this won't work because APIs seem a bit too different, but wondering if feasible. Would also make sure you haven't missed a HashMap somewhere in the codebase (same goes for HashSet
& FxHashSet
).
@@ -61,8 +64,8 @@ impl WordLevelBuilder { | |||
|
|||
/// Set the vocab (token -> ID) mapping. | |||
#[must_use] | |||
pub fn vocab(mut self, vocab: HashMap<String, u32>) -> Self { | |||
self.config.vocab = vocab; | |||
pub fn vocab<S: BuildHasher>(mut self, vocab: HashMap<String, u32, S>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this (<S: BuildHasher>
) need to be specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern of specifying a trait bound on S
is to prevent the user from having to deal with the change in hashmap. For example, if the function signature is: pub fn vocab(mut self, vocab: FxHashMap<String, u32>) -> Self
, the user of the library would need to convert to a FxHashMap on their end. This not only causes them to take another direct dependency, it would also cause breaking changes. I ran into this (I believe it is still in my other PR) where I would need to modify the tests to convert into a FxHashMap. Since FxHashMap is just a type-specified version of std::collections::HashMap using FxHash, this type signature lets the user interact with the library without any extra effort. I tried my best to make the least impact on the API, using the tests as an indicator.
This also explains my reasoning for not using the use rustc_hash::FxHashMap as HashMap
, as in some files, such as this one, I need to specify using std::collections::HashMap
. The reason for not using the import alias in the rest of the codebase is just for consistency. It would be harder to navigate the codebase if HashMap was defined as separate things in different files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the trait BuildHasher
a requirement for FxHashMap
?
You could also alias like so: use std::collections::HashMap as StdHashMap
which shouldn't change anything API-wise. But I understand the rationale, sgtm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, FxHashMap<K, V> := std::collections::HashMap<K, V, rustc_hash::FxBuildHasher>
, and rustc_hash::FxBuildHasher
impls BuildHasher
.
This way, the function accepts rustc_hash::FxBuildHasher
as well as std::hash::DefaultHasher
(which would be the default std::collections::HashMap
.
@MeetThePatel , did you see the timing with Edit: I checked your timings in cargo_bench.txt and I think there are a few more % to squeeze out if you check our PR. |
Split PR #1733 into two separate PRs, as per @McPatate 's suggestion. This PR contains the changes for adding FXHash.
The benchmarks (from
cargo bench
) are in cargo_bench.txt. Some performance gains, but not as large as the short-string optimization.Note: The changes in this PR are slightly different than the ones from the other PR. I refactored some things to make this change mostly invisible to the user (this only applies to the Rust library; the Python and Node users shouldn't feel any difference). The only exposure these changes have to a library user are the return types for the "getters". For the setters (and other functionality), you can provide a
std::collections::HashMap
and it will convert to arustc_hash::FxHashMap
(and the same for HashSet) internally.I'll work on the PR for the short-string optimization after this PR has been reviewed. I think that would be easier, as changes here would impact the other.