-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Algorithmically quite slow #28
Comments
Maybe so. Another question is if it matters? |
@Restioson oh, that's super interesting; that's quite a reduction! I'd be super keen to see more on how this was achieved. You're right in that so far we've mostly focused on porting JS to Rust, but it'd be really cool to have a path forward to improve performance! |
@Restioson oh by the way; I'm seeing lots of |
@yoshuawuyts I believe godbolt does it on release by default. I'll try create another godbolt demo to put the link in. Performance wise, subbing out the flat-tree fns for my own made my time go from ~15us -> 400ns about. My algorithm stores a tree like 1 as [1; 2; 3]. To find the left child, we multiply the (1-indexed) index by two. To find the right child, we multiply the (again 1-indexed) index by two and add one. To find the parent, we divide the index by 2 and floor it. This is worse for cache locality though, but in general faster. The very slow function, btw, was |
Sounds about right that it's only depth that "needs" to be optimized. |
I assume depth is calculating log2? If it is, http://huonw.github.io/fast-math/fast_math/fn.log_raw2.html could be used. |
@Restioson If I understand your proposal, that would change the numbering in a way which makes it unfit for streaming data. Let me explain. This library is used in https://github.com/datrs/merkle-tree-stream to help with producing the merkle tree stream. The idea is that data blocks are coming in and the hash of these blocks (the leaves) are generated continuously. When there are enough leaves (or parents) to calculate the cumulative hash of a subtree, this hash is generated. There is no info about how many data blocks we have, so no info about the depth either. This is why the numbering starts with the leftmost leaf so that any number of nodes can be generated incrementally. Take a look at https://github.com/datrs/print-flat-tree to get some idea about this layout/numbering. Would it be possible to retain the same ordering with your scheme? If so, then indeed it could be used. Regarding the pub fn depth(i: usize) -> usize {
(!i).trailing_zeros() as usize
} (it could be beneficial to make it return |
@khernyo I wasn't specifically asking to replace the algorithm, just to improve the current one (perhaps using an alternate algorithm). What does depth do anyway, btw? |
Depth is a node's distance from a child leaf. So, e.g. in a tree with four leaves:
the leaves (nodes 0, 2, 4 and 6) have a depth of 0, the root (node 3) has a depth of 2. |
Fair enough. I'll try hack around with this and see how much of a difference it makes subbing in the new depth function. |
Just rewrote the depth function in #38, so now it's just: #[inline]
pub fn depth(i: usize) -> usize {
(!i).trailing_zeros() as usize
} In turn this is now in assembly: xor al, -1
movzx edi, al
call core::num::<impl u8>::trailing_zeros
mov dword ptr [rsp + 4], eax
mov eax, dword ptr [rsp + 4]
mov ecx, eax
mov eax, ecx Which is about as fast as we can get. The only possible speedup I could see here is by not casting to a usize, but that would be a major change for the API. Hope this should bring us a bit closer to acceptable performance. Also if anyone would like to take a stab at building a perf suite using criterion that'd be amazing! |
I wonder if |
@Restioson edit: yeah oops, I forgot to mention: if you add the right optimization flags to depth:
not rdi
tzcnt rax, rdi
ret I wrote an RFC yesterday to introduce |
Just wanted to mention that it appears that .trailing_ones() has been inlined as of January 27, 2020 as per this pull request: rust-lang/rust#68165 |
I decided to look at the generated assembly for the left_child function (inlining everything it calls):
By rewriting it to store each level in its entirety consecutively (level 1, then 2, then 3, etc), I got this (since then changed, will update later, but wasn't much longer):
I understand that the algorithm is taken from the JS project, but I still think that this algorithm is quite slow...
The text was updated successfully, but these errors were encountered: