Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed-up multiplication by small integers, and improve lookup part of compute_quotient_poly() #1153

Closed
wants to merge 10 commits into from

Conversation

Nashtare
Copy link
Collaborator

@Nashtare Nashtare commented Jul 31, 2023

This PR introduces a mul_u32() method for Field, to be used to speed-up multiplication by small values, though funnily the places where I was originally planning to use it ended up leveraging F::from_noncanonical_u96() instead. The gain in compute_quotient_poly() when lookups are involved is pretty important.

Additionally, it speeds-up places where the multiplicative generator is involved, as that one is always small for prime fields. Namely, I've added a SmallPowers struct (mimicking Powers), and used it wherever possible. Note that the fft speed-ups can't be integrated directly in the fft module as it is defined over Field for which we cannot convert to primitive types.

@nbgl I think there may be some improvements to get by having a similar method for PackedField, however I am not really familiar with AVX instructions. I added a blanket impl for now, with TODOs for Goldilocks specialization.

closes #869

@Nashtare Nashtare changed the title Add mul_32() method in Field and speed-up compute_quotient_poly() with lookups Speed-up multiplication by small integers, and improve lookup part of compute_quotient_poly() Jul 31, 2023
@unzvfu unzvfu self-requested a review July 31, 2023 16:14
@unzvfu
Copy link
Contributor

unzvfu commented Jul 31, 2023

Thanks for this @Nashtare, I'll have a look at this in detail tomorrow. So I can compare when I test it myself, could you please post what concrete performance improvement you observed and how you measured it?

@Nashtare
Copy link
Collaborator Author

@unzvfu The main gains are with respect to compute_quotient_poly() when lookups are involved. For this I modified the bench_recursion example to have the third case build a circuit of $2^N$ gates with $2^{N-1}$ lookups (it's currently fixed to 514, not sure why actually). I'm observing for the compute quotient polys() step of the first proof (the one containing lookups) an improvement varying from 8% to 45% on average.

As for the rest, the improvements are minimal as not part of some bottleneck for the prover. For instance I added a local logger on the evm side for the step computing lagrange_first and lagrange_last in compute_quotient_polys() , I'm getting on average a speed-up of 30% for the Memory one with simple_transfer (but the concrete gain is only 0.06s).

@nbgl
Copy link
Contributor

nbgl commented Jul 31, 2023

I'd also be curious in seeing how much of a difference mul_u32 makes. I have definitely seen the compiler optimize reduce_u128 into reduce_u96, when the former is called with a small input. That would make the mul_u32 method redundant (unless the optimization is being missed, in which case we should look into why).

@unzvfu
Copy link
Contributor

unzvfu commented Aug 1, 2023

@Nashtare: Is there a way you can share your benchmark code with me (is it available as a branch in your repo)? I agree with Jacqui that I would have expected the compiler to take care of many of the instances where mul_u32() seems to apply. I'm a little suspicious that some of the changes in this PR aren't contributing to the speedup you're seeing, so I'd like to investigate a little further.

Could you please also refactor all the code that was copy-pasted from lde_onto_coset()/coset_fft_with_options()? I think the call sites will be much clearer if you just pass the precomputed powers to lde_onto_coset() (make lde_onto_coset() inline if you think the function call overhead is playing a role). Thanks!

@Nashtare
Copy link
Collaborator Author

Nashtare commented Aug 1, 2023

Ah, interesting, in which case indeed most of the changes in this PR are useless 😅 I wouldn't have expected the compiler to optimize full-reduction that well with smaller inputs.
The lookup pairs ops probably remains the only thing that is worth changing then.

@unzvfu I've added a mul_u32_bench branch that updates the bench_recursion case for numerous lookups (--lookup-type 2), and a local timing measurement for the Lagrange polynomials computation on the EVM side. You should be able to cherry-pick the last commit on top of main for comparison.

evm/src/prover.rs Outdated Show resolved Hide resolved
@unzvfu
Copy link
Contributor

unzvfu commented Aug 2, 2023

Ah, interesting, in which case indeed most of the changes in this PR are useless sweat_smile I wouldn't have expected the compiler to optimize full-reduction that well with smaller inputs. The lookup pairs ops probably remains the only thing that is worth changing then.

@unzvfu I've added a mul_u32_bench branch that updates the bench_recursion case for numerous lookups (--lookup-type 2), and a local timing measurement for the Lagrange polynomials computation on the EVM side. You should be able to cherry-pick the last commit on top of main for comparison.

Ok, so I've finally got what I believe is a fair comparison for "compute quotient polys", but unfortunately I don't see any performance difference. Could you please verify that I'm checking the right thing though? Here are two example runs; first plonky2/main:

hlaw@zaphod:~/src/plonky2$ git remote show origin | head -3
* remote origin
  Fetch URL: [email protected]:mir-protocol/plonky2.git
  Push  URL: [email protected]:mir-protocol/plonky2.git
hlaw@zaphod:~/src/plonky2$ git status
On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
	modified:   plonky2/examples/bench_recursion.rs

Untracked files:
[...snip...]

hlaw@zaphod:~/src/plonky2$ git diff
diff --git a/plonky2/examples/bench_recursion.rs b/plonky2/examples/bench_recursion.rs
index d0c273d1..cf3ff536 100644
--- a/plonky2/examples/bench_recursion.rs
+++ b/plonky2/examples/bench_recursion.rs
@@ -160,7 +160,7 @@ fn dummy_many_rows_proof<
     let initial_a = builder.add_virtual_target();
 
     let output = builder.add_lookup_from_index(initial_a, tip5_idx);
-    for _ in 0..514 {
+    for _ in 0..(1 << log2_size) - 1 {
         builder.add_lookup_from_index(output, 0);
     }
 
hlaw@zaphod:~/src/plonky2$ RUSTFLAGS='-Ctarget-cpu=native' cargo run --release --example bench_recursion -- -vv --lookup-type 2 2>&1 | grep 'compute quotient polys'
[DEBUG plonky2::util::timing] | 0.1363s to compute quotient polys
[DEBUG plonky2::util::timing] | 0.0709s to compute quotient polys
[DEBUG plonky2::util::timing] | 0.0349s to compute quotient polys

and then plonky2@toposware/mul_32_bench:

hlaw@zaphod:~/src/plonky2-toposware$ git remote show origin | head -3
* remote origin
  Fetch URL: [email protected]:topos-protocol/plonky2.git
  Push  URL: [email protected]:topos-protocol/plonky2.git
hlaw@zaphod:~/src/plonky2-toposware$ git status
On branch mul_u32_bench
Your branch is up to date with 'origin/mul_u32_bench'.

Changes not staged for commit:
[...snip... just the fix from #1163]

hlaw@zaphod:~/src/plonky2-toposware$ RUSTFLAGS='-Ctarget-cpu=native' cargo run --release --example bench_recursion -- -vv --lookup-type 2 2>&1 | grep 'compute quotient polys'
[DEBUG plonky2::util::timing] | 0.1462s to compute quotient polys
[DEBUG plonky2::util::timing] | 0.0760s to compute quotient polys
[DEBUG plonky2::util::timing] | 0.0385s to compute quotient polys

Could you confirm that you are compiling with -Ctarget-cpu=native and could you also let us know which AVX extensions your CPU has?

@Nashtare
Copy link
Collaborator Author

Nashtare commented Aug 2, 2023

Hmm that's weird, I definitely got variable performance improvements (also tried on larger tables). I was compiling with -Ctarget-cpu=native, with AVX2 enabled (I don't have AVX512 support).

On the other hand, from playing with godbolt, it seems Jacqueline was right and that the compiler can properly infer small enough inputs when performing modular reduction, making the mul_u32() method somewhat redundant. Due to this latter point, and the fact that the quotient poly improvement seems somehow not consistent, should we just close this?

@unzvfu
Copy link
Contributor

unzvfu commented Aug 2, 2023

Hmm that's weird, I definitely got variable performance improvements (also tried on larger tables). I was compiling with -Ctarget-cpu=native, with AVX2 enabled (I don't have AVX512 support).

On the other hand, from playing with godbolt, it seems Jacqueline was right and that the compiler can properly infer small enough inputs when performing modular reduction, making the mul_u32() method somewhat redundant. Due to this latter point, and the fact that the quotient poly improvement seems somehow not consistent, should we just close this?

If you're happy to close, that's okay with me. Also happy to revisit the idea later if you have an idea of how it might be made to (more definitively) bear fruit.

@Nashtare
Copy link
Collaborator Author

Nashtare commented Aug 2, 2023

Yeah let's close it for now. The quotient_poly part wasn't the biggest overhead with lookups anyway. We're working internally on making those more efficient to prove, we should come up with something relatively soon!

@Nashtare Nashtare closed this Aug 2, 2023
@Nashtare Nashtare deleted the mul_u32 branch September 8, 2023 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Faster multiplication by generator
3 participants