Speed-up multiplication by small integers, and improve lookup part of compute_quotient_poly() #1153

Nashtare · 2023-07-31T00:23:10Z

This PR introduces a mul_u32() method for Field, to be used to speed-up multiplication by small values, though funnily the places where I was originally planning to use it ended up leveraging F::from_noncanonical_u96() instead. The gain in compute_quotient_poly() when lookups are involved is pretty important.

Additionally, it speeds-up places where the multiplicative generator is involved, as that one is always small for prime fields. Namely, I've added a SmallPowers struct (mimicking Powers), and used it wherever possible. Note that the fft speed-ups can't be integrated directly in the fft module as it is defined over Field for which we cannot convert to primitive types.

@nbgl I think there may be some improvements to get by having a similar method for PackedField, however I am not really familiar with AVX instructions. I added a blanket impl for now, with TODOs for Goldilocks specialization.

closes #869

unzvfu · 2023-07-31T16:33:44Z

Thanks for this @Nashtare, I'll have a look at this in detail tomorrow. So I can compare when I test it myself, could you please post what concrete performance improvement you observed and how you measured it?

Nashtare · 2023-07-31T19:59:51Z

@unzvfu The main gains are with respect to compute_quotient_poly() when lookups are involved. For this I modified the bench_recursion example to have the third case build a circuit of $2^N$ gates with $2^{N-1}$ lookups (it's currently fixed to 514, not sure why actually). I'm observing for the compute quotient polys() step of the first proof (the one containing lookups) an improvement varying from 8% to 45% on average.

As for the rest, the improvements are minimal as not part of some bottleneck for the prover. For instance I added a local logger on the evm side for the step computing lagrange_first and lagrange_last in compute_quotient_polys() , I'm getting on average a speed-up of 30% for the Memory one with simple_transfer (but the concrete gain is only 0.06s).

nbgl · 2023-07-31T23:47:00Z

I'd also be curious in seeing how much of a difference mul_u32 makes. I have definitely seen the compiler optimize reduce_u128 into reduce_u96, when the former is called with a small input. That would make the mul_u32 method redundant (unless the optimization is being missed, in which case we should look into why).

unzvfu · 2023-08-01T10:51:02Z

@Nashtare: Is there a way you can share your benchmark code with me (is it available as a branch in your repo)? I agree with Jacqui that I would have expected the compiler to take care of many of the instances where mul_u32() seems to apply. I'm a little suspicious that some of the changes in this PR aren't contributing to the speedup you're seeing, so I'd like to investigate a little further.

Could you please also refactor all the code that was copy-pasted from lde_onto_coset()/coset_fft_with_options()? I think the call sites will be much clearer if you just pass the precomputed powers to lde_onto_coset() (make lde_onto_coset() inline if you think the function call overhead is playing a role). Thanks!

Nashtare · 2023-08-01T11:54:02Z

Ah, interesting, in which case indeed most of the changes in this PR are useless 😅 I wouldn't have expected the compiler to optimize full-reduction that well with smaller inputs.
The lookup pairs ops probably remains the only thing that is worth changing then.

@unzvfu I've added a mul_u32_bench branch that updates the bench_recursion case for numerous lookups (--lookup-type 2), and a local timing measurement for the Lagrange polynomials computation on the EVM side. You should be able to cherry-pick the last commit on top of main for comparison.

evm/src/prover.rs

unzvfu · 2023-08-02T14:24:38Z

Ah, interesting, in which case indeed most of the changes in this PR are useless sweat_smile I wouldn't have expected the compiler to optimize full-reduction that well with smaller inputs. The lookup pairs ops probably remains the only thing that is worth changing then.

@unzvfu I've added a mul_u32_bench branch that updates the bench_recursion case for numerous lookups (--lookup-type 2), and a local timing measurement for the Lagrange polynomials computation on the EVM side. You should be able to cherry-pick the last commit on top of main for comparison.

Ok, so I've finally got what I believe is a fair comparison for "compute quotient polys", but unfortunately I don't see any performance difference. Could you please verify that I'm checking the right thing though? Here are two example runs; first plonky2/main:

hlaw@zaphod:~/src/plonky2$ git remote show origin | head -3
* remote origin
  Fetch URL: [email protected]:mir-protocol/plonky2.git
  Push  URL: [email protected]:mir-protocol/plonky2.git
hlaw@zaphod:~/src/plonky2$ git status
On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
	modified:   plonky2/examples/bench_recursion.rs

Untracked files:
[...snip...]

hlaw@zaphod:~/src/plonky2$ git diff
diff --git a/plonky2/examples/bench_recursion.rs b/plonky2/examples/bench_recursion.rs
index d0c273d1..cf3ff536 100644
--- a/plonky2/examples/bench_recursion.rs
+++ b/plonky2/examples/bench_recursion.rs
@@ -160,7 +160,7 @@ fn dummy_many_rows_proof<
     let initial_a = builder.add_virtual_target();
 
     let output = builder.add_lookup_from_index(initial_a, tip5_idx);
-    for _ in 0..514 {
+    for _ in 0..(1 << log2_size) - 1 {
         builder.add_lookup_from_index(output, 0);
     }
 
hlaw@zaphod:~/src/plonky2$ RUSTFLAGS='-Ctarget-cpu=native' cargo run --release --example bench_recursion -- -vv --lookup-type 2 2>&1 | grep 'compute quotient polys'
[DEBUG plonky2::util::timing] | 0.1363s to compute quotient polys
[DEBUG plonky2::util::timing] | 0.0709s to compute quotient polys
[DEBUG plonky2::util::timing] | 0.0349s to compute quotient polys

and then plonky2@toposware/mul_32_bench:

hlaw@zaphod:~/src/plonky2-toposware$ git remote show origin | head -3
* remote origin
  Fetch URL: [email protected]:topos-protocol/plonky2.git
  Push  URL: [email protected]:topos-protocol/plonky2.git
hlaw@zaphod:~/src/plonky2-toposware$ git status
On branch mul_u32_bench
Your branch is up to date with 'origin/mul_u32_bench'.

Changes not staged for commit:
[...snip... just the fix from #1163]

hlaw@zaphod:~/src/plonky2-toposware$ RUSTFLAGS='-Ctarget-cpu=native' cargo run --release --example bench_recursion -- -vv --lookup-type 2 2>&1 | grep 'compute quotient polys'
[DEBUG plonky2::util::timing] | 0.1462s to compute quotient polys
[DEBUG plonky2::util::timing] | 0.0760s to compute quotient polys
[DEBUG plonky2::util::timing] | 0.0385s to compute quotient polys

Could you confirm that you are compiling with -Ctarget-cpu=native and could you also let us know which AVX extensions your CPU has?

Nashtare · 2023-08-02T14:29:38Z

Hmm that's weird, I definitely got variable performance improvements (also tried on larger tables). I was compiling with -Ctarget-cpu=native, with AVX2 enabled (I don't have AVX512 support).

On the other hand, from playing with godbolt, it seems Jacqueline was right and that the compiler can properly infer small enough inputs when performing modular reduction, making the mul_u32() method somewhat redundant. Due to this latter point, and the fact that the quotient poly improvement seems somehow not consistent, should we just close this?

unzvfu · 2023-08-02T14:55:54Z

Hmm that's weird, I definitely got variable performance improvements (also tried on larger tables). I was compiling with -Ctarget-cpu=native, with AVX2 enabled (I don't have AVX512 support).

On the other hand, from playing with godbolt, it seems Jacqueline was right and that the compiler can properly infer small enough inputs when performing modular reduction, making the mul_u32() method somewhat redundant. Due to this latter point, and the fact that the quotient poly improvement seems somehow not consistent, should we just close this?

If you're happy to close, that's okay with me. Also happy to revisit the idea later if you have an idea of how it might be made to (more definitively) bear fruit.

Nashtare · 2023-08-02T14:57:44Z

Yeah let's close it for now. The quotient_poly part wasn't the biggest overhead with lookups anyway. We're working internally on making those more efficient to prove, we should come up with something relatively soon!

Nashtare added 5 commits July 30, 2023 18:38

Add new method for multiplying by u32

e0e0620

Speed-up some ops

bf91831

Add default implementation for PackedField

809f84b

Speed-up compute_lookup_polys

a2b05d1

use reduce_with_powers_u32

c661576

Nashtare force-pushed the mul_u32 branch from 61a853b to c661576 Compare July 31, 2023 00:26

Nashtare added 3 commits July 31, 2023 08:37

Add SmallPowers struct

3293870

Assume MULTIPLICATIVE_GROUP_GENERATOR to be small for prime fields

880f39f

Speed-up lde_onto_coset

a02dfc2

Nashtare force-pushed the mul_u32 branch from d813f81 to a02dfc2 Compare July 31, 2023 14:31

Nashtare changed the title ~~Add mul_32() method in Field and speed-up compute_quotient_poly() with lookups~~ Speed-up multiplication by small integers, and improve lookup part of compute_quotient_poly() Jul 31, 2023

unzvfu self-requested a review July 31, 2023 16:14

Store and reuse small powers

53d8fd2

Nashtare force-pushed the mul_u32 branch from 9ce2a0e to 1c8b0cb Compare August 1, 2023 12:22

unzvfu reviewed Aug 1, 2023

View reviewed changes

evm/src/prover.rs Outdated Show resolved Hide resolved

Nashtare force-pushed the mul_u32 branch from 1c8b0cb to 2b9133a Compare August 1, 2023 13:13

Refactor lde_onto_coset

11e2095

Nashtare force-pushed the mul_u32 branch from 2b9133a to 11e2095 Compare August 1, 2023 13:15

Nashtare closed this Aug 2, 2023

Nashtare deleted the mul_u32 branch September 8, 2023 20:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed-up multiplication by small integers, and improve lookup part of compute_quotient_poly() #1153

Speed-up multiplication by small integers, and improve lookup part of compute_quotient_poly() #1153

Nashtare commented Jul 31, 2023 •

edited

Loading

unzvfu commented Jul 31, 2023

Nashtare commented Jul 31, 2023

nbgl commented Jul 31, 2023

unzvfu commented Aug 1, 2023 •

edited

Loading

Nashtare commented Aug 1, 2023 •

edited

Loading

unzvfu commented Aug 2, 2023

Nashtare commented Aug 2, 2023

unzvfu commented Aug 2, 2023

Nashtare commented Aug 2, 2023

Speed-up multiplication by small integers, and improve lookup part of compute_quotient_poly() #1153

Speed-up multiplication by small integers, and improve lookup part of compute_quotient_poly() #1153

Conversation

Nashtare commented Jul 31, 2023 • edited Loading

unzvfu commented Jul 31, 2023

Nashtare commented Jul 31, 2023

nbgl commented Jul 31, 2023

unzvfu commented Aug 1, 2023 • edited Loading

Nashtare commented Aug 1, 2023 • edited Loading

unzvfu commented Aug 2, 2023

Nashtare commented Aug 2, 2023

unzvfu commented Aug 2, 2023

Nashtare commented Aug 2, 2023

Nashtare commented Jul 31, 2023 •

edited

Loading

unzvfu commented Aug 1, 2023 •

edited

Loading

Nashtare commented Aug 1, 2023 •

edited

Loading