parallelize inline_all_lcs() API #335

brucechin · 2021-01-03T04:58:58Z

Hello, recently I have been working on a project consisting of a huge number of constraints and linear combinations. Before this issue I asked for improve memory consumption in #324. However, the setup phase to generate CRS is very slow and I found that most of the time is spent on the inline_all_lcs() function.

Previously, by removing unnecessary lcs the memory consumption is greatly reduced(like ~90%). However, within this function, it is still iterating self.lc_map using a single thread.

snark/relations/src/r1cs/constraint_system.rs

Line 288 in 54a0b1a

for (&index, lc) in &self.lc_map {

May I ask if the developer teams have any plans to parallelize this function? It could improve the performance a lot just by carefully atomically insert and remove elements into/from the transformed_lc_map with a threadpool. This feature should be easy to implement for someone who is familiar with concurrent programming with Rust. Thanks!

The text was updated successfully, but these errors were encountered:

weikengchen · 2021-01-03T05:11:17Z

Ah, it is a little bit hard to rewrite this function in a parallel way (somehow linear).

One potential temporary solution is to avoid adding many FpVars one by one, but instead to take the sum as a witness variable and enforce a constraint over it. Currently, adding two FpVar would create a new symbolic LC.

(By the way, for matrix multiplication, you may want to take a look at Virgo [https://github.com/sunblaze-ucb/Virgo])

ValarDragon · 2021-01-03T05:47:13Z

AFAIU, the correctness guarantees of that function as written require it to be executed sequentially.

Theres a couple things I'd like to understand:

How many constraints are there vs how many symbolic LC's are their in this constraint system
(Presumably) how are we getting so many symbolic LC's, and are there ways we can avoid creating so many. (If its all coming from one gadget / usage pattern, can we make that gadget just introduce fewer symbolic LC's for now, in absence of a general solution)
What is taking all the time in inline_all_lcs. Is it mallocs? If so, maybe we can reduce the number of distinct malloc calls (pooling / allocating more at once). We can definitely reduce the amount of data Consider storing a pointer to the value of a coefficient in the final R1CS object #122. Is it BTree searches calls? If so, how would that compare against alternate data structures.

An approach specific to improving with parallelism, is could we build / finalize multiple "sub-constraint-systems" in parallel, and have some combining procedure. (With different handlings for shared, and non-shared variables in the combining process)

brucechin · 2021-01-03T06:50:06Z

Ah, it is a little bit hard to rewrite this function in a parallel way (somehow linear).

One potential temporary solution is to avoid adding many FpVars one by one, but instead to take the sum as a witness variable and enforce a constraint over it. Currently, adding two FpVar would create a new symbolic LC.

(By the way, for matrix multiplication, you may want to take a look at Virgo [https://github.com/sunblaze-ucb/Virgo])

I thought, for example, vector A dot product vector B = res. in snark I need to create FpVar for each element in A and B. But if the dot product length is very long, inlining those linear combinations takes much time.

brucechin · 2021-01-03T06:53:38Z

AFAIU, the correctness guarantees of that function as written require it to be executed sequentially.

Theres a couple things I'd like to understand:

How many constraints are there vs how many symbolic LC's are their in this constraint system

(Presumably) how are we getting so many symbolic LC's, and are there ways we can avoid creating so many. (If its all coming from one gadget / usage pattern, can we make that gadget just introduce fewer symbolic LC's for now, in absence of a general solution)

What is taking all the time in inline_all_lcs. Is it mallocs? If so, maybe we can reduce the number of distinct malloc calls (pooling / allocating more at once). We can definitely reduce the amount of data Consider storing a pointer to the value of a coefficient in the final R1CS object #122. Is it BTree searches calls? If so, how would that compare against alternate data structures.

An approach specific to improving with parallelism, is could we build / finalize multiple "sub-constraint-systems" in parallel, and have some combining procedure. (With different handlings for shared, and non-shared variables in the combining process)

because I use constant wire * witness in the vector dot product, the number of constraints is not a large number. But the number of linear combinations could be thousands of times larger.
because the matrix is so large. so I have to create many constant/witness wires for multiplication
i have not done detailed profiling yet, so I can not provide some concrete profiling report for inline_all_lcs() function here.

weikengchen · 2021-01-03T07:51:19Z

Ah, it is a little bit hard to rewrite this function in a parallel way (somehow linear).
One potential temporary solution is to avoid adding many FpVars one by one, but instead to take the sum as a witness variable and enforce a constraint over it. Currently, adding two FpVar would create a new symbolic LC.
(By the way, for matrix multiplication, you may want to take a look at Virgo [https://github.com/sunblaze-ucb/Virgo])

I thought, for example, vector A dot product vector B = res. in snark I need to create FpVar for each element in A and B.
fn constant_mul_cs_helper_u8(cs: ConstraintSystemRef<Fq>, a: u8, constant: u8) -> FqVar {
    let aa: Fq = a.into();
    let cc: Fq = constant.into();
    let a_var = FpVar::<Fq>::new_witness(r1cs_core::ns!(cs, "a gadget"), || Ok(aa)).unwrap();
    let c_var = FpVar::Constant(cc);
    a_var.mul(c_var)
}

fn scala_cs_helper_u8(
    cs: ConstraintSystemRef<Fq>,
    input: &[u8],
    weight: &[u8],
) -> FqVar {
    let mut tmp1 =
        FpVar::<Fq>::new_witness(r1cs_core::ns!(cs, "q1*q2 gadget"), || Ok(Fq::zero())).unwrap();
    for i in 0..input.len() {
        tmp1 += constant_mul_cs_helper_u8(cs.clone(), input[i], weight[i]);
    }
    tmp1
}
Although the above code does not generate too many constraints because they are all constant wire multiplying witness. But if the dot product length is very long, inlining those linear combinations takes much time.

Yes. The situation is that the current system generates many symbolic LCs for it. When a FpVar is multiplied by a constant, a new LC is created. For every +=, a new symbolic LC is created. Yeah, in fact, the sum is merely a linear combination of the input. It should require only one constraint, and one LC.

brucechin · 2021-01-04T04:27:16Z

Yeah. And the performance suffers a lot from such a huge amount of symbolic LC. I thought the constant * FpVar is almost for free but it turns out to be not that negligible:( For this specific case, do you have any suggestions to reduce the number of LCs? I think it is incorrect to accumulate the dot product using uint64 and only turn the accumulated result into FpVar.

brucechin · 2021-01-05T08:42:32Z

Some microbenchmarking results:

constant vector * witness vector length of 10000 : CRS generation time ~60 seconds, prove time ~60 seconds, which is far from almost free.
witness vector * witness vector length of 10000 : CRS generation time ~80 seconds, prove time ~70 seconds.

during inline_all_lcs() function, ~90% of the execution time is spent on

snark/relations/src/r1cs/constraint_system.rs

Line 343 in 8d9055d

transformed_lc.extend((lc * coeff).0.into_iter());

Pratyush · 2021-01-05T15:50:10Z

Hey @brucechin , could you provide a link to the benchmark? Those numbers seem terrible

brucechin · 2021-01-06T00:05:14Z

This is the microbenchmark I wrote: https://github.com/brucechin/vector-dot-product-microbench/tree/master

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallelize inline_all_lcs() API #335

parallelize inline_all_lcs() API #335

brucechin commented Jan 3, 2021 •

edited

Loading

weikengchen commented Jan 3, 2021 •

edited

Loading

ValarDragon commented Jan 3, 2021 •

edited

Loading

brucechin commented Jan 3, 2021 •

edited

Loading

brucechin commented Jan 3, 2021

weikengchen commented Jan 3, 2021

brucechin commented Jan 4, 2021

brucechin commented Jan 5, 2021 •

edited

Loading

Pratyush commented Jan 5, 2021

brucechin commented Jan 6, 2021 •

edited

Loading

parallelize inline_all_lcs() API #335

parallelize inline_all_lcs() API #335

Comments

brucechin commented Jan 3, 2021 • edited Loading

weikengchen commented Jan 3, 2021 • edited Loading

ValarDragon commented Jan 3, 2021 • edited Loading

brucechin commented Jan 3, 2021 • edited Loading

brucechin commented Jan 3, 2021

weikengchen commented Jan 3, 2021

brucechin commented Jan 4, 2021

brucechin commented Jan 5, 2021 • edited Loading

Pratyush commented Jan 5, 2021

brucechin commented Jan 6, 2021 • edited Loading

brucechin commented Jan 3, 2021 •

edited

Loading

weikengchen commented Jan 3, 2021 •

edited

Loading

ValarDragon commented Jan 3, 2021 •

edited

Loading

brucechin commented Jan 3, 2021 •

edited

Loading

brucechin commented Jan 5, 2021 •

edited

Loading

brucechin commented Jan 6, 2021 •

edited

Loading