Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallelize inline_all_lcs() API #335

Open
brucechin opened this issue Jan 3, 2021 · 9 comments
Open

parallelize inline_all_lcs() API #335

brucechin opened this issue Jan 3, 2021 · 9 comments

Comments

@brucechin
Copy link

brucechin commented Jan 3, 2021

Hello, recently I have been working on a project consisting of a huge number of constraints and linear combinations. Before this issue I asked for improve memory consumption in #324. However, the setup phase to generate CRS is very slow and I found that most of the time is spent on the inline_all_lcs() function.

Previously, by removing unnecessary lcs the memory consumption is greatly reduced(like ~90%). However, within this function, it is still iterating self.lc_map using a single thread.

for (&index, lc) in &self.lc_map {

May I ask if the developer teams have any plans to parallelize this function? It could improve the performance a lot just by carefully atomically insert and remove elements into/from the transformed_lc_map with a threadpool. This feature should be easy to implement for someone who is familiar with concurrent programming with Rust. Thanks!

@weikengchen
Copy link
Member

weikengchen commented Jan 3, 2021

Ah, it is a little bit hard to rewrite this function in a parallel way (somehow linear).

One potential temporary solution is to avoid adding many FpVars one by one, but instead to take the sum as a witness variable and enforce a constraint over it. Currently, adding two FpVar would create a new symbolic LC.

(By the way, for matrix multiplication, you may want to take a look at Virgo [https://github.com/sunblaze-ucb/Virgo])

@ValarDragon
Copy link
Member

ValarDragon commented Jan 3, 2021

AFAIU, the correctness guarantees of that function as written require it to be executed sequentially.

Theres a couple things I'd like to understand:

  1. How many constraints are there vs how many symbolic LC's are their in this constraint system
  2. (Presumably) how are we getting so many symbolic LC's, and are there ways we can avoid creating so many. (If its all coming from one gadget / usage pattern, can we make that gadget just introduce fewer symbolic LC's for now, in absence of a general solution)
  3. What is taking all the time in inline_all_lcs. Is it mallocs? If so, maybe we can reduce the number of distinct malloc calls (pooling / allocating more at once). We can definitely reduce the amount of data Consider storing a pointer to the value of a coefficient in the final R1CS object #122. Is it BTree searches calls? If so, how would that compare against alternate data structures.

An approach specific to improving with parallelism, is could we build / finalize multiple "sub-constraint-systems" in parallel, and have some combining procedure. (With different handlings for shared, and non-shared variables in the combining process)

@brucechin
Copy link
Author

brucechin commented Jan 3, 2021

Ah, it is a little bit hard to rewrite this function in a parallel way (somehow linear).

One potential temporary solution is to avoid adding many FpVars one by one, but instead to take the sum as a witness variable and enforce a constraint over it. Currently, adding two FpVar would create a new symbolic LC.

(By the way, for matrix multiplication, you may want to take a look at Virgo [https://github.com/sunblaze-ucb/Virgo])

I thought, for example, vector A dot product vector B = res. in snark I need to create FpVar for each element in A and B. But if the dot product length is very long, inlining those linear combinations takes much time.

@brucechin
Copy link
Author

AFAIU, the correctness guarantees of that function as written require it to be executed sequentially.

Theres a couple things I'd like to understand:

  1. How many constraints are there vs how many symbolic LC's are their in this constraint system
  2. (Presumably) how are we getting so many symbolic LC's, and are there ways we can avoid creating so many. (If its all coming from one gadget / usage pattern, can we make that gadget just introduce fewer symbolic LC's for now, in absence of a general solution)
  3. What is taking all the time in inline_all_lcs. Is it mallocs? If so, maybe we can reduce the number of distinct malloc calls (pooling / allocating more at once). We can definitely reduce the amount of data Consider storing a pointer to the value of a coefficient in the final R1CS object #122. Is it BTree searches calls? If so, how would that compare against alternate data structures.

An approach specific to improving with parallelism, is could we build / finalize multiple "sub-constraint-systems" in parallel, and have some combining procedure. (With different handlings for shared, and non-shared variables in the combining process)

  1. because I use constant wire * witness in the vector dot product, the number of constraints is not a large number. But the number of linear combinations could be thousands of times larger.
  2. because the matrix is so large. so I have to create many constant/witness wires for multiplication
  3. i have not done detailed profiling yet, so I can not provide some concrete profiling report for inline_all_lcs() function here.

@weikengchen
Copy link
Member

Ah, it is a little bit hard to rewrite this function in a parallel way (somehow linear).
One potential temporary solution is to avoid adding many FpVars one by one, but instead to take the sum as a witness variable and enforce a constraint over it. Currently, adding two FpVar would create a new symbolic LC.
(By the way, for matrix multiplication, you may want to take a look at Virgo [https://github.com/sunblaze-ucb/Virgo])

I thought, for example, vector A dot product vector B = res. in snark I need to create FpVar for each element in A and B.

fn constant_mul_cs_helper_u8(cs: ConstraintSystemRef<Fq>, a: u8, constant: u8) -> FqVar {
    let aa: Fq = a.into();
    let cc: Fq = constant.into();
    let a_var = FpVar::<Fq>::new_witness(r1cs_core::ns!(cs, "a gadget"), || Ok(aa)).unwrap();
    let c_var = FpVar::Constant(cc);
    a_var.mul(c_var)
}

fn scala_cs_helper_u8(
    cs: ConstraintSystemRef<Fq>,
    input: &[u8],
    weight: &[u8],
) -> FqVar {
    let mut tmp1 =
        FpVar::<Fq>::new_witness(r1cs_core::ns!(cs, "q1*q2 gadget"), || Ok(Fq::zero())).unwrap();
    for i in 0..input.len() {
        tmp1 += constant_mul_cs_helper_u8(cs.clone(), input[i], weight[i]);
    }
    tmp1
}

Although the above code does not generate too many constraints because they are all constant wire multiplying witness. But if the dot product length is very long, inlining those linear combinations takes much time.

Yes. The situation is that the current system generates many symbolic LCs for it. When a FpVar is multiplied by a constant, a new LC is created. For every +=, a new symbolic LC is created. Yeah, in fact, the sum is merely a linear combination of the input. It should require only one constraint, and one LC.

@brucechin
Copy link
Author

Yeah. And the performance suffers a lot from such a huge amount of symbolic LC. I thought the constant * FpVar is almost for free but it turns out to be not that negligible:( For this specific case, do you have any suggestions to reduce the number of LCs? I think it is incorrect to accumulate the dot product using uint64 and only turn the accumulated result into FpVar.

@brucechin
Copy link
Author

brucechin commented Jan 5, 2021

Some microbenchmarking results:

constant vector * witness vector length of 10000 : CRS generation time ~60 seconds, prove time ~60 seconds, which is far from almost free.
witness vector * witness vector length of 10000 : CRS generation time ~80 seconds, prove time ~70 seconds.

during inline_all_lcs() function, ~90% of the execution time is spent on

transformed_lc.extend((lc * coeff).0.into_iter());

@Pratyush
Copy link
Member

Pratyush commented Jan 5, 2021

Hey @brucechin , could you provide a link to the benchmark? Those numbers seem terrible

@brucechin
Copy link
Author

brucechin commented Jan 6, 2021

This is the microbenchmark I wrote: https://github.com/brucechin/vector-dot-product-microbench/tree/master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants