-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallelize inline_all_lcs() API #335
Comments
Ah, it is a little bit hard to rewrite this function in a parallel way (somehow linear). One potential temporary solution is to avoid adding many FpVars one by one, but instead to take the sum as a witness variable and enforce a constraint over it. Currently, adding two FpVar would create a new symbolic LC. (By the way, for matrix multiplication, you may want to take a look at Virgo [https://github.com/sunblaze-ucb/Virgo]) |
AFAIU, the correctness guarantees of that function as written require it to be executed sequentially. Theres a couple things I'd like to understand:
An approach specific to improving with parallelism, is could we build / finalize multiple "sub-constraint-systems" in parallel, and have some combining procedure. (With different handlings for shared, and non-shared variables in the combining process) |
I thought, for example, vector A dot product vector B = res. in snark I need to create FpVar for each element in A and B. But if the dot product length is very long, inlining those linear combinations takes much time. |
|
Yes. The situation is that the current system generates many symbolic LCs for it. When a FpVar is multiplied by a constant, a new LC is created. For every |
Yeah. And the performance suffers a lot from such a huge amount of symbolic LC. I thought the constant * FpVar is almost for free but it turns out to be not that negligible:( For this specific case, do you have any suggestions to reduce the number of LCs? I think it is incorrect to accumulate the dot product using uint64 and only turn the accumulated result into FpVar. |
Some microbenchmarking results: constant vector * witness vector length of 10000 : CRS generation time ~60 seconds, prove time ~60 seconds, which is far from almost free. during inline_all_lcs() function, ~90% of the execution time is spent on snark/relations/src/r1cs/constraint_system.rs Line 343 in 8d9055d
|
Hey @brucechin , could you provide a link to the benchmark? Those numbers seem terrible |
This is the microbenchmark I wrote: https://github.com/brucechin/vector-dot-product-microbench/tree/master |
Hello, recently I have been working on a project consisting of a huge number of constraints and linear combinations. Before this issue I asked for improve memory consumption in #324. However, the setup phase to generate CRS is very slow and I found that most of the time is spent on the
inline_all_lcs()
function.Previously, by removing unnecessary lcs the memory consumption is greatly reduced(like ~90%). However, within this function, it is still iterating
self.lc_map
using a single thread.snark/relations/src/r1cs/constraint_system.rs
Line 288 in 54a0b1a
May I ask if the developer teams have any plans to parallelize this function? It could improve the performance a lot just by carefully atomically insert and remove elements into/from the
transformed_lc_map
with a threadpool. This feature should be easy to implement for someone who is familiar with concurrent programming with Rust. Thanks!The text was updated successfully, but these errors were encountered: