-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: ICICLE msm integration #498
Conversation
Current benchmark doesn't look right, the difference is only 5x, we should expect at least >10x speedup. In initial exploration, we observed 200x differences! Again could due to setup, or warm up (i believe criterion only warm up CPU, not GPU) The number below is run on computing the same MSM (committing to the same polynomial) multiple times, and taking the average runtime. (instead of running multiple instances of MSM at the same time, which GPU should be better at?)
|
end_timer!(conv_time); | ||
|
||
// load them on host first | ||
let bases = HostOrDeviceSlice::Host(bases); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to convert and load SRS bases everytime during committing? Can we just load it once and reuse it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another point: in VID, the polynomial degree is much lower. Thus we also won't need to upload too many powers_of_g
elements to GPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to convert and load SRS bases everytime during committing? Can we just load it once and reuse it?
this is exactly what I meant on being unsure about the API boundary
ultimately we probably would not use this function in a standalone way, instead, we will pick it apart, and flesh out the full steps inside vid's function, to have fine-grained control on when are data being loaded, and maximize reuse.
imo, we would modify our struct Advz
and store Option<&HostOrDeviceSlice<T>>
that stores cudamem ref to the srs loaded in the previous run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I feel it might be better to split it at the PCS level rather than in the VID code. Because the PCS commit function itself shouldn't upload SRS everytime. Is it possible that we add another API, load_srs_to_gpu()
, and the commit_in_gpu
function can take &HostOrDevicesSlice<T>
and no longer need to upload srs, and it will return an error if SRS hasn't been uploaded?
In the VID code, it will call load_srs_to_gpu
and commit_in_gpu
when needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your suggestion makes more sense! I'll implement that!
to go even further, I would also separate out load_poly_coeffs_on_gpu()
, and during commit_on_gpu()
only takes in pointers, this is because we could be reusing the coeffs from on-gpu FFT in the future. This would account for that flexibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a slight annoyance is lifetime due to HostOrDeviceSlice<'a, T>
as a return parameter, I've been fighting this today. I don't want to assign 'static
for it, as we shouldn't expect the reference to live that long, we only want it to be as long as the cuda pointer being active.
I'll figure sth out, but just to share some engineering journey.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 514d479
I'm so annoyed that I can't get my nix shell to work. outside it, i can compile and run icicle code, but inside I can only successfully compile, but any invocation of cuda FFI code would failed.
runtime error trace
I first suspected that I might need to dig deeper into what actually happens during the FFI calling phase, which object files are being used, and which dynamic library were used to built these executables etc. to pinpoint the source. p.s. since icicle's build script explicitly use baseShell = with pkgs;
clang15Stdenv.mkDerivation {
name = "clang15-nix-shell";
buildInputs = [
argbash
openssl
pkg-config
git
nixpkgs-fmt
cargo-with-nightly
stableToolchain
nightlyToolchain
cargo-sort
clang-tools_15
clangStdenv
llvm_15
] ++ lib.optionals stdenv.isDarwin
[ darwin.apple_sdk.frameworks.Security ];
CARGO_TARGET_DIR = "target/nix_rustc";
shellHook = ''
export RUST_BACKTRACE=full
export PATH="$PATH:$(pwd)/target/debug:$(pwd)/target/release"
# Prevent cargo aliases from using programs in `~/.cargo` to avoid conflicts with local rustup installations.
export CARGO_HOME=$HOME/.cargo-nix
# Ensure `cargo fmt` uses `rustfmt` from nightly.
export RUSTFMT="${nightlyToolchain}/bin/rustfmt"
export C_INCLUDE_PATH="${llvmPackages_15.libclang.lib}/lib/clang/${llvmPackages_15.libclang.version}/include"
export CC="${clang-tools_15.clang}/bin/clang"
export CXX="${clang-tools_15.clang}/bin/clang++"
export AR="${llvm_15}/bin/llvm-ar"
export CFLAGS="-mcpu=generic"
# ensure clang-sys got the correct version
export LLVM_CONFIG_PATH="${llvmPackages_15.llvm.dev}/bin/llvm-config"
export LIBCLANG_PATH=${llvmPackages_15.libclang.lib}/lib
export CLANG_PATH=${clang-tools_15.clang}/bin/clang
# by default choose u64_backend
export RUSTFLAGS='--cfg curve25519_dalek_backend="u64"'
''
# install pre-commit hooks
+ self.check.${system}.pre-commit-check.shellHook;
};
//....
devShells = {
# enter with `nix develop .#cudaShell`
cudaShell = baseShell.overrideAttrs (oldAttrs: {
# for GPU/CUDA env (e.g. to run ICICLE code)
name = "cuda-env-shell";
buildInputs = oldAttrs.buildInputs ++ [ cmake util-linux gcc11 ];
# CXX is overridden to use gcc11 as icicle-curves's build scripts need them, but gcc12 is not supported
shellHook = oldAttrs.shellHook + ''
export CUDA_PATH=/usr/local/cuda
export PATH="${pkgs.gcc11}/bin:$CUDA_PATH/bin:$CUDA_PATH/nvvm/bin:$PATH"
export LD_LIBRARY_PATH="$CUDA_PATH/lib64:$LIBCLANG_PATH"
'';
});
}; 🤦 need more time on this |
The code is ready for review, interestingly I have some naive "warmup" function that makes the benchmark more accurate, as the warmup takes a constant of ~200ms, which you won't see now:
There are some remaining tasks:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
closes: #490
Unit tests already test the correctness of the MSM results.
Benchmark can be run via
cargo bench --bench msm --features "test-srs icicle"
CudaErrorInsufficientDriver
err)Benchmark
TL;DR: it's about 50x speedup compared to arkworks! 🎉
Current
criterion
benchmark:Individual GPU-accelerated breakdown
MSM(2^20)
note: the "GPU-accelerated MSM" is somewhat misleading, because we use non-blocking async MSM on GPU, the computation actually wasn't finished before our CPU moved on, thus part of "Load MSM result GPU->CPU" is "synchronizing the result on the cuda stream" which means wait for the work to finish. Since we know the loading a single projective group should be instant, the more accurate MSM computation time is
20.49 + 26 = 47 sec
which aligns with thecriterion
output above.Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.
Pending
section inCHANGELOG.md
Files changed
in the GitHub PR explorer