Skip to content

Add PrunedStore and use it in Wasmi's executor #1449

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 49 commits into from
Apr 3, 2025
Merged

Conversation

Robbepop
Copy link
Member

@Robbepop Robbepop commented Mar 30, 2025

Closes #1448
Unblocks #1433

This adds a PrunedStore that can be created from a Store<T>. It can safely be restored back to a Store<T> and use certain Store<T> related methods. However, this conversion is not entirely free. PrunedStore provides an efficient way to access the StoreInner parts of a Store<T>.

Note: the code to convert between Store<T> and PrunedStore uses unsafe Rust code. I execute the PrunedStore conversion tests with miri and it did not find any unsoundness while running the tests. The entire Wasmi testsuite passes using the PrunedStore.

Downsides

  • The T in Store<T> is now required to be 'static. Not a very big deal in common user code but not great either, especially since Wasmtime's Store<T> does not have this requirement. This is probably because Wasmtime simly does not perform the TypeId check at all.
  • Performance tests indicate that Wasmi execution is heavily affected by the changes introduced by this PR. Some test cases perform similar to before (e.g. tiny_keccak) others perform way worse (e.g. counter). This indicates that Wasmi performs the same overall but performance of different op-codes changed significantly.

Copy link

codecov bot commented Mar 30, 2025

Codecov Report

Attention: Patch coverage is 89.47368% with 12 lines in your changes missing coverage. Please review.

Project coverage is 71.50%. Comparing base (53aa916) to head (601776d).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
crates/wasmi/src/store.rs 86.27% 7 Missing ⚠️
crates/wasmi/src/engine/executor/instrs.rs 85.71% 2 Missing ⚠️
crates/wasmi/src/engine/executor/instrs/call.rs 94.87% 2 Missing ⚠️
crates/wasmi/src/engine/executor/mod.rs 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1449      +/-   ##
==========================================
+ Coverage   71.47%   71.50%   +0.02%     
==========================================
  Files         161      161              
  Lines       16352    16364      +12     
==========================================
+ Hits        11688    11701      +13     
+ Misses       4664     4663       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Robbepop added 20 commits March 30, 2025 13:42
This way PrunedStore is pointer sized.
This is an API can can more easily be wrapped for an indirect call via a PrunedStore.
This is a better name since this type manages both, function parameters and results.
This allows to call host functions via a PrunedStore.
This way they cannot be used with foo.bar() syntax to disambiguate them with their Deref[Mut] impls.
It no longer holds a reference to a Store<Prune> but simply is the shape of a Store<Prune> and now has #[repr(transparent)] to make transmutation sound.
This is just for demo purposes (it works) and not final code.
Call hooks from Wasm -> Host are now handled at the TypedStore.
Also make RestorePrunedWrapper::restore infallible.
@Robbepop
Copy link
Member Author

Robbepop commented Mar 30, 2025

I just benchmarked the Wasmi executor that is now using the new PrunedStore type and is therefore fully non-generic.

Performance is affected a lot. It seems that Wasmi can still achieve the same performance, but op-codes are affected very differently. For example, counter dropped significantly whereas tiny_keccak remained stable and global_bump even improved ever so slightly.
This usually means that we can fix these performance rollercoaster with some inspection and #[inline(never)], #[inline(always)] or normal #[inline] annotations. However, this is super flaky and might change with every Rust version. This would be solved if Rust had explicit-tail-calls.

image

@Robbepop Robbepop changed the title experiment: Add PrunedStore Add PrunedStore and use it in Wasmi's executor Mar 31, 2025
@Robbepop
Copy link
Member Author

Robbepop commented Apr 2, 2025

Things I have already tried out to fix the performance regressions of this PR:

  • Putting #[inline], #[inline(never)] and #[inline(always)] and combinations in various selected places in the Wasmi executor. While this might seem arbitrary it helped a lot with past performance regressions. Unfortunately, so far I only yielded the same performance or regressed performance further.
  • I tried to shift Wasmi bytecode instructions around, thus changing the structure of the jump table. This helped slightly for certain benchmarks (like tiny_keccak) but resulted in a regression overall.
  • I tried to PGO compile Wasmi. This helped when used with a very limited set of benchmarks. For example, when running PGO with the tiny_keccak test case it yielded a performance improvement of ~15%. However, as soon as the benchmark set was expanded there were no performance gains anymore.
  • I have tried out different LLVM flags via rustc -Cllvm-flag to have an impact on LLVM's optimization heuristics. Quite a few LLVM flags have been tried out, none of which actually succeeded.
  • I have tried out cargo flamegraph to get a better understanding. However, flamegraphs are not very helpful for Wasmi performance profiling because one has to compile with maximum optimization flags in order to profile but when doing so Wasmi collapses all the executor instructions into a single execute_instrs function and then the entire flamegraph is just flat form there on and no useful information is gained. Having a tail-call based instruction dispatch would really help here.
  • I have tried Xcode's Instruments profiler. It worked but yielded very similar results to cargo flamegraph and thus was not very helpful for me.
  • I have tried to make Wasmi's state machine more explicit to Rust and LLVM in order to help heuristics as I have read that actually LLVM should generate way better dispatch logic than it actually does for Wasmi: it currently generates a single jump table based dispatched and it ideally should decentralize those jumps into each op-code handler block to help the branch predictor, especially on ARM hardware, but it doesn't.
  • I have tried to put store into the Executor struct to see what performance implications this has. It yielded no differences in performance.
  • I tried removing the safety checks in PrunedStore::restore to see whether store safety checks have any (major) impact on Wasmi's performance for weird reasons. But as expected it did not yield any performance changes.
  • My suspicion is that in the generic Wasmi executor the store parameter was not put into a register and the non-generic Wasmi executor can put it in one. So I tried to use black_box and other techniques like wrapping store into yet another struct to prevent Rust from doing so but with no success.
  • I have tried out various different sets of optimization flags but, as expected, this yielded no performance changes to what was known before.
  • I have tried enabling the simd crate feature to see whether enabling even more Wasmi op-codes would yield crazy different performance metrics. But, as expected, this just yielded the known performance regressions on top by roughly 10-15%.
  • I tried applying #[cold] attributes onto variaous sets of op-code handlers that make use of the store parameter.
  • I have used cargo asm to have a look at the generated LLVM and aarch64 assembly for Wasmi's interpreter loop. This revealed that LLVM does a very poor job at optimizing Wasmi's jump table as it uses a central jump table dispatch but does not put branches into each op-code handler which would significantly boost performance especially on ARM hardware. However, the same is true for main so not a performance issue specific to this PR.
  • Currently trying to figure out the differences on the generated LLVM IR and aarch64 assembly between main and the PR:
  • I even tried a walk in the park but with no success.

@Robbepop
Copy link
Member Author

Robbepop commented Apr 3, 2025

After the most recent commits, performance is somewhat on par with main again.

Benchmarks

Same or Better

Real Use Cases

image

Loads and Stores

image

Global Writes

image

Regressions

image

image

@Robbepop
Copy link
Member Author

Robbepop commented Apr 3, 2025

Despite the remaining performance issues (e.g. counter) I think this is good to go. The performance improvements balance out the regressions kind of. We can have follow-up PRs to fix the remaining performance issues.

@Robbepop Robbepop merged commit 7b3e45c into main Apr 3, 2025
19 checks passed
@Robbepop Robbepop deleted the rf-add-pruned-store branch April 3, 2025 12:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Experiment with PrunedStore for Wasmi's executor
1 participant