-
Notifications
You must be signed in to change notification settings - Fork 667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RISC-V vector spec v1.0 support #279
base: master
Are you sure you want to change the base?
Conversation
Maybe you could first submit to https://github.com/rdolbeau/fftw3/tree/riscv-v where the code was originally developed with the EPI intrinsics. The RVV implementation (and the SVE implementation upon which it is based in https://github.com/rdolbeau/fftw3/tree/arm-sve-clean) adds a lot of codelets by using a 'fixed-width' model, which may not be ideal for code-size and planning time. But a "scalable" implementation (full-width and/or dynamically masked) would require an adapted infrastructure, as currently the SIMD solvers require a known width at compile-time. |
@rdolbeau The code is indeed a prototype and need further development. I'd love to submit it to your repo first, and we can make some optimizations on it later. |
@sh-zheng Indeed I need to bring the branch up-to-date (or create a new, clean one), I'll try to do that ASAP. |
#include <riscv_vector.h> | ||
|
||
#if HAVE_RVV | ||
/* don't know how to autodetect RVV; assume it is present */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detect RVV at run-time:
#include <sys/auxv.h>
if (getauxval(AT_HWCAP) & HWCAP_ISA_V)
{
}
It's not detect zve*
right, but that's the best we can did for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much. I will try this detection.
@rdolbeau Thank you so much. A new, clean branch may be better, since the official rvv intrinsics are stable enough for development. |
@sh-zheng Sorry for the delay, been quite busy; riscv-v-clean should be up-to-date with master and just a couple of consolidated commits (rdcycle & the V support). |
@rdolbeau Thank you very much for the new branch. I will submit my codes into it as soon as possible. |
What is the plan? For @sh-zheng to PR their changes to @rdolbeau's fork, and then PR that here?
My suggestion is to contribute the fixed-width RVV support first, and then propose optimizations in follow-up patches. A fixed SIMD width seems to be a fundamental assumption of FFTW's code-gen, and breaking this assumption may require invasive changes. Additionally, both RVV and SVE could benefit from supporting scalable vectors, and it might be preferable to consider both use-cases at the same time, to ensure the changes don't preclude one ISA or the other. |
@nick-knight I agree. Maybe we can submit the fixed-width version first. The submit of scalable version is still in progress and may not be perfect soon. |
@rdolbeau @sh-zheng r5v.c:28:12: warning: implicit declaration of function '__builtin_epi_vsetvl'; did you mean '__builtin_iceil'? [-Wimplicit-function-declaration] 28 | return __builtin_epi_vsetvl(rs/64, __epi_e64, __epi_m1) >= (rs/64); | ^~~~~~~~~~~~~~~~~~~~ | __builtin_iceilr5v.c:28:12: warning: nested extern declaration of '__builtin_epi_vsetvl' [-Wnested-externs]r5v.c:28:40: error: '__epi_e64' undeclared (first use in this function) 28 | return __builtin_epi_vsetvl(rs/64, __epi_e64, __epi_m1) >= (rs/64); | ^~~~~~~~~r5v.c:28:40: note: each undeclared identifier is reported only once for each function it appears inr5v.c:28:51: error: '__epi_m1' undeclared (first use in this function) 28 | return __builtin_epi_vsetvl(rs/64, __epi_e64, __epi_m1) >= (rs/64); | ^~~~~~~~r5v.c:29:3: warning: control reaches end of non-void function [-Wreturn-type] 29 | } 2,I remove --enable-r5v, it can build pass, but it can't work in C906, if I call any fftw API(for example, fftw_plan_dft_1d), the program will get stuck. No any error been print. These two problems make me so consufed. Could you help me? |
@JustToEnjoy Unfortunately on my branch the code still uses EPI syntax (the European Processor Initiative project, which developed some early support for a set V intrinsics internally). It needs updating to 'standard' intrinsics. @sh-zheng had a prototype but I'm not sure targeting which set of intrinsics, or which version(s) of which compilers (llvm, gcc) would have support. |
SiFive has successfully built FFTW from this PR branch: it passes tests, and we are pointing our customers at it. We used an in-house LLVM toolchain, but I don't think it differs significantly from upstream LLVM w.r.t. the necessary intrinsics. |
@nick-knight My suggestion to @sh-zheng was to merge on my branch (riscv-v-clean) so we keep the history, then merge to upstream (here). Unfortunately it seems our calendar didn't align it seems :-( |
@rdolbeau I have done my best to acknowledge your contribution in customer engagements. If @sh-zheng follows through on your suggestion, then I assume they will close this PR and you will open a new one. When that happens, I will start pointing customers at the new one. Or, even better, the maintainer will accept the PR and we can just use |
So sorry for the delay since I'm in some troubles, and may not continue to move forward with the project :-( . I'd appreciate it if someone could take the project forward. The project is now in such a state, that: My suggestion is that, since the main-line of the fftw has not supported the flexible vector-length, we could submit this version, as a preliminary implementation of fixed vector length, and at least make it usable. The version of flexible vector-length could be developed later, and be submitted in another PR. @nick-knight @rdolbeau |
@sh-zheng @nick-knight For unrelated reason I recently opened a PR for SVE (#315). I'll try to take a look at @sh-zeng modifications and see if I can merge them on my RVV branch to follow-up, as they are somewhat similar in terms of behavior and issues. |
Great! |
1. Extend the support of VLEN to 65536. The data type in struct tw_instr of kernel/ifftw.h should be extended synchronously, to support longer integer. 2. Include vtw.h for VTW1, VTW2, and VTWS, to be compatible with the coding style of ARM SVE. Although the vtw.h is auto-generated in the version of ARM SVE, for the integrity and correctness, the vtw.h will be submitted in this version.
Hello @sh-zheng, think you might want to add dft/simd/rvv*/*.c and rdft/simd/rvv*/*.c to .gitignore. |
@OMaghiarIMG Thank you for the review. A new version has been updated. |
At last I've been able to move my own branch from the old EPI intrinsics to the current set of RISC-V V1.0 intrinsics. I've made a release package to ease testing (no need for ocaml and maintainer mode!). It is able to compile and pass checks using either clang-18 (from Debian sid) or gcc-14 (recompiled from FSF source), running on a Banana Pi F3 SBC which features RISC-V V1.0 using 256-bits registers. It's not very fast and only has 4 GiB of RAM, but it's a lot faster than Qemu :-) @sh-zheng branch also pass the same set of tests on the same hardware. Interestingly, the performance results I get are highly dependent on the compiler: with clang-18 my branch appear to be a bit faster, while with gcc-14 @sh-zheng branch is a bit faster. There's probably quite a bit of analytics needed to figure out what's the best way (or a reasonable compromise...) of doing single-vector/interleaved format using RISC-V V. The many different micro-architectures that are appearing will make trade-offs in the source a lot more complex than for x86-64 or aarch64... Also, an alternate implementation would be to update and test the 'split' code, which uses two vectors for complex (one for real, one for imaginary). However this requires the V macro to hold both scalable vectors in a structure or array, which I'm not sure is supported in current compilers (the EPI compiler from the BSC does [based on clang], or a least did at some point). @rofirrim, do you know in which compiler(s) (upstream clang, upstream gcc, BSC's clang) there would be support for a struct or an array of scalable types? |
Hi @rdolbeau,
I seem to recall you would be able to do that if it is acceptable to you to specify a minimum vector size (similar to what SVE does). Check: https://clang.llvm.org/docs/AttributeReference.html#riscv-rvv-vector-bits I was wondering also if you could use a segmented tuple load/store and then extract the different vectors? Check https://www.godbolt.org/z/5h1Wj7dEf as an example of a naive complex multiplication in case it is useful. Hope this helps. |
Unfortunately, that would require compiling each file with a different set of options. Not impossible, but certainly a burden to add to the build system... OTOH, my memory is failing me; while I did use an array or struct, for RISC-V V with the EPI compilers, it was a tuple type: https://github.com/rdolbeau/fftw3/blob/46763868ee386f7a94cb5863afa70519a8123d24/simd-support/simd-r5v-split.h#L73 . It should be possible to do using the current vfloat64m2_t tuple type as V, and not require any special support of feature... I need to investigate this. Only half as many registers are available, but for smaller transform the smaller overhead my be a benefit. ... or, as I just saw in the godbolt link you added, the tuple type vfloat32m1x2_t, which I didn't know existed. Plenty of options actually :-)
Segmented load was the goal, it did work in simulation back then (https://github.com/rdolbeau/fftw3/blob/46763868ee386f7a94cb5863afa70519a8123d24/simd-support/simd-r5v-split.h#L216). IIRC they were once an extension, do I read the specifications right and they are actually in the base V extension now? |
Yes, they are now part of base V. |
New release package from my branch here. Edit: updated from 003 to 004 |
Enable rvv 1.0 with --enable-rvv.
Correctness tests for FP32 and FP64 are done with bench --verify on qemu.