-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SVE Implementation for Level-1 BLAS Routines #4959
Conversation
updated KERNEL.ARMV8SVE for level 1 sve (swap, rot and scal) kernels.
Hiya, have you tested the impact on Graviton 3/4? |
Thank you very much for this revised PR, I'm looking forward to the HPEC2024 proceedings becoming available. |
kernel/arm64/rot_kernel_sve.c
Outdated
|
||
static int rot_kernel_sve(BLASLONG n, FLOAT *x, FLOAT *y, FLOAT c, FLOAT s) | ||
{ | ||
for (int i = 0; i < n; i += SVE_WIDTH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make i a BLASLONG here please, and adjust the casts in the SVE_WHILELT to uint64_t accordingly ?
kernel/arm64/scal_kernel_sve.c
Outdated
#define SVE_WIDTH svcntw() | ||
#endif | ||
|
||
static int scal_kernel_sve(int n, FLOAT *x, FLOAT da) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BLASLONG n ?
kernel/arm64/scal_kernel_sve.c
Outdated
|
||
static int scal_kernel_sve(int n, FLOAT *x, FLOAT da) | ||
{ | ||
for (int i = 0; i < n; i += SVE_WIDTH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make i a BLASLONG here too, please
kernel/arm64/scal_kernel_sve.c
Outdated
{ | ||
for (int i = 0; i < n; i += SVE_WIDTH) | ||
{ | ||
svbool_t pg = SVE_WHILELT(i, n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add uint64_t casts for i and n here please
kernel/arm64/scal_kernel_c.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please see kernel/arm/scal.c for letting "dummy2" decide whether to propagate NaN and Inf values - probably there is a more elegant solution than what I put there, otherwise just copy that file
kernel/arm64/scal_kernel_sve.c
Outdated
{ | ||
svbool_t pg = SVE_WHILELT(i, n); | ||
SVE_TYPE x_vec = svld1(pg, &x[i]); | ||
SVE_TYPE result = svmul_z(pg, x_vec, da); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually unsure here if svmul_z will "do the right thing" concerning NaN or Inf arguments in x_vec or da
Thank You @martin-frbg for suggestions. we will do the required modifications. |
Thank you for the review. We have reviewed and implemented the changes as per your suggestions. The updated files, which include the modifications for the swap and rotate routines using SVE (Scalable Vector Extension), have been uploaded. please verify and revert. |
Thank you very much. There is one SVE_WHILE in swap_kernel_sve.c that is missing the silly uint64_t casts for AppleClang which is currently killing all the Mac CI jobs (I know there is no Mac currently that does non-streaming SVE, but there may be a reason to support it in the future - the SVE kernels get pulled in through DYNAMIC_ARCH builds). |
Have updated the swap_kernel_sve.c with proper cast of uint64_t. |
Great, thank you. I assume it makes sense to merge this without the SCAL kernel you had originally planned (?), then we can assess its performance impact on other SVE targets (like the Neoverse-based cpus @Mousius mentioned above) in more detail |
Yes. |
We have optimized Level-1 BLAS routines (scal, swap, and rot) utilizing ARM SVE, resulting in significant performance enhancements in OpenBLAS on two variants of the A64FX—FUJITSU PRIMEHPC FX700 and the FUGAKU supercomputer. These optimizations achieved performance improvements ranging from 1.80x to 4x through effective code vectorization. This research has been accepted as a full paper and presented at the 28th Annual IEEE High Performance Extreme Computing (HPEC) Conference in September 2024, under the title "Optimization Strategies to Accelerate BLAS Operations with ARM SVE."