Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SVE Implementation for Level-1 BLAS Routines #4959

Merged
merged 22 commits into from
Dec 13, 2024

Conversation

CDAC-SSDG
Copy link
Contributor

We have optimized Level-1 BLAS routines (scal, swap, and rot) utilizing ARM SVE, resulting in significant performance enhancements in OpenBLAS on two variants of the A64FX—FUJITSU PRIMEHPC FX700 and the FUGAKU supercomputer. These optimizations achieved performance improvements ranging from 1.80x to 4x through effective code vectorization. This research has been accepted as a full paper and presented at the 28th Annual IEEE High Performance Extreme Computing (HPEC) Conference in September 2024, under the title "Optimization Strategies to Accelerate BLAS Operations with ARM SVE."

@Mousius
Copy link
Contributor

Mousius commented Oct 30, 2024

Hiya, have you tested the impact on Graviton 3/4?

@martin-frbg
Copy link
Collaborator

Thank you very much for this revised PR, I'm looking forward to the HPEC2024 proceedings becoming available.
The CI results so far suggest that
(1)Apple Clang is once again being silly about ambiguous SVE intrinsics, probably requiring a few type casts for the arguments like in #4140
and
(2) the new SCAL kernels may need to handle the dummy2 argument that has recently been (ab)used to signal whether to propagate INF and NAN (not wanted for internal uses of SCAL, but now expected when SCAL gets called from user code - this is probably the cause of the failures in openblas_utest and openblas_utest_ext)


static int rot_kernel_sve(BLASLONG n, FLOAT *x, FLOAT *y, FLOAT c, FLOAT s)
{
for (int i = 0; i < n; i += SVE_WIDTH)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make i a BLASLONG here please, and adjust the casts in the SVE_WHILELT to uint64_t accordingly ?

#define SVE_WIDTH svcntw()
#endif

static int scal_kernel_sve(int n, FLOAT *x, FLOAT da)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BLASLONG n ?


static int scal_kernel_sve(int n, FLOAT *x, FLOAT da)
{
for (int i = 0; i < n; i += SVE_WIDTH)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make i a BLASLONG here too, please

{
for (int i = 0; i < n; i += SVE_WIDTH)
{
svbool_t pg = SVE_WHILELT(i, n);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add uint64_t casts for i and n here please

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see kernel/arm/scal.c for letting "dummy2" decide whether to propagate NaN and Inf values - probably there is a more elegant solution than what I put there, otherwise just copy that file

{
svbool_t pg = SVE_WHILELT(i, n);
SVE_TYPE x_vec = svld1(pg, &x[i]);
SVE_TYPE result = svmul_z(pg, x_vec, da);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually unsure here if svmul_z will "do the right thing" concerning NaN or Inf arguments in x_vec or da

@garadeaniket
Copy link

Thank You @martin-frbg for suggestions. we will do the required modifications.

@CDAC-SSDG
Copy link
Contributor Author

Thank you for the review. We have reviewed and implemented the changes as per your suggestions. The updated files, which include the modifications for the swap and rotate routines using SVE (Scalable Vector Extension), have been uploaded. please verify and revert.

@martin-frbg
Copy link
Collaborator

Thank you very much. There is one SVE_WHILE in swap_kernel_sve.c that is missing the silly uint64_t casts for AppleClang which is currently killing all the Mac CI jobs (I know there is no Mac currently that does non-streaming SVE, but there may be a reason to support it in the future - the SVE kernels get pulled in through DYNAMIC_ARCH builds).
Could you please fix that ?

@SushilPratap04
Copy link
Contributor

Have updated the swap_kernel_sve.c with proper cast of uint64_t.
Thank You.

@martin-frbg
Copy link
Collaborator

Great, thank you. I assume it makes sense to merge this without the SCAL kernel you had originally planned (?), then we can assess its performance impact on other SVE targets (like the Neoverse-based cpus @Mousius mentioned above) in more detail

@SushilPratap04
Copy link
Contributor

Yes.

@martin-frbg martin-frbg added this to the 0.3.29 milestone Dec 13, 2024
@martin-frbg martin-frbg merged commit 229d8a0 into OpenMathLib:develop Dec 13, 2024
83 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants