SVE Implementation for Level-1 BLAS Routines #4959

CDAC-SSDG · 2024-10-30T08:56:23Z

We have optimized Level-1 BLAS routines (scal, swap, and rot) utilizing ARM SVE, resulting in significant performance enhancements in OpenBLAS on two variants of the A64FX—FUJITSU PRIMEHPC FX700 and the FUGAKU supercomputer. These optimizations achieved performance improvements ranging from 1.80x to 4x through effective code vectorization. This research has been accepted as a full paper and presented at the 28th Annual IEEE High Performance Extreme Computing (HPEC) Conference in September 2024, under the title "Optimization Strategies to Accelerate BLAS Operations with ARM SVE."

updated KERNEL.ARMV8SVE for level 1 sve (swap, rot and scal) kernels.

Mousius · 2024-10-30T10:51:27Z

Hiya, have you tested the impact on Graviton 3/4?

martin-frbg · 2024-10-30T10:52:48Z

Thank you very much for this revised PR, I'm looking forward to the HPEC2024 proceedings becoming available.
The CI results so far suggest that
(1)Apple Clang is once again being silly about ambiguous SVE intrinsics, probably requiring a few type casts for the arguments like in #4140
and
(2) the new SCAL kernels may need to handle the dummy2 argument that has recently been (ab)used to signal whether to propagate INF and NAN (not wanted for internal uses of SCAL, but now expected when SCAL gets called from user code - this is probably the cause of the failures in openblas_utest and openblas_utest_ext)

martin-frbg · 2024-10-31T14:59:22Z

kernel/arm64/rot_kernel_sve.c

+
+static int rot_kernel_sve(BLASLONG n, FLOAT *x, FLOAT *y, FLOAT c, FLOAT s)
+{
+       for (int i = 0; i < n; i += SVE_WIDTH)


can you make i a BLASLONG here please, and adjust the casts in the SVE_WHILELT to uint64_t accordingly ?

martin-frbg · 2024-10-31T15:01:11Z

kernel/arm64/scal_kernel_sve.c

+#define SVE_WIDTH svcntw()
+#endif
+
+static int scal_kernel_sve(int n, FLOAT *x, FLOAT da)


BLASLONG n ?

martin-frbg · 2024-10-31T15:01:43Z

kernel/arm64/scal_kernel_sve.c

+
+static int scal_kernel_sve(int n, FLOAT *x, FLOAT da)
+{
+  for (int i = 0; i < n; i += SVE_WIDTH)


make i a BLASLONG here too, please

martin-frbg · 2024-10-31T15:05:37Z

kernel/arm64/scal_kernel_sve.c

+{
+  for (int i = 0; i < n; i += SVE_WIDTH)
+  {
+    svbool_t pg = SVE_WHILELT(i, n);


add uint64_t casts for i and n here please

martin-frbg · 2024-10-31T15:09:13Z

kernel/arm64/scal_kernel_c.c

please see kernel/arm/scal.c for letting "dummy2" decide whether to propagate NaN and Inf values - probably there is a more elegant solution than what I put there, otherwise just copy that file

martin-frbg · 2024-10-31T15:18:06Z

kernel/arm64/scal_kernel_sve.c

+  {
+    svbool_t pg = SVE_WHILELT(i, n);
+    SVE_TYPE x_vec = svld1(pg, &x[i]);
+    SVE_TYPE result = svmul_z(pg, x_vec, da);


I'm actually unsure here if svmul_z will "do the right thing" concerning NaN or Inf arguments in x_vec or da

garadeaniket · 2024-11-04T03:58:24Z

Thank You @martin-frbg for suggestions. we will do the required modifications.

CDAC-SSDG · 2024-12-13T05:56:43Z

Thank you for the review. We have reviewed and implemented the changes as per your suggestions. The updated files, which include the modifications for the swap and rotate routines using SVE (Scalable Vector Extension), have been uploaded. please verify and revert.

martin-frbg · 2024-12-13T11:06:22Z

Thank you very much. There is one SVE_WHILE in swap_kernel_sve.c that is missing the silly uint64_t casts for AppleClang which is currently killing all the Mac CI jobs (I know there is no Mac currently that does non-streaming SVE, but there may be a reason to support it in the future - the SVE kernels get pulled in through DYNAMIC_ARCH builds).
Could you please fix that ?

SushilPratap04 · 2024-12-13T11:21:24Z

Have updated the swap_kernel_sve.c with proper cast of uint64_t.
Thank You.

martin-frbg · 2024-12-13T11:34:52Z

Great, thank you. I assume it makes sense to merge this without the SCAL kernel you had originally planned (?), then we can assess its performance impact on other SVE targets (like the Neoverse-based cpus @Mousius mentioned above) in more detail

SushilPratap04 · 2024-12-13T11:36:50Z

Yes.

CDAC-SSDG and others added 9 commits October 30, 2024 13:57

Update CONTRIBUTORS.md

2718b37

Added optimized scal routine files

0667cf6

Added sve optimized kernels for swap routine

b8bc2a7

Added sve kernels for rot routine.

7822ae9

Update KERNEL.ARMV8SVE

fa880ab

updated KERNEL.ARMV8SVE for level 1 sve (swap, rot and scal) kernels.

Delete kernel/arm64/rot.c

668e28a

Delete kernel/arm64/rot_kernel_c.c

d90ee00

Delete kernel/arm64/rot_kernel_sve.c

012fe4d

Add files via upload

3b2421c

martin-frbg reviewed Oct 31, 2024

View reviewed changes

CDAC-SSDG added 12 commits December 13, 2024 10:58

Delete kernel/arm64/rot.c

b9f51a5

Delete kernel/arm64/rot_kernel_c.c

10857c9

Delete kernel/arm64/rot_kernel_sve.c

f62519c

Delete kernel/arm64/scal.c

5540f21

Delete kernel/arm64/scal_kernel_c.c

95a9701

Delete kernel/arm64/scal_kernel_sve.c

3b7b746

Delete kernel/arm64/swap.c

f6416c0

Delete kernel/arm64/swap_kernel_c.c

c17c19f

Delete kernel/arm64/swap_kernel_sve.c

7658501

Update CONTRIBUTORS.md

41912f9

Update KERNEL.ARMV8SVE

06ffd41

Added Updated swap and rot sve kernels.

dd71e42

LaxmikantBotkewar approved these changes Dec 13, 2024

View reviewed changes

Update swap_kernel_sve.c

3368a4e

martin-frbg added this to the 0.3.29 milestone Dec 13, 2024

martin-frbg merged commit 229d8a0 into OpenMathLib:develop Dec 13, 2024
83 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SVE Implementation for Level-1 BLAS Routines #4959

SVE Implementation for Level-1 BLAS Routines #4959

CDAC-SSDG commented Oct 30, 2024

Mousius commented Oct 30, 2024

martin-frbg commented Oct 30, 2024

martin-frbg Oct 31, 2024

martin-frbg Oct 31, 2024

martin-frbg Oct 31, 2024

martin-frbg Oct 31, 2024

martin-frbg Oct 31, 2024

martin-frbg Oct 31, 2024

garadeaniket commented Nov 4, 2024

CDAC-SSDG commented Dec 13, 2024

martin-frbg commented Dec 13, 2024

SushilPratap04 commented Dec 13, 2024

martin-frbg commented Dec 13, 2024

SushilPratap04 commented Dec 13, 2024

SVE Implementation for Level-1 BLAS Routines #4959

SVE Implementation for Level-1 BLAS Routines #4959

Conversation

CDAC-SSDG commented Oct 30, 2024

Mousius commented Oct 30, 2024

martin-frbg commented Oct 30, 2024

martin-frbg Oct 31, 2024

Choose a reason for hiding this comment

martin-frbg Oct 31, 2024

Choose a reason for hiding this comment

martin-frbg Oct 31, 2024

Choose a reason for hiding this comment

martin-frbg Oct 31, 2024

Choose a reason for hiding this comment

martin-frbg Oct 31, 2024

Choose a reason for hiding this comment

martin-frbg Oct 31, 2024

Choose a reason for hiding this comment

garadeaniket commented Nov 4, 2024

CDAC-SSDG commented Dec 13, 2024

martin-frbg commented Dec 13, 2024

SushilPratap04 commented Dec 13, 2024

martin-frbg commented Dec 13, 2024

SushilPratap04 commented Dec 13, 2024