Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoongArch64: Update symv #5061

Merged
merged 2 commits into from
Jan 10, 2025
Merged

Conversation

XiWeiGu
Copy link
Contributor

@XiWeiGu XiWeiGu commented Jan 10, 2025

Improve the performance of the {s/d}symv interface with LASX optimization when INCX=1 and INCY=1.
symv

@XiWeiGu XiWeiGu changed the title La64 update symv LoongArch64: Update symv Jan 10, 2025
@martin-frbg martin-frbg added this to the 0.3.29 milestone Jan 10, 2025
@martin-frbg
Copy link
Collaborator

The SSYMV behaviour looks a bit curious here, both before and after your change. (Probably my switchover point for multithreading in interface/symv.c - at 200x200 - is not optimal too)

@martin-frbg martin-frbg merged commit c31f148 into OpenMathLib:develop Jan 10, 2025
80 of 84 checks passed
@azuresky01
Copy link

Hi, @XiWeiGu

With just released OpenBLAS 0.3.29 version, gmake compilation fails on cblas_ssymv test (see below).

System: Debian sid loong64, Kernel 6.12.9-loong64 (and also on AOSC OS (12.0.3) loongarch64 with kernel 6.11.10)
Host: Loongson-3A6000-7A2000-1w-V0.1-EVB
compiler: gcc/gfortran 14.2.0
compiling command: "CC=gcc FC=gfortran make"

...
  cblas_sgbmv  PASSED THE ROW-MAJOR    COMPUTATIONAL TESTS ( 17284 CALLS)

 cblas_ssymv  PASSED THE TESTS OF ERROR-EXITS

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
           EXPECTED RESULT   COMPUTED RESULT
       1      0.667230          0.667230    
       2      -1.16658          -1.16658    
       3      0.210144          0.210145    
       4      0.543602          0.543602    
       5     -0.675359         -0.675359    
       6       1.33709           1.33709    
       7      0.668852          0.668852    
       8       1.70144           1.70144    
       9     -0.375508         -0.375508    
      10       1.21933           1.21932    
      11      -1.99305          -1.99305    
      12      0.306078          0.306078    
      13       1.00700           1.02162    
      14       1.29749           1.29749    
      15      0.503632          0.503632    
      16      0.624914          0.639533    
      17      -1.05779          -1.05779    
      18       3.71865           3.71865    
      19      -1.45335          -1.45335    
      20       2.15959           2.15959    
      21      -1.31750          -1.31750    
      22     -0.331946         -0.331946    
      23     -0.397639         -0.397639    
      24     -0.898512         -0.898512    
      25     -0.143124         -0.143124    
      26      0.763038          0.763038    
      27     -0.184042         -0.184042    
      28      -1.33639          -1.33639    
      29      -1.81842          -1.81842    
      30     -0.639056         -0.639056    
      31      -1.73327          -1.73327    
      32     -0.603902         -0.603902    
      33     -0.602854         -0.602854    
      34     -0.638333         -0.638333    
      35     -0.985314         -0.985313    
      36     -0.386861         -0.386861    
      37      -2.50400          -2.50400    
      38     -0.100883         -0.100883    
      39     -0.482794         -0.482794    
      40      0.279730          0.279730    
      41     -0.869296         -0.869295    
      42      0.548803E-01      0.548803E-01
      43     -0.304101         -0.304101    
      44      -1.02007          -1.02007    
      45       1.02721           1.02721    
      46     -0.275883         -0.275883    
      47      0.605179          0.605179    
      48     -0.218911E-01     -0.218913E-01
      49      0.751897E-01      0.751896E-01
      50     -0.489880         -0.489880    
      51     -0.759315         -0.759316    
      52      0.139411          0.139411    
      53     -0.798163         -0.798163    
      54     -0.303303         -0.303303    
      55     -0.172739         -0.172739    
      56       1.07911           1.07911    
      57      0.393114E-01      0.393114E-01
      58       1.35710           1.35710    
      59     -0.688482         -0.688482    
      60      0.847893          0.847893    
      61      0.881102          0.881102    
      62       5.50897           5.50897    
      63     -0.112485         -0.112485    
 ******* cblas_ssymv  FAILED ON CALL NUMBER:
   1445: cblas_ssymv (    CblasUpper, 63, 1.0, A, 64, X, 1, 0.0, Y, 1) .
 ******* cblas_ssymv  FAILED ON CALL NUMBER:
      2: cblas_ssymv (    CblasUpper,  1, 0.0, A,  2, X, 1, 0.0, Y, 1) .

 ******* FATAL ERROR - TESTS ABANDONED *******
ERROR STOP 

Error termination. Backtrace:
#0  0x7ffff1e585c7 in ???
#1  0x7ffff1e5999b in ???
#2  0x7ffff1e5b8e3 in ???
#3  0x5555580c15c3 in ???
#4  0x5555580b6d6b in ???
#5  0x7ffff1bf0f8f in __libc_start_call_main
	at ../sysdeps/nptl/libc_start_call_main.h:58
#6  0x7ffff1bf108f in __libc_start_main_impl
	at ../csu/libc-start.c:360
#7  0x5555580b6e2f in ???
#8  0xffffffffffffffff in ???
make[1]: *** [Makefile:145:all2] 错误 1
...

@martin-frbg
Copy link
Collaborator

I cannot reproduce this problem on the only Loongarch64 hardware I have access to, a 3C5000L-LL in the GCC Compile Farm running Debian Trixie (and gcc 14.2 built from source), binutils 2.43 from Debian, LA464 target was autodetected. (Not sure if that helps, but at least this is using the new kernels from this PR)

@XiWeiGu
Copy link
Contributor Author

XiWeiGu commented Jan 13, 2025

The SSYMV behaviour looks a bit curious here, both before and after your change. (Probably my switchover point for multithreading in interface/symv.c - at 200x200 - is not optimal too)

Thank you for your reply. I will test to see if it is related to the threshold for enabling multithreading

@XiWeiGu
Copy link
Contributor Author

XiWeiGu commented Jan 13, 2025

Hi, @XiWeiGu

With just released OpenBLAS 0.3.29 version, gmake compilation fails on cblas_ssymv test (see below).

System: Debian sid loong64, Kernel 6.12.9-loong64 (and also on AOSC OS (12.0.3) loongarch64 with kernel 6.11.10) Host: Loongson-3A6000-7A2000-1w-V0.1-EVB compiler: gcc/gfortran 14.2.0 compiling command: "CC=gcc FC=gfortran make"

...
  cblas_sgbmv  PASSED THE ROW-MAJOR    COMPUTATIONAL TESTS ( 17284 CALLS)

 cblas_ssymv  PASSED THE TESTS OF ERROR-EXITS

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
           EXPECTED RESULT   COMPUTED RESULT
       1      0.667230          0.667230    
       2      -1.16658          -1.16658    
       3      0.210144          0.210145    
       4      0.543602          0.543602    
       5     -0.675359         -0.675359    
       6       1.33709           1.33709    
       7      0.668852          0.668852    
       8       1.70144           1.70144    
       9     -0.375508         -0.375508    
      10       1.21933           1.21932    
      11      -1.99305          -1.99305    
      12      0.306078          0.306078    
      13       1.00700           1.02162    
      14       1.29749           1.29749    
      15      0.503632          0.503632    
      16      0.624914          0.639533    
      17      -1.05779          -1.05779    
      18       3.71865           3.71865    
      19      -1.45335          -1.45335    
      20       2.15959           2.15959    
      21      -1.31750          -1.31750    
      22     -0.331946         -0.331946    
      23     -0.397639         -0.397639    
      24     -0.898512         -0.898512    
      25     -0.143124         -0.143124    
      26      0.763038          0.763038    
      27     -0.184042         -0.184042    
      28      -1.33639          -1.33639    
      29      -1.81842          -1.81842    
      30     -0.639056         -0.639056    
      31      -1.73327          -1.73327    
      32     -0.603902         -0.603902    
      33     -0.602854         -0.602854    
      34     -0.638333         -0.638333    
      35     -0.985314         -0.985313    
      36     -0.386861         -0.386861    
      37      -2.50400          -2.50400    
      38     -0.100883         -0.100883    
      39     -0.482794         -0.482794    
      40      0.279730          0.279730    
      41     -0.869296         -0.869295    
      42      0.548803E-01      0.548803E-01
      43     -0.304101         -0.304101    
      44      -1.02007          -1.02007    
      45       1.02721           1.02721    
      46     -0.275883         -0.275883    
      47      0.605179          0.605179    
      48     -0.218911E-01     -0.218913E-01
      49      0.751897E-01      0.751896E-01
      50     -0.489880         -0.489880    
      51     -0.759315         -0.759316    
      52      0.139411          0.139411    
      53     -0.798163         -0.798163    
      54     -0.303303         -0.303303    
      55     -0.172739         -0.172739    
      56       1.07911           1.07911    
      57      0.393114E-01      0.393114E-01
      58       1.35710           1.35710    
      59     -0.688482         -0.688482    
      60      0.847893          0.847893    
      61      0.881102          0.881102    
      62       5.50897           5.50897    
      63     -0.112485         -0.112485    
 ******* cblas_ssymv  FAILED ON CALL NUMBER:
   1445: cblas_ssymv (    CblasUpper, 63, 1.0, A, 64, X, 1, 0.0, Y, 1) .
 ******* cblas_ssymv  FAILED ON CALL NUMBER:
      2: cblas_ssymv (    CblasUpper,  1, 0.0, A,  2, X, 1, 0.0, Y, 1) .

 ******* FATAL ERROR - TESTS ABANDONED *******
ERROR STOP 

Error termination. Backtrace:
#0  0x7ffff1e585c7 in ???
#1  0x7ffff1e5999b in ???
#2  0x7ffff1e5b8e3 in ???
#3  0x5555580c15c3 in ???
#4  0x5555580b6d6b in ???
#5  0x7ffff1bf0f8f in __libc_start_call_main
	at ../sysdeps/nptl/libc_start_call_main.h:58
#6  0x7ffff1bf108f in __libc_start_main_impl
	at ../csu/libc-start.c:360
#7  0x5555580b6e2f in ???
#8  0xffffffffffffffff in ???
make[1]: *** [Makefile:145:all2] 错误 1
...

It seems to be a precision issue.

My testing environment is as follows, and all tests can pass.

OS: Deepin beige 23 loongarch64
Host: Loongson-3A5000-7A1000-1w-ML5A
Kernel: Linux 6.9.6-loong64-desktop-rolling
GCC: gcc version 12.3.0 (Deepin 12.3.0-17deepin8)

I will later verify it on an AOSC OS (12.0.3) system

@azuresky01
Copy link

Hi, @XiWeiGu
With just released OpenBLAS 0.3.29 version, gmake compilation fails on cblas_ssymv test (see below).
System: Debian sid loong64, Kernel 6.12.9-loong64 (and also on AOSC OS (12.0.3) loongarch64 with kernel 6.11.10) Host: Loongson-3A6000-7A2000-1w-V0.1-EVB compiler: gcc/gfortran 14.2.0 compiling command: "CC=gcc FC=gfortran make"

...
  cblas_sgbmv  PASSED THE ROW-MAJOR    COMPUTATIONAL TESTS ( 17284 CALLS)

 cblas_ssymv  PASSED THE TESTS OF ERROR-EXITS

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
           EXPECTED RESULT   COMPUTED RESULT
       1      0.667230          0.667230    
       2      -1.16658          -1.16658    
       3      0.210144          0.210145    
       4      0.543602          0.543602    
       5     -0.675359         -0.675359    
       6       1.33709           1.33709    
       7      0.668852          0.668852    
       8       1.70144           1.70144    
       9     -0.375508         -0.375508    
      10       1.21933           1.21932    
      11      -1.99305          -1.99305    
      12      0.306078          0.306078    
      13       1.00700           1.02162    
      14       1.29749           1.29749    
      15      0.503632          0.503632    
      16      0.624914          0.639533    
      17      -1.05779          -1.05779    
      18       3.71865           3.71865    
      19      -1.45335          -1.45335    
      20       2.15959           2.15959    
      21      -1.31750          -1.31750    
      22     -0.331946         -0.331946    
      23     -0.397639         -0.397639    
      24     -0.898512         -0.898512    
      25     -0.143124         -0.143124    
      26      0.763038          0.763038    
      27     -0.184042         -0.184042    
      28      -1.33639          -1.33639    
      29      -1.81842          -1.81842    
      30     -0.639056         -0.639056    
      31      -1.73327          -1.73327    
      32     -0.603902         -0.603902    
      33     -0.602854         -0.602854    
      34     -0.638333         -0.638333    
      35     -0.985314         -0.985313    
      36     -0.386861         -0.386861    
      37      -2.50400          -2.50400    
      38     -0.100883         -0.100883    
      39     -0.482794         -0.482794    
      40      0.279730          0.279730    
      41     -0.869296         -0.869295    
      42      0.548803E-01      0.548803E-01
      43     -0.304101         -0.304101    
      44      -1.02007          -1.02007    
      45       1.02721           1.02721    
      46     -0.275883         -0.275883    
      47      0.605179          0.605179    
      48     -0.218911E-01     -0.218913E-01
      49      0.751897E-01      0.751896E-01
      50     -0.489880         -0.489880    
      51     -0.759315         -0.759316    
      52      0.139411          0.139411    
      53     -0.798163         -0.798163    
      54     -0.303303         -0.303303    
      55     -0.172739         -0.172739    
      56       1.07911           1.07911    
      57      0.393114E-01      0.393114E-01
      58       1.35710           1.35710    
      59     -0.688482         -0.688482    
      60      0.847893          0.847893    
      61      0.881102          0.881102    
      62       5.50897           5.50897    
      63     -0.112485         -0.112485    
 ******* cblas_ssymv  FAILED ON CALL NUMBER:
   1445: cblas_ssymv (    CblasUpper, 63, 1.0, A, 64, X, 1, 0.0, Y, 1) .
 ******* cblas_ssymv  FAILED ON CALL NUMBER:
      2: cblas_ssymv (    CblasUpper,  1, 0.0, A,  2, X, 1, 0.0, Y, 1) .

 ******* FATAL ERROR - TESTS ABANDONED *******
ERROR STOP 

Error termination. Backtrace:
#0  0x7ffff1e585c7 in ???
#1  0x7ffff1e5999b in ???
#2  0x7ffff1e5b8e3 in ???
#3  0x5555580c15c3 in ???
#4  0x5555580b6d6b in ???
#5  0x7ffff1bf0f8f in __libc_start_call_main
	at ../sysdeps/nptl/libc_start_call_main.h:58
#6  0x7ffff1bf108f in __libc_start_main_impl
	at ../csu/libc-start.c:360
#7  0x5555580b6e2f in ???
#8  0xffffffffffffffff in ???
make[1]: *** [Makefile:145:all2] 错误 1
...

It seems to be a precision issue.

My testing environment is as follows, and all tests can pass.

OS: Deepin beige 23 loongarch64
Host: Loongson-3A5000-7A1000-1w-ML5A
Kernel: Linux 6.9.6-loong64-desktop-rolling
GCC: gcc version 12.3.0 (Deepin 12.3.0-17deepin8)

I will later verify it on an AOSC OS (12.0.3) system

I got the same results on AOSC OS (12.0.3) system with gcc/gfortran 14, but let us see whether you can reproduce.

In my tests it seems that both cblas_ssymv/cblas_dsymv have precision issue. (entering ctest sub-directory, and run "./xscblat2 < sin2" or "./xdcblat2 < din2" to verify). On the other side, cmake compilation seems OK without any problem.

@martin-frbg
Copy link
Collaborator

cmake uses O3 by design in Release mode, and no optimization option if no mode was specified (IIRC). Maybe this is enough to cause the difference in the loongson backend, there were a few cases in the past where I had to disable optimization by a pragma in one of the ctest sources to get rid of spurious test failures.
Is it only ctest where you see the error, or test/sblat3 too ?

@azuresky01
Copy link

cmake uses O3 by design in Release mode, and no optimization option if no mode was specified (IIRC). Maybe this is enough to cause the difference in the loongson backend, there were a few cases in the past where I had to disable optimization by a pragma in one of the ctest sources to get rid of spurious test failures. Is it only ctest where you see the error, or test/sblat3 too ?

Only ctest. test/sblat3 is OK, though there are floating-point exceptions as following:

Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG

@XiWeiGu
Copy link
Contributor Author

XiWeiGu commented Jan 13, 2025

The SSYMV behaviour looks a bit curious here, both before and after your change. (Probably my switchover point for multithreading in interface/symv.c - at 200x200 - is not optimal too)

Thank you for your reply. I will test to see if it is related to the threshold for enabling multithreading

I conducted some tests, and it should not be related to the matrix size threshold you set for enabling multithreading. I tried adjusting the matrix size for multithreading (300x300, 400x400), and at these sizes, there is a sharp performance drop compared to single-threading. However, the performance gradually recovers as the matrix size increases. This issue only occurs on LoongArch64, and I did not observe similar behavior on x86-64.

@XiWeiGu
Copy link
Contributor Author

XiWeiGu commented Jan 13, 2025

I encountered the same issue in the following environment:

OS: AOSC OS 11.6.1 loongarch64
Host: Loongson-3A6000-HV-7A2000-XA61200
Kernel: Linux 6.11.8-aosc-main
GCC: gcc 版本 14.2.0 20240801 (AOSC OS, Core) (GCC)
Command: CC=gcc FC=gfortran make

Error message:

cblas_ssymv PASSED THE TESTS OF ERROR-EXITS

******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
EXPECTED RESULT COMPUTED RESULT
1 0.667230 0.667230
2 -1.16658 -1.16658
3 0.210144 0.210145
4 0.543602 0.543602
5 -0.675359 -0.675359
6 1.33709 1.33709
7 0.668852 0.668852
8 1.70144 1.70144
9 -0.375508 -0.375508
10 1.21933 1.21932
11 -1.99305 -1.99305
12 0.306078 0.306078
13 1.00700 1.00700
14 1.29749 1.29749
15 0.503632 0.430227
16 0.624914 0.576943
17 -1.05779 -1.05779
18 3.71865 3.71865
19 -1.45335 -1.45335
20 2.15959 2.15959
21 -1.31750 -1.07568
22 -0.331946 -0.331946
23 -0.397639 -0.397639
24 -0.898512 -0.886479
25 -0.143124 -0.143124
26 0.763038 0.763038
27 -0.184042 -0.184042
28 -1.33639 -1.33639
29 -1.81842 -1.81842
30 -0.639056 -0.639056
31 -1.73327 -1.73327
32 -0.603902 -0.603901
33 -0.602854 -0.602854
34 -0.638333 -0.638333
35 -0.985314 -0.985313
36 -0.386861 -0.386861
37 -2.50400 -2.50400
38 -0.100883 -0.100883
39 -0.482794 -0.482794
40 0.279730 0.279730
41 -0.869296 -0.869295
42 0.548803E-01 0.548803E-01
43 -0.304101 -0.304101
44 -1.02007 -1.02007
45 1.02721 1.02721
46 -0.275883 -0.275883
47 0.605179 0.605179
48 -0.218911E-01 -0.218913E-01
49 0.751897E-01 0.751896E-01
50 -0.489880 -0.489880
51 -0.759315 -0.759316
52 0.139411 0.139411
53 -0.798163 -0.798163
54 -0.303303 -0.303303
55 -0.172739 -0.172739
56 1.07911 1.07911
57 0.393114E-01 0.393114E-01
58 1.35710 1.35710
59 -0.688482 -0.688482
60 0.847893 0.847893
61 0.881102 0.881102
62 5.50897 5.50897
63 -0.112485 -0.112485
******* cblas_ssymv FAILED ON CALL NUMBER:
1445: cblas_ssymv ( CblasUpper, 63, 1.0, A, 64, X, 1, 0.0, Y, 1) .
******* cblas_ssymv FAILED ON CALL NUMBER:
2: cblas_ssymv ( CblasUpper, 1, 0.0, A, 2, X, 1, 0.0, Y, 1) .

******* FATAL ERROR - TESTS ABANDONED *******

I currently have no idea why it is causing the failure; further analysis is necessary.

@azuresky01
Copy link

cmake uses O3 by design in Release mode, and no optimization option if no mode was specified (IIRC). Maybe this is enough to cause the difference in the loongson backend, there were a few cases in the past where I had to disable optimization by a pragma in one of the ctest sources to get rid of spurious test failures. Is it only ctest where you see the error, or test/sblat3 too ?

I asked the maintainers of AOSC OS for help on this issue since they have more Loongson computers. They confirmed this compilation error and say this happens only on 6000 series (la664) architecture. They are now investigating it as well:

https://bbs.aosc.io/t/topic/302/3

@martin-frbg
Copy link
Collaborator

I wonder if the error goes away if we apply #4667 unconditionally

@azuresky01
Copy link

I wonder if the error goes away if we apply #4667 unconditionally

No, removing those three lines does not solve the problem.

@martin-frbg
Copy link
Collaborator

Uh, what exactly did you remove if you write of "those three lines" ? Just to be sure, you should have ended up with
override FFLAGS = $(filter_out(-O2 -O3,$(FFLAGS)) -O0
right before the line that does another override to add the no-tree-vectorize.

@azuresky01
Copy link

Uh, what exactly did you remove if you write of "those three lines" ? Just to be sure, you should have ended up with override FFLAGS = $(filter_out(-O2 -O3,$(FFLAGS)) -O0 right before the line that does another override to add the no-tree-vectorize.

Sorry I removed the line override FFLAGS = $(filter_out(-O2 -O3,$(FFLAGS)) -O0 in previous test. Let me try with this line two hours later. (Doing other things on the computer at this moment.)

@MingcongBai
Copy link

@azuresky01 Can you please try #5070?

@azuresky01
Copy link

Uh, what exactly did you remove if you write of "those three lines" ? Just to be sure, you should have ended up with override FFLAGS = $(filter_out(-O2 -O3,$(FFLAGS)) -O0 right before the line that does another override to add the no-tree-vectorize.

I just finished new test. With two straight override lines the compilation error still appears.

@azuresky01
Copy link

@azuresky01 Can you please try #5070?

Yes. I confirm that #5070 does fix the problem. The compilation error disappears.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants