vulkan: scale caching for k quants + misc fixes #11081

netrunnereve · 2025-01-05T02:26:21Z

We can make inference run a bit faster by extracting the scales in parallel and saving them to shared memory, where they'll be used by all the threads working on the superblock. This came out of the experiments in #10999.

This was not done for Q4_K and Q5_K as their scales are packed in a complicated way which makes this method even slower.

PR:

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   5112 runs -   232.89 us/run - 117.44 MFLOP/run - 504.27 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   359.69 us/run - 117.44 MFLOP/run - 326.50 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   5112 runs -   234.78 us/run - 117.44 MFLOP/run - 500.22 GFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   313.31 us/run - 117.44 MFLOP/run - 374.84 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   333.78 us/run - 117.44 MFLOP/run - 351.85 GFLOPS

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	Vulkan	100	8	1	none	tg128	24.78 ± 0.03
llama 8B Q3_K - Medium	3.74 GiB	8.03 B	Vulkan	100	8	1	none	tg128	21.98 ± 0.02
llama 7B Q6_K	5.53 GiB	7.24 B	Vulkan	100	8	1	none	tg128	22.27 ± 0.01

Master:

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4260 runs -   241.10 us/run - 117.44 MFLOP/run - 487.09 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   449.01 us/run - 117.44 MFLOP/run - 261.56 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4260 runs -   235.58 us/run - 117.44 MFLOP/run - 498.51 GFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   315.21 us/run - 117.44 MFLOP/run - 372.58 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   365.79 us/run - 117.44 MFLOP/run - 321.06 GFLOPS

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	Vulkan	100	8	1	none	tg128	22.15 ± 0.01
llama 8B Q3_K - Medium	3.74 GiB	8.03 B	Vulkan	100	8	1	none	tg128	18.97 ± 0.00
llama 7B Q6_K	5.53 GiB	7.24 B	Vulkan	100	8	1	none	tg128	20.38 ± 0.00

ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q6_k.comp

ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q2_k.comp

jeffbolznv · 2025-01-05T16:13:31Z

RTX 4070 results. Keep in mind there's a lot of variability in the results, but at first glance it seems like an improvement for Q3_K but worse for the others:

after:

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  17040 runs -    60.51 us/run - 117.44 MFLOP/run -   1.94 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  10650 runs -    95.48 us/run - 234.88 MFLOP/run -   2.46 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   7952 runs -   126.38 us/run - 352.32 MFLOP/run -   2.79 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6603 runs -   153.03 us/run - 469.76 MFLOP/run -   3.07 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6498 runs -   157.50 us/run - 587.20 MFLOP/run -   3.73 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2568 runs -   402.20 us/run - 939.52 MFLOP/run -   2.34 TFLOPS

  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  12780 runs -    79.41 us/run - 117.44 MFLOP/run -   1.48 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   9798 runs -   105.30 us/run - 234.88 MFLOP/run -   2.23 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6816 runs -   148.98 us/run - 352.32 MFLOP/run -   2.36 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   5964 runs -   172.48 us/run - 469.76 MFLOP/run -   2.72 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4617 runs -   220.06 us/run - 587.20 MFLOP/run -   2.67 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2675 runs -   384.43 us/run - 939.52 MFLOP/run -   2.44 TFLOPS

  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   7668 runs -   131.17 us/run - 117.44 MFLOP/run - 895.32 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   7668 runs -   133.59 us/run - 234.88 MFLOP/run -   1.76 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6248 runs -   163.28 us/run - 352.32 MFLOP/run -   2.16 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6603 runs -   153.27 us/run - 469.76 MFLOP/run -   3.06 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6156 runs -   164.57 us/run - 587.20 MFLOP/run -   3.57 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3959 runs -   253.39 us/run - 939.52 MFLOP/run -   3.71 TFLOPS

before:

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  18744 runs -    54.18 us/run - 117.44 MFLOP/run -   2.17 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  13632 runs -    73.36 us/run - 234.88 MFLOP/run -   3.20 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  10508 runs -    95.71 us/run - 352.32 MFLOP/run -   3.68 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   8307 runs -   122.29 us/run - 469.76 MFLOP/run -   3.84 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   7011 runs -   145.50 us/run - 587.20 MFLOP/run -   4.04 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3531 runs -   284.81 us/run - 939.52 MFLOP/run -   3.30 TFLOPS

  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  10224 runs -   104.74 us/run - 117.44 MFLOP/run -   1.12 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   8520 runs -   123.00 us/run - 234.88 MFLOP/run -   1.91 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6816 runs -   148.62 us/run - 352.32 MFLOP/run -   2.37 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6177 runs -   162.05 us/run - 469.76 MFLOP/run -   2.90 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4959 runs -   203.90 us/run - 587.20 MFLOP/run -   2.88 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3317 runs -   309.30 us/run - 939.52 MFLOP/run -   3.04 TFLOPS

  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  10224 runs -   105.69 us/run - 117.44 MFLOP/run -   1.11 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   9372 runs -   110.35 us/run - 234.88 MFLOP/run -   2.13 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   8236 runs -   122.10 us/run - 352.32 MFLOP/run -   2.89 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   7029 runs -   142.71 us/run - 469.76 MFLOP/run -   3.29 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6156 runs -   166.92 us/run - 587.20 MFLOP/run -   3.52 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4601 runs -   217.50 us/run - 939.52 MFLOP/run -   4.32 TFLOPS

netrunnereve · 2025-01-05T19:58:31Z

For multiple ns I'm seeing clear improvements with Q3_K and Q6_K, but Q2_K is much less consistent and is in some cases slower than master.

PR:

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2982 runs -   337.74 us/run - 234.88 MFLOP/run - 695.45 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   441.88 us/run - 352.32 MFLOP/run - 797.33 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1917 runs -   566.21 us/run - 469.76 MFLOP/run - 829.66 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1368 runs -   740.15 us/run - 587.20 MFLOP/run - 793.36 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    963 runs -  1064.79 us/run - 939.52 MFLOP/run - 882.36 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   454.94 us/run - 234.88 MFLOP/run - 516.29 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1988 runs -   539.48 us/run - 352.32 MFLOP/run - 653.08 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1491 runs -   754.86 us/run - 469.76 MFLOP/run - 622.32 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1197 runs -   862.33 us/run - 587.20 MFLOP/run - 680.95 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    856 runs -  1182.12 us/run - 939.52 MFLOP/run - 794.78 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2982 runs -   388.88 us/run - 234.88 MFLOP/run - 603.99 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2272 runs -   464.96 us/run - 352.32 MFLOP/run - 757.74 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1917 runs -   550.29 us/run - 469.76 MFLOP/run - 853.67 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1539 runs -   675.49 us/run - 587.20 MFLOP/run - 869.30 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1070 runs -   966.01 us/run - 939.52 MFLOP/run - 972.59 GFLOPS

Master:

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   336.28 us/run - 234.88 MFLOP/run - 698.47 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2272 runs -   458.81 us/run - 352.32 MFLOP/run - 767.90 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1917 runs -   573.96 us/run - 469.76 MFLOP/run - 818.45 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1539 runs -   727.08 us/run - 587.20 MFLOP/run - 807.62 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    963 runs -  1067.80 us/run - 939.52 MFLOP/run - 879.87 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2130 runs -   543.67 us/run - 234.88 MFLOP/run - 432.03 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1704 runs -   642.54 us/run - 352.32 MFLOP/run - 548.33 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1278 runs -   885.94 us/run - 469.76 MFLOP/run - 530.24 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1026 runs -  1004.95 us/run - 587.20 MFLOP/run - 584.31 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    856 runs -  1270.78 us/run - 939.52 MFLOP/run - 739.33 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   425.50 us/run - 234.88 MFLOP/run - 552.01 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1988 runs -   537.97 us/run - 352.32 MFLOP/run - 654.91 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1704 runs -   625.29 us/run - 469.76 MFLOP/run - 751.28 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1368 runs -   771.12 us/run - 587.20 MFLOP/run - 761.49 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    963 runs -  1076.21 us/run - 939.52 MFLOP/run - 872.99 GFLOPS

I tried calculating the A * scale multiplication ahead of time for Q2_K, but it didn't do much. That also should reduce the number of shared memory reads as the products are stored in registers.

A * scale multiplication cached in registers:

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   332.67 us/run - 234.88 MFLOP/run - 706.06 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2272 runs -   443.91 us/run - 352.32 MFLOP/run - 793.69 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1917 runs -   565.84 us/run - 469.76 MFLOP/run - 830.20 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1368 runs -   741.69 us/run - 587.20 MFLOP/run - 791.71 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    963 runs -  1071.39 us/run - 939.52 MFLOP/run - 876.92 GFLOPS

0cc4m · 2025-01-07T07:16:55Z

I'll post benchmarks at a later point, but this reduces performance on RTX 3090 for q2_k and q6_k. I see small improvements on Radeon Pro VII. Intel still crashes, but only in test-backend-ops -o MUL_MAT. I don't know what's going on there, since test-backend-ops -o MUL_MAT perf passes just fine. Looking at the perf results, it's a small improvement on A770, too.

jeffbolznv · 2025-01-07T14:22:55Z

IMO the crash is still very likely related to the barriers in nonuniform control flow. It really needs to be fixed if we're going to use shared memory here. If the additional branches are causing too many problems then maybe we could change how the work is spread across a workgroup so that the number of iterations is uniform, but that could also affect perf (likely making it worse, I'd guess).

netrunnereve · 2025-01-07T21:39:46Z

If the additional branches are causing too many problems then maybe we could change how the work is spread across a workgroup so that the number of iterations is uniform, but that could also affect perf

To get rid of the branches we could just have the main i loop run with no checks as long as we have enough blocks remaining to use all threads, and then switch to a separate code path for the final multiplications. There's no need to redo the algorithm.

netrunnereve · 2025-01-08T01:06:10Z

Okay I've fixed up Q6_K to handle the early return case, and it's now running at 23.3 t/s with a few extra tweaks. @0cc4m can you try this on Intel to see if it prevents the crash?

jeffbolznv · 2025-01-08T04:09:15Z

I tested the latest Q6_K changes on RTX 4070. For llama-bench with llama-2-7b.Q6_K, the perf is basically unchanged, which is not surprising since it's just memory bandwidth-limited. The directed perf results are more interesting:

before:
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  46860 runs -   107.44 us/run - 117.44 MFLOP/run -   1.09 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  45582 runs -   110.08 us/run - 234.88 MFLOP/run -   2.13 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  39760 runs -   126.70 us/run - 352.32 MFLOP/run -   2.78 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  33654 runs -   149.37 us/run - 469.76 MFLOP/run -   3.15 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  30438 runs -   164.95 us/run - 587.20 MFLOP/run -   3.56 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  22684 runs -   221.28 us/run - 939.52 MFLOP/run -   4.25 TFLOPS

after:
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  45156 runs -   112.21 us/run - 117.44 MFLOP/run -   1.05 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  46860 runs -   106.90 us/run - 234.88 MFLOP/run -   2.20 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  43168 runs -   116.55 us/run - 352.32 MFLOP/run -   3.02 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  44304 runs -   113.42 us/run - 469.76 MFLOP/run -   4.14 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  37962 runs -   132.16 us/run - 587.20 MFLOP/run -   4.44 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   9202 runs -   544.83 us/run - 939.52 MFLOP/run -   1.72 TFLOPS

So there's a nice boost for larger n, but it just falls off a cliff for n=8. I looked into this, and what's happening is the barriers are causing all the loads of the B matrix to be bunched together, and it's using too many registers. I tried moving all the B loads to the start of the function and saving them in local arrays, and that seems to resolve the issue:

with loads at the top:

  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  48564 runs -   104.69 us/run - 117.44 MFLOP/run -   1.12 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  47286 runs -   106.60 us/run - 234.88 MFLOP/run -   2.20 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  40328 runs -   124.44 us/run - 352.32 MFLOP/run -   2.83 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  44091 runs -   113.45 us/run - 469.76 MFLOP/run -   4.14 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  39159 runs -   127.93 us/run - 587.20 MFLOP/run -   4.59 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  22791 runs -   220.12 us/run - 939.52 MFLOP/run -   4.27 TFLOPS

netrunnereve · 2025-01-08T21:45:05Z

So there's a nice boost for larger n, but it just falls off a cliff for n=8.

Hmm this looks like an Nvidia only issue, and I didn't see this on my end when I was testing my changes. AMD's tools report that 82/256 vector registers are used in the 64 block size, 4 rows, and 8 columns case.

  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   31
2.90 us/run - 117.44 MFLOP/run - 375.33 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2982 runs -   37
2.20 us/run - 234.88 MFLOP/run - 631.06 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2272 runs -   45
4.42 us/run - 352.32 MFLOP/run - 775.32 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1917 runs -   55
0.29 us/run - 469.76 MFLOP/run - 853.66 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1539 runs -   66
9.48 us/run - 587.20 MFLOP/run - 877.11 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1070 runs -   979.36 us/run - 939.52 MFLOP/run - 959.33 GFLOPS

I checked the assembly and at least for me the compiler is interleaving the B loads and the sum FMAs rather than doing them all at once. Also if I do a quick estimation:

temp: 8*4 = 32 registers
B: 4*4 = 16 registers
sum: 4 registers
scales: 4 registers
qs: 4*4 = 16 registers

That's 72 vector registers, and I guess we can go up to 100 ish when we include the indexes and so forth. If we assume that the compiler is loading all the B columns together then that's 16*8 = 128 registers, which would bring the total number over 200. However the compiler in this case doesn't need to load all the B values into registers in one go, and it should be smart enough to not spill over.

BTW I definitely can make this change to fix the n=8 performance, and I'll do these tweaks in one go once I get confirmation that Intel is working. It's just weird that the compiler is running out of registers in this case, which hinders performance more than smaller loads would.

This reverts commit 65110b8.

…n use, plus some more optimizations

0cc4m · 2025-01-09T09:25:38Z

Okay I've fixed up Q6_K to handle the early return case, and it's now running at 23.3 t/s with a few extra tweaks. @0cc4m can you try this on Intel to see if it prevents the crash?

I tested it on my A770, now Q6_K passes and it crashes on a later Q2_K test, so it was correct.

netrunnereve · 2025-01-10T03:14:37Z

New numbers after fixing the early returns and making some more changes:

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	Vulkan	100	8	1	none	tg128	28.47 ± 0.05
llama 8B Q3_K - Medium	3.74 GiB	8.03 B	Vulkan	100	8	1	none	tg128	25.06 ± 0.05
llama 7B Q6_K	5.53 GiB	7.24 B	Vulkan	100	8	1	none	tg128	23.42 ± 0.03

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   5112 runs -   210.87 us/run - 117.44 MFLOP/run - 556.94 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4260 runs -   275.80 us/run - 117.44 MFLOP/run - 425.81 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   5112 runs -   224.08 us/run - 117.44 MFLOP/run - 524.09 GFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   300.91 us/run - 117.44 MFLOP/run - 390.29 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   312.18 us/run - 117.44 MFLOP/run - 376.20 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   321.61 us/run - 234.88 MFLOP/run - 730.32 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2982 runs -   367.27 us/run - 234.88 MFLOP/run - 639.53 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   314.11 us/run - 234.88 MFLOP/run - 747.77 GFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   393.84 us/run - 234.88 MFLOP/run - 596.39 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2982 runs -   364.45 us/run - 234.88 MFLOP/run - 644.48 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   420.31 us/run - 352.32 MFLOP/run - 838.23 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2272 runs -   463.35 us/run - 352.32 MFLOP/run - 760.38 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   423.65 us/run - 352.32 MFLOP/run - 831.63 GFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2272 runs -   482.50 us/run - 352.32 MFLOP/run - 730.20 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2272 runs -   444.87 us/run - 352.32 MFLOP/run - 791.97 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1704 runs -   609.99 us/run - 469.76 MFLOP/run - 770.11 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1704 runs -   662.96 us/run - 469.76 MFLOP/run - 708.59 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1917 runs -   538.73 us/run - 469.76 MFLOP/run - 871.98 GFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1704 runs -   627.62 us/run - 469.76 MFLOP/run - 748.48 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1917 runs -   538.51 us/run - 469.76 MFLOP/run - 872.34 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1539 runs -   717.78 us/run - 587.20 MFLOP/run - 818.08 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1368 runs -   765.61 us/run - 587.20 MFLOP/run - 766.97 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1539 runs -   683.26 us/run - 587.20 MFLOP/run - 859.41 GFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1368 runs -   737.19 us/run - 587.20 MFLOP/run - 796.54 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1539 runs -   664.04 us/run - 587.20 MFLOP/run - 884.28 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    963 runs -  1049.30 us/run - 939.52 MFLOP/run - 895.38 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    963 runs -  1077.66 us/run - 939.52 MFLOP/run - 871.82 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    963 runs -  1060.65 us/run - 939.52 MFLOP/run - 885.80 GFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    963 runs -  1086.90 us/run - 939.52 MFLOP/run - 864.40 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1070 runs -   974.52 us/run - 939.52 MFLOP/run - 964.09 GFLOPS

jeffbolznv · 2025-01-10T14:57:50Z

ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q4_k.comp

@@ -45,55 +43,57 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {

        [[unroll]] for (uint n = 0; n < num_rows; ++n) {
            const uint ib0 = a_offset / QUANT_K + (first_row+n)*num_blocks_per_row;
-            f16vec2 d = data_a[ib0 + i].d;
+            const f16vec2 d = data_a[ib0 + i].d;


It would be nice to apply the calc_superblock changes to q4 and q5 as well, to keep all the k-quants structured the same way.

netrunnereve added 6 commits January 4, 2025 14:29

q6_k scale caching

d122d5c

16 bit unpack

6b06d16

q4_k test (slow)

21c6b80

revert it

b0e4ccb

q3_k

07d0d58

q2_k

d70a731

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jan 5, 2025

netrunnereve requested a review from 0cc4m January 5, 2025 02:26

netrunnereve force-pushed the vulkan branch from 89cbbc6 to 1997b8e Compare January 5, 2025 02:37

github-actions bot added script Script related python python script changes Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jan 5, 2025

little stuff

c01ccf8

netrunnereve force-pushed the vulkan branch from 1997b8e to c01ccf8 Compare January 5, 2025 02:42

netrunnereve removed script Script related python python script changes Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jan 5, 2025

jeffbolznv self-requested a review January 5, 2025 05:36

0cc4m reviewed Jan 5, 2025

View reviewed changes

ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q6_k.comp Outdated Show resolved Hide resolved

ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q6_k.comp Outdated Show resolved Hide resolved

jeffbolznv requested changes Jan 5, 2025

View reviewed changes

ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q6_k.comp Outdated Show resolved Hide resolved

ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q6_k.comp Outdated Show resolved Hide resolved

ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q2_k.comp Outdated Show resolved Hide resolved

netrunnereve force-pushed the vulkan branch from dc07407 to 01024f9 Compare January 7, 2025 02:13

netrunnereve added 8 commits January 8, 2025 21:51

Revert "try precalculating products of a and q2_k scales"

1730771

This reverts commit 65110b8.

unpack should be u16, add vim swap to gitignore (about time)

b4ae700

better q4_k scales

cdf70cf

q5_k

6f5d62b

better q6_k with separate paths for all threads and partial threads i…

91f1d9c

…n use, plus some more optimizations

q2_k better dequant

cc28742

q3_k optimizations

fe71a8c

q3_k use hmask simd from cpu avx version

923e9a8

netrunnereve force-pushed the vulkan branch from 1028c63 to 923e9a8 Compare January 9, 2025 02:51

netrunnereve removed request for ngxson and JohannesGaessler January 9, 2025 02:56

Merge https://github.com/ggerganov/llama.cpp into vulkan

c946364

0cc4m mentioned this pull request Jan 9, 2025

Vulkan: Fix float16 use on devices without float16 support + fix subgroup_size_control validation error #11161

Merged

netrunnereve mentioned this pull request Jan 9, 2025

vulkan: multi-row k quants #10846

Merged

netrunnereve added 4 commits January 9, 2025 17:06

make the caches happy

51b5ac5

q3_k separate out calculation

973bc40

q2_k separate out

6145fc7

little stuff

845d572

jeffbolznv reviewed Jan 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: scale caching for k quants + misc fixes #11081

vulkan: scale caching for k quants + misc fixes #11081

netrunnereve commented Jan 5, 2025

jeffbolznv commented Jan 5, 2025

netrunnereve commented Jan 5, 2025

0cc4m commented Jan 7, 2025

jeffbolznv commented Jan 7, 2025

netrunnereve commented Jan 7, 2025

netrunnereve commented Jan 8, 2025 •

edited

Loading

jeffbolznv commented Jan 8, 2025

netrunnereve commented Jan 8, 2025 •

edited

Loading

0cc4m commented Jan 9, 2025

netrunnereve commented Jan 10, 2025

jeffbolznv Jan 10, 2025

vulkan: scale caching for k quants + misc fixes #11081

Are you sure you want to change the base?

vulkan: scale caching for k quants + misc fixes #11081

Conversation

netrunnereve commented Jan 5, 2025

jeffbolznv commented Jan 5, 2025

netrunnereve commented Jan 5, 2025

0cc4m commented Jan 7, 2025

jeffbolznv commented Jan 7, 2025

netrunnereve commented Jan 7, 2025

netrunnereve commented Jan 8, 2025 • edited Loading

jeffbolznv commented Jan 8, 2025

netrunnereve commented Jan 8, 2025 • edited Loading

0cc4m commented Jan 9, 2025

netrunnereve commented Jan 10, 2025

jeffbolznv Jan 10, 2025

Choose a reason for hiding this comment

netrunnereve commented Jan 8, 2025 •

edited

Loading

netrunnereve commented Jan 8, 2025 •

edited

Loading