ggml-cuda : add TQ2_0 kernels, for ternary inference on GPU #11183

compilade · 2025-01-10T20:10:49Z

Follow-up to #8151, which added ternary types (although CPU-only at first), this implements CUDA kernels for TQ2_0 (mmvq, tile loading for mmq and mma, and dequant-based cuBLAS).

(Although there was a similar effort in ikawrakow/ik_llama.cpp#13 by @ikawrakow, mmq wasn't handled there, but here, it is.)

Perplexity

Note that generation quality may differ slighly from CPU inference because the CUDA TQ2_0 kernels use Q8_1 (32 int8 weights per scale) as the activation type, while on CPU, Q8_K is used (256 int8 weights per scale).

The perplexities below were calculated with TriLM-3.9B on wiki.test.raw from wikitext-2-raw, when using the CUDA backend.

Quant	Token embeddings and output tensor types	Perplexity
`F16`	`F16` and `F16`	11.1508 +/- 0.07852
`TQ2_0`	`F16` and `F16`	11.1517 +/- 0.07853
`TQ2_0`	`Q4_K` and `Q6_K`	11.1539 +/- 0.07852

Performance

It's fast. But there is still room for improvement. The implementation is relatively naïve.

Commands used for the benchmarks below

For tg128:

$ ./bin/llama-bench -m ../models/trilm/TriLM_3.9B_Unpacked-TQ2_0.gguf -n 128 -p 0 -r 20

For pp2048:

$ ./bin/llama-bench -m ../models/trilm/TriLM_3.9B_Unpacked-TQ2_0.gguf -b 4,8,16,32,64,128,256,512,1024,2048 -ub 2048 -n 0 -p 2048 -r 10

And again for each tested quant type.

Tokens per second for TriLM-3.9B comparing TQ2_0 and various quant types on a NVIDIA GeForce RTX 3090 (using the CUDA backend):

(best of each row is in bold)

test	`n_batch` and `n_ubatch`	`TQ2_0`	`Q2_K`	`Q4_0`	`Q4_K_M`	`Q8_0`	`F16`
tg128	1	262.27 ± 1.49	201.56 ± 1.00	215.75 ± 0.97	204.23 ± 0.50	146.94 ± 0.25	95.14 ± 0.26
pp2048	4	606.14 ± 0.24	459.33 ± 1.88	592.91 ± 0.28	476.91 ± 2.67	476.89 ± 0.77	298.54 ± 0.12
pp2048	8	830.37 ± 1.49	662.66 ± 1.09	854.81 ± 2.38	637.81 ± 2.19	684.99 ± 0.18	599.84 ± 0.26
pp2048	16	1923.61 ± 0.84	1678.69 ± 0.92	1624.60 ± 0.23	1630.84 ± 0.83	1356.20 ± 0.73	1142.86 ± 0.26
pp2048	32	3260.83 ± 0.83	2708.09 ± 0.21	2758.63 ± 0.23	2825.91 ± 0.33	2512.09 ± 1.33	2187.86 ± 0.42
pp2048	64	4653.85 ± 2.89	3637.53 ± 0.89	4146.47 ± 4.02	4026.11 ± 0.81	3874.62 ± 5.72	3874.89 ± 7.65
pp2048	128	5542.26 ± 10.78	3697.21 ± 0.73	5192.41 ± 13.34	4804.31 ± 0.94	5012.78 ± 13.91	5693.27 ± 11.47
pp2048	256	6594.14 ± 8.64	4676.00 ± 23.15	6185.45 ± 17.64	5843.54 ± 6.90	6156.16 ± 16.44	6733.23 ± 19.90
pp2048	512	6964.92 ± 25.13	5165.61 ± 25.91	6514.26 ± 15.43	6172.70 ± 8.62	6513.97 ± 8.07	6913.39 ± 21.24
pp2048	1024	6988.18 ± 23.12	5382.19 ± 32.51	6534.47 ± 25.53	6246.18 ± 24.18	6558.76 ± 18.24	6783.25 ± 15.27
pp2048	2048	6558.82 ± 10.95	5218.12 ± 15.81	6143.25 ± 9.76	5940.68 ± 15.27	6204.75 ± 14.12	6524.19 ± 7.07

The same tests, with the same 3.9B ternary model, using a NVIDIA GeForce RTX 4090:

test	`n_batch` and `n_ubatch`	`TQ2_0`	`Q2_K`	`Q4_0`	`Q4_K_M`	`Q8_0`	`F16`
tg128	1	330.38 ± 0.46	285.66 ± 0.71	241.07 ± 0.47	231.95 ± 0.52	162.51 ± 0.25	103.10 ± 0.06
pp2048	4	969.87 ± 7.71	780.09 ± 3.80	819.69 ± 3.12	748.16 ± 4.38	588.14 ± 1.97	349.37 ± 0.44
pp2048	8	1397.01 ± 4.09	1232.43 ± 4.01	1335.88 ± 7.21	1180.74 ± 2.29	1017.97 ± 2.36	686.39 ± 1.86
pp2048	16	2975.74 ± 15.60	2643.16 ± 6.86	2364.93 ± 3.57	2353.31 ± 11.33	1865.25 ± 7.49	1172.29 ± 5.04
pp2048	32	5204.42 ± 0.97	4430.29 ± 2.91	4196.54 ± 22.09	4324.71 ± 22.95	3470.22 ± 0.71	2628.78 ± 8.86
pp2048	64	8312.00 ± 41.84	6819.06 ± 36.22	7111.68 ± 28.82	6978.54 ± 13.87	5935.29 ± 1.36	4865.42 ± 0.83
pp2048	128	10958.72 ± 77.77	7526.91 ± 27.85	10054.20 ± 1.54	9341.22 ± 44.48	8879.21 ± 1.23	7541.59 ± 64.49
pp2048	256	14145.32 ± 65.05	9294.30 ± 57.37	13194.65 ± 2.25	12198.99 ± 88.07	12612.84 ± 71.58	11319.90 ± 5.33
pp2048	512	15346.04 ± 19.07	10761.67 ± 45.84	14350.62 ± 80.40	13610.42 ± 50.48	14215.50 ± 16.16	13252.30 ± 15.23
pp2048	1024	14236.63 ± 4.37	11092.28 ± 8.33	13210.68 ± 69.18	12785.47 ± 21.77	13277.57 ± 65.59	12670.69 ± 61.22
pp2048	2048	11890.03 ± 79.01	9992.28 ± 23.62	11208.72 ± 51.92	11006.58 ± 83.24	11375.74 ± 39.91	10722.65 ± 45.49

There is a noticeable relative speedup compared to larger types at low batch sizes (e.g. when doing single-user text generation like in tg128). Of course, there is still room for improvement.

(TQ1_0 is out of scope of this PR, but GPU support for it will also come eventually)

This also removes custom TQ2_0 mmq dp4a, because re-using the one from Q8_0 allows avoiding to repeatedly unpack the 2-bit values to 8-bit and instead only do it once per tile.

compilade · 2025-01-10T20:57:38Z

tests/test-backend-ops.cpp

-    // GGML_TYPE_TQ1_0, GGML_TYPE_TQ2_0, // TODO: implement for all backends
+    // GGML_TYPE_TQ1_0,
+    GGML_TYPE_TQ2_0,


An unintended side effect of un-commenting TQ2_0 here makes the Metal tests fail, as in https://github.com/ggerganov/llama.cpp/actions/runs/12716518343/job/35451025034?pr=11183#step:5:13921, because operations on that type are not yet implemented there and the ggml_metal_supports_op function isn't representative of the types supported by the Metal backend.

Some solutions are:

Implement all relevant TQ2_0 support for Metal

Will happen eventually, a starting point already floats around somewhere in a branch linked in ggml-quants : ternary packing for TriLMs and BitNet b1.58 #8151 (comment).

Make the ggml_metal_supports_op correctly return false when it should

Should be done for correctness

An "easy" way to temporarily do this would be similar to what was done for BF16 and simply return false when a TQ2_0 tensor is encountered. The same should be done for the other not-yet-supported types like TQ1_0.

Avoid testing TQ2_0 to hide the error

This doesn't fix the problem.

Most of these solutions (apart from hiding the problem) are out of scope of this PR which focuses on the CUDA implementation of TQ2_0. But I don't want this to make the Metal CI fail.

@compilade

…v#11183 Credit : @compilade

compilade added 4 commits December 27, 2024 20:21

ggml-cuda : add TQ2_0 support

970b5ab

ggml-cuda : cleanup TQ2_0

fb43d5e

This also removes custom TQ2_0 mmq dp4a, because re-using the one from Q8_0 allows avoiding to repeatedly unpack the 2-bit values to 8-bit and instead only do it once per tile.

Merge branch 'master' into compilade/cuda-tq2_0

983aa09

ggml-cuda : remove some superfluous comments for TQ2_0 tile loading

f5fddb6

compilade added enhancement New feature or request performance Speed related topics Review Complexity : High Generally require indepth knowledge of LLMs or GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 10, 2025

compilade requested a review from JohannesGaessler as a code owner January 10, 2025 20:10

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs python python script changes labels Jan 10, 2025

compilade commented Jan 10, 2025

View reviewed changes

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jan 10, 2025

ggml-cuda : add TQ2_0 kernels, for ternary inference on GPU ggergano…

79cde2c

…v#11183 Credit : @compilade

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cuda : add TQ2_0 kernels, for ternary inference on GPU #11183

ggml-cuda : add TQ2_0 kernels, for ternary inference on GPU #11183

compilade commented Jan 10, 2025

compilade Jan 10, 2025 •

edited

Loading

ggml-cuda : add TQ2_0 kernels, for ternary inference on GPU #11183

Are you sure you want to change the base?

ggml-cuda : add TQ2_0 kernels, for ternary inference on GPU #11183

Conversation

compilade commented Jan 10, 2025

Perplexity

Performance

compilade Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

compilade Jan 10, 2025 •

edited

Loading