-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml-cuda : add TQ2_0 kernels, for ternary inference on GPU #11183
base: master
Are you sure you want to change the base?
Conversation
This also removes custom TQ2_0 mmq dp4a, because re-using the one from Q8_0 allows avoiding to repeatedly unpack the 2-bit values to 8-bit and instead only do it once per tile.
// GGML_TYPE_TQ1_0, GGML_TYPE_TQ2_0, // TODO: implement for all backends | ||
// GGML_TYPE_TQ1_0, | ||
GGML_TYPE_TQ2_0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An unintended side effect of un-commenting TQ2_0
here makes the Metal tests fail, as in https://github.com/ggerganov/llama.cpp/actions/runs/12716518343/job/35451025034?pr=11183#step:5:13921, because operations on that type are not yet implemented there and the ggml_metal_supports_op
function isn't representative of the types supported by the Metal backend.
Some solutions are:
- Implement all relevant
TQ2_0
support for Metal- Will happen eventually, a starting point already floats around somewhere in a branch linked in ggml-quants : ternary packing for TriLMs and BitNet b1.58 #8151 (comment).
- Make the
ggml_metal_supports_op
correctly returnfalse
when it should- Should be done for correctness
- An "easy" way to temporarily do this would be similar to what was done for BF16 and simply return
false
when aTQ2_0
tensor is encountered. The same should be done for the other not-yet-supported types likeTQ1_0
.
- Avoid testing
TQ2_0
to hide the error- This doesn't fix the problem.
Most of these solutions (apart from hiding the problem) are out of scope of this PR which focuses on the CUDA implementation of TQ2_0
. But I don't want this to make the Metal CI fail.
Follow-up to #8151, which added ternary types (although CPU-only at first), this implements CUDA kernels for
TQ2_0
(mmvq, tile loading for mmq and mma, and dequant-based cuBLAS).(Although there was a similar effort in ikawrakow/ik_llama.cpp#13 by @ikawrakow, mmq wasn't handled there, but here, it is.)
Perplexity
Note that generation quality may differ slighly from CPU inference because the CUDA
TQ2_0
kernels useQ8_1
(32 int8 weights per scale) as the activation type, while on CPU,Q8_K
is used (256 int8 weights per scale).The perplexities below were calculated with TriLM-3.9B on
wiki.test.raw
fromwikitext-2-raw
, when using the CUDA backend.F16
F16
andF16
TQ2_0
F16
andF16
TQ2_0
Q4_K
andQ6_K
Performance
It's fast. But there is still room for improvement. The implementation is relatively naïve.
Commands used for the benchmarks below
For
tg128
:$ ./bin/llama-bench -m ../models/trilm/TriLM_3.9B_Unpacked-TQ2_0.gguf -n 128 -p 0 -r 20
For
pp2048
:$ ./bin/llama-bench -m ../models/trilm/TriLM_3.9B_Unpacked-TQ2_0.gguf -b 4,8,16,32,64,128,256,512,1024,2048 -ub 2048 -n 0 -p 2048 -r 10
And again for each tested quant type.
Tokens per second for TriLM-3.9B comparing
TQ2_0
and various quant types on a NVIDIA GeForce RTX 3090 (using the CUDA backend):(best of each row is in bold)
n_batch
andn_ubatch
TQ2_0
Q2_K
Q4_0
Q4_K_M
Q8_0
F16
The same tests, with the same 3.9B ternary model, using a NVIDIA GeForce RTX 4090:
n_batch
andn_ubatch
TQ2_0
Q2_K
Q4_0
Q4_K_M
Q8_0
F16
There is a noticeable relative speedup compared to larger types at low batch sizes (e.g. when doing single-user text generation like in
tg128
). Of course, there is still room for improvement.(
TQ1_0
is out of scope of this PR, but GPU support for it will also come eventually)