-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: scale caching for k quants + misc fixes #11081
base: master
Are you sure you want to change the base?
Conversation
RTX 4070 results. Keep in mind there's a lot of variability in the results, but at first glance it seems like an improvement for Q3_K but worse for the others:
|
For multiple ns I'm seeing clear improvements with Q3_K and Q6_K, but Q2_K is much less consistent and is in some cases slower than master. PR:
Master:
I tried calculating the A * scale multiplication ahead of time for Q2_K, but it didn't do much. That also should reduce the number of shared memory reads as the products are stored in registers. A * scale multiplication cached in registers:
|
I'll post benchmarks at a later point, but this reduces performance on RTX 3090 for q2_k and q6_k. I see small improvements on Radeon Pro VII. Intel still crashes, but only in |
IMO the crash is still very likely related to the barriers in nonuniform control flow. It really needs to be fixed if we're going to use shared memory here. If the additional branches are causing too many problems then maybe we could change how the work is spread across a workgroup so that the number of iterations is uniform, but that could also affect perf (likely making it worse, I'd guess). |
To get rid of the branches we could just have the main i loop run with no checks as long as we have enough blocks remaining to use all threads, and then switch to a separate code path for the final multiplications. There's no need to redo the algorithm. |
Okay I've fixed up Q6_K to handle the early return case, and it's now running at 23.3 t/s with a few extra tweaks. @0cc4m can you try this on Intel to see if it prevents the crash? |
I tested the latest Q6_K changes on RTX 4070. For llama-bench with llama-2-7b.Q6_K, the perf is basically unchanged, which is not surprising since it's just memory bandwidth-limited. The directed perf results are more interesting:
So there's a nice boost for larger n, but it just falls off a cliff for n=8. I looked into this, and what's happening is the barriers are causing all the loads of the B matrix to be bunched together, and it's using too many registers. I tried moving all the B loads to the start of the function and saving them in local arrays, and that seems to resolve the issue:
|
Hmm this looks like an Nvidia only issue, and I didn't see this on my end when I was testing my changes. AMD's tools report that 82/256 vector registers are used in the 64 block size, 4 rows, and 8 columns case.
I checked the assembly and at least for me the compiler is interleaving the B loads and the sum FMAs rather than doing them all at once. Also if I do a quick estimation:
That's 72 vector registers, and I guess we can go up to 100 ish when we include the indexes and so forth. If we assume that the compiler is loading all the B columns together then that's 16*8 = 128 registers, which would bring the total number over 200. However the compiler in this case doesn't need to load all the B values into registers in one go, and it should be smart enough to not spill over. BTW I definitely can make this change to fix the n=8 performance, and I'll do these tweaks in one go once I get confirmation that Intel is working. It's just weird that the compiler is running out of registers in this case, which hinders performance more than smaller loads would. |
This reverts commit 65110b8.
…n use, plus some more optimizations
I tested it on my A770, now Q6_K passes and it crashes on a later Q2_K test, so it was correct. |
New numbers after fixing the early returns and making some more changes:
|
@@ -45,55 +43,57 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) { | |||
|
|||
[[unroll]] for (uint n = 0; n < num_rows; ++n) { | |||
const uint ib0 = a_offset / QUANT_K + (first_row+n)*num_blocks_per_row; | |||
f16vec2 d = data_a[ib0 + i].d; | |||
const f16vec2 d = data_a[ib0 + i].d; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to apply the calc_superblock changes to q4 and q5 as well, to keep all the k-quants structured the same way.
We can make inference run a bit faster by extracting the scales in parallel and saving them to shared memory, where they'll be used by all the threads working on the superblock. This came out of the experiments in #10999.
This was not done for Q4_K and Q5_K as their scales are packed in a complicated way which makes this method even slower.
PR:
Master: