You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GGML has risen in significance with regard to CPU-centric approaches as the backbone to whisper.cpp and llama.cpp.
I wonder if it could finally provide a counter-balance and even compete with GPU approaches.
Slides 23-29 from their NeurIPs presentation cover the approach (timestamp: minute 6-8 in the associated video).
The SLIDE code is also available from this repo i believe.
It seems like the project did not become widely popular because of inertia within the deeplearning eco-system in favour of GPU driven approaches.
A notable reference is from this paper where it states that,
". For this reason, (Chen et al., 2020) build their SLIDE system in C++ from scratch on CPU. Even though their implementation achieved remarkable speed up, their impact is limited as they implemented their system from scratch in C++, making it difficult for the community to adopt SLIDE in practice. "
Alot of their work on CPU centric machine learning is probably relevant and could be popularized via GGML.
So squinting at all of this, could hazard a guess that it may be possible to squeeze Llama 65B @ 128GB (38.5 GB @ 4-bit )-> to around 13B (4GB @ 4-bit)... That is, to scale its efficiency to the current equivalent of Llama 7B.
Which would open the door to running GPT3 scale 175B parameter models on CPU, at roughly the same cost as the current Llamma 30B implementation..
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
GGML has risen in significance with regard to CPU-centric approaches as the backbone to whisper.cpp and llama.cpp.
I wonder if it could finally provide a counter-balance and even compete with GPU approaches.
Off the top of my head, I can think of this paper :
SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems by Beidi Chen (@keroro824). This work is both beneficial for Large Scale Neural Network Training and Inference.
Slides 23-29 from their NeurIPs presentation cover the approach (timestamp: minute 6-8 in the associated video).
The SLIDE code is also available from this repo i believe.
It seems like the project did not become widely popular because of inertia within the deeplearning eco-system in favour of GPU driven approaches.
A notable reference is from this paper where it states that,
Alot of their work on CPU centric machine learning is probably relevant and could be popularized via GGML.
Other work in this space that is of interestest.
This is probably of relevance to discussion at ggerganov/llama.cpp#638
and ggerganov/llama.cpp#521 .
Also interesting article from Tim about the tendancy towards sparsity in attention layers as language models scale. https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/
So squinting at all of this, could hazard a guess that it may be possible to squeeze Llama 65B @ 128GB (38.5 GB @ 4-bit )-> to around 13B (4GB @ 4-bit)... That is, to scale its efficiency to the current equivalent of Llama 7B.
Which would open the door to running GPT3 scale 175B parameter models on CPU, at roughly the same cost as the current Llamma 30B implementation..
Thoughts?
Beta Was this translation helpful? Give feedback.
All reactions