You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have installed autoawq with version of 0.2.6, and quantize common LLMs supported by AWQ in A100-SXM4-80GB, such as gpt_bigcode-santacoder, Qwen2-7B-Instruct, Yi-9B, Meta-Llama-3-8B-Instruct. when compare inference efficiency with Huggingface(HF), I find no improvement for a considerable number of models. The details of inference time benchmark are as follows:
LLMs
HF(s)
AWQ(s)
gpt_bigcode-santacoder
1.46
2.62
Qwen2-7B-Instruct
6.96
6.59
Yi-9B
4.90
3.93
Meta-Llama-3-8B-Instruct
7.39
8.52
as we can see from above benchmark, only Yi-9B and Qwen2-7B-Instruct can speed up inference in AWQ quantization, whereas other 2 LLMs failed.
importtorch, awqfromdatasetsimportload_datasetfromawqimportAutoAWQForCausalLMdefawq_quant(model_path, quant_path):
#model_path = '/mgData4/loulianzhang/model/huggingface/Qwen2-7B-Instruct'quant_config= { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
# Load model# for Mindmodel=AutoAWQForCausalLM.from_pretrained(model_path, **{"low_cpu_mem_usage": True, "use_cache": False}, trust_remote_code=True, safetensors=False)
model.to("cuda:5")
tokenizer=AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantizecalib_data="/mgData4/loulianzhang/data/huggingface/pile-val-backup"#model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_data)model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_data, split="validation")
# modify the config file so that it is compatible with transformers integrationquantization_config=AwqConfig(
bits=quant_config["w_bit"],
group_size=quant_config["q_group_size"],
zero_point=quant_config["zero_point"],
version=quant_config["version"].lower(),
).to_dict()
# the pretrained transformers model is stored in the model attribute + we need to pass a dictmodel.model.config.quantization_config=quantization_config# a second solution would be to use Autoconfig and push to hub (what we do at llm-awq)# Save quantized modelmodel.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f'Model is quantized and saved at "{quant_path}"')
who is able to explain this phenomenon?
The text was updated successfully, but these errors were encountered:
I have installed autoawq with version of 0.2.6, and quantize common LLMs supported by AWQ in A100-SXM4-80GB, such as gpt_bigcode-santacoder, Qwen2-7B-Instruct, Yi-9B, Meta-Llama-3-8B-Instruct. when compare inference efficiency with Huggingface(HF), I find no improvement for a considerable number of models. The details of inference time benchmark are as follows:
as we can see from above benchmark, only Yi-9B and Qwen2-7B-Instruct can speed up inference in AWQ quantization, whereas other 2 LLMs failed.
Environments
autoawq 0.2.6
autoawq_kernels 0.0.7
torch 2.4.0
torchaudio 2.4.0
torchvision 0.19.0
Quantization Code
who is able to explain this phenomenon?
The text was updated successfully, but these errors were encountered: