AWQ quantization doesn't work in many opensource LLM in terms of inference efficiency #243

loulianzhang · 2024-12-10T02:09:17Z

I have installed autoawq with version of 0.2.6, and quantize common LLMs supported by AWQ in A100-SXM4-80GB, such as gpt_bigcode-santacoder, Qwen2-7B-Instruct, Yi-9B, Meta-Llama-3-8B-Instruct. when compare inference efficiency with Huggingface(HF), I find no improvement for a considerable number of models. The details of inference time benchmark are as follows:

LLMs	HF(s)	AWQ(s)
gpt_bigcode-santacoder	1.46	2.62
Qwen2-7B-Instruct	6.96	6.59
Yi-9B	4.90	3.93
Meta-Llama-3-8B-Instruct	7.39	8.52

as we can see from above benchmark, only Yi-9B and Qwen2-7B-Instruct can speed up inference in AWQ quantization, whereas other 2 LLMs failed.

Environments

autoawq 0.2.6
autoawq_kernels 0.0.7
torch 2.4.0
torchaudio 2.4.0
torchvision 0.19.0

Quantization Code

import torch, awq
from datasets import load_dataset
from awq import AutoAWQForCausalLM

def awq_quant(model_path, quant_path):
    #model_path = '/mgData4/loulianzhang/model/huggingface/Qwen2-7B-Instruct'

    quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
    # Load model
    # for Mind
    model = AutoAWQForCausalLM.from_pretrained(model_path, **{"low_cpu_mem_usage": True, "use_cache": False}, trust_remote_code=True, safetensors=False)
    model.to("cuda:5")
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    # Quantize
    calib_data="/mgData4/loulianzhang/data/huggingface/pile-val-backup"
    #model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_data)
    model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_data, split="validation")

    # modify the config file so that it is compatible with transformers integration
    quantization_config = AwqConfig(
        bits=quant_config["w_bit"],
        group_size=quant_config["q_group_size"],
        zero_point=quant_config["zero_point"],
        version=quant_config["version"].lower(),
    ).to_dict()

    # the pretrained transformers model is stored in the model attribute + we need to pass a dict
    model.model.config.quantization_config = quantization_config
    # a second solution would be to use Autoconfig and push to hub (what we do at llm-awq)

    # Save quantized model
    model.save_quantized(quant_path)
    tokenizer.save_pretrained(quant_path)
    print(f'Model is quantized and saved at "{quant_path}"')

who is able to explain this phenomenon?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWQ quantization doesn't work in many opensource LLM in terms of inference efficiency #243

AWQ quantization doesn't work in many opensource LLM in terms of inference efficiency #243

loulianzhang commented Dec 10, 2024

AWQ quantization doesn't work in many opensource LLM in terms of inference efficiency #243

AWQ quantization doesn't work in many opensource LLM in terms of inference efficiency #243

Comments

loulianzhang commented Dec 10, 2024

Environments

Quantization Code