Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWQ quantization doesn't work in many opensource LLM in terms of inference efficiency #243

Open
loulianzhang opened this issue Dec 10, 2024 · 0 comments

Comments

@loulianzhang
Copy link

I have installed autoawq with version of 0.2.6, and quantize common LLMs supported by AWQ in A100-SXM4-80GB, such as gpt_bigcode-santacoder, Qwen2-7B-Instruct, Yi-9B, Meta-Llama-3-8B-Instruct. when compare inference efficiency with Huggingface(HF), I find no improvement for a considerable number of models. The details of inference time benchmark are as follows:

LLMs HF(s) AWQ(s)
gpt_bigcode-santacoder 1.46 2.62
Qwen2-7B-Instruct 6.96 6.59
Yi-9B 4.90 3.93
Meta-Llama-3-8B-Instruct 7.39 8.52

as we can see from above benchmark, only Yi-9B and Qwen2-7B-Instruct can speed up inference in AWQ quantization, whereas other 2 LLMs failed.

Environments

autoawq 0.2.6
autoawq_kernels 0.0.7
torch 2.4.0
torchaudio 2.4.0
torchvision 0.19.0

Quantization Code

import torch, awq
from datasets import load_dataset
from awq import AutoAWQForCausalLM

def awq_quant(model_path, quant_path):
    #model_path = '/mgData4/loulianzhang/model/huggingface/Qwen2-7B-Instruct'

    quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
    # Load model
    # for Mind
    model = AutoAWQForCausalLM.from_pretrained(model_path, **{"low_cpu_mem_usage": True, "use_cache": False}, trust_remote_code=True, safetensors=False)
    model.to("cuda:5")
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    # Quantize
    calib_data="/mgData4/loulianzhang/data/huggingface/pile-val-backup"
    #model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_data)
    model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_data, split="validation")

    # modify the config file so that it is compatible with transformers integration
    quantization_config = AwqConfig(
        bits=quant_config["w_bit"],
        group_size=quant_config["q_group_size"],
        zero_point=quant_config["zero_point"],
        version=quant_config["version"].lower(),
    ).to_dict()

    # the pretrained transformers model is stored in the model attribute + we need to pass a dict
    model.model.config.quantization_config = quantization_config
    # a second solution would be to use Autoconfig and push to hub (what we do at llm-awq)

    # Save quantized model
    model.save_quantized(quant_path)
    tokenizer.save_pretrained(quant_path)
    print(f'Model is quantized and saved at "{quant_path}"')

who is able to explain this phenomenon?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant