A novel post-training quantization framework that enhances GPTQ by integrating KL divergence for better accuracy preservation when deploying Large Language Models on edge devices. 1
KLAWQ extends GPTQ by adding a KL divergence term to align quantized model outputs with the original model's distribution: 2
L(Q) = L_MSE(Q) + β * L_KL(Q)
The algorithm modifies the Hessian computation as H_tot = H + βA, where A is the KL Hessian matrix. 3
- Configuration: Hyperparameters (
β,τ) inKLAWQ/gptqmodel/quantization/config.py4 - Core Algorithm: KL Hessian computation in
KLAWQ/kl-aware-quant/quantization/gptq.py5 - Quantization Engine: Low-level operations in
KLAWQ/kl-aware-quant/quantization/quantizer.py6 - Analysis Notebooks: Experimental validation in
kl-hessian-gptq-*.ipynbfiles 7
-
Clone and Setup: 8
git clone https://github.com/ha405/Compression-Framework-for-EdgeAI cd Compression-Framework-for-EdgeAI -
Install Dependencies: Install PyTorch, transformers, and other requirements from
requirements.txt -
Run Quantization: Use the Jupyter notebooks for experimentation or integrate the KLAWQ modules directly
Experiments on GPT-2 at 8-bit precision demonstrate improved perplexity scores compared to vanilla GPTQ while maintaining post-training quantization efficiency. 9
The framework is built on a comprehensive infrastructure stack including PyTorch >=2.4.1, transformers >=4.51.2, and FastAPI for model serving. The project structure shows a modular design with separate components for adapter functionality, model definitions, and processing loops, though the core KLAWQ innovation is concentrated in the quantization modules. 10
Wiki pages you might want to explore: