From ece8e5793a1dd053bff1909678061945974e7bd8 Mon Sep 17 00:00:00 2001
From: Aisuko <urakiny@gmail.com>
Date: Tue, 5 Mar 2024 17:59:59 +1100
Subject: [PATCH] Kaggle Notebook | Quantization Techniques | Version 11

---
 quantization/quantization-techniques.ipynb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/quantization/quantization-techniques.ipynb b/quantization/quantization-techniques.ipynb
index d26c6d3..8a5fbe5 100644
--- a/quantization/quantization-techniques.ipynb
+++ b/quantization/quantization-techniques.ipynb
@@ -1 +1 @@
-{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.6.4","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python"},"kaggle":{"accelerator":"none","dataSources":[],"dockerImageVersionId":30626,"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":false}},"nbformat_minor":4,"nbformat":4,"cells":[{"source":"<a href=\"https://www.kaggle.com/code/aisuko/quantization-techniques?scriptVersionId=165063839\" target=\"_blank\"><img align=\"left\" alt=\"Kaggle\" title=\"Open in Kaggle\" src=\"https://kaggle.com/static/images/open-in-kaggle.svg\"></a>","metadata":{},"cell_type":"markdown"},{"cell_type":"markdown","source":"# Overview\n\nAs bigger as the size of LLM, there are various technologies have been developed that try to shrink the model size, like quantization techniques. Quantization techniques focus on representing data with less information while also trying to not loss too much accuracy. This often means converting a data type to represent the same information with fewer bits. For example, if the model weights are stored as 32-bit floating points and they'are quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memeory-usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits.\n\nIn this notebook, let's review the quantization techniques have been used by us until now.","metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19"}},{"cell_type":"markdown","source":"# Activation-Aware Weight Quantization(AWQ)\n\nIt observe the activation rather than weights. See example in [Quantization Methods](https://www.kaggle.com/code/aisuko/quantization-methods?scriptVersionId=160183672)","metadata":{}},{"cell_type":"markdown","source":"# Weight of Quantization\n\nIt has two main families\n\n\n## Post-Training Quantization(PTQ)\n\nWe have discussed Native INT8 quantization, and it has two methods, see [Introduction to weight quantization](https://www.kaggle.com/code/aisuko/introduction-to-weight-quantization):\n\n* Absolute maximum(absmax)\n* Zero-Point quantization\n\nFurthermore, we use LLM.INT8(). It relies on a vector-wise(absmax) quantization scheme and introduces mixed-precision quantization. And it has three useful features:\n\n* Offloading\n* Outlier threshold\n* Skip module conversion\n\n\n## Generative Pre-trained Transformers(GPTQ)\n\n* Arbirary Order Insight\n* Lazy Batch-Updates\n* Cholesky Reformulation\n\nSee [Quantization with GPTQ](https://www.kaggle.com/code/aisuko/quantization-with-gptq)\n\n\n# Quantization-Aware Training(QAT)\n\nWe have been not use any methods of this one.","metadata":{}},{"cell_type":"markdown","source":"# Fine-tuning approchs\n\nSome of fine-tuning approch like \"adapter-tuning\" technique. It will quantize model first before the training start.\n\n\n## QLoRA\n\nIt will freeze the model weight to LoRA before training. See [Fine-tuning Llama2 with QLoRA](https://www.kaggle.com/code/aisuko/fine-tuning-llama2-with-qlora)\n\n* NF4\n* Double quantization\n* Paged Optimizers","metadata":{}},{"cell_type":"markdown","source":"# Credit\n\n* https://maartengrootendorst.substack.com/p/which-quantization-method-is-right?utm_source=profile&utm_medium=reader2\n* https://huggingface.co/docs/transformers/v4.37.0/quantization\n* https://medium.com/towards-data-science/introduction-to-weight-quantization-2494701b9c0c\n* https://towardsdatascience.com/4-bit-quantization-with-gptq-36b0f4f02c34","metadata":{}}]}
\ No newline at end of file
+{"cells":[{"source":"<a href=\"https://www.kaggle.com/code/aisuko/quantization-techniques?scriptVersionId=165506354\" target=\"_blank\"><img align=\"left\" alt=\"Kaggle\" title=\"Open in Kaggle\" src=\"https://kaggle.com/static/images/open-in-kaggle.svg\"></a>","metadata":{},"cell_type":"markdown"},{"cell_type":"markdown","id":"68342cf8","metadata":{"papermill":{"duration":0.003662,"end_time":"2024-03-05T06:59:46.573853","exception":false,"start_time":"2024-03-05T06:59:46.570191","status":"completed"},"tags":[]},"source":["# Overview\n","\n","As bigger as the size of LLM, there are various technologies have been developed that try to shrink the model size, like quantization techniques. Quantization techniques focus on representing data with less information while also trying to not loss too much accuracy. This often means converting a data type to represent the same information with fewer bits. For example, if the model weights are stored as 32-bit floating points and they'are quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memeory-usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits.\n","\n","In this notebook, let's review the quantization techniques have been used by us until now.\n","\n","\n","# Activation-Aware Weight Quantization(AWQ)\n","\n","It observe the activation rather than weights. See example in [Quantization Methods](https://www.kaggle.com/code/aisuko/quantization-methods)\n","\n","\n","# Weight of Quantization\n","\n","**It has two main families**\n","\n","\n","# Post-Training Quantization(PTQ)\n","\n","We have discussed Native INT8 quantization, and it has two methods, see [Introduction to weight quantization](https://www.kaggle.com/code/aisuko/introduction-to-weight-quantization):\n","\n","* Absolute maximum(absmax)\n","* Zero-Point quantization\n","\n","Furthermore, we use LLM.INT8(). It relies on a vector-wise(absmax) quantization scheme and introduces mixed-precision quantization. And it has three useful features:\n","\n","* Offloading\n","* Outlier threshold\n","* Skip module conversion\n","\n","\n","### Generative Pre-trained Transformers(GPTQ)\n","\n","* Arbirary Order Insight\n","* Lazy Batch-Updates\n","* Cholesky Reformulation\n","\n","See [Quantization with GPTQ](https://www.kaggle.com/code/aisuko/quantization-with-gptq)\n","\n","\n","# Quantization-Aware Training(QAT)\n","\n","We have been not use any methods of this one."]},{"cell_type":"markdown","id":"d823c86d","metadata":{"papermill":{"duration":0.002208,"end_time":"2024-03-05T06:59:46.578968","exception":false,"start_time":"2024-03-05T06:59:46.57676","status":"completed"},"tags":[]},"source":["# NF4(QLoRA)\n","\n","It will freeze the model weight to LoRA before training. See [Fine-tuning Llama2 with QLoRA](https://www.kaggle.com/code/aisuko/fine-tuning-llama2-with-qlora)\n","\n","* NF4\n","* Double quantization\n","* Paged Optimizers"]},{"cell_type":"markdown","id":"dd97fb0b","metadata":{"papermill":{"duration":0.002663,"end_time":"2024-03-05T06:59:46.584026","exception":false,"start_time":"2024-03-05T06:59:46.581363","status":"completed"},"tags":[]},"source":["# Credit\n","\n","* https://maartengrootendorst.substack.com/p/which-quantization-method-is-right?utm_source=profile&utm_medium=reader2\n","* https://huggingface.co/docs/transformers/v4.37.0/quantization\n","* https://medium.com/towards-data-science/introduction-to-weight-quantization-2494701b9c0c\n","* https://towardsdatascience.com/4-bit-quantization-with-gptq-36b0f4f02c34"]}],"metadata":{"kaggle":{"accelerator":"none","dataSources":[],"dockerImageVersionId":30626,"isGpuEnabled":false,"isInternetEnabled":true,"language":"python","sourceType":"notebook"},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.10.12"},"papermill":{"default_parameters":{},"duration":5.21622,"end_time":"2024-03-05T06:59:47.008268","environment_variables":{},"exception":null,"input_path":"__notebook__.ipynb","output_path":"__notebook__.ipynb","parameters":{},"start_time":"2024-03-05T06:59:41.792048","version":"2.4.0"}},"nbformat":4,"nbformat_minor":5}
\ No newline at end of file