Kaggle Notebook | Quantization Techniques | Version 11

Aisuko · Mar 5, 2024 · ece8e57 · ece8e57
1 parent d3d6bc6
commit ece8e57
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/quantization/quantization-techniques.ipynb b/quantization/quantization-techniques.ipynb
@@ -1 +1 @@
-{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.6.4","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python"},"kaggle":{"accelerator":"none","dataSources":[],"dockerImageVersionId":30626,"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":false}},"nbformat_minor":4,"nbformat":4,"cells":[{"source":"<a href=\"https://www.kaggle.com/code/aisuko/quantization-techniques?scriptVersionId=165063839\" target=\"_blank\"><img align=\"left\" alt=\"Kaggle\" title=\"Open in Kaggle\" src=\"https://kaggle.com/static/images/open-in-kaggle.svg\"></a>","metadata":{},"cell_type":"markdown"},{"cell_type":"markdown","source":"# Overview\n\nAs bigger as the size of LLM, there are various technologies have been developed that try to shrink the model size, like quantization techniques. Quantization techniques focus on representing data with less information while also trying to not loss too much accuracy. This often means converting a data type to represent the same information with fewer bits. For example, if the model weights are stored as 32-bit floating points and they'are quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memeory-usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits.\n\nIn this notebook, let's review the quantization techniques have been used by us until now.","metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19"}},{"cell_type":"markdown","source":"# Activation-Aware Weight Quantization(AWQ)\n\nIt observe the activation rather than weights. See example in [Quantization Methods](https://www.kaggle.com/code/aisuko/quantization-methods?scriptVersionId=160183672)","metadata":{}},{"cell_type":"markdown","source":"# Weight of Quantization\n\nIt has two main families\n\n\n## Post-Training Quantization(PTQ)\n\nWe have discussed Native INT8 quantization, and it has two methods, see [Introduction to weight quantization](https://www.kaggle.com/code/aisuko/introduction-to-weight-quantization):\n\n* Absolute maximum(absmax)\n* Zero-Point quantization\n\nFurthermore, we use LLM.INT8(). It relies on a vector-wise(absmax) quantization scheme and introduces mixed-precision quantization. And it has three useful features:\n\n* Offloading\n* Outlier threshold\n* Skip module conversion\n\n\n## Generative Pre-trained Transformers(GPTQ)\n\n* Arbirary Order Insight\n* Lazy Batch-Updates\n* Cholesky Reformulation\n\nSee [Quantization with GPTQ](https://www.kaggle.com/code/aisuko/quantization-with-gptq)\n\n\n# Quantization-Aware Training(QAT)\n\nWe have been not use any methods of this one.","metadata":{}},{"cell_type":"markdown","source":"# Fine-tuning approchs\n\nSome of fine-tuning approch like \"adapter-tuning\" technique. It will quantize model first before the training start.\n\n\n## QLoRA\n\nIt will freeze the model weight to LoRA before training. See [Fine-tuning Llama2 with QLoRA](https://www.kaggle.com/code/aisuko/fine-tuning-llama2-with-qlora)\n\n* NF4\n* Double quantization\n* Paged Optimizers","metadata":{}},{"cell_type":"markdown","source":"# Credit\n\n* https://maartengrootendorst.substack.com/p/which-quantization-method-is-right?utm_source=profile&utm_medium=reader2\n* https://huggingface.co/docs/transformers/v4.37.0/quantization\n* https://medium.com/towards-data-science/introduction-to-weight-quantization-2494701b9c0c\n* https://towardsdatascience.com/4-bit-quantization-with-gptq-36b0f4f02c34","metadata":{}}]}
+{"cells":[{"source":"<a href=\"https://www.kaggle.com/code/aisuko/quantization-techniques?scriptVersionId=165506354\" target=\"_blank\"><img align=\"left\" alt=\"Kaggle\" title=\"Open in Kaggle\" src=\"https://kaggle.com/static/images/open-in-kaggle.svg\"></a>","metadata":{},"cell_type":"markdown"},{"cell_type":"markdown","id":"68342cf8","metadata":{"papermill":{"duration":0.003662,"end_time":"2024-03-05T06:59:46.573853","exception":false,"start_time":"2024-03-05T06:59:46.570191","status":"completed"},"tags":[]},"source":["# Overview\n","\n","As bigger as the size of LLM, there are various technologies have been developed that try to shrink the model size, like quantization techniques. Quantization techniques focus on representing data with less information while also trying to not loss too much accuracy. This often means converting a data type to represent the same information with fewer bits. For example, if the model weights are stored as 32-bit floating points and they'are quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memeory-usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits.\n","\n","In this notebook, let's review the quantization techniques have been used by us until now.\n","\n","\n","# Activation-Aware Weight Quantization(AWQ)\n","\n","It observe the activation rather than weights. See example in [Quantization Methods](https://www.kaggle.com/code/aisuko/quantization-methods)\n","\n","\n","# Weight of Quantization\n","\n","**It has two main families**\n","\n","\n","# Post-Training Quantization(PTQ)\n","\n","We have discussed Native INT8 quantization, and it has two methods, see [Introduction to weight quantization](https://www.kaggle.com/code/aisuko/introduction-to-weight-quantization):\n","\n","* Absolute maximum(absmax)\n","* Zero-Point quantization\n","\n","Furthermore, we use LLM.INT8(). It relies on a vector-wise(absmax) quantization scheme and introduces mixed-precision quantization. And it has three useful features:\n","\n","* Offloading\n","* Outlier threshold\n","* Skip module conversion\n","\n","\n","### Generative Pre-trained Transformers(GPTQ)\n","\n","* Arbirary Order Insight\n","* Lazy Batch-Updates\n","* Cholesky Reformulation\n","\n","See [Quantization with GPTQ](https://www.kaggle.com/code/aisuko/quantization-with-gptq)\n","\n","\n","# Quantization-Aware Training(QAT)\n","\n","We have been not use any methods of this one."]},{"cell_type":"markdown","id":"d823c86d","metadata":{"papermill":{"duration":0.002208,"end_time":"2024-03-05T06:59:46.578968","exception":false,"start_time":"2024-03-05T06:59:46.57676","status":"completed"},"tags":[]},"source":["# NF4(QLoRA)\n","\n","It will freeze the model weight to LoRA before training. See [Fine-tuning Llama2 with QLoRA](https://www.kaggle.com/code/aisuko/fine-tuning-llama2-with-qlora)\n","\n","* NF4\n","* Double quantization\n","* Paged Optimizers"]},{"cell_type":"markdown","id":"dd97fb0b","metadata":{"papermill":{"duration":0.002663,"end_time":"2024-03-05T06:59:46.584026","exception":false,"start_time":"2024-03-05T06:59:46.581363","status":"completed"},"tags":[]},"source":["# Credit\n","\n","* https://maartengrootendorst.substack.com/p/which-quantization-method-is-right?utm_source=profile&utm_medium=reader2\n","* https://huggingface.co/docs/transformers/v4.37.0/quantization\n","* https://medium.com/towards-data-science/introduction-to-weight-quantization-2494701b9c0c\n","* https://towardsdatascience.com/4-bit-quantization-with-gptq-36b0f4f02c34"]}],"metadata":{"kaggle":{"accelerator":"none","dataSources":[],"dockerImageVersionId":30626,"isGpuEnabled":false,"isInternetEnabled":true,"language":"python","sourceType":"notebook"},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.10.12"},"papermill":{"default_parameters":{},"duration":5.21622,"end_time":"2024-03-05T06:59:47.008268","environment_variables":{},"exception":null,"input_path":"__notebook__.ipynb","output_path":"__notebook__.ipynb","parameters":{},"start_time":"2024-03-05T06:59:41.792048","version":"2.4.0"}},"nbformat":4,"nbformat_minor":5}