-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Kaggle Notebook | Quantization Techniques | Version 11
- Loading branch information
Showing
1 changed file
with
1 addition
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.6.4","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python"},"kaggle":{"accelerator":"none","dataSources":[],"dockerImageVersionId":30626,"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":false}},"nbformat_minor":4,"nbformat":4,"cells":[{"source":"<a href=\"https://www.kaggle.com/code/aisuko/quantization-techniques?scriptVersionId=165063839\" target=\"_blank\"><img align=\"left\" alt=\"Kaggle\" title=\"Open in Kaggle\" src=\"https://kaggle.com/static/images/open-in-kaggle.svg\"></a>","metadata":{},"cell_type":"markdown"},{"cell_type":"markdown","source":"# Overview\n\nAs bigger as the size of LLM, there are various technologies have been developed that try to shrink the model size, like quantization techniques. Quantization techniques focus on representing data with less information while also trying to not loss too much accuracy. This often means converting a data type to represent the same information with fewer bits. For example, if the model weights are stored as 32-bit floating points and they'are quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memeory-usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits.\n\nIn this notebook, let's review the quantization techniques have been used by us until now.","metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19"}},{"cell_type":"markdown","source":"# Activation-Aware Weight Quantization(AWQ)\n\nIt observe the activation rather than weights. See example in [Quantization Methods](https://www.kaggle.com/code/aisuko/quantization-methods?scriptVersionId=160183672)","metadata":{}},{"cell_type":"markdown","source":"# Weight of Quantization\n\nIt has two main families\n\n\n## Post-Training Quantization(PTQ)\n\nWe have discussed Native INT8 quantization, and it has two methods, see [Introduction to weight quantization](https://www.kaggle.com/code/aisuko/introduction-to-weight-quantization):\n\n* Absolute maximum(absmax)\n* Zero-Point quantization\n\nFurthermore, we use LLM.INT8(). It relies on a vector-wise(absmax) quantization scheme and introduces mixed-precision quantization. And it has three useful features:\n\n* Offloading\n* Outlier threshold\n* Skip module conversion\n\n\n## Generative Pre-trained Transformers(GPTQ)\n\n* Arbirary Order Insight\n* Lazy Batch-Updates\n* Cholesky Reformulation\n\nSee [Quantization with GPTQ](https://www.kaggle.com/code/aisuko/quantization-with-gptq)\n\n\n# Quantization-Aware Training(QAT)\n\nWe have been not use any methods of this one.","metadata":{}},{"cell_type":"markdown","source":"# Fine-tuning approchs\n\nSome of fine-tuning approch like \"adapter-tuning\" technique. It will quantize model first before the training start.\n\n\n## QLoRA\n\nIt will freeze the model weight to LoRA before training. See [Fine-tuning Llama2 with QLoRA](https://www.kaggle.com/code/aisuko/fine-tuning-llama2-with-qlora)\n\n* NF4\n* Double quantization\n* Paged Optimizers","metadata":{}},{"cell_type":"markdown","source":"# Credit\n\n* https://maartengrootendorst.substack.com/p/which-quantization-method-is-right?utm_source=profile&utm_medium=reader2\n* https://huggingface.co/docs/transformers/v4.37.0/quantization\n* https://medium.com/towards-data-science/introduction-to-weight-quantization-2494701b9c0c\n* https://towardsdatascience.com/4-bit-quantization-with-gptq-36b0f4f02c34","metadata":{}}]} | ||
{"cells":[{"source":"<a href=\"https://www.kaggle.com/code/aisuko/quantization-techniques?scriptVersionId=165506354\" target=\"_blank\"><img align=\"left\" alt=\"Kaggle\" title=\"Open in Kaggle\" src=\"https://kaggle.com/static/images/open-in-kaggle.svg\"></a>","metadata":{},"cell_type":"markdown"},{"cell_type":"markdown","id":"68342cf8","metadata":{"papermill":{"duration":0.003662,"end_time":"2024-03-05T06:59:46.573853","exception":false,"start_time":"2024-03-05T06:59:46.570191","status":"completed"},"tags":[]},"source":["# Overview\n","\n","As bigger as the size of LLM, there are various technologies have been developed that try to shrink the model size, like quantization techniques. Quantization techniques focus on representing data with less information while also trying to not loss too much accuracy. This often means converting a data type to represent the same information with fewer bits. For example, if the model weights are stored as 32-bit floating points and they'are quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memeory-usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits.\n","\n","In this notebook, let's review the quantization techniques have been used by us until now.\n","\n","\n","# Activation-Aware Weight Quantization(AWQ)\n","\n","It observe the activation rather than weights. See example in [Quantization Methods](https://www.kaggle.com/code/aisuko/quantization-methods)\n","\n","\n","# Weight of Quantization\n","\n","**It has two main families**\n","\n","\n","# Post-Training Quantization(PTQ)\n","\n","We have discussed Native INT8 quantization, and it has two methods, see [Introduction to weight quantization](https://www.kaggle.com/code/aisuko/introduction-to-weight-quantization):\n","\n","* Absolute maximum(absmax)\n","* Zero-Point quantization\n","\n","Furthermore, we use LLM.INT8(). It relies on a vector-wise(absmax) quantization scheme and introduces mixed-precision quantization. And it has three useful features:\n","\n","* Offloading\n","* Outlier threshold\n","* Skip module conversion\n","\n","\n","### Generative Pre-trained Transformers(GPTQ)\n","\n","* Arbirary Order Insight\n","* Lazy Batch-Updates\n","* Cholesky Reformulation\n","\n","See [Quantization with GPTQ](https://www.kaggle.com/code/aisuko/quantization-with-gptq)\n","\n","\n","# Quantization-Aware Training(QAT)\n","\n","We have been not use any methods of this one."]},{"cell_type":"markdown","id":"d823c86d","metadata":{"papermill":{"duration":0.002208,"end_time":"2024-03-05T06:59:46.578968","exception":false,"start_time":"2024-03-05T06:59:46.57676","status":"completed"},"tags":[]},"source":["# NF4(QLoRA)\n","\n","It will freeze the model weight to LoRA before training. See [Fine-tuning Llama2 with QLoRA](https://www.kaggle.com/code/aisuko/fine-tuning-llama2-with-qlora)\n","\n","* NF4\n","* Double quantization\n","* Paged Optimizers"]},{"cell_type":"markdown","id":"dd97fb0b","metadata":{"papermill":{"duration":0.002663,"end_time":"2024-03-05T06:59:46.584026","exception":false,"start_time":"2024-03-05T06:59:46.581363","status":"completed"},"tags":[]},"source":["# Credit\n","\n","* https://maartengrootendorst.substack.com/p/which-quantization-method-is-right?utm_source=profile&utm_medium=reader2\n","* https://huggingface.co/docs/transformers/v4.37.0/quantization\n","* https://medium.com/towards-data-science/introduction-to-weight-quantization-2494701b9c0c\n","* https://towardsdatascience.com/4-bit-quantization-with-gptq-36b0f4f02c34"]}],"metadata":{"kaggle":{"accelerator":"none","dataSources":[],"dockerImageVersionId":30626,"isGpuEnabled":false,"isInternetEnabled":true,"language":"python","sourceType":"notebook"},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.10.12"},"papermill":{"default_parameters":{},"duration":5.21622,"end_time":"2024-03-05T06:59:47.008268","environment_variables":{},"exception":null,"input_path":"__notebook__.ipynb","output_path":"__notebook__.ipynb","parameters":{},"start_time":"2024-03-05T06:59:41.792048","version":"2.4.0"}},"nbformat":4,"nbformat_minor":5} |