Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions docs/source/en/quantization/awq.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@ The [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7

<figcaption class="text-center text-gray-500 text-lg">Fused module</figcaption>


| Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM) |
|-------------:|-----------------:|----------------:|-------------------:|------------------:|:----------------|
| 1 | 32 | 32 | 81.4899 | 80.2569 | 4.00 GB (5.05%) |
Expand Down Expand Up @@ -180,6 +181,7 @@ model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quant

The parameter `modules_to_fuse` should include:


- `"attention"`: The names of the attention layers to fuse in the following order: query, key, value and output projection layer. If you don't want to fuse these layers, pass an empty list.
- `"layernorm"`: The names of all the LayerNorm layers you want to replace with a custom fused LayerNorm. If you don't want to fuse these layers, pass an empty list.
- `"mlp"`: The names of the MLP layers you want to fuse into a single MLP layer in the order: (gate (dense, layer, post-attention) / up / down layers).
Expand Down Expand Up @@ -231,7 +233,11 @@ Note this feature is supported on AMD GPUs.

</Tip>

<Tip>

**Important:** The minimum required Python version for using `autoawq` is now 3.9. Ensure your environment meets this requirement to avoid compatibility issues.

</Tip>
## CPU support

Recent versions of `autoawq` supports CPU with ipex op optimizations. To get started, first install the latest version of `autoawq` by running:
Expand Down
4 changes: 3 additions & 1 deletion docs/source/en/quantization/contribute.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Before integrating a new quantization method into Transformers, ensure the metho
class Linear4bit(nn.Module):
def __init__(self, ...):
...

def forward(self, x):
return my_4bit_kernel(x, self.weight, self.bias)
```
Expand All @@ -44,6 +44,7 @@ This way, Transformers models can be easily quantized by replacing some instance

For some quantization methods, they may require "pre-quantizing" the models through data calibration (e.g., AWQ). In this case, we prefer to only support inference in Transformers and let the third-party library maintained by the ML community deal with the model quantization itself.

- Ensure that the environment meets the minimum Python version requirement of 3.9.
## Build a new HFQuantizer class

1. Create a new quantization config class inside [src/transformers/utils/quantization_config.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/utils/quantization_config.py) and make sure to expose the new quantization config inside Transformers main `init` by adding it to the [`_import_structure`](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/__init__.py#L1088) object of [src/transformers/__init__.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/__init__.py).
Expand All @@ -64,6 +65,7 @@ For some quantization methods, they may require "pre-quantizing" the models thro

6. Write the `_process_model_after_weight_loading` method. This method enables implementing additional features that require manipulating the model after loading the weights.


7. Document everything! Make sure your quantization method is documented by adding a new file under `docs/source/en/quantization` and adding a new row in the table in `docs/source/en/quantization/overview.md`.

8. Add tests! You should add tests by first adding the package in our nightly Dockerfile inside `docker/transformers-quantization-latest-gpu` and then adding a new test file in `tests/quantization/xxx`. Feel free to check out how it is implemented for other quantization methods.
7 changes: 4 additions & 3 deletions docs/source/ja/main_classes/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ rendered properly in your Markdown viewer.

以下のコードを実行するには、以下の要件がインストールされている必要があります:

- Python 3.9 以上が必要です。

- 最新の `AutoGPTQ` ライブラリをインストールする。
`pip install auto-gptq` をインストールする。

Expand All @@ -43,7 +45,6 @@ rendered properly in your Markdown viewer.
`pip install --upgrade accelerate` を実行する。

GPTQ統合は今のところテキストモデルのみをサポートしているので、視覚、音声、マルチモーダルモデルでは予期せぬ挙動に遭遇するかもしれないことに注意してください。

### Load and quantize a model

GPTQ は、量子化モデルを使用する前に重みのキャリブレーションを必要とする量子化方法です。トランスフォーマー モデルを最初から量子化する場合は、量子化モデルを作成するまでに時間がかかることがあります (`facebook/opt-350m`モデルの Google colab では約 5 分)。
Expand Down Expand Up @@ -193,7 +194,7 @@ model_4bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_4
torch.float32
```

### FP4 quantization
### FP4 quantization

#### Requirements

Expand Down Expand Up @@ -442,6 +443,6 @@ Hugging Face エコシステムのアダプターの公式サポートにより

[[autodoc]] BitsAndBytesConfig

## Quantization with 🤗 `optimum`
## Quantization with 🤗 `optimum`

`optimum`でサポートされている量子化方法の詳細については、[Optimum ドキュメント](https://huggingface.co/docs/optimum/index) を参照し、これらが自分のユースケースに適用できるかどうかを確認してください。
5 changes: 3 additions & 2 deletions docs/source/zh/main_classes/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,8 +139,9 @@ model = AutoModelForCausalLM.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ", att
- 安装最新版本的`accelerate`库:
`pip install --upgrade accelerate`

请注意,目前GPTQ集成仅支持文本模型,对于视觉、语音或多模态模型可能会遇到预期以外结果。
- Python 版本要求至少为 3.9

请注意,目前GPTQ集成仅支持文本模型,对于视觉、语音或多模态模型可能会遇到预期以外结果。
### 加载和量化模型

GPTQ是一种在使用量化模型之前需要进行权重校准的量化方法。如果您想从头开始对transformers模型进行量化,生成量化模型可能需要一些时间(在Google Colab上对`facebook/opt-350m`模型量化约为5分钟)。
Expand Down Expand Up @@ -307,7 +308,7 @@ torch.float32
```


### FP4 量化
### FP4 量化

#### 要求

Expand Down