Skip to content

Commit d9b706d

Browse files
ajrasanesoodoshll
authored andcommitted
[OMNIML-2244] Add E2E example for mixed precision quantization and ONNX export (NVIDIA#656)
## What does this PR do? **Type of change:** New Feature **Overview:** - Enable ONNX export for auto quantized models - Update documentation and changelog ## Usage <!-- You can potentially add a usage example below. --> ``` python torch_quant_to_onnx.py --quantize_mode=auto \ --onnx_save_path=./vit_base_patch16_224.nvfp4_fp8.onnx \ --calibration_data_size 64 \ --auto_quantization_formats NVFP4_AWQ_LITE_CFG FP8_DEFAULT_CFG \ --batch_size 128 ``` ## Testing <!-- Mention how have you tested your change if applicable. --> ``` python evaluate.py --onnx_path=vit_base_patch16_224.nvfp4_fp8.onnx \ --model_name=vit_base_patch16_224 \ --results_path=./results.txt \ --batch_size 128 ``` Accuracy results ``` The top1 accuracy of the model is 84.15% The top5 accuracy of the model is 97.396% ``` Reference accuracy for fp16 ``` The top1 accuracy of the model is 85.102% The top5 accuracy of the model is 97.526% ``` ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes - **Did you write any new necessary tests?**: No - **Did you add or update any necessary documentation?**: Yes - **Did you update [Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> --------- Signed-off-by: ajrasane <[email protected]>
1 parent 2508db8 commit d9b706d

File tree

14 files changed

+316
-78
lines changed

14 files changed

+316
-78
lines changed

.github/CODEOWNERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,7 @@ modelopt/torch/utils @NVIDIA/modelopt-torch-utils-codeowners
5050
/examples/pruning @NVIDIA/modelopt-torch-nas-prune-codeowners
5151
/examples/specdec_bench @NVIDIA/modelopt-torch-speculative-codeowners
5252
/examples/speculative_decoding @NVIDIA/modelopt-torch-speculative-codeowners
53+
/examples/torch_onnx @NVIDIA/modelopt-onnx-codeowners
5354
/examples/vlm_ptq @NVIDIA/modelopt-examples-vlm-codeowners
5455
/examples/vllm_serve @NVIDIA/modelopt-examples-llm_ptq-codeowners
5556
/examples/windows @NVIDIA/modelopt-windows-codeowners

.github/workflows/example_tests.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -123,7 +123,7 @@ jobs:
123123
strategy:
124124
fail-fast: false
125125
matrix:
126-
example: [diffusers, onnx_ptq]
126+
example: [diffusers, torch_onnx]
127127
uses: ./.github/workflows/_example_tests_runner.yml
128128
secrets: inherit
129129
with:
@@ -137,7 +137,7 @@ jobs:
137137
strategy:
138138
fail-fast: false
139139
matrix:
140-
example: [diffusers, onnx_ptq]
140+
example: [diffusers, torch_onnx]
141141
uses: ./.github/workflows/_example_tests_runner.yml
142142
secrets: inherit
143143
with:

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -119,7 +119,7 @@ more fine-grained control on installed dependencies or for alternative docker im
119119
| LLM Quantization | [View Support Matrix](./examples/llm_ptq/README.md#support-matrix) |
120120
| Diffusers Quantization | [View Support Matrix](./examples/diffusers/README.md#support-matrix) |
121121
| VLM Quantization | [View Support Matrix](./examples/vlm_ptq/README.md#support-matrix) |
122-
| ONNX Quantization | [View Support Matrix](./examples/onnx_ptq/README.md#onnx-export-supported-llm-models) |
122+
| ONNX Quantization | [View Support Matrix](./examples/torch_onnx/README.md#onnx-export-supported-llm-models) |
123123
| Windows Quantization | [View Support Matrix](./examples/windows/README.md#support-matrix) |
124124
| Quantization Aware Training | [View Support Matrix](./examples/llm_qat/README.md#support-matrix) |
125125
| Pruning | [View Support Matrix](./examples/pruning/README.md#support-matrix) |

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ Welcome to Model Optimizer (ModelOpt) documentation!
99
getting_started/[0-9]*
1010
Quick Start: PTQ - PyTorch <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq>
1111
Quick Start: PTQ - ONNX <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/onnx_ptq>
12+
Quick Start: PTQ - PyTorch to ONNX <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/torch_onnx>
1213
Quick Start: PTQ - Windows <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows>
1314
Quick Start: QAT <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_qat>
1415
Quick Start: Pruning <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning>

examples/onnx_ptq/README.md

Lines changed: 2 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,8 @@ Model Optimizer enables highly performant quantization formats including NVFP4,
1212
| :------------: | :------------: | :------------: | :------------: |
1313
| Pre-Requisites | Required & optional packages to use this technique | [Link](#pre-requisites) | |
1414
| Getting Started | Learn how to optimize your models using PTQ to reduce precision and improve inference efficiency | [Link](#getting-started) | [docs](https://nvidia.github.io/Model-Optimizer/guides/_onnx_quantization.html) |
15-
| Support Matrix | View the ONNX export supported LLM models | [Link](#onnx-export-supported-llm-models) | |
16-
| PyTorch to ONNX | Example scripts demonstrating how to quantize with PyTorch and then convert to ONNX | [Link](#torch-quantization-to-onnx-export-example) | |
15+
| PyTorch to ONNX | Example scripts demonstrating how to quantize with PyTorch and then convert to ONNX | [Link](../torch_onnx/) | |
1716
| Advanced Features | Examples demonstrating use advanced ONNX quantization features | [Link](#advanced-features) | |
18-
| Pre-Quantized Checkpoints | Ready to deploy Hugging Face pre-quantized checkpoints | [Link](#pre-quantized-checkpoints) | |
1917
| Resources | Extra links to relevant resources | [Link](#resources) | |
2018

2119
</div>
@@ -80,7 +78,7 @@ python image_prep.py \
8078

8179
The model can be quantized as an FP8, INT8 or INT4 model using either the CLI or Python API. For FP8 and INT8 quantization, you have a choice between `max` and `entropy` calibration algorithms. For INT4 quantization, [awq_clip](https://arxiv.org/abs/2306.00978) or [rtn_dq](https://ar5iv.labs.arxiv.org/html/2301.12017) algorithms can be chosen.
8280

83-
> *For NVFP4 and MXFP8 ONNX, see the [PyTorch to ONNX section](#torch-quantization-to-onnx-export-example).*
81+
> *For NVFP4 and MXFP8 ONNX, see the [PyTorch to ONNX example](../torch_onnx/).*
8482
8583
> *Minimum opset requirements: int8 (13+), fp8 (21+), int4 (21+). ModelOpt will automatically upgrade lower opset versions to meet these requirements.*
8684
@@ -129,58 +127,6 @@ The top5 accuracy of the model is <accuracy score between 0-100%>
129127
Inference latency of the model is <X> ms
130128
```
131129

132-
## Torch quantization to ONNX export example
133-
134-
This example demonstrates how to quantize a [timm](https://github.com/huggingface/pytorch-image-models) vision model for various precision formats followed by export to ONNX. The script leverages the ModelOpt toolkit for both quantization and ONNX export.
135-
136-
> *Opset 20 is used to export the torch models to ONNX.*
137-
138-
### What it does
139-
140-
- Loads a pretrained timm torch model (default: ViT-Base).
141-
- Quantizes the torch model to MXFP8, INT4 or NVFP4 using ModelOpt.
142-
- Exports the quantized model to ONNX.
143-
- Postprocesses the ONNX model to be compatible with TensorRT.
144-
- Saves the final ONNX model.
145-
146-
### Usage
147-
148-
```bash
149-
python torch_quant_to_onnx.py \
150-
--timm_model_name=vit_base_patch16_224 \
151-
--quantize_mode=<fp8|mxfp8|int8|nvfp4|int4_awq> \
152-
--onnx_save_path=<path to save the exported ONNX model>
153-
```
154-
155-
### Evaluation
156-
157-
If the input model is of type image classification, use the following script to evaluate it. The script automatically downloads and uses the [ILSVRC/imagenet-1k](https://huggingface.co/datasets/ILSVRC/imagenet-1k) dataset from Hugging Face. This gated repository requires authentication via Hugging Face access token. See <https://huggingface.co/docs/hub/en/security-tokens> for details.
158-
159-
> *Note: TensorRT 10.11 or later is required to evaluate the MXFP8 or NVFP4 ONNX models.*
160-
161-
```bash
162-
python evaluate.py \
163-
--onnx_path=<path to the exported ONNX model> \
164-
--imagenet_path=<HF dataset card or local path to the ImageNet dataset> \
165-
--engine_precision=stronglyTyped \
166-
--model_name=vit_base_patch16_224
167-
```
168-
169-
### ONNX Export Supported LLM Models
170-
171-
| Model | FP16 | INT4 | FP8 | NVFP4 |
172-
| :---: | :---: | :---: | :---: | :---: |
173-
| [Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) |||||
174-
| [Llama3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) |||||
175-
| [Llama3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) |||||
176-
| [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) |||||
177-
| [Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) |||||
178-
| [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) |||||
179-
| [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) |||||
180-
| [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) |||||
181-
| [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) |||||
182-
| [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) |||||
183-
184130
## Advanced Features
185131

186132
### Per node calibration of ONNX models
@@ -273,10 +219,6 @@ trtexec --onnx=/path/to/identity_neural_network.quant.onnx \
273219
--staticPlugins=/path/to/libidentity_conv_iplugin_v2_io_ext.so
274220
```
275221

276-
## Pre-Quantized Checkpoints
277-
278-
- Ready-to-deploy checkpoints that can be exported to ONNX format (if supported as per the [Support Matrix](#onnx-export-supported-llm-models)) \[[🤗 Hugging Face - Nvidia Model Optimizer Collection](https://huggingface.co/collections/nvidia/inference-optimized-checkpoints-with-model-optimizer)\]
279-
280222
## Resources
281223

282224
- 📅 [Roadmap](https://github.com/NVIDIA/Model-Optimizer/issues/146)

examples/torch_onnx/README.md

Lines changed: 215 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,215 @@
1+
# Torch Quantization to ONNX Export
2+
3+
This example demonstrates how to quantize PyTorch models (vision and LLM) followed by export to ONNX format. The scripts leverage the ModelOpt toolkit for both quantization and ONNX export.
4+
5+
<div align="center">
6+
7+
| **Section** | **Description** | **Link** |
8+
| :------------: | :------------: | :------------: |
9+
| Pre-Requisites | Required packages to use this example | [Link](#pre-requisites) |
10+
| Vision Models | Quantize timm models and export to ONNX | [Link](#vision-models) |
11+
| LLM Export | Export LLMs to quantized ONNX | [Link](#llm-export) |
12+
| Mixed Precision | Auto mode for optimal per-layer quantization | [Link](#mixed-precision-quantization-auto-mode) |
13+
| Support Matrix | View the ONNX export supported LLM models | [Link](#onnx-export-supported-llm-models) |
14+
| Resources | Extra links to relevant resources | [Link](#resources) |
15+
16+
</div>
17+
18+
## Pre-Requisites
19+
20+
### Docker
21+
22+
Please use the TensorRT docker image (e.g., `nvcr.io/nvidia/tensorrt:25.08-py3`) or visit our [installation docs](https://nvidia.github.io/Model-Optimizer/getting_started/2_installation.html) for more information.
23+
24+
Set the following environment variables inside the TensorRT docker.
25+
26+
```bash
27+
export CUDNN_LIB_DIR=/usr/lib/x86_64-linux-gnu/
28+
export LD_LIBRARY_PATH="${CUDNN_LIB_DIR}:${LD_LIBRARY_PATH}"
29+
```
30+
31+
### Local Installation
32+
33+
Install Model Optimizer with `onnx` dependencies using `pip` from [PyPI](https://pypi.org/project/nvidia-modelopt/) and install the requirements for the example:
34+
35+
```bash
36+
pip install -U "nvidia-modelopt[onnx]"
37+
pip install -r requirements.txt
38+
```
39+
40+
For TensorRT Compiler framework workloads:
41+
42+
Install the latest [TensorRT](https://developer.nvidia.com/tensorrt) from [here](https://developer.nvidia.com/tensorrt/download).
43+
44+
## Vision Models
45+
46+
The `torch_quant_to_onnx.py` script quantizes [timm](https://github.com/huggingface/pytorch-image-models) vision models and exports them to ONNX.
47+
48+
### What it does
49+
50+
- Loads a pretrained timm torch model (default: ViT-Base).
51+
- Quantizes the torch model to FP8, MXFP8, INT8, NVFP4, or INT4_AWQ using ModelOpt.
52+
- Exports the quantized model to ONNX.
53+
- Postprocesses the ONNX model to be compatible with TensorRT.
54+
- Saves the final ONNX model.
55+
56+
> *Opset 20 is used to export the torch models to ONNX.*
57+
58+
### Usage
59+
60+
```bash
61+
python torch_quant_to_onnx.py \
62+
--timm_model_name=vit_base_patch16_224 \
63+
--quantize_mode=<fp8|mxfp8|int8|nvfp4|int4_awq> \
64+
--onnx_save_path=<path to save the exported ONNX model>
65+
```
66+
67+
### Evaluation
68+
69+
If the input model is of type image classification, use the following script to evaluate it. The script automatically downloads and uses the [ILSVRC/imagenet-1k](https://huggingface.co/datasets/ILSVRC/imagenet-1k) dataset from Hugging Face. This gated repository requires authentication via Hugging Face access token. See <https://huggingface.co/docs/hub/en/security-tokens> for details.
70+
71+
> *Note: TensorRT 10.11 or later is required to evaluate the MXFP8 or NVFP4 ONNX models.*
72+
73+
```bash
74+
python ../onnx_ptq/evaluate.py \
75+
--onnx_path=<path to the exported ONNX model> \
76+
--imagenet_path=<HF dataset card or local path to the ImageNet dataset> \
77+
--engine_precision=stronglyTyped \
78+
--model_name=vit_base_patch16_224
79+
```
80+
81+
## LLM Export
82+
83+
The `llm_export.py` script exports LLM models to ONNX with optional quantization.
84+
85+
### What it does
86+
87+
- Loads a HuggingFace LLM model (local path or model name).
88+
- Optionally quantizes the model to FP8, INT4_AWQ, or NVFP4.
89+
- Exports the model to ONNX format.
90+
- Post-processes the ONNX graph for TensorRT compatibility.
91+
92+
### Usage
93+
94+
```bash
95+
python llm_export.py \
96+
--hf_model_path=<HuggingFace model name or local path> \
97+
--dtype=<fp16|fp8|int4_awq|nvfp4> \
98+
--output_dir=<directory to save ONNX model>
99+
```
100+
101+
### Examples
102+
103+
Export Qwen2 to FP16 ONNX:
104+
105+
```bash
106+
python llm_export.py \
107+
--hf_model_path=Qwen/Qwen2-0.5B-Instruct \
108+
--dtype=fp16 \
109+
--output_dir=./qwen2_fp16
110+
```
111+
112+
Export Qwen2 to FP8 ONNX with quantization:
113+
114+
```bash
115+
python llm_export.py \
116+
--hf_model_path=Qwen/Qwen2-0.5B-Instruct \
117+
--dtype=fp8 \
118+
--output_dir=./qwen2_fp8
119+
```
120+
121+
Export to NVFP4 with custom calibration:
122+
123+
```bash
124+
python llm_export.py \
125+
--hf_model_path=Qwen/Qwen3-0.6B \
126+
--dtype=nvfp4 \
127+
--calib_size=512 \
128+
--output_dir=./qwen3_nvfp4
129+
```
130+
131+
### Key Parameters
132+
133+
| Parameter | Description |
134+
| :--- | :--- |
135+
| `--hf_model_path` | HuggingFace model name (e.g., `Qwen/Qwen2-0.5B-Instruct`) or local model path |
136+
| `--dtype` | Export precision: `fp16`, `fp8`, `int4_awq`, or `nvfp4` |
137+
| `--output_dir` | Directory to save the exported ONNX model |
138+
| `--calib_size` | Number of calibration samples for quantization (default: 512) |
139+
| `--lm_head` | Precision of lm_head layer (default: `fp16`) |
140+
| `--save_original` | Save the raw ONNX before post-processing |
141+
| `--trust_remote_code` | Trust remote code when loading from HuggingFace Hub |
142+
143+
## Mixed Precision Quantization (Auto Mode)
144+
145+
The `auto` mode enables mixed precision quantization by searching for the optimal quantization format per layer. This approach balances model accuracy and compression by assigning different precision formats (e.g., NVFP4, FP8) to different layers based on their sensitivity.
146+
147+
### How it works
148+
149+
1. **Sensitivity Analysis**: Computes per-layer sensitivity scores using gradient-based analysis
150+
2. **Format Search**: Searches across specified quantization formats for each layer
151+
3. **Constraint Optimization**: Finds the optimal format assignment that satisfies the effective bits constraint while minimizing accuracy loss
152+
153+
### Key Parameters
154+
155+
| Parameter | Default | Description |
156+
| :--- | :---: | :--- |
157+
| `--effective_bits` | 4.8 | Target average bits per weight across the model. Lower values = more compression but potentially lower accuracy. The search algorithm finds the optimal per-layer format assignment that meets this constraint while minimizing accuracy loss. For example, 4.8 means an average of 4.8 bits per weight (mix of FP4 and FP8 layers). |
158+
| `--num_score_steps` | 128 | Number of forward/backward passes used to compute per-layer sensitivity scores via gradient-based analysis. Higher values provide more accurate sensitivity estimates but increase search time. Recommended range: 64-256. |
159+
| `--calibration_data_size` | 512 | Number of calibration samples used for both sensitivity scoring and calibration. For auto mode, labels are required for loss computation. |
160+
161+
### Usage
162+
163+
```bash
164+
python torch_quant_to_onnx.py \
165+
--timm_model_name=vit_base_patch16_224 \
166+
--quantize_mode=auto \
167+
--auto_quantization_formats NVFP4_AWQ_LITE_CFG FP8_DEFAULT_CFG \
168+
--effective_bits=4.8 \
169+
--num_score_steps=128 \
170+
--calibration_data_size=512 \
171+
--evaluate \
172+
--onnx_save_path=vit_base_patch16_224.auto_quant.onnx
173+
```
174+
175+
### Results (ViT-Base)
176+
177+
| | Top-1 accuracy (torch) | Top-5 accuracy (torch) |
178+
| :--- | :---: | :---: |
179+
| Torch autocast (FP16) | 85.11% | 97.53% |
180+
| NVFP4 Quantized | 84.558% | 97.36% |
181+
| Auto Quantized (FP8 + NVFP4, 4.78 effective bits) | 84.726% | 97.434% |
182+
183+
## ONNX Export Supported LLM Models
184+
185+
| Model | FP16 | INT4 | FP8 | NVFP4 |
186+
| :---: | :---: | :---: | :---: | :---: |
187+
| [Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) |||||
188+
| [Llama3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) |||||
189+
| [Llama3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) |||||
190+
| [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) |||||
191+
| [Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) |||||
192+
| [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) |||||
193+
| [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) |||||
194+
| [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) |||||
195+
| [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) |||||
196+
| [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) |||||
197+
198+
## Resources
199+
200+
- 📅 [Roadmap](https://github.com/NVIDIA/Model-Optimizer/issues/146)
201+
- 📖 [Documentation](https://nvidia.github.io/Model-Optimizer)
202+
- 🎯 [Benchmarks](../benchmark.md)
203+
- 💡 [Release Notes](https://nvidia.github.io/Model-Optimizer/reference/0_changelog.html)
204+
- 🐛 [File a bug](https://github.com/NVIDIA/Model-Optimizer/issues/new?template=1_bug_report.md)
205+
-[File a Feature Request](https://github.com/NVIDIA/Model-Optimizer/issues/new?template=2_feature_request.md)
206+
207+
### Technical Resources
208+
209+
There are many quantization schemes supported in the example scripts:
210+
211+
1. The [FP8 format](https://developer.nvidia.com/blog/nvidia-arm-and-intel-publish-fp8-specification-for-standardization-as-an-interchange-format-for-ai/) is available on the Hopper and Ada GPUs with [CUDA compute capability](https://developer.nvidia.com/cuda-gpus) greater than or equal to 8.9.
212+
213+
1. The [INT4 AWQ](https://arxiv.org/abs/2306.00978) is an INT4 weight only quantization and calibration method. INT4 AWQ is particularly effective for low batch inference where inference latency is dominated by weight loading time rather than the computation time itself. For low batch inference, INT4 AWQ could give lower latency than FP8/INT8 and lower accuracy degradation than INT8.
214+
215+
1. The [NVFP4](https://blogs.nvidia.com/blog/generative-ai-studio-ces-geforce-rtx-50-series/) is one of the new FP4 formats supported by NVIDIA Blackwell GPU and demonstrates good accuracy compared with other 4-bit alternatives. NVFP4 can be applied to both model weights as well as activations, providing the potential for both a significant increase in math throughput and reductions in memory footprint and memory bandwidth usage compared to the FP8 data format on Blackwell.
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
datasets>=2.14.4
2+
timm
3+
torchvision
4+
transformers

examples/onnx_ptq/torch_quant_to_onnx.py renamed to examples/torch_onnx/torch_quant_to_onnx.py

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,11 @@
1515

1616
import argparse
1717
import re
18+
import sys
19+
from pathlib import Path
20+
21+
# Add onnx_ptq to path for shared modules
22+
sys.path.insert(0, str(Path(__file__).parent.parent / "onnx_ptq"))
1823

1924
import timm
2025
import torch
@@ -323,12 +328,6 @@ def main():
323328
)
324329
print(f"Quantized Model - Top-1 Accuracy: {top1:.2f}%, Top-5 Accuracy: {top5:.2f}%")
325330

326-
if args.quantize_mode in ["auto"]:
327-
print(
328-
f"The selected quantization mode {args.quantize_mode} is not supported for ONNX export yet."
329-
)
330-
return
331-
332331
# Export to ONNX
333332
export_to_onnx(
334333
quantized_model,

0 commit comments

Comments
 (0)