-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Intel GPU] Docs of XPUInductorQuantizer #3293
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3293
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
:: | ||
quantizer = XPUInductorQuantizer() | ||
quantizer.set_global(get_xpu_inductor_symm_quantization_config()) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code format has not taken effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for reminding, added the fix.
@@ -96,6 +96,13 @@ Prototype features are not available as part of binary distributions like PyPI o | |||
:link: ../prototype/pt2e_quant_x86_inductor.html | |||
:tags: Quantization | |||
|
|||
.. customcarditem:: | |||
:header: PyTorch 2 Export Quantization with Intel GPU Backend through Inductor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intel XPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At previous stage when we upload RFCs, we recommend using GPU instead of XPU for readability for users. Do we have some changes on this description desicsion?
@@ -0,0 +1,234 @@ | |||
PyTorch 2 Export Quantization with Intel GPU Backend through Inductor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intel XPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyTorch 2 Export Quantization with Intel GPU Backend through Inductor | |
Export Quantization with Intel GPU Backend through Inductor |
utilizes PyTorch 2 Export Quantization flow and lowers the quantized model into the inductor. | ||
|
||
The pytorch 2 export quantization flow uses the torch.export to capture the model into a graph and perform quantization transformations on top of the ATen graph. | ||
This approach is expected to have significantly higher model coverage, better programmability, and a simplified UX. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach is expected to have significantly higher model coverage with better programmability and a simplified user experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggestions, modified.
The quantization flow mainly includes three steps: | ||
|
||
- Step 1: Capture the FX Graph from the eager Model based on the `torch export mechanism <https://pytorch.org/docs/main/export.html>`_. | ||
- Step 2: Apply the Quantization flow based on the captured FX Graph, including defining the backend-specific quantizer, generating the prepared model with observers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apply the quantization flow based on the captured FX Graph, including defining the backend-specific quantizer, generating the prepared model with observers,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggestions, has changed the description here.
performing the prepared model's calibration, and converting the prepared model into the quantized model. | ||
- Step 3: Lower the quantized model into inductor with the API ``torch.compile``. | ||
|
||
During Step 3, the inductor would decide which kernels are dispatched into. There are two kinds of kernels the Intel GPU would obtain benefits, oneDNN kernels and triton kernels. `Intel oneAPI Deep Neural Network Library (oneDNN) <https://github.com/uxlfoundation/oneDNN>`_ contains |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a end-user documentation, I think we could focus on PyTorch itself, and remove this section explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggestion, I removed the prolonged description over oneDNN and triton. Instead, I add a simple mention at Step 3
above.
Post Training Quantization | ||
---------------------------- | ||
|
||
Static quantization is the only method we support currently. QAT and dynamic quantization will be available in later versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the further ready context from current introduction - "QAT and dynamic quantization will be available in later versions."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggestion, removed.
|
||
:: | ||
|
||
pip install torchvision pytorch-triton-xpu --index-url https://download.pytorch.org/whl/nightly/xpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use standard "pip install torch torchvision torchaudio", not separate internal commands to highlight the internal dependencies command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need keep using our own channels, since torchvision is customized on XPU, we need let user could run example in this doc successfully. Standard channel would have runtime error. Synced with @jingxu10 I changed to use pip3 install torch torchvision torchaudio pytorch-triton-xpu --index-url https://download.pytorch.org/whl/xpu
, instead of nightly
wheel.
|
||
The high-level architecture of this flow could look like this: | ||
|
||
.. image:: ../_static/img/pt2e_quant_xpu_inductor.png |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that Float Model
, Example Input
and XPUInductorQuantizer
is invisible in dark mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for reminding, the pictures is moidified
PyTorch 2 Export Quantization with Intel GPU Backend through Inductor | ||
================================================================== | ||
|
||
**Author**: `Yan Zhiwei <https://github.com/ZhiweiYan-96>`_, `Wang Eikan <https://github.com/EikanWang>`_, `Zhang, Liangang <https://github.com/liangan1>`_, `Liu River <https://github.com/riverliuintel>`_, `Cui Yifeng <https://github.com/CuiYifeng>`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please unify the style of names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, modified
quant_min=-128, | ||
quant_max=127, | ||
qscheme=torch.per_tensor_symmetric, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider whether we need more detailed annotations here to explain the meaning of these key parameters to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, explanation is added.
dtype=torch.int8, | ||
quant_min=-128, | ||
quant_max=127, | ||
qscheme=torch.per_channel_symmetric, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, explanation is added.
-------------- | ||
|
||
This tutorial introduces XPUInductorQuantizer aiming for serving the quantized model inference on Intel GPUs. The tutorial will cover how it | ||
utilizes PyTorch 2 Export Quantization flow and lowers the quantized model into the inductor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are you trying to say in this phrase: "lowers the quantized model into the inductor"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the terminology in torch.compile
optimized_model(*example_inputs) | ||
|
||
|
||
Put all these codes together, we will have the toy example code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to have this sentence, just delete it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for advice, removed.
|
||
|
||
Put all these codes together, we will have the toy example code. | ||
Please note that since the Inductor ``freeze`` feature does not turn on by default yet, run your example code with ``TORCHINDUCTOR_FREEZING=1``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it might be better to put this near the top so developers are aware they need to run with this env variable set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggestions, I moved to the top of example code section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed for grammar, sentence flow. General feedback
Co-authored-by: alexsin368 <[email protected]>
Co-authored-by: alexsin368 <[email protected]>
Co-authored-by: alexsin368 <[email protected]>
Co-authored-by: alexsin368 <[email protected]>
Co-authored-by: alexsin368 <[email protected]>
Co-authored-by: alexsin368 <[email protected]>
Co-authored-by: alexsin368 <[email protected]>
Co-authored-by: alexsin368 <[email protected]>
Co-authored-by: alexsin368 <[email protected]>
Co-authored-by: alexsin368 <[email protected]>
Co-authored-by: alexsin368 <[email protected]>
Co-authored-by: alexsin368 <[email protected]>
Co-authored-by: alexsin368 <[email protected]>
Co-authored-by: alexsin368 <[email protected]>
Co-authored-by: alexsin368 <[email protected]>
Co-authored-by: alexsin368 <[email protected]>
Co-authored-by: alexsin368 <[email protected]>
optimized_model(*example_inputs) | ||
|
||
In a more advanced scenario, int8-mixed-bf16 quantization comes into play. In this instance, | ||
a convolution or GEMM operator produces the output in BFloat16 instead of Float32 in the absence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a convolution or GEMM operator produces the output in BFloat16 instead of Float32 in the absence | |
a Convolution or GEMM operator produces the output in BFloat16 instead of Float32 in the absence |
or
a convolution or GEMM operator produces the output in BFloat16 instead of Float32 in the absence | |
a Conv or GEMM operator produces the output in BFloat16 instead of Float32 in the absence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggestion. We may keep this as here is a vanilla noun.
-------------- | ||
|
||
This tutorial introduces XPUInductorQuantizer, which aims to serve quantized models for inference on Intel GPUs. | ||
It utilizes the PyTorch 2 Export Quantization flow and lowers the quantized model into the inductor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we standardize capitalization of Inductor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for reminding, has align the style now
hi, @svekars @AlannaBurke could you please help review our documentation? The PR serves as a tutorial for PT2E int8 on Intel GPU backend. Appreciation for your feedback and suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few editorial suggestions.
--------------- | ||
|
||
- `PyTorch 2 Export Post Training Quantization <https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html>`_ | ||
- `TorchInductor and torch.compile concepts in PyTorch <https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html>`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `TorchInductor and torch.compile concepts in PyTorch <https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html>`_ | |
- `TorchInductor and torch.compile concepts in PyTorch <https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html>`_ | |
- PyTorch 2.7 or later |
@@ -0,0 +1,234 @@ | |||
PyTorch 2 Export Quantization with Intel GPU Backend through Inductor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyTorch 2 Export Quantization with Intel GPU Backend through Inductor | |
Export Quantization with Intel GPU Backend through Inductor |
Introduction | ||
-------------- | ||
|
||
This tutorial introduces XPUInductorQuantizer, which aims to serve quantized models for inference on Intel GPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tutorial introduces XPUInductorQuantizer, which aims to serve quantized models for inference on Intel GPUs. | |
This tutorial introduces ``XPUInductorQuantizer``, which aims to serve quantized models for inference on Intel GPUs. |
-------------- | ||
|
||
This tutorial introduces XPUInductorQuantizer, which aims to serve quantized models for inference on Intel GPUs. | ||
It utilizes the PyTorch 2 Export Quantization flow and lowers the quantized model into the inductor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It utilizes the PyTorch 2 Export Quantization flow and lowers the quantized model into the inductor. | |
``XPUInductorQuantizer`` use the PyTorch Export Quantization flow and lowers the quantized model into the inductor. |
This tutorial introduces XPUInductorQuantizer, which aims to serve quantized models for inference on Intel GPUs. | ||
It utilizes the PyTorch 2 Export Quantization flow and lowers the quantized model into the inductor. | ||
|
||
The Pytorch 2 Export Quantization flow uses `torch.export` to capture the model into a graph and perform quantization transformations on top of the ATen graph. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to call it "PyTorch 2 Export Quantization flow" or can it be just "Export Quantization flow"?
The Pytorch 2 Export Quantization flow uses `torch.export` to capture the model into a graph and perform quantization transformations on top of the ATen graph. | |
The PyTorch 2 Export Quantization flow uses ``torch.export`` to capture the model into a graph and perform quantization transformations on top of the ATen graph. |
|
||
The Pytorch 2 Export Quantization flow uses `torch.export` to capture the model into a graph and perform quantization transformations on top of the ATen graph. | ||
This approach is expected to have significantly higher model coverage with better programmability and a simplified user experience. | ||
TorchInductor is the compiler backend that compiles the FX Graphs generated by TorchDynamo into optimized C++/Triton kernels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TorchInductor is the compiler backend that compiles the FX Graphs generated by TorchDynamo into optimized C++/Triton kernels. | |
TorchInductor is a compiler backend that transforms FX Graphs generated by ``TorchDynamo`` into optimized C++/Triton kernels. |
- Step 1: Capture the FX Graph from the eager model based on the `torch export mechanism <https://pytorch.org/docs/main/export.html>`_. | ||
- Step 2: Apply the quantization flow based on the captured FX Graph, including defining the backend-specific quantizer, generating the prepared model with observers, | ||
performing the prepared model's calibration, and converting the prepared model into the quantized model. | ||
- Step 3: Lower the quantized model into inductor with the API ``torch.compile``, which would call triton kernels or oneDNN GEMM/Convolution kernels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Step 3: Lower the quantized model into inductor with the API ``torch.compile``, which would call triton kernels or oneDNN GEMM/Convolution kernels. | |
- Step 3: Lower the quantized model into inductor with the API ``torch.compile``, which would call Triton kernels or oneDNN GEMM/Convolution kernels. |
pip3 install torch torchvision torchaudio pytorch-triton-xpu --index-url https://download.pytorch.org/whl/xpu | ||
|
||
|
||
Please note that since the inductor ``freeze`` feature does not turn on by default yet, run your example code with ``TORCHINDUCTOR_FREEZING=1``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that since the inductor ``freeze`` feature does not turn on by default yet, run your example code with ``TORCHINDUCTOR_FREEZING=1``. | |
Please note that since the inductor ``freeze`` feature does not turn on by default yet, you must run your example code with ``TORCHINDUCTOR_FREEZING=1``. |
quantizer.set_global(get_xpu_inductor_symm_quantization_config()) | ||
|
||
After the backend-specific quantizer is imported, prepare the model for post-training quantization. | ||
``prepare_pt2e`` folds BatchNorm operators into preceding Conv2d operators, and inserts observers into appropriate places in the model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``prepare_pt2e`` folds BatchNorm operators into preceding Conv2d operators, and inserts observers into appropriate places in the model. | |
``prepare_pt2e`` folds ``BatchNorm`` operators into preceding Conv2d operators, and inserts observers into appropriate places in the model. |
optimized_model = torch.compile(converted_model) | ||
|
||
# Running some benchmark | ||
optimized_model(*example_inputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please write a conclusion for this tutorial. For example:
optimized_model(*example_inputs) | |
optimized_model(*example_inputs) | |
Conclusion | |
----------- | |
In this tutorial, we have learned how to utilize the ``XPUInductorQuantizer`` to perform post-training quantization on models for inference on Intel GPUs, leveraging PyTorch 2's Export Quantization flow. We covered the step-by-step process of capturing an FX Graph, applying quantization, and lowering the quantized model into the inductor backend using ``torch.compile``. Additionally, we explored the benefits of using int8-mixed-bf16 quantization for improved memory efficiency and potential performance gains, especially when using ``BFloat16`` autocast. |
Description
Add tutorials for XPUInductorQuantzer, which serves as the INT8 quantization backend for Intel GPU inside PT2E.