-
Notifications
You must be signed in to change notification settings - Fork 348
Static quant support for SmoothQuant #3089
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3089
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New FailuresAs of commit abfce41 with merge base 4013764 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@Xia-Weiwen I think it's better to wait until the Int8Tensor migration is done |
Thanks for the info |
tokenizer = AutoTokenizer.from_pretrained(model_id) | ||
model = ( | ||
AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16) | ||
AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: torch_dtype
is deprecated; please check #2982 for more info
tokenizer = AutoTokenizer.from_pretrained(model_id) | ||
model = ( | ||
AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16) | ||
AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
torch.manual_seed(34) | ||
w8a8_model = ( | ||
AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16) | ||
AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here :)
model_save_path: str, | ||
model_save_hf_hub_path: str, | ||
static_quant_act: bool, | ||
compile: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you share result for torch.compile
with static-quant? Not sure for the reason, but it decreased Token/Sec within dynamic-quant, and discussed to remove at #2728 (comment) .
Summary
This PR adds static quant support for SmoothQuant by adding a new
Int8StaticActivationInt8WeightConfig
configuration. Static quantization will generally have better latency & throughput than dynamic quant as it saves the overhead of runtime qparam selection.In the implementation:
SmoothQuantObserver
returns act scale along with the smoothing factor.Int8StaticActivationInt8WeightConfig
for transformation of each linear layer.Int8StaticActivationInt8WeightConfig
is not suitable for general static quantization (although it works), users should use PT2E in that case. It's because the act scale for the config are global instead of per-linear-layer, which is the same asFloat8StaticActivationFloat8WeightConfig
Test plan
This PR also updates the test cases for SmoothQuant: