[Feature] Log/info/Save/Restore quantization steps #1410

mratsim · 2025-05-04T19:05:09Z

Inspired by trying to push through #1409 or gptq experiments.

It's incredibly frustrating to fix quantization workflows to have new failures force you to redo hours of quantization.

Some stories:

AWQ quantization is done on CPU then GPU, I tried to workaround [AWQ] Insane memory requirement: over 900GB for 32B model #1409 by increasing swap space but:
1. When killed by Linux OOM, the whole terminal is killed so there is no trace of what sample #ID led to OOM. I.e. if left unattended, I have to do guesswork because there is no log available anymore.
2. After the CPU OOM is fixed, I get a GPU OOM and so have to redo 40min of CPU computation (on overclocked Ryzen 9 9950X). It would be much much much more user-friendly to do a GPU test run at the very beginning, ensuring ahead of time that all hardware requirements are met instead of being surprised after people committed time.
GPTQ quantization can have errors due to numerical instability and finding no solution with hessian.
1. Those should definitely be logged to disk.
2. It would be great time savings to be allowed to restart GPTQ from layers that converged and allow more samples or more seq lengths or a different dampening fraction for the follow-up layers that failed.

brian-dellabetta · 2025-05-14T15:21:12Z

Hi @mratsim , thank you for the report, this is helpful for us to understand pain points of using llm-compressor.

Regarding your OOM errors, #1426 and neuralmagic/compressed-tensors#301 should go a long way to resolving these. The latter has some nice memory usage charts to show we shouldn't hit OOM when saving.

Regarding GPTQ, I think the concern with offloading hessians is the large storage requirement, but maybe a few could be logged, cc @kylesayrs

Regarding checkpointing, that could be a good feature to add to the sequential pipelines, but we'd have to think about how to expose it. Hopefully these latest enhancements will make the remove the need for a lot of this

SUMMARY: - Add QuantizationMixin to AWQModifier so we don't have redundant inputs (num_bits, symmetric, group_size) - Move AWQModifier to sequential pipelines, to avoid huge memory requirements of caching all activations at once. Regression test results are acceptable, results are all roughly the same, and within stderr, see test plan below. Resolves #1409 Resolves #1369 Related to #1383 Related to #1406 Related to #1368 Related to #1410 More improvements split into #1435 TEST PLAN: - [x] Rerun tests to validate No regression in tests, comparing against those reported in [original AWQ PR](#1177 (comment)). All gsm8k results are within stderr: | Type | gsm8k | wikitext | ------ | ------ | ----- | Old AWQ+QuantModifier Sym | .1054, .1069 | 9.1931 | New AWQ+QuantMixin Sym | .1077, .1084 | 9.1841 | Old AWQ+QuantModifier Asym | .1274, .1281 | 9.0281 | New AWQ+QuantMixin Asym | .1312, .1350 | 9.0288 --------- Signed-off-by: Brian Dellabetta <[email protected]> Co-authored-by: Kyle Sayers <[email protected]>

mratsim added the enhancement New feature or request label May 4, 2025

brian-dellabetta mentioned this issue May 14, 2025

AWQ QuantizationMixin + SequentialPipeline #1426

Merged

1 task

brian-dellabetta self-assigned this May 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Log/info/Save/Restore quantization steps #1410

[Feature] Log/info/Save/Restore quantization steps #1410

mratsim commented May 4, 2025

brian-dellabetta commented May 14, 2025

Uh oh!

[Feature] Log/info/Save/Restore quantization steps #1410

[Feature] Log/info/Save/Restore quantization steps #1410

Comments

mratsim commented May 4, 2025

brian-dellabetta commented May 14, 2025

Uh oh!