[Docs] Add GPTQModel #14056

Qubitium · 2025-02-28T16:34:19Z

PR Changes:

Add doc for GPTQModel as another fully-backed by ModelCloud.AI user option for GPTQ model quantization

github-actions · 2025-02-28T16:34:29Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mgoin · 2025-02-28T16:50:50Z

docs/source/features/quantization/gptqmodel.md

+
+# GPTQModel
+
+To create a new [2, 3, 4, 8]-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQModel) from ModelCloud.AI.


I was curious about this earlier and didn't get to ask - how are the 2bit and 3bit layers running in vLLM? IIRC we don't have GPTQ kernels for these bits

Done. Frankly, we never tested 2,3 bits with vllm so made bad assumption there would be a fallback kernel that is 2,3 compatible. We only started testing 2,3 bits in gptqmodel recently as well due to the deepseek needs. Before deepseek, no one gambled with 2/3 bits. =)

@mgoin Actually, exllama vllm kernel internal code has 2,3 bits support but it was never tested and validated. I plan to adapt this kernel for GPTQModel as well so HF Transformer have access to this kernel. I will do some accuracy comparision vs reference Torch and Marlin and verify 2-8 bits and get back to you. If good, we can just advertise 2,3.

mgoin · 2025-02-28T16:56:05Z

docs/source/features/quantization/gptqmodel.md

Make sure to add link in docs/source/features/quantization/index.md

mgoin · 2025-02-28T18:13:20Z

docs/source/features/quantization/gptqmodel.md

+To run an GPTQModel quantized model with vLLM, you can use [DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2](https://huggingface.co/ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2) with the following command:
+
+```console
+python examples/offline_inference/llm_engine_example.py --model DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2 --quantization gptq


Please remove --quantization gptq as this with prevent marlin or machete from being used

mgoin · 2025-02-28T18:13:31Z

docs/source/features/quantization/gptqmodel.md

+sampling_params = SamplingParams(temperature=0.6, top_p=0.9)
+
+# Create an LLM.
+llm = LLM(model="DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2", quantization="gptq")


Qubitium · 2025-03-01T00:53:13Z

docs/source/features/quantization/index.md

@@ -12,6 +12,7 @@ supported_hardware
 auto_awq
 bnb
 gguf
+gptqmodel


@mgoin This list appears to be doing a-z order but fp8 is near the bottom. I would actually re-order this in the performance and level of vllm kernel support, not quant name. bnb and gguf should be near the bottom here but not my decision to make.

Qubitium · 2025-03-01T00:57:23Z

docs/source/features/quantization/auto_awq.md

@@ -3,7 +3,7 @@
 # AutoAWQ

 To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
-Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%.
+Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint.


@mgoin GPTQModel doc is based off the awq doc template but I need to fix this sentence.

BF16 to INT4 does not yield 70% model reduction. This is absolute best case scenario and overall model memory does not drop 70%. Depending on params, BPW can approach 5bits just like GPTQ.

add gptqmodel doc

7a5b6cb

mergify bot added the documentation Improvements or additions to documentation label Feb 28, 2025

Update gptqmodel.md

0debd84

Qubitium marked this pull request as ready for review February 28, 2025 16:47

Qubitium added 2 commits March 1, 2025 00:49

Update gptqmodel.md

f778c8b

更新 gptqmodel.md

311efab

mgoin reviewed Feb 28, 2025

View reviewed changes

Qubitium added 2 commits March 1, 2025 08:36

Update gptqmodel.md

5ba6e09

update

c37bfbb

Qubitium changed the title ~~Add GPTQModel doc~~ [Docs] Add GPTQModel Mar 1, 2025

Qubitium commented Mar 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Docs] Add GPTQModel #14056

[Docs] Add GPTQModel #14056

Qubitium commented Feb 28, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Feb 28, 2025

mgoin Feb 28, 2025

Qubitium Mar 1, 2025

Qubitium Mar 1, 2025

mgoin Feb 28, 2025

Qubitium Mar 1, 2025

mgoin Feb 28, 2025

Qubitium Mar 1, 2025

mgoin Feb 28, 2025

Qubitium Mar 1, 2025

Qubitium Mar 1, 2025 •

edited

Loading

Qubitium Mar 1, 2025


		# GPTQModel

		To create a new [2, 3, 4, 8]-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQModel) from ModelCloud.AI.

[Docs] Add GPTQModel #14056

Are you sure you want to change the base?

[Docs] Add GPTQModel #14056

Conversation

Qubitium commented Feb 28, 2025 • edited by github-actions bot Loading

github-actions bot commented Feb 28, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Qubitium Mar 1, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Qubitium commented Feb 28, 2025 •

edited by github-actions bot

Loading

Qubitium Mar 1, 2025 •

edited

Loading