Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Add GPTQModel #14056

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

Qubitium
Copy link
Contributor

@Qubitium Qubitium commented Feb 28, 2025

PR Changes:

  • Add doc for GPTQModel as another fully-backed by ModelCloud.AI user option for GPTQ model quantization

@mgoin

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the documentation Improvements or additions to documentation label Feb 28, 2025
@Qubitium Qubitium marked this pull request as ready for review February 28, 2025 16:47

# GPTQModel

To create a new [2, 3, 4, 8]-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQModel) from ModelCloud.AI.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious about this earlier and didn't get to ask - how are the 2bit and 3bit layers running in vLLM? IIRC we don't have GPTQ kernels for these bits

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Frankly, we never tested 2,3 bits with vllm so made bad assumption there would be a fallback kernel that is 2,3 compatible. We only started testing 2,3 bits in gptqmodel recently as well due to the deepseek needs. Before deepseek, no one gambled with 2/3 bits. =)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgoin Actually, exllama vllm kernel internal code has 2,3 bits support but it was never tested and validated. I plan to adapt this kernel for GPTQModel as well so HF Transformer have access to this kernel. I will do some accuracy comparision vs reference Torch and Marlin and verify 2-8 bits and get back to you. If good, we can just advertise 2,3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to add link in docs/source/features/quantization/index.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

To run an GPTQModel quantized model with vLLM, you can use [DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2](https://huggingface.co/ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2) with the following command:

```console
python examples/offline_inference/llm_engine_example.py --model DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2 --quantization gptq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove --quantization gptq as this with prevent marlin or machete from being used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

sampling_params = SamplingParams(temperature=0.6, top_p=0.9)

# Create an LLM.
llm = LLM(model="DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2", quantization="gptq")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Qubitium Qubitium changed the title Add GPTQModel doc [Docs] Add GPTQModel Mar 1, 2025
@@ -12,6 +12,7 @@ supported_hardware
auto_awq
bnb
gguf
gptqmodel
Copy link
Contributor Author

@Qubitium Qubitium Mar 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgoin This list appears to be doing a-z order but fp8 is near the bottom. I would actually re-order this in the performance and level of vllm kernel support, not quant name. bnb and gguf should be near the bottom here but not my decision to make.

@@ -3,7 +3,7 @@
# AutoAWQ

To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%.
Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgoin GPTQModel doc is based off the awq doc template but I need to fix this sentence.

BF16 to INT4 does not yield 70% model reduction. This is absolute best case scenario and overall model memory does not drop 70%. Depending on params, BPW can approach 5bits just like GPTQ.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants