Awq support #1084

shihaobai · 2025-10-21T07:05:31Z

No description provided.

gemini-code-assist · 2025-10-21T07:06:00Z

Summary of Changes

Hello @shihaobai, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the LightLLM framework by integrating comprehensive support for AWQ and AWQ Marlin quantization. This enables the system to load and efficiently process models quantized using these methods, which can lead to reduced memory footprint and faster inference times. The changes involve extending the core weight management system to handle the specific requirements of AWQ, such as zero points and scales, and incorporating optimized kernel operations from vllm for performance.

Highlights

AWQ Quantization Support: Implemented comprehensive support for AWQ (Activation-aware Weight Quantization) and AWQ Marlin quantization methods, enabling more efficient inference with quantized models.
Enhanced Weight Handling: Introduced new weight classes (e.g., AWQMMWeightTpl, AWQCOLMMWeight, AWQROWMMWeight) to specifically manage and process AWQ quantized weights, scales, and zero points, including tensor parallelism slicing.
VLLM Integration: Leveraged vllm operations for optimized AWQ dequantization, GEMM, and Marlin-specific computations, ensuring high performance for quantized models.
Dynamic Quantization Method Selection: Improved the quantization configuration system to dynamically identify and select between AWQ, AWQ Marlin, and other quantization types based on the provided Hugging Face quantization configuration.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces AWQ (Activation-Aware Weight Quantization) support to the lightllm project. The changes include adding new weight classes for AWQ and AWQ Marlin, modifying existing weight classes to accommodate quantization parameters, and integrating AWQ into the quantization configuration. The code introduces new classes for handling AWQ quantized weights, including AWQCOLMMWeight, AWQMARLINCOLMMWeight, AWQROWMMWeight, and AWQMARLINMultiROWMMWeight. These classes manage the loading, slicing, and processing of quantized weights, scales, and zero points. The changes also modify the Quantcfg class to handle AWQ quantization configurations and select the appropriate quantization method.

gemini-code-assist · 2025-10-21T07:07:31Z

lightllm/common/basemodel/layer_weights/meta_weights/mm_weight/colmm_weight.py

        if quant_method is None or not quantized_weight:
            return UnquantizedCOLMMWeight
-        else:
-            return W8A8B128COLMMWeight
+        return COLBMM_WEIGHT_CLS_MAP[quant_method.get_name()]


Consider adding a default case or an error message if the quant_method.get_name() doesn't match any key in COLBMM_WEIGHT_CLS_MAP. This can prevent unexpected behavior if a new quantization method is added but not included in the map.

Suggested change

if quant_method is None or not quantized_weight:

return UnquantizedCOLMMWeight

else:

return W8A8B128COLMMWeight

return COLBMM_WEIGHT_CLS_MAP[quant_method.get_name()]

if quant_method is None or not quantized_weight:

return UnquantizedCOLMMWeight

return COLBMM_WEIGHT_CLS_MAP.get(quant_method.get_name(), None) # None or raise error

gemini-code-assist · 2025-10-21T07:07:31Z

lightllm/common/basemodel/layer_weights/meta_weights/mm_weight/rowmm_weight.py

+
+        return ROWBMM_WEIGHT_CLS_MAP[quant_method.get_name()]


Similar to the colmm_weight.py file, consider adding a default case or an error message if the quant_method.get_name() doesn't match any key in ROWBMM_WEIGHT_CLS_MAP. This can prevent unexpected behavior if a new quantization method is added but not included in the map.

Suggested change

return ROWBMM_WEIGHT_CLS_MAP[quant_method.get_name()]

if quant_method is None or not quantized_weight:

return UnquantizedROWMMWeight

return ROWBMM_WEIGHT_CLS_MAP.get(quant_method.get_name(), None)

gemini-code-assist · 2025-10-21T07:07:31Z

lightllm/common/basemodel/layer_weights/meta_weights/mm_weight/rowmm_weight.py

+
+        return MULTI_ROWBMM_WEIGHT_CLS_MAP[quant_method.get_name()]


Similar to the colmm_weight.py file, consider adding a default case or an error message if the quant_method.get_name() doesn't match any key in MULTI_ROWBMM_WEIGHT_CLS_MAP. This can prevent unexpected behavior if a new quantization method is added but not included in the map.

Suggested change

return MULTI_ROWBMM_WEIGHT_CLS_MAP[quant_method.get_name()]

if quant_method is None or not quantized_weight:

return UnquantizedMultiROWMMWeight

return MULTI_ROWBMM_WEIGHT_CLS_MAP.get(quant_method.get_name(), None)

gemini-code-assist · 2025-10-21T07:07:31Z

lightllm/common/quantization/awq_quant.py

+    def apply(self, input_tensor, weights, bias=None, out=None, workspace=None, use_custom_tensor_mananger=True):
+        qweight, weight_scale, qzeros = weights
+
+        NEED_DEQUANT_WEIGHT = input_tensor.shape[:-1].numel() >= 256


The condition input_tensor.shape[:-1].numel() >= 256 seems like a magic number. It would be beneficial to extract this value into a constant with a descriptive name to improve readability and maintainability. What does 256 represent?

Suggested change

NEED_DEQUANT_WEIGHT = input_tensor.shape[:-1].numel() >= 256

NEED_DEQUANT_WEIGHT = input_tensor.shape[:-1].numel() >= 256 # replace 256 with a constant

gemini-code-assist · 2025-10-21T07:07:31Z

lightllm/common/quantization/awq_quant.py

+    if not torch.cuda.is_available():
+        return False


It's good to check for CUDA availability, but consider logging a more informative message about why Marlin is not compatible if CUDA is not available. This can help users troubleshoot issues.

Suggested change

if not torch.cuda.is_available():

return False

if not torch.cuda.is_available():

logger.warning("CUDA is not available, AWQ Marlin is not compatible.")

return False

shihaobai added 4 commits October 20, 2025 21:53

add awq for llama/qwen dense

135e637

fix loadworker > 1 for awq

20fdf1a

remove unused print

18081a6

add awq marlin

358821e

gemini-code-assist bot reviewed Oct 21, 2025

View reviewed changes

add awq for qwen3 moe

4cc4174

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Awq support #1084

Awq support #1084

Uh oh!

shihaobai commented Oct 21, 2025

Uh oh!

gemini-code-assist bot commented Oct 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Uh oh!

gemini-code-assist bot Oct 21, 2025

Uh oh!

gemini-code-assist bot Oct 21, 2025

Uh oh!

gemini-code-assist bot Oct 21, 2025

Uh oh!

gemini-code-assist bot Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	NEED_DEQUANT_WEIGHT = input_tensor.shape[:-1].numel() >= 256
	NEED_DEQUANT_WEIGHT = input_tensor.shape[:-1].numel() >= 256 # replace 256 with a constant

Uh oh!

Awq support #1084

Are you sure you want to change the base?

Awq support #1084

Uh oh!

Conversation

shihaobai commented Oct 21, 2025

Uh oh!

gemini-code-assist bot commented Oct 21, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants