Skip to content

Conversation

@shihaobai
Copy link
Collaborator

No description provided.

@gemini-code-assist
Copy link

Summary of Changes

Hello @shihaobai, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the LightLLM framework by integrating comprehensive support for AWQ and AWQ Marlin quantization. This enables the system to load and efficiently process models quantized using these methods, which can lead to reduced memory footprint and faster inference times. The changes involve extending the core weight management system to handle the specific requirements of AWQ, such as zero points and scales, and incorporating optimized kernel operations from vllm for performance.

Highlights

  • AWQ Quantization Support: Implemented comprehensive support for AWQ (Activation-aware Weight Quantization) and AWQ Marlin quantization methods, enabling more efficient inference with quantized models.
  • Enhanced Weight Handling: Introduced new weight classes (e.g., AWQMMWeightTpl, AWQCOLMMWeight, AWQROWMMWeight) to specifically manage and process AWQ quantized weights, scales, and zero points, including tensor parallelism slicing.
  • VLLM Integration: Leveraged vllm operations for optimized AWQ dequantization, GEMM, and Marlin-specific computations, ensuring high performance for quantized models.
  • Dynamic Quantization Method Selection: Improved the quantization configuration system to dynamically identify and select between AWQ, AWQ Marlin, and other quantization types based on the provided Hugging Face quantization configuration.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces AWQ (Activation-Aware Weight Quantization) support to the lightllm project. The changes include adding new weight classes for AWQ and AWQ Marlin, modifying existing weight classes to accommodate quantization parameters, and integrating AWQ into the quantization configuration. The code introduces new classes for handling AWQ quantized weights, including AWQCOLMMWeight, AWQMARLINCOLMMWeight, AWQROWMMWeight, and AWQMARLINMultiROWMMWeight. These classes manage the loading, slicing, and processing of quantized weights, scales, and zero points. The changes also modify the Quantcfg class to handle AWQ quantization configurations and select the appropriate quantization method.

Comment on lines 16 to +19
if quant_method is None or not quantized_weight:
return UnquantizedCOLMMWeight
else:
return W8A8B128COLMMWeight
return COLBMM_WEIGHT_CLS_MAP[quant_method.get_name()]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Consider adding a default case or an error message if the quant_method.get_name() doesn't match any key in COLBMM_WEIGHT_CLS_MAP. This can prevent unexpected behavior if a new quantization method is added but not included in the map.

Suggested change
if quant_method is None or not quantized_weight:
return UnquantizedCOLMMWeight
else:
return W8A8B128COLMMWeight
return COLBMM_WEIGHT_CLS_MAP[quant_method.get_name()]
if quant_method is None or not quantized_weight:
return UnquantizedCOLMMWeight
return COLBMM_WEIGHT_CLS_MAP.get(quant_method.get_name(), None) # None or raise error

Comment on lines +22 to +23

return ROWBMM_WEIGHT_CLS_MAP[quant_method.get_name()]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the colmm_weight.py file, consider adding a default case or an error message if the quant_method.get_name() doesn't match any key in ROWBMM_WEIGHT_CLS_MAP. This can prevent unexpected behavior if a new quantization method is added but not included in the map.

Suggested change
return ROWBMM_WEIGHT_CLS_MAP[quant_method.get_name()]
if quant_method is None or not quantized_weight:
return UnquantizedROWMMWeight
return ROWBMM_WEIGHT_CLS_MAP.get(quant_method.get_name(), None)

Comment on lines +31 to +32

return MULTI_ROWBMM_WEIGHT_CLS_MAP[quant_method.get_name()]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the colmm_weight.py file, consider adding a default case or an error message if the quant_method.get_name() doesn't match any key in MULTI_ROWBMM_WEIGHT_CLS_MAP. This can prevent unexpected behavior if a new quantization method is added but not included in the map.

Suggested change
return MULTI_ROWBMM_WEIGHT_CLS_MAP[quant_method.get_name()]
if quant_method is None or not quantized_weight:
return UnquantizedMultiROWMMWeight
return MULTI_ROWBMM_WEIGHT_CLS_MAP.get(quant_method.get_name(), None)

def apply(self, input_tensor, weights, bias=None, out=None, workspace=None, use_custom_tensor_mananger=True):
qweight, weight_scale, qzeros = weights

NEED_DEQUANT_WEIGHT = input_tensor.shape[:-1].numel() >= 256

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The condition input_tensor.shape[:-1].numel() >= 256 seems like a magic number. It would be beneficial to extract this value into a constant with a descriptive name to improve readability and maintainability. What does 256 represent?

Suggested change
NEED_DEQUANT_WEIGHT = input_tensor.shape[:-1].numel() >= 256
NEED_DEQUANT_WEIGHT = input_tensor.shape[:-1].numel() >= 256 # replace 256 with a constant

Comment on lines +169 to +170
if not torch.cuda.is_available():
return False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's good to check for CUDA availability, but consider logging a more informative message about why Marlin is not compatible if CUDA is not available. This can help users troubleshoot issues.

Suggested change
if not torch.cuda.is_available():
return False
if not torch.cuda.is_available():
logger.warning("CUDA is not available, AWQ Marlin is not compatible.")
return False

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants