-
Couldn't load subscription status.
- Fork 282
Awq support #1084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Awq support #1084
Conversation
Summary of ChangesHello @shihaobai, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the LightLLM framework by integrating comprehensive support for AWQ and AWQ Marlin quantization. This enables the system to load and efficiently process models quantized using these methods, which can lead to reduced memory footprint and faster inference times. The changes involve extending the core weight management system to handle the specific requirements of AWQ, such as zero points and scales, and incorporating optimized kernel operations from Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces AWQ (Activation-Aware Weight Quantization) support to the lightllm project. The changes include adding new weight classes for AWQ and AWQ Marlin, modifying existing weight classes to accommodate quantization parameters, and integrating AWQ into the quantization configuration. The code introduces new classes for handling AWQ quantized weights, including AWQCOLMMWeight, AWQMARLINCOLMMWeight, AWQROWMMWeight, and AWQMARLINMultiROWMMWeight. These classes manage the loading, slicing, and processing of quantized weights, scales, and zero points. The changes also modify the Quantcfg class to handle AWQ quantization configurations and select the appropriate quantization method.
| if quant_method is None or not quantized_weight: | ||
| return UnquantizedCOLMMWeight | ||
| else: | ||
| return W8A8B128COLMMWeight | ||
| return COLBMM_WEIGHT_CLS_MAP[quant_method.get_name()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a default case or an error message if the quant_method.get_name() doesn't match any key in COLBMM_WEIGHT_CLS_MAP. This can prevent unexpected behavior if a new quantization method is added but not included in the map.
| if quant_method is None or not quantized_weight: | |
| return UnquantizedCOLMMWeight | |
| else: | |
| return W8A8B128COLMMWeight | |
| return COLBMM_WEIGHT_CLS_MAP[quant_method.get_name()] | |
| if quant_method is None or not quantized_weight: | |
| return UnquantizedCOLMMWeight | |
| return COLBMM_WEIGHT_CLS_MAP.get(quant_method.get_name(), None) # None or raise error |
|
|
||
| return ROWBMM_WEIGHT_CLS_MAP[quant_method.get_name()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the colmm_weight.py file, consider adding a default case or an error message if the quant_method.get_name() doesn't match any key in ROWBMM_WEIGHT_CLS_MAP. This can prevent unexpected behavior if a new quantization method is added but not included in the map.
| return ROWBMM_WEIGHT_CLS_MAP[quant_method.get_name()] | |
| if quant_method is None or not quantized_weight: | |
| return UnquantizedROWMMWeight | |
| return ROWBMM_WEIGHT_CLS_MAP.get(quant_method.get_name(), None) |
|
|
||
| return MULTI_ROWBMM_WEIGHT_CLS_MAP[quant_method.get_name()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the colmm_weight.py file, consider adding a default case or an error message if the quant_method.get_name() doesn't match any key in MULTI_ROWBMM_WEIGHT_CLS_MAP. This can prevent unexpected behavior if a new quantization method is added but not included in the map.
| return MULTI_ROWBMM_WEIGHT_CLS_MAP[quant_method.get_name()] | |
| if quant_method is None or not quantized_weight: | |
| return UnquantizedMultiROWMMWeight | |
| return MULTI_ROWBMM_WEIGHT_CLS_MAP.get(quant_method.get_name(), None) |
| def apply(self, input_tensor, weights, bias=None, out=None, workspace=None, use_custom_tensor_mananger=True): | ||
| qweight, weight_scale, qzeros = weights | ||
|
|
||
| NEED_DEQUANT_WEIGHT = input_tensor.shape[:-1].numel() >= 256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition input_tensor.shape[:-1].numel() >= 256 seems like a magic number. It would be beneficial to extract this value into a constant with a descriptive name to improve readability and maintainability. What does 256 represent?
| NEED_DEQUANT_WEIGHT = input_tensor.shape[:-1].numel() >= 256 | |
| NEED_DEQUANT_WEIGHT = input_tensor.shape[:-1].numel() >= 256 # replace 256 with a constant |
| if not torch.cuda.is_available(): | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good to check for CUDA availability, but consider logging a more informative message about why Marlin is not compatible if CUDA is not available. This can help users troubleshoot issues.
| if not torch.cuda.is_available(): | |
| return False | |
| if not torch.cuda.is_available(): | |
| logger.warning("CUDA is not available, AWQ Marlin is not compatible.") | |
| return False |
No description provided.