-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GenAI] Use BitsAndBytes for 4bit quantization. #7406
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Overview
This PR introduces support for 4‑bit quantization using the BitsAndBytes library by renaming and replacing the old Int4 API with a new Quantize4Bit approach. Key changes include updating the IQuantizeModule interface and its configuration record, propagating these changes across module extension methods and model loading routines, and adjusting documentation and sample code accordingly.
Reviewed Changes
File | Description |
---|---|
src/Microsoft.ML.GenAI.Core/Module/IQuantizeModule.cs | Added Quantize4Bit method and Quantize4BitConfig record with updated XML comments. |
src/Microsoft.ML.GenAI.Core/Extension/ModuleExtension.cs | Replaced ToInt4QuantizeModule calls with the new ToQuantize4BitModule API. |
src/Microsoft.ML.GenAI.[Phi | LLaMA |
Test Files | Removed tests for the deprecated Int4 quantize functionality. |
Docs/Samples | Updated sample code to reflect the new API usage. |
Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
test/Microsoft.ML.GenAI.Phi.Tests/Phi3Tests.cs:51
- Consider adding new tests for the 4-bit quantization functionality to ensure the new Quantize4BitModule behaves as expected, since tests for the old Int4 quantization were removed.
[- [Fact] ... removal of Phi3Mini4KInt4QuantizeShapeTest]
/azp run |
Azure Pipelines successfully started running 2 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 2 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 2 pipeline(s). |
<PackageReference Include="Microsoft.SemanticKernel" Version="$(SemanticKernelVersion)" /> | ||
<PackageReference Include="AutoGen.SourceGenerator" Version="$(AutoGenVersion)" /> | ||
<PackageReference Include="Microsoft.Extensions.Logging.Console" Version="8.0.0" /> | ||
<PackageReference Include="LittleLittleCloud.TorchSharp.BitsAndBytes" Version="0.0.4" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not something we can directly get into torchsharp itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not, bitsandbytes
is not part of libtorch
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the code in a repo owned by you? Or by Microsoft?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we are going to want the version in a central location since you use it in more than once place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the code in a repo owned by you? Or by Microsoft?
The wrapping code is in a repo owned by myself. The cuda code is owned by huggingface I believe. Both are under MIT license.
@@ -90,13 +90,18 @@ public static void ToInt8QuantizeModule<T>( | |||
/// </summary> | |||
/// <typeparam name="T"></typeparam> | |||
/// <param name="model"></param> | |||
public static void ToInt4QuantizeModule<T>( | |||
this T model) | |||
/// <param name="quantizedDType">Quantized data type, can be "fp4" or "nf4".</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe either here or in the summary add a note as to the difference between "fp4" and "nf4".
We are excited to review your PR.
So we can do the best job, please check:
Fixes #nnnn
in your description to cause GitHub to automatically close the issue(s) when your PR is merged.This PR uses the 4bit quantization method from
bitsandbytes
library to quantize linear layer into 4 bits.What's bitsandbytes
bitsandbytes
is a library used by huggingface transformer to provide support for 4bit and 8bit quantization and operation.The
bitsandbytes
is written in cuda, and we provide a C# binding libraryLittleLittleCloud.TorchSharp.BitsAndBytes
to enable easy leverage with torchsharp library.