Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GenAI] Use BitsAndBytes for 4bit quantization. #7406

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

LittleLittleCloud
Copy link
Contributor

We are excited to review your PR.

So we can do the best job, please check:

  • There's a descriptive title that will make sense to other developers some time from now.
  • There's associated issues. All PR's should have issue(s) associated - unless a trivial self-evident change such as fixing a typo. You can use the format Fixes #nnnn in your description to cause GitHub to automatically close the issue(s) when your PR is merged.
  • Your change description explains what the change does, why you chose your approach, and anything else that reviewers should know.
  • You have included any necessary tests in the same PR.

This PR uses the 4bit quantization method from bitsandbytes library to quantize linear layer into 4 bits.

What's bitsandbytes

bitsandbytes is a library used by huggingface transformer to provide support for 4bit and 8bit quantization and operation.

The bitsandbytes is written in cuda, and we provide a C# binding library LittleLittleCloud.TorchSharp.BitsAndBytes to enable easy leverage with torchsharp library.

@Copilot Copilot bot review requested due to automatic review settings March 2, 2025 01:06
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This PR introduces support for 4‑bit quantization using the BitsAndBytes library by renaming and replacing the old Int4 API with a new Quantize4Bit approach. Key changes include updating the IQuantizeModule interface and its configuration record, propagating these changes across module extension methods and model loading routines, and adjusting documentation and sample code accordingly.

Reviewed Changes

File Description
src/Microsoft.ML.GenAI.Core/Module/IQuantizeModule.cs Added Quantize4Bit method and Quantize4BitConfig record with updated XML comments.
src/Microsoft.ML.GenAI.Core/Extension/ModuleExtension.cs Replaced ToInt4QuantizeModule calls with the new ToQuantize4BitModule API.
src/Microsoft.ML.GenAI.[Phi LLaMA
Test Files Removed tests for the deprecated Int4 quantize functionality.
Docs/Samples Updated sample code to reflect the new API usage.

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

test/Microsoft.ML.GenAI.Phi.Tests/Phi3Tests.cs:51

  • Consider adding new tests for the 4-bit quantization functionality to ensure the new Quantize4BitModule behaves as expected, since tests for the old Int4 quantization were removed.
[-    [Fact] ... removal of Phi3Mini4KInt4QuantizeShapeTest]

@LittleLittleCloud
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@LittleLittleCloud
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@LittleLittleCloud
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

<PackageReference Include="Microsoft.SemanticKernel" Version="$(SemanticKernelVersion)" />
<PackageReference Include="AutoGen.SourceGenerator" Version="$(AutoGenVersion)" />
<PackageReference Include="Microsoft.Extensions.Logging.Console" Version="8.0.0" />
<PackageReference Include="LittleLittleCloud.TorchSharp.BitsAndBytes" Version="0.0.4" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not something we can directly get into torchsharp itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, bitsandbytes is not part of libtorch...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the code in a repo owned by you? Or by Microsoft?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we are going to want the version in a central location since you use it in more than once place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the code in a repo owned by you? Or by Microsoft?

The wrapping code is in a repo owned by myself. The cuda code is owned by huggingface I believe. Both are under MIT license.

@@ -90,13 +90,18 @@ public static void ToInt8QuantizeModule<T>(
/// </summary>
/// <typeparam name="T"></typeparam>
/// <param name="model"></param>
public static void ToInt4QuantizeModule<T>(
this T model)
/// <param name="quantizedDType">Quantized data type, can be "fp4" or "nf4".</param>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe either here or in the summary add a note as to the difference between "fp4" and "nf4".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants