-
Notifications
You must be signed in to change notification settings - Fork 12.7k
convert : handle pre-quantized models #14810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect!
In case you feel like it, add support for MXFP4 as well. I will be upstreaming a ggml
implementation soon and it would be nice to have HF conversion support. You can use some of the smaller models here https://huggingface.co/models?sort=created&search=mxfp4 (anyone without hadamard matrices should work).
In case you were wondering, the workaround was about this line: Conversion would fail because that tensor doesn't exist. Edit: I might have used the |
MLX support could be useful, they have Edit: "quantization": {
"group_size": 32,
"bits": 5
},
"quantization_config": {
"group_size": 32,
"bits": 5
}, |
Is it ready to be merged? We need this :) |
It kind of is, but I wanted to make it more general, and also repack instead of going through F32 and requantize. I've described this in #15111 (comment), but the approach here isn't compatible with how MXFP4 is handled for gpt-oss. I mean, whenever the I guess I can change this in a follow-up PR, since I have started working on a deferred repacking system to transparently handle transformations of pre-quantized tensors. (it's a bit more involved than I thought, so I guess it makes sense to leave it for a follow-up pull-request) So this can be considered ready to be merged (assuming this doesn't break remote conversion), and a follow-up pull-request will handle repacking (instead of requantizing). |
It sounds good :) |
How can I convert llama3.2_1b after GPTQ to gguf? |
@compilade Preferably #14737 should be merged first, then you can rebase this. |
Should fix #14762, and also address #3353.
This roughly implements the idea in #14762 (comment) to allow converting from pre-quantized models, by splitting
ModelBase.get_tensors
into an intermediateModelBase.index_tensors
which returns adict[str, Callable[[], Tensor]]
, which can be modified beforeget_tensors
is called.get_tensors
still keeps the same signature (it still returns anIterator[tuple[str, Tensor]]
).For now, support for these pre-quantizations has been implemented:
bitnet
fp8
gptq
(only 2, 4 and 8 bits, not 3)The 3-bit variant of GPTQ is more complicated, and so was omitted for now.
Notes
ModelBase.tensor_names
in favor ofself.model_tensors
which also allows getting the tensor data (because it's adict[str, Callable[[], Tensor]]
.lm_head.weight
tensor fromself.tensor_names
, but I don't see why it's necessary. I've removed it becauseself.tensor_names
was also removed.--dry-run
of a--remote
conversion of https://huggingface.co/WisdomShell/CodeShell-7B, which does run the tensor index completeness check. No problem either.TODO
--remote
)Make sure to read the contributing guidelines before submitting a PR