Skip to content

convert : handle pre-quantized models #14810

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

compilade
Copy link
Collaborator

Should fix #14762, and also address #3353.

This roughly implements the idea in #14762 (comment) to allow converting from pre-quantized models, by splitting ModelBase.get_tensors into an intermediate ModelBase.index_tensors which returns a dict[str, Callable[[], Tensor]], which can be modified before get_tensors is called. get_tensors still keeps the same signature (it still returns an Iterator[tuple[str, Tensor]]).

For now, support for these pre-quantizations has been implemented:

The 3-bit variant of GPTQ is more complicated, and so was omitted for now.

Notes

TODO

  • Test if this causes memory usage regressions
    • Lazy or not, safetensors or not
    • So far it seems good.
  • Test remote conversion (with --remote)

Make sure to read the contributing guidelines before submitting a PR

@compilade compilade added enhancement New feature or request python python script changes labels Jul 22, 2025
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect!

In case you feel like it, add support for MXFP4 as well. I will be upstreaming a ggml implementation soon and it would be nice to have HF conversion support. You can use some of the smaller models here https://huggingface.co/models?sort=created&search=mxfp4 (anyone without hadamard matrices should work).

@CISC
Copy link
Collaborator

CISC commented Jul 22, 2025

In case you were wondering, the workaround was about this line:
https://huggingface.co/WisdomShell/CodeShell-7B-Chat/blob/main/pytorch_model.bin.index.json#L555

Conversion would fail because that tensor doesn't exist.

Edit: I might have used the safetensors version, could be it actually works with pytorch bins.

@CISC
Copy link
Collaborator

CISC commented Aug 8, 2025

MLX support could be useful, they have .scales and .biases tensors:
https://ml-explore.github.io/mlx/build/html/python/_autosummary/mlx.core.dequantize.html

Edit: group_size and bits are stored in config.json (not sure why there are two entries):

    "quantization": {
        "group_size": 32,
        "bits": 5
    },
    "quantization_config": {
        "group_size": 32,
        "bits": 5
    },

@MaoJianwei
Copy link

Is it ready to be merged? We need this :)

#15173

@compilade
Copy link
Collaborator Author

compilade commented Aug 9, 2025

Is it ready to be merged? We need this :)

It kind of is, but I wanted to make it more general, and also repack instead of going through F32 and requantize. I've described this in #15111 (comment), but the approach here isn't compatible with how MXFP4 is handled for gpt-oss. I mean, whenever the mxfp4 quant method is handled, it will break the repacking which the gpt-oss conversion relies on.

I guess I can change this in a follow-up PR, since mxfp4 isn't handled here yet.

I have started working on a deferred repacking system to transparently handle transformations of pre-quantized tensors. (it's a bit more involved than I thought, so I guess it makes sense to leave it for a follow-up pull-request)
It's going to be a bit like LoraTorchTensor from convert_lora_to_gguf.py, but for pre-quantized tensors, and with a repacking API for types gguf-py/gguf/quants.py can currently quantize. It should allow avoiding to go through F32 when possible, which means it will be possible to handle mxfp4 and also ternary models more satisfyingly (pre-quantized tensors will also keep a compatible type when using --outtype auto in what I'm planning).

So this can be considered ready to be merged (assuming this doesn't break remote conversion), and a follow-up pull-request will handle repacking (instead of requantizing).

@compilade compilade marked this pull request as ready for review August 9, 2025 04:19
@MaoJianwei
Copy link

So this can be considered ready to be merged (assuming this doesn't break remote conversion), and a follow-up pull-request will handle repacking (instead of requantizing).

It sounds good :)

@BugBusterMax
Copy link

How can I convert llama3.2_1b after GPTQ to gguf?

@CISC
Copy link
Collaborator

CISC commented Aug 10, 2025

@compilade Preferably #14737 should be merged first, then you can rebase this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python python script changes
Projects
None yet
5 participants