convert : handle pre-quantized models #14810

compilade · 2025-07-22T08:31:45Z

Should fix #14762, and also address #3353.

This roughly implements the idea in #14762 (comment) to allow converting from pre-quantized models, by splitting ModelBase.get_tensors into an intermediate ModelBase.index_tensors which returns a dict[str, Callable[[], Tensor]], which can be modified before get_tensors is called. get_tensors still keeps the same signature (it still returns an Iterator[tuple[str, Tensor]]).

For now, support for these pre-quantizations has been implemented:

bitnet
- https://huggingface.co/HF1BitLLM/Llama3-8B-1.58-100B-tokens
  - works
- https://huggingface.co/codys12/Qwen3-8B-BitNet
  - not really supported (yet) because of the extra RMS norms; will be possible in a separate PR
- https://huggingface.co/microsoft/bitnet-b1.58-2B-4T
  - should be possible to support once the architecture is also added
fp8
- https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-1.2B-FP8
  - has relatively high perplexity, not sure why
- https://huggingface.co/Qwen/Qwen3-4B-FP8
  - has good relatively low perplexity, so I assume this works
gptq (only 2, 4 and 8 bits, not 3)
- https://huggingface.co/AlignQuant/Meta-Llama-3-8B-Instruct-GPTQ-2bit
  - high perplexity (in the hundreads), but I guess that's normal for 2-bit GPTQ? (maybe not...)
- https://huggingface.co/eewer/Qwen3-0.6B-4bit-GPTQ
  - works
- https://huggingface.co/reinattwijaya/Qwen3-0.6B-final-noreason-gptq-8bit-c4
  - works

The 3-bit variant of GPTQ is more complicated, and so was omitted for now.

Notes

This removes ModelBase.tensor_names in favor of self.model_tensors which also allows getting the tensor data (because it's a dict[str, Callable[[], Tensor]].
CodeShell had a workaround in llama : add support for GPT2, Bloom and CodeShell tied word embeddings #12456 (by @CISC) which removed the lm_head.weight tensor from self.tensor_names, but I don't see why it's necessary. I've removed it because self.tensor_names was also removed.
- I've converted https://huggingface.co/WisdomShell/CodeShell-7B-Chat without problem
- I've also ran a --dry-run of a --remote conversion of https://huggingface.co/WisdomShell/CodeShell-7B, which does run the tensor index completeness check. No problem either.
Lambda functions in Python don't quite capture their environment, so I've used the default-parameter trick from https://stackoverflow.com/a/2295372 a lot.

TODO

Test if this causes memory usage regressions
- Lazy or not, safetensors or not
- So far it seems good.
Test remote conversion (with --remote)

Make sure to read the contributing guidelines before submitting a PR

ggerganov

Perfect!

In case you feel like it, add support for MXFP4 as well. I will be upstreaming a ggml implementation soon and it would be nice to have HF conversion support. You can use some of the smaller models here https://huggingface.co/models?sort=created&search=mxfp4 (anyone without hadamard matrices should work).

CISC · 2025-07-22T10:15:56Z

In case you were wondering, the workaround was about this line:
https://huggingface.co/WisdomShell/CodeShell-7B-Chat/blob/main/pytorch_model.bin.index.json#L555

Conversion would fail because that tensor doesn't exist.

Edit: I might have used the safetensors version, could be it actually works with pytorch bins.

CISC · 2025-08-08T11:21:00Z

MLX support could be useful, they have .scales and .biases tensors:
https://ml-explore.github.io/mlx/build/html/python/_autosummary/mlx.core.dequantize.html

Edit: group_size and bits are stored in config.json (not sure why there are two entries):

    "quantization": {
        "group_size": 32,
        "bits": 5
    },
    "quantization_config": {
        "group_size": 32,
        "bits": 5
    },

MaoJianwei · 2025-08-09T03:11:32Z

Is it ready to be merged? We need this :)

#15173

compilade · 2025-08-09T04:19:29Z

Is it ready to be merged? We need this :)

It kind of is, but I wanted to make it more general, and also repack instead of going through F32 and requantize. I've described this in #15111 (comment), but the approach here isn't compatible with how MXFP4 is handled for gpt-oss. I mean, whenever the mxfp4 quant method is handled, it will break the repacking which the gpt-oss conversion relies on.

I guess I can change this in a follow-up PR, since mxfp4 isn't handled here yet.

I have started working on a deferred repacking system to transparently handle transformations of pre-quantized tensors. (it's a bit more involved than I thought, so I guess it makes sense to leave it for a follow-up pull-request)
It's going to be a bit like LoraTorchTensor from convert_lora_to_gguf.py, but for pre-quantized tensors, and with a repacking API for types gguf-py/gguf/quants.py can currently quantize. It should allow avoiding to go through F32 when possible, which means it will be possible to handle mxfp4 and also ternary models more satisfyingly (pre-quantized tensors will also keep a compatible type when using --outtype auto in what I'm planning).

So this can be considered ready to be merged (assuming this doesn't break remote conversion), and a follow-up pull-request will handle repacking (instead of requantizing).

MaoJianwei · 2025-08-10T02:29:00Z

So this can be considered ready to be merged (assuming this doesn't break remote conversion), and a follow-up pull-request will handle repacking (instead of requantizing).

It sounds good :)

BugBusterMax · 2025-08-10T15:45:50Z

How can I convert llama3.2_1b after GPTQ to gguf?

CISC · 2025-08-10T15:54:59Z

@compilade Preferably #14737 should be merged first, then you can rebase this.

convert : begin handling pre-quantized models

de12f8a

compilade added enhancement New feature or request python python script changes labels Jul 22, 2025

compilade mentioned this pull request Jul 22, 2025

Feature Request: Direct FP8 conversion from convert_hf_to_gguf.py #14762

Closed

4 tasks

ggerganov approved these changes Jul 22, 2025

View reviewed changes

CISC approved these changes Jul 22, 2025

View reviewed changes

CISC mentioned this pull request Jul 29, 2025

model: Add support for GLM 4.5 family of models (#14921) #14939

Merged

compilade mentioned this pull request Aug 6, 2025

gguf-py : add Numpy MXFP4 de/quantization support #15111

Merged

compilade marked this pull request as draft August 7, 2025 13:08

This was linked to issues Aug 8, 2025

Misc. bug: Fail to convert Qwen3-4B-Instruct-2507-FP8 from safetensors to GGUF format. #15173

Open

[BUG] DeepSeek V3 weight_scale_inv tensor mapping not supported in converter #14781

Open

compilade marked this pull request as ready for review August 9, 2025 04:19

CISC mentioned this pull request Aug 10, 2025

Improve Mistral models integration with llama.cpp #14737

Merged

inforithmics mentioned this pull request Aug 13, 2025

Introduce New Lookup-Table(LUT)-Based Matrix Multiplication Method (TMAC) #13206

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

convert : handle pre-quantized models #14810

convert : handle pre-quantized models #14810

compilade commented Jul 22, 2025

Uh oh!

ggerganov left a comment

Uh oh!

CISC commented Jul 22, 2025 •

edited

Loading

Uh oh!

CISC commented Aug 8, 2025 •

edited

Loading

Uh oh!

MaoJianwei commented Aug 9, 2025

Uh oh!

compilade commented Aug 9, 2025 •

edited

Loading

Uh oh!

MaoJianwei commented Aug 10, 2025

Uh oh!

BugBusterMax commented Aug 10, 2025

Uh oh!

CISC commented Aug 10, 2025

Uh oh!

Uh oh!

convert : handle pre-quantized models #14810

Are you sure you want to change the base?

convert : handle pre-quantized models #14810

Conversation

compilade commented Jul 22, 2025

Notes

TODO

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

CISC commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaoJianwei commented Aug 9, 2025

Uh oh!

compilade commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaoJianwei commented Aug 10, 2025

Uh oh!

BugBusterMax commented Aug 10, 2025

Uh oh!

CISC commented Aug 10, 2025

Uh oh!

Uh oh!

CISC commented Jul 22, 2025 •

edited

Loading

CISC commented Aug 8, 2025 •

edited

Loading

compilade commented Aug 9, 2025 •

edited

Loading