quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest error #15550

EAddario · 2025-08-24T21:44:58Z

This PR introduces a new option --target-bpw implementing an optimised quant type selection algorithm to automatically determine per-tensor quantisation types in order to achieve a target bits-per-weight (bpw), with minimal estimated quality loss.

The selection algorithm,

builds a candidate set of quant types (K or IQ types)
for each layer/tensor, it simulates quantise→dequantise per candidate type, and estimates error using a weighted MSE error function. If the imatrix includes activations, it adds a bias penalty term to better reflect forward‑pass impact, making the error estimation more accurate and thus the quant type selection
it filters candidates to the pareto frontier (lowest error for a given size), then starts from the smallest bpw mix increasing to larger formats, based on the best error‑reduction per added bit, until the global bpw budget is reached
returns a map of tensor name → ggml_type overrides, which the main quantisation pass uses. If the minimum achievable BPW already exceeds the target, it returns that minimum.

The target_bpw_type() function will look over all quantisable tensors (e.g. embedding, output, etc.) unless --output-tensor-type, --token-embedding-type, and/or --tensor-type options are also used, in which case they'll take precedence.

--prune-layers can also be used in the same run, in which case the target_bpw_type() will skip the pruned layers and only consider the remaining against the total bpw budget.

Important note:

An imatrix that includes activations is required for the algorithm to work. At the time of writing, this is only available by generating the file using #14891 with the --output-format gguf option.

Typical usage: llama-quantize --imatrix imatrix-with-activations.gguf --target-bpw 5.18 LLM-Model-F16.gguf BPW-Quantized-Q4_K_M.gguf q4_k_m

Special thanks to @ddh0, @AesSedai and @compilade for their contributions during the development of this PR.

PR created in draft until testing is completed

…ttps://github.com/ddh0

EAddario added 30 commits August 19, 2025 09:54

Refactor variable name

ba7335e

Add target_bpw parameter

4d94911

Update usage

cfec404

Add parse_target_bpw()

5e85fb3

Load activations

e6d55dc

Populate activations_data with imatrix activations if present

77b818c

Process activations

0edbf0c

Process target_bpw parameter

e877474

Populate params

1b3d5b5

Refactor variable and add target_bpw

a22a9de

Add fallback_type enum

c96b8ee

Add is_iq()

9adae08

Validate if imatrix contains activations

017945a

Add target_bpw_type() logic

92f49ab

Implement bpw_overrides call

1187f6a

Refactor variable names

5aceb9e

Update comments

ee05d6b

Avoid division by zero if truncation occurs

f22b309

Increase precision for error calculation

936294f

Merge branch 'master' into quantize

b33abae

Add F16/BF16 type

5cd69a6

Add F16/BF16 type

69586e2

Do not mix K and IQ quants

29b2dc3

Add better fallbacks for IQ mixes

43caadf

Skip if output.weight or type is COPY

52da4a4

Fix bias lambda bug

3f0118d

Optimise tensor sampling

b0b33b7

Improve error estimation using weighted MSE

35ad0fc

Exclude embeddings and output tensor

5ef493e

Change error estimate to use normalised weighted MSE

95b2ab2

EAddario added 29 commits October 6, 2025 21:40

Fix trimming logic

044fa78

Generate model ID hash

c11184a

Remove bias mode computation

3a3d807

Remove --no-bias option

c93131c

Automatically determine if bias error is significant

5b0d3f6

Merge branch 'master' into quantize

951de2e

Reduce compute time by parallelising tensor processing - courtesy of h…

12e0524

…ttps://github.com/ddh0

Add quant types

b6094a9

Add --keep-bpw-state option

ca28230

Refactor signal handlers

b1b58e6

Update quant types

cd734b8

Minor refactoring

b7911f1

Add tensor type and depth heuristics

a6853ea

Add option to override bpw state file name

0b3e930

Minor refactoring

a510393

Merge branch 'master' into quantize

41a0069

Finetune heuristics

fa1df81

Merge branch 'master' into quantize

90402c0

Update usage

00ddf03

Fix lambda capture

543b5a9

Fix lambda capture

27bf25e

Update epsilon specifier

04561d5

Finetune heuristics

d6ccd56

Merge branch 'master' into quantize

8da14c0

Simplify tensor selection

5303212

Minor refactoring

f8863b9

Read statistics from imatrix

6e32244

Add Euclidean-Cosine score to identify important tensors

c59bb6d

Merge branch 'master' into quantize

b02b1b2

DajanaV mentioned this pull request Nov 1, 2025

UPSTREAM PR #15550: quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest error auroralabs-loci/llama.cpp#34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest error #15550

quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest error #15550

EAddario commented Aug 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest error #15550

Are you sure you want to change the base?

quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest error #15550

Conversation

EAddario commented Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Important note:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

EAddario commented Aug 24, 2025 •

edited

Loading