Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

imatrix: add option to display importance score statistics for a given imatrix file #12718

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

EAddario
Copy link

@EAddario EAddario commented Apr 2, 2025

A new --show-statistics option generates a report highlighting which tensors/layers contribute the most in a model. The report is sorted from the highest influence to lowest. The process computes the average value of scores per tensor/layer and calculates their % contribution, exiting immediately after completion.

This PR can be used along with quantize: Handle user-defined quantization levels for additional tensors to do layer-wise quantization similar, but not quite the same, to the process described in Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels

Output example:

llama-imatrix --in-file imatrix-DeepSeek-R1-Distill-Llama-8B-small.dat --show-statistics

Computing statistics for imatrix-DeepSeek-R1-Distill-Llama-8B-small.dat (225 tensors)

 Layer	               Tensor	          μ(Importance Scores)	   Contribution
================================================================================
    -	                        output	        5523.92	             13.9226 %
   27	                        attn_v	         356.58	              0.8987 %
   27	                        attn_k	         356.58	              0.8987 %
   27	                        attn_q	         356.58	              0.8987 %
   24	                        attn_k	         347.19	              0.8751 %
   24	                        attn_q	         347.19	              0.8751 %
   24	                        attn_v	         347.19	              0.8751 %
   25	                        attn_q	         346.77	              0.8740 %
   25	                        attn_k	         346.77	              0.8740 %
   25	                        attn_v	         346.77	              0.8740 %
   29	                        attn_v	         344.46	              0.8682 %
...
   0	                      ffn_down	           0.09	              0.0002 %

@ngxson
Copy link
Collaborator

ngxson commented Apr 2, 2025

Nice idea, seems like something we discuss the last time? @bartowski1182

Btw is it possible to show importance score from an existing imatrix file @EAddario ?

@EAddario
Copy link
Author

EAddario commented Apr 2, 2025

Thank you @ngxson. Yes, it will process any imatrix file produced by llama-imatrix, but it is restricted to single file (does not deal with multiple --in-file)

@jukofyork
Copy link
Contributor

Isn't this just related to the hidden state norms getting larger as you move through the different layers? If so, then it won't really account for the accumulation of errors caused by an early layer on the final output?

@EAddario
Copy link
Author

EAddario commented Apr 6, 2025

Not sure if I'm understanding the comment correctly @jukofyork, but the logic I'm using to identify the most influential tensors/layers is to simply average the importance scores (IS) for each, add those averages together, and then compute their individual contributions from the total.

The logic llama-imatrix uses to calculate the IS is to square the value of the corresponding weight during inference, keep a running total of how many times that particular value has been updated, and then save the average when inference has finished.

This only applies to 2d or larger tensors, so it will ignore norms (1d), but since errors influence which weights get updated (and how frequently), the IS does account for errors, albeit indirectly.

Make sense?

@compilade
Copy link
Collaborator

Not sure if I'm understanding the comment correctly @jukofyork, but the logic I'm using to identify the most influential tensors/layers is to simply average the importance scores (IS) for each, add those averages together, and then compute their individual contributions from the total.

@EAddario

I think the mean squared activations (which would be their variance assuming a mean of 0) cannot really be compared across tensors without some kind of normalization, because the values of the model weights can also affect the relative importance of the activations. (llama-imatrix calculates the sum of squared activations and their count, it doesn't directly take into account the model weights; it's only when quantizing that they are taken into account (and even then it depends on the type))

The goal here is to find which layers need more precision, right?

I'm not sure if the mean squared activations really are what you're looking for.

There might be other measures like skewness and kurtosis which may be useful. But I'm not sure if taking only the activations into account is the right way to get the insights you seek.


What I'd like to try eventually would be to use a simultaneous quantization algorithm to try multiple bit-widths at once in a reasonable amount of time so that the errors can be compared per tensor to help with the choice of quantization type.

This would be possible for x[i] ≈ q[i] * s types using a cumulative search similar to #12557, but I don't know how to do that with x[i] ≈ q[i] * s - m types yet.


I still think it can be useful to have some way to visualize what is in imatrix files and/or the distribution of the activations. But not all the necessary information is kept in imatrix files, only the per-channel sum of squared activations, which is a bit limiting for this purpose. Adding more measures (like the mean, skewness and kurtosis, either per-tensor or per-channel) in the file would be easier after #9400.

In the paper you link (https://arxiv.org/pdf/2406.17415), the closest thing to what you propose would be the LIM (layer input modification) score, which is calculated as follows (in Section 3.1), where $L_i$ is the i-th layer, and $L_i^I$ are the input activations and $L_i^O$ the corresponding output activations:

$$ LIM(L_i) = -\frac{L_i^I \cdot L_i^O}{\left|L_i^I\right| \left|L_i^O\right|} $$

llama-imatrix technically has access to both the input and output activations of a layer, but only uses its input.

@EAddario
Copy link
Author

EAddario commented Apr 7, 2025

Very clear now, thanks @compilade. You're correct, I'm using the mean squared activation averaged to identify which tensors/layers produce large magnitude activations and whilst agree it isn't as accurate as, say, correlation / covariance / LIM I think it's still a reasonable proxy, specially considering how the importance scores are actually used during quantization (quant_weights in ggml-quants.c)

I had a quick look at your PRs. I definitely like the idea of storing imatrix data in GGUF format and can appreciate how it would improve the generation of these types of stats. #12557 is quite intriguing, but truth be told I haven't had a chance to really digest it fully (there's a lot going on!) but would love to see it merged specially if it improves ternary quants

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants