Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : refactor llama_context, llama_kv_cache, llm_build_context (v2) #12181

Merged
merged 16 commits into from
Mar 13, 2025

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Mar 4, 2025

alt #11213

Overview

The implementation in #11213 became too complicated, trying to make a lot of changes at once. This is an alternative implementation, which does not involve the abstraction of the llama_context. The PR introduces some new abstractions, improves the graph build handling and is an initial step for the next changes listed in section "Next" below.

  • Rework the old llm_build_context into new llm_graph_context implemented in llama-graph.h/.cpp
  • Introduce llm_graph_input_... classes for handling graph inputs in a safer and cleaner way
  • Introduce llm_graph_result for extracting important tensors such as embeddings and logits, instead of searching for them by tensor name
  • Introduce llm_memory_i concept that will abstract different cache/memory mechanisms. For now we have only llama_kv_cache as a type of memory
  • Rework session saving/loading using new llama_io_write_i and llama_io_read_i interfaces
  • Remove "worst case" concept from the graph building logic

API changes

The current changes are only necessary to make the API more consistent in following the naming convention. To migrate, simply replace the old API calls with the new ones.

  • Deprecate llama_kv_cache_... API
  • Add llama_kv_self_... API

Next

  • Introduce new model arch interface and have the different models implement it
  • Add new class llama_kv_cache_recurrent and remove all recurrent logic from the existing class llama_kv_cache_unified. Simplify llama_kv_cache_unified.

@github-actions github-actions bot added android Issues specific to Android examples python python script changes server labels Mar 4, 2025
@ggerganov ggerganov force-pushed the gg/llama-kv-cache-v2 branch 7 times, most recently from 766edbf to 62ba774 Compare March 7, 2025 11:20
@ggerganov ggerganov marked this pull request as ready for review March 7, 2025 11:26
@ggerganov ggerganov requested a review from ngxson as a code owner March 7, 2025 11:26
@ggerganov
Copy link
Member Author

Planning to merge this tomorrow unless there are any suggestions for improvements.

@ggerganov ggerganov force-pushed the gg/llama-kv-cache-v2 branch from 62ba774 to a170669 Compare March 11, 2025 11:53
@ggerganov
Copy link
Member Author

From what I understand, the correct mask also includes the tokens comes before the image. The non-causual mask in llama.cpp only mask the image itself and not the text before it (please correct if I'm wrong).

Yes, this is my concern too. I think it should be easy to fix for the current use case where we evaluate the text and the vision tokens separately. But I think we should also figure out a way to fix it when in the future have support for mixed-modality batches that contain both text and vision tokens/embeddings at the same time. From the top of my head, I think we can simply split such batches into multiple single-modality batches, internally in libllama.

@giladgd
Copy link
Contributor

giladgd commented Mar 13, 2025

I'm getting this error when trying to use any model, for example DeepSeek-R1-Distill-Qwen-14B-IQ2_M:

llama.cpp/src/llama-context.cpp:2032: GGML_ASSERT(n_outputs <= n_outputs_max) failed

After debugging a bit I found that what triggered it is a call to llama_state_get_size (which I call right after creating the context via llama_init_from_model).

@ggerganov
Copy link
Member Author

Can you provide a repro command? The state save/load logic might need some refinement after these changes.

@giladgd
Copy link
Contributor

giladgd commented Mar 14, 2025

I haven't encountered this issue using a command but rather in my binding code.
Here's a simple reproduction:

void repro() {
    llama_backend_init();

    auto model_params = llama_model_default_params();
    model_params.n_gpu_layers = 33;

    auto model_path = "/home/user/models/DeepSeek-R1-Distill-Qwen-14B-IQ2_M.gguf";
    auto model = llama_load_model_from_file(model_path, model_params);
    fputs("model loaded\n", stdout);
    fflush(stdout);

    auto context_params = llama_context_default_params();
    auto ctx = llama_init_from_model(model, context_params);
    fputs("context created\n", stdout);
    fflush(stdout);

    auto state_size = llama_state_get_size(ctx);
    fputs(("State size: " + std::to_string(state_size) + "\n").c_str(), stdout);
    fflush(stdout);

    llama_free(ctx);
    llama_free_model(model);

    llama_backend_free();
}

jpohhhh pushed a commit to Telosnex/llama.cpp that referenced this pull request Mar 14, 2025
…ml-org#12181)

* llama : refactor llama_context, llama_kv_cache, llm_build_context

ggml-ci

* graph : don't mutate the KV cache during defrag

ggml-ci

* context : reduce virtuals + remove test function

ggml-ci

* context : move interface implementation to source file + factory

ggml-ci

* graph : move KV cache build functions to llama_context impl

ggml-ci

* graph : remove model reference from build_pooling

ggml-ci

* graph : remove llama_model reference

ggml-ci

* kv_cache : provide rope factors

ggml-ci

* graph : rework inputs to use only unique_ptr, remove attn input abstraction

ggml-ci

* context : remove llama_context_i abstraction

ggml-ci

* context : clean-up

ggml-ci

* graph : clean-up

ggml-ci

* llama : remove redundant keywords (struct, enum)

ggml-ci

* model : adapt gemma3

ggml-ci

* graph : restore same attention ops as on master

ggml-ci

* llama : remove TODO + fix indent

ggml-ci
@ggerganov
Copy link
Member Author

@giladgd #12397 should fix this.

@fairydreaming
Copy link
Collaborator

@ggerganov I noticed that T5 models no longer work correctly after merging this PR so I investigated possible causes.

I see that you removed is_encoding flag that previously controlled KQ mask creation during encoding phase. Therefore T5 encoder currently uses causal attention mask which is wrong. Another problem is that in T5 decoder implementation build_attn() with llm_graph_input_attn_kv_unified inp expects "2D" V tensor as indicated by this assert:

assert(v_cur->ne[0] == n_embd_v_gqa && v_cur->ne[1] == n_tokens);

but you pass "3D" V tensor here:

Vcur = ggml_reshape_3d(ctx0, Vcur, n_embd_head, n_head_kv, n_tokens);

This results in ggml_tranpose() transposing wrong dimensions in non-debug builds and assertion failure in debug builds.

I found that removing this single line fixed the problem. I'd correct it myself, but I'm not sure how do you intend to handle is_encoding problem, so I'm leaving it to you.

@ggerganov
Copy link
Member Author

Another problem is that in T5 decoder implementation build_attn() with llm_graph_input_attn_kv_unified inp expects "2D" V tensor as indicated by this assert:

Thanks for catching that. I broke this in this commit: 70ef653. The reason was because I wanted to make the PR to produce the same graphs as on master and this extra reshape was causing some small differences. I think it is best to restore the reshape so that all 3 Q, K, V tensors are passed as 3D tensors for consistency.

I see that you removed is_encoding flag that previously controlled KQ mask creation during encoding phase. Therefore T5 encoder currently uses causal attention mask which is wrong.

Maybe the user code should explicitly set the attention type? Btw, this probably explains the differences that I referred to in this #12181 (comment).

@fairydreaming
Copy link
Collaborator

I see that you removed is_encoding flag that previously controlled KQ mask creation during encoding phase. Therefore T5 encoder currently uses causal attention mask which is wrong.

Maybe the user code should explicitly set the attention type? Btw, this probably explains the differences that I referred to in this #12181 (comment).

Do you mean something like this?

if (llama_model_has_encoder(model)) {
   llama_set_causal_attn(lctx, false);
   llama_encode(...);
   llama_set_causal_attn(lctx, true);
}

I just tested it and it works fine. Maybe an extra assert in encode() that would print some info if causal_attn is set true would be good to, otherwise existing code will silently stop working correctly for unknown reason.

@ggerganov
Copy link
Member Author

Yes, that's what I have in mind. But it is too cumbersome and error prone. Maybe temporary we should set causal_attn = false internally for all encode calls and restore to the value it had before the call.

Ideally, we need to have separate contexts for the encoder and the decoder of such models so that we can configure them independently, but this is not ready yet.

@fairydreaming
Copy link
Collaborator

Yes, that's what I have in mind. But it is too cumbersome and error prone. Maybe temporary we should set causal_attn = false internally for all encode calls and restore to the value it had before the call.

Ideally, we need to have separate contexts for the encoder and the decoder of such models so that we can configure them independently, but this is not ready yet.

@ggerganov I guess the "cleanest" solution would be to add llm_graph_input_attn_no_cache_enc and build_attn_inp_no_cache_enc() that would be used only by encoder and would create KQ mask for encoder. I see that you already do similar thing with inp->pos_bucket - there are separate build_inp_pos_bucket_enc() and build_inp_pos_bucket_dec() methods in llm_graph_context for encoder and decoder.

It could always create non-causal mask since I don't know of any models that use causal attention in encoder. If any appears, handling it would be a matter of adding new causal_attn_enc flag in hparams and cparams and creating KQ mask for encoder based on its value.

@ggerganov
Copy link
Member Author

It's hard to decide how to do it exactly. For now, here is a simple patch that should work:

#12447

@fairydreaming
Copy link
Collaborator

@ggerganov There seem to be another problem with the refactor that manifests when using CUDA backend with T5 models. From what I understand the problem is that you copy the encoder output here:

memcpy(cross.v_embd.data(), embd, ggml_nbytes(t_embd));

without making sure the encoder graph finished computation. When I added ggml_synchronize() call earlier it started working correctly:

diff --git a/src/llama-context.cpp b/src/llama-context.cpp
index 42332acf..8d441b0c 100644
--- a/src/llama-context.cpp
+++ b/src/llama-context.cpp
@@ -1100,7 +1100,8 @@ int llama_context::encode(llama_batch & inp_batch) {
                 {
                     // extract token embeddings
                     GGML_ASSERT(n_tokens*n_embd <= (int64_t) embd_size);
-                    ggml_backend_tensor_get_async(backend_embd, t_embd, embd, 0, n_tokens*n_embd*sizeof(float));
+                    ggml_backend_synchronize(backend_embd);
+                    ggml_backend_tensor_get(t_embd, embd, 0, n_tokens*n_embd*sizeof(float));
                 } break;
             case LLAMA_POOLING_TYPE_MEAN:
             case LLAMA_POOLING_TYPE_CLS:

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Mar 19, 2025
…ml-org#12181)

* llama : refactor llama_context, llama_kv_cache, llm_build_context

ggml-ci

* graph : don't mutate the KV cache during defrag

ggml-ci

* context : reduce virtuals + remove test function

ggml-ci

* context : move interface implementation to source file + factory

ggml-ci

* graph : move KV cache build functions to llama_context impl

ggml-ci

* graph : remove model reference from build_pooling

ggml-ci

* graph : remove llama_model reference

ggml-ci

* kv_cache : provide rope factors

ggml-ci

* graph : rework inputs to use only unique_ptr, remove attn input abstraction

ggml-ci

* context : remove llama_context_i abstraction

ggml-ci

* context : clean-up

ggml-ci

* graph : clean-up

ggml-ci

* llama : remove redundant keywords (struct, enum)

ggml-ci

* model : adapt gemma3

ggml-ci

* graph : restore same attention ops as on master

ggml-ci

* llama : remove TODO + fix indent

ggml-ci
@giladgd
Copy link
Contributor

giladgd commented Mar 22, 2025

I'm getting a segmentation fault when using llama_adapter_lora_init with the latest master, and I think it might be related to this PR since I haven't encountered it before.
It only happens when not offloading layers to the GPU.

Here's a simple reproduction code:

void repro() {
    llama_backend_init();

    auto model_params = llama_model_default_params();
    model_params.n_gpu_layers = 0;

    // https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf
    auto model_path = "/home/user/models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf";
    auto model = llama_model_load_from_file(model_path, model_params);
    fputs("model loaded\n", stdout);
    fflush(stdout);

    // https://huggingface.co/ngxson/test_gguf_lora_adapter/blob/main/lora-Llama-3-Instruct-abliteration-LoRA-8B-f16.gguf
    auto lora_path = "/home/user/models/lora-Llama-3-Instruct-abliteration-LoRA-8B-f16.gguf";
    auto lora = llama_adapter_lora_init(model, lora_path);
    fputs("lora created\n", stdout);
    fflush(stdout);

    llama_adapter_lora_free(lora);
    llama_model_free(model);

    llama_backend_free();
}

Here's a stack trace from gdb on an Ubuntu 22.04 machine when compiled with no GPU support:

Stack trace
#0  0x00007fffceb0d360 in ggml_backend_cpu_aarch64_buffer_set_tensor (buffer=<optimized out>, tensor=0x5faeb80, data=0x609d090, offset=<optimized out>, size=262144) at /home/user/llama.cpp/ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp:5632
        tensor_traits = 0x0
        OK = <optimized out>
#1  0x00007fffceee8a75 in operator() (dev=0x5c3a3a0, orig=<optimized out>, __closure=<synthetic pointer>) at /home/user/llama.cpp/src/llama-adapter.cpp:316
        offs = 100330736
        size = 262144
        ctx_gguf = <optimized out>
        read_buf = <optimized out>
        gguf_file = <optimized out>
        ctx_gguf = <optimized out>
        read_buf = <optimized out>
        gguf_file = <optimized out>
        offs = <optimized out>
        size = <optimized out>
#2  llama_adapter_lora_init_impl (model=..., path_lora=0x1c26f6e6022 <error: Cannot access memory at address 0x1c26f6e6022>, adapter=...) at /home/user/llama.cpp/src/llama-adapter.cpp:321
        orig = {a = <optimized out>, b = 0x5f8b910}
        dev = {a = 0x5c3a3a0, b = 0x5d688f0}
        it = {first = "blk.9.ffn_up.weight", second = {a = 0x5faeb80, b = 0x5faecf0}}
        __for_range = std::unordered_map with 0 elements = {[""] = {a = 0x0, b = 0xc3a4c491c3b8c290}<error reading variable: Cannot access memory at address 0xc2b8c2a0c3a3c4b8>...}
        __for_begin = <optimized out>
        __for_end = <optimized out>
        gguf_file = {pimpl = std::unique_ptr<llama_file::impl> = {get() = 0x5b417c0}}
        read_buf = std::vector of length 262144, capacity 262144 = {0 '\000', 60 '<', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 
          0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 
          0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 
          0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 
          0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 
          0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 
          0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000'...}
        set_tensor = <optimized out>
        __func__ = "llama_adapter_lora_init_impl"
        ctx_init = 0x5c3a3a0
        meta_gguf_params = <optimized out>
        ctx_gguf = std::unique_ptr<gguf_context> = {get() = 0x5faeb80}
        ctx = std::unique_ptr<ggml_context> = {get() = 0x7fffffff7bb8}
        n_tensors = <optimized out>
        ctx_map = std::map with 2 elements = {[0x7fffceb41720 <ggml_backend_cpu_aarch64_buffer_type()::ggml_backend_cpu_buffer_type_aarch64>] = 0x5a19490, [0x7ffff408bd40 <ggml_backend_cpu_buffer_from_ptr_type()::ggml_backend_cpu_buffer_type>] = 0x5a1b920}
        ctx_for_buft = <optimized out>
        ab_map = std::map with 225 elements = {["blk.0.attn_k.weight"] = {a = 0x5f64aa0, b = 0x5f64c10}, ["blk.0.attn_output.weight"] = {a = 0x5f64d80, b = 0x5f64ef0}, ["blk.0.attn_q.weight"] = {a = 0x5f65060, b = 0x5f651d0}, ["blk.0.attn_v.weight"] = {a = 0x5f65340, b = 0x5f654b0}, ["blk.0.ffn_down.weight"] = {a = 0x5f64200, 
            b = 0x5f64370}, ["blk.0.ffn_gate.weight"] = {a = 0x5f644e0, b = 0x5f64650}, ["blk.0.ffn_up.weight"] = {a = 0x5f647c0, b = 0x5f64930}, ["blk.1.attn_k.weight"] = {a = 0x5f65ec0, b = 0x5f66030}, ["blk.1.attn_output.weight"] = {a = 0x5f661a0, b = 0x5f66310}, ["blk.1.attn_q.weight"] = {a = 0x5f66480, b = 0x5f665f0}, 
          ["blk.1.attn_v.weight"] = {a = 0x5f66760, b = 0x5f668d0}, ["blk.1.ffn_down.weight"] = {a = 0x5f65620, b = 0x5f65790}, ["blk.1.ffn_gate.weight"] = {a = 0x5f65900, b = 0x5f65a70}, ["blk.1.ffn_up.weight"] = {a = 0x5f65be0, b = 0x5f65d50}, ["blk.10.attn_k.weight"] = {a = 0x5f672e0, b = 0x5f67450}, 
          ["blk.10.attn_output.weight"] = {a = 0x5f675c0, b = 0x5f67730}, ["blk.10.attn_q.weight"] = {a = 0x5f678a0, b = 0x5f67a10}, ["blk.10.attn_v.weight"] = {a = 0x5f67b80, b = 0x5f67cf0}, ["blk.10.ffn_down.weight"] = {a = 0x5f66a40, b = 0x5f66bb0}, ["blk.10.ffn_gate.weight"] = {a = 0x5f66d20, b = 0x5f66e90}, 
          ["blk.10.ffn_up.weight"] = {a = 0x5f67000, b = 0x5f67170}, ["blk.11.attn_k.weight"] = {a = 0x5f68700, b = 0x5f68870}, ["blk.11.attn_output.weight"] = {a = 0x5f689e0, b = 0x5f68b50}, ["blk.11.attn_q.weight"] = {a = 0x5f68cc0, b = 0x5f68e30}, ["blk.11.attn_v.weight"] = {a = 0x5f68fa0, b = 0x5f69110}, 
          ["blk.11.ffn_down.weight"] = {a = 0x5f67e60, b = 0x5f67fd0}, ["blk.11.ffn_gate.weight"] = {a = 0x5f68140, b = 0x5f682b0}, ["blk.11.ffn_up.weight"] = {a = 0x5f68420, b = 0x5f68590}, ["blk.12.attn_k.weight"] = {a = 0x5f69b20, b = 0x5f69c90}, ["blk.12.attn_output.weight"] = {a = 0x5f69e00, b = 0x5f69f70}, 
          ["blk.12.attn_q.weight"] = {a = 0x5f6a0e0, b = 0x5f6a250}, ["blk.12.attn_v.weight"] = {a = 0x5f6a3c0, b = 0x5f6a530}, ["blk.12.ffn_down.weight"] = {a = 0x5f69280, b = 0x5f693f0}, ["blk.12.ffn_gate.weight"] = {a = 0x5f69560, b = 0x5f696d0}, ["blk.12.ffn_up.weight"] = {a = 0x5f69840, b = 0x5f699b0}, 
          ["blk.13.attn_k.weight"] = {a = 0x5f6af40, b = 0x5f6b0b0}, ["blk.13.attn_output.weight"] = {a = 0x5f6b220, b = 0x5f6b390}, ["blk.13.attn_q.weight"] = {a = 0x5f6b500, b = 0x5f6b670}, ["blk.13.attn_v.weight"] = {a = 0x5f6b7e0, b = 0x5f6b950}, ["blk.13.ffn_down.weight"] = {a = 0x5f6a6a0, b = 0x5f6a810}, 
          ["blk.13.ffn_gate.weight"] = {a = 0x5f6a980, b = 0x5f6aaf0}, ["blk.13.ffn_up.weight"] = {a = 0x5f6ac60, b = 0x5f6add0}, ["blk.14.attn_k.weight"] = {a = 0x5f6c360, b = 0x5f6c4d0}, ["blk.14.attn_output.weight"] = {a = 0x5f6c640, b = 0x5f6c7b0}, ["blk.14.attn_q.weight"] = {a = 0x5f6c920, b = 0x5f6ca90}, 
          ["blk.14.attn_v.weight"] = {a = 0x5f6cc00, b = 0x5f6cd70}, ["blk.14.ffn_down.weight"] = {a = 0x5f6bac0, b = 0x5f6bc30}, ["blk.14.ffn_gate.weight"] = {a = 0x5f6bda0, b = 0x5f6bf10}, ["blk.14.ffn_up.weight"] = {a = 0x5f6c080, b = 0x5f6c1f0}, ["blk.15.attn_k.weight"] = {a = 0x5f6d780, b = 0x5f6d8f0}, 
          ["blk.15.attn_output.weight"] = {a = 0x5f6da60, b = 0x5f6dbd0}, ["blk.15.attn_q.weight"] = {a = 0x5f6dd40, b = 0x5f6deb0}, ["blk.15.attn_v.weight"] = {a = 0x5f6e020, b = 0x5f6e190}, ["blk.15.ffn_down.weight"] = {a = 0x5f6cee0, b = 0x5f6d050}, ["blk.15.ffn_gate.weight"] = {a = 0x5f6d1c0, b = 0x5f6d330}, 
          ["blk.15.ffn_up.weight"] = {a = 0x5f6d4a0, b = 0x5f6d610}, ["blk.16.attn_k.weight"] = {a = 0x5f6eba0, b = 0x5f6ed10}, ["blk.16.attn_output.weight"] = {a = 0x5f6ee80, b = 0x5f6eff0}, ["blk.16.attn_q.weight"] = {a = 0x5f6f160, b = 0x5f6f2d0}, ["blk.16.attn_v.weight"] = {a = 0x5f6f440, b = 0x5f6f5b0}, 
          ["blk.16.ffn_down.weight"] = {a = 0x5f6e300, b = 0x5f6e470}, ["blk.16.ffn_gate.weight"] = {a = 0x5f6e5e0, b = 0x5f6e750}, ["blk.16.ffn_up.weight"] = {a = 0x5f6e8c0, b = 0x5f6ea30}, ["blk.17.attn_k.weight"] = {a = 0x5f6ffc0, b = 0x5f70130}, ["blk.17.attn_output.weight"] = {a = 0x5f702a0, b = 0x5f70410}, 
          ["blk.17.attn_q.weight"] = {a = 0x5f70580, b = 0x5f706f0}, ["blk.17.attn_v.weight"] = {a = 0x5f70860, b = 0x5f709d0}, ["blk.17.ffn_down.weight"] = {a = 0x5f6f720, b = 0x5f6f890}, ["blk.17.ffn_gate.weight"] = {a = 0x5f6fa00, b = 0x5f6fb70}, ["blk.17.ffn_up.weight"] = {a = 0x5f6fce0, b = 0x5f6fe50}, 
          ["blk.18.attn_k.weight"] = {a = 0x5f713e0, b = 0x5f71550}, ["blk.18.attn_output.weight"] = {a = 0x5f716c0, b = 0x5f71830}, ["blk.18.attn_q.weight"] = {a = 0x5f719a0, b = 0x5f71b10}, ["blk.18.attn_v.weight"] = {a = 0x5f71c80, b = 0x5f71df0}, ["blk.18.ffn_down.weight"] = {a = 0x5f70b40, b = 0x5f70cb0}, 
          ["blk.18.ffn_gate.weight"] = {a = 0x5f70e20, b = 0x5f70f90}, ["blk.18.ffn_up.weight"] = {a = 0x5f71100, b = 0x5f71270}, ["blk.19.attn_k.weight"] = {a = 0x5f72800, b = 0x5f72970}, ["blk.19.attn_output.weight"] = {a = 0x5f72ae0, b = 0x5f72c50}, ["blk.19.attn_q.weight"] = {a = 0x5f72dc0, b = 0x5f72f30}, 
          ["blk.19.attn_v.weight"] = {a = 0x5f730a0, b = 0x5f73210}, ["blk.19.ffn_down.weight"] = {a = 0x5f71f60, b = 0x5f720d0}, ["blk.19.ffn_gate.weight"] = {a = 0x5f72240, b = 0x5f723b0}, ["blk.19.ffn_up.weight"] = {a = 0x5f72520, b = 0x5f72690}, ["blk.2.attn_k.weight"] = {a = 0x5f73c20, b = 0x5f73d90}, 
          ["blk.2.attn_output.weight"] = {a = 0x5f73f00, b = 0x5f74070}, ["blk.2.attn_q.weight"] = {a = 0x5f741e0, b = 0x5f74350}, ["blk.2.attn_v.weight"] = {a = 0x5f744c0, b = 0x5f74630}, ["blk.2.ffn_down.weight"] = {a = 0x5f73380, b = 0x5f734f0}, ["blk.2.ffn_gate.weight"] = {a = 0x5f73660, b = 0x5f737d0}, 
          ["blk.2.ffn_up.weight"] = {a = 0x5f73940, b = 0x5f73ab0}, ["blk.20.attn_k.weight"] = {a = 0x5f75040, b = 0x5f751b0}, ["blk.20.attn_output.weight"] = {a = 0x5f75320, b = 0x5f75490}, ["blk.20.attn_q.weight"] = {a = 0x5f75600, b = 0x5f75770}, ["blk.20.attn_v.weight"] = {a = 0x5f758e0, b = 0x5f75a50}, 
          ["blk.20.ffn_down.weight"] = {a = 0x5f747a0, b = 0x5f74910}, ["blk.20.ffn_gate.weight"] = {a = 0x5f74a80, b = 0x5f74bf0}, ["blk.20.ffn_up.weight"] = {a = 0x5f74d60, b = 0x5f74ed0}, ["blk.21.attn_k.weight"] = {a = 0x5f76460, b = 0x5f765d0}, ["blk.21.attn_output.weight"] = {a = 0x5f76740, b = 0x5f768b0}...}
        str_endswith = <optimized out>
#3  0x00007fffceee90b2 in llama_adapter_lora_init (model=0x59ad9e0, path_lora=0x7ffff41505d0 "/home/user/models/lora-Llama-3-Instruct-abliteration-LoRA-8B-f16.gguf") at /home/user/llama.cpp/src/llama-adapter.cpp:333
        adapter = 0x5d69b40
        __func__ = "llama_adapter_lora_init"
#4  0x00007ffff412563b in repro () at /home/user/repro/repro.cpp:236
        model_params = {devices = 0x0, n_gpu_layers = 0, split_mode = LLAMA_SPLIT_MODE_LAYER, main_gpu = 0, tensor_split = 0x0, progress_callback = 0x0, progress_callback_user_data = 0x0, kv_overrides = 0x0, vocab_only = false, use_mmap = true, use_mlock = false, check_tensors = false}
        model_path = 0x7ffff4150568 "/home/user/models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf"
        model = 0x59ad9e0
        lora_path = 0x7ffff41505d0 "/home/user/models/lora-Llama-3-Instruct-abliteration-LoRA-8B-f16.gguf"
        lora = <optimized out>

@ggerganov
Copy link
Member Author

@giladgd Could you confirm that the cause is actually #12332? I think weights repacking is currently not compatible with using LoRA adapters.

@giladgd
Copy link
Contributor

giladgd commented Mar 25, 2025

@ggerganov I've run some tests and you're right, #12332 is indeed the cause. Sorry for the confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
android Issues specific to Android examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants