-
Notifications
You must be signed in to change notification settings - Fork 11.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : refactor llama_context, llama_kv_cache, llm_build_context (v2) #12181
Conversation
766edbf
to
62ba774
Compare
Planning to merge this tomorrow unless there are any suggestions for improvements. |
62ba774
to
a170669
Compare
ggml-ci
ggml-ci
ggml-ci
Yes, this is my concern too. I think it should be easy to fix for the current use case where we evaluate the text and the vision tokens separately. But I think we should also figure out a way to fix it when in the future have support for mixed-modality batches that contain both text and vision tokens/embeddings at the same time. From the top of my head, I think we can simply split such batches into multiple single-modality batches, internally in |
I'm getting this error when trying to use any model, for example
After debugging a bit I found that what triggered it is a call to |
Can you provide a repro command? The state save/load logic might need some refinement after these changes. |
I haven't encountered this issue using a command but rather in my binding code. void repro() {
llama_backend_init();
auto model_params = llama_model_default_params();
model_params.n_gpu_layers = 33;
auto model_path = "/home/user/models/DeepSeek-R1-Distill-Qwen-14B-IQ2_M.gguf";
auto model = llama_load_model_from_file(model_path, model_params);
fputs("model loaded\n", stdout);
fflush(stdout);
auto context_params = llama_context_default_params();
auto ctx = llama_init_from_model(model, context_params);
fputs("context created\n", stdout);
fflush(stdout);
auto state_size = llama_state_get_size(ctx);
fputs(("State size: " + std::to_string(state_size) + "\n").c_str(), stdout);
fflush(stdout);
llama_free(ctx);
llama_free_model(model);
llama_backend_free();
} |
…ml-org#12181) * llama : refactor llama_context, llama_kv_cache, llm_build_context ggml-ci * graph : don't mutate the KV cache during defrag ggml-ci * context : reduce virtuals + remove test function ggml-ci * context : move interface implementation to source file + factory ggml-ci * graph : move KV cache build functions to llama_context impl ggml-ci * graph : remove model reference from build_pooling ggml-ci * graph : remove llama_model reference ggml-ci * kv_cache : provide rope factors ggml-ci * graph : rework inputs to use only unique_ptr, remove attn input abstraction ggml-ci * context : remove llama_context_i abstraction ggml-ci * context : clean-up ggml-ci * graph : clean-up ggml-ci * llama : remove redundant keywords (struct, enum) ggml-ci * model : adapt gemma3 ggml-ci * graph : restore same attention ops as on master ggml-ci * llama : remove TODO + fix indent ggml-ci
@ggerganov I noticed that T5 models no longer work correctly after merging this PR so I investigated possible causes. I see that you removed Line 1381 in 01e8f21
but you pass "3D" V tensor here: Line 9566 in 01e8f21
This results in I found that removing this single line fixed the problem. I'd correct it myself, but I'm not sure how do you intend to handle |
Thanks for catching that. I broke this in this commit: 70ef653. The reason was because I wanted to make the PR to produce the same graphs as on
Maybe the user code should explicitly set the attention type? Btw, this probably explains the differences that I referred to in this #12181 (comment). |
Do you mean something like this?
I just tested it and it works fine. Maybe an extra assert in encode() that would print some info if causal_attn is set true would be good to, otherwise existing code will silently stop working correctly for unknown reason. |
Yes, that's what I have in mind. But it is too cumbersome and error prone. Maybe temporary we should set Ideally, we need to have separate contexts for the encoder and the decoder of such models so that we can configure them independently, but this is not ready yet. |
@ggerganov I guess the "cleanest" solution would be to add It could always create non-causal mask since I don't know of any models that use causal attention in encoder. If any appears, handling it would be a matter of adding new |
It's hard to decide how to do it exactly. For now, here is a simple patch that should work: |
@ggerganov There seem to be another problem with the refactor that manifests when using CUDA backend with T5 models. From what I understand the problem is that you copy the encoder output here: llama.cpp/src/llama-context.cpp Line 1149 in c6af216
without making sure the encoder graph finished computation. When I added ggml_synchronize() call earlier it started working correctly:
|
…ml-org#12181) * llama : refactor llama_context, llama_kv_cache, llm_build_context ggml-ci * graph : don't mutate the KV cache during defrag ggml-ci * context : reduce virtuals + remove test function ggml-ci * context : move interface implementation to source file + factory ggml-ci * graph : move KV cache build functions to llama_context impl ggml-ci * graph : remove model reference from build_pooling ggml-ci * graph : remove llama_model reference ggml-ci * kv_cache : provide rope factors ggml-ci * graph : rework inputs to use only unique_ptr, remove attn input abstraction ggml-ci * context : remove llama_context_i abstraction ggml-ci * context : clean-up ggml-ci * graph : clean-up ggml-ci * llama : remove redundant keywords (struct, enum) ggml-ci * model : adapt gemma3 ggml-ci * graph : restore same attention ops as on master ggml-ci * llama : remove TODO + fix indent ggml-ci
I'm getting a segmentation fault when using Here's a simple reproduction code: void repro() {
llama_backend_init();
auto model_params = llama_model_default_params();
model_params.n_gpu_layers = 0;
// https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf
auto model_path = "/home/user/models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf";
auto model = llama_model_load_from_file(model_path, model_params);
fputs("model loaded\n", stdout);
fflush(stdout);
// https://huggingface.co/ngxson/test_gguf_lora_adapter/blob/main/lora-Llama-3-Instruct-abliteration-LoRA-8B-f16.gguf
auto lora_path = "/home/user/models/lora-Llama-3-Instruct-abliteration-LoRA-8B-f16.gguf";
auto lora = llama_adapter_lora_init(model, lora_path);
fputs("lora created\n", stdout);
fflush(stdout);
llama_adapter_lora_free(lora);
llama_model_free(model);
llama_backend_free();
} Here's a stack trace from Stack trace
|
@ggerganov I've run some tests and you're right, #12332 is indeed the cause. Sorry for the confusion. |
alt #11213
Overview
The implementation in #11213 became too complicated, trying to make a lot of changes at once. This is an alternative implementation, which does not involve the abstraction of the
llama_context
. The PR introduces some new abstractions, improves the graph build handling and is an initial step for the next changes listed in section "Next" below.llm_build_context
into newllm_graph_context
implemented inllama-graph.h/.cpp
llm_graph_input_...
classes for handling graph inputs in a safer and cleaner wayllm_graph_result
for extracting important tensors such as embeddings and logits, instead of searching for them by tensor namellm_memory_i
concept that will abstract different cache/memory mechanisms. For now we have onlyllama_kv_cache
as a type of memoryllama_io_write_i
andllama_io_read_i
interfacesAPI changes
The current changes are only necessary to make the API more consistent in following the naming convention. To migrate, simply replace the old API calls with the new ones.
llama_kv_cache_...
APIllama_kv_self_...
APINext
class llama_kv_cache_recurrent
and remove all recurrent logic from the existingclass llama_kv_cache_unified
. Simplifyllama_kv_cache_unified
.