Skip to content

Eval bug: Segfault at the end of the cache (cache defragmentation?) #12259

Closed as duplicate of#12354
@cpg314

Description

@cpg314

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
version: 4850 (ea002810)
built with cc (GCC) 14.2.1 20250128 for x86_64-pc-linux-gnu

Operating systems

Linux

GGML backends

CUDA, CPU

Hardware

NVIDIA RTX 3060 12 GB VRAM and an AMD Ryzen 9 7900

Models

Mistral Small 24B 2501 with Q4_K_M quantization and q4_0 KV cache quantization

Problem description & steps to reproduce

  • A long message is passed to llama-server (started with -c 2048), producing a long response.
  • The cache reaches the size of the context.
  • There is a context shift
  • There is a defrag
  • The server segfaults
    (see logs below)

The segfault does NOT happen:

  • with -c 1024 or with -c 2056
  • when I disable defragmentation with -dt 0

First Bad Commit

No response

Relevant log output

slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 2048, n_past = 2047, n_cache_tokens = 2047, truncated = 1
srv  update_slots: decoding batch, n_tokens = 1
slot process_toke: id  0 | task 0 | n_decoded = 363, n_remaining = -1, next token:  4440 ' document'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 366
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 367, front = 0
slot update_slots: id  0 | task 0 | slot context shift, n_keep = 1, n_left = 2046, n_discard = 1023
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 2048, n_past = 1025, n_cache_tokens = 1025, truncated = 1
srv  update_slots: decoding batch, n_tokens = 1
llama_decode_impl: fragmentation: 0.37 - requesting defrag
slot process_toke: id  0 | task 0 | n_decoded = 364, n_remaining = -1, next token:  1395 ' is'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 367
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 368, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 2048, n_past = 1026, n_cache_tokens = 1026, truncated = 1
srv  update_slots: decoding batch, n_tokens = 1
fish: Job 1, './build/bin/llama-server -m /ho…' terminated by signal SIGSEGV (Address boundary error)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions