Eval bug: Segfault at the end of the cache (cache defragmentation?)

### Name and Version

```
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
version: 4850 (ea002810)
built with cc (GCC) 14.2.1 20250128 for x86_64-pc-linux-gnu
```

### Operating systems

Linux

### GGML backends

CUDA, CPU

### Hardware

NVIDIA RTX 3060 12 GB VRAM and an AMD Ryzen 9 7900

### Models

Mistral Small 24B 2501 with Q4_K_M quantization and q4_0 KV cache quantization

### Problem description & steps to reproduce

- A long message is passed to `llama-server` (started with `-c 2048`), producing a long response.
- The cache reaches the size of the context.
- There is a context shift
- There is a defrag
- The server segfaults
(see logs below)

The segfault does NOT happen:
- with `-c 1024` or with `-c 2056`
- when I disable defragmentation with `-dt 0`

### First Bad Commit

_No response_

### Relevant log output

```shell
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 2048, n_past = 2047, n_cache_tokens = 2047, truncated = 1
srv  update_slots: decoding batch, n_tokens = 1
slot process_toke: id  0 | task 0 | n_decoded = 363, n_remaining = -1, next token:  4440 ' document'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 366
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 367, front = 0
slot update_slots: id  0 | task 0 | slot context shift, n_keep = 1, n_left = 2046, n_discard = 1023
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 2048, n_past = 1025, n_cache_tokens = 1025, truncated = 1
srv  update_slots: decoding batch, n_tokens = 1
llama_decode_impl: fragmentation: 0.37 - requesting defrag
slot process_toke: id  0 | task 0 | n_decoded = 364, n_remaining = -1, next token:  1395 ' is'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 367
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 368, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 2048, n_past = 1026, n_cache_tokens = 1026, truncated = 1
srv  update_slots: decoding batch, n_tokens = 1
fish: Job 1, './build/bin/llama-server -m /ho…' terminated by signal SIGSEGV (Address boundary error)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Eval bug: Segfault at the end of the cache (cache defragmentation?) #12259

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Eval bug: Segfault at the end of the cache (cache defragmentation?) #12259

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions