Feature Request: Is it possible to bring back this CUDA backend only feature from Pull Request #13529

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

Hello,

I was very excited about this pull request: #13529

URL Link: https://github.com/ggml-org/llama.cpp/pull/13529

Even if it is only for CUDA, if it is possible to maintain this without major issues, it would be a massive quality of life improvement for those that do not have beefy hardware.

Thank you for your time and consideration.

### Motivation

This will allow those without great hardware to experience greater context lengths and better performance from MOE models by being able to offload more layers onto GPU due to the decrease of memory requirement from KV Cache.

### Possible Implementation

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Is it possible to bring back this CUDA backend only feature from Pull Request #13529 #15407

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Is it possible to bring back this CUDA backend only feature from Pull Request #13529 #15407

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions