Prompt Caching & System Prompts #141

nonnull-ca · 2023-11-02T00:49:37Z

nonnull-ca
Nov 2, 2023

One relatively common approach for handling arbitrary-length inputs is, essentially, to have a system prompt or other global info at the start of the context, with the rest of the context filled with as much of the tail end of the input as will fit.

So e.g. if I had a context length of 8, and my system prompt was 1, and my current input was aBcDeFgH, the resulting context would be 1BcDeFgH.

Unfortunately, this currently interacts extremely poorly with prompt caching. Suppose the user responds with i, and hence the input now because aBcDeFgHi. Without a system prompt, this becomes a context of BcDeFgHi, which the current caching system handles reasonably well. But with a system prompt of 1, this becomes a context of 1cDeFgHi... which the cache cannot handle, and as a result the entire input beyond the system prompt needs to be re-ingested.

In chat-like scenarios (model outputs a sentence or two, then user inputs a sentence or two, rinse, repeat), this results in a performance cliff (or rather, a drastic increase in response latency) once history grows to beyond the context size. This is because the number of tokens that must be ingested before starting to generate the output suddenly jump from to <context length - system prompt length>. This can be a significant increase, especially now that -c8 means that longer contexts are viable.

Is there a way around this that allows for somewhat more graceful behavior here?

One obvious approach is to artificially over-truncate the context. So e.g. go from 1BcDeFgH to 1gHi. This would then mean that a few response cycles could happen before having to do a full re-ingestion. Unfortunately, this still means occasional full stalls, just less of them, and this costs context length in the process.

Another approach would be to re-add the system prompt at the end any time it drops out of the context window - so cDeFgHi1 in this example. Unfortunately, attention only allows attending to prior tokens, and models don't do well with input data followed by input instructions as a result. Maybe they could with explicit training, who knows, but not for existing models at least.

Void-025 · 2025-04-11T15:15:48Z

Void-025
Apr 11, 2025

Was a solution for this problem ever implemented?
I know this is an old post, but I was wondering how a system prompt would interact with the cache and this is all I could find on it...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Prompt Caching & System Prompts #141

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Prompt Caching & System Prompts #141

Uh oh!

nonnull-ca Nov 2, 2023

Replies: 1 comment

Uh oh!

Void-025 Apr 11, 2025

nonnull-ca
Nov 2, 2023

Void-025
Apr 11, 2025