Prompt Caching & System Prompts #141
nonnull-ca
started this conversation in
General
Replies: 1 comment
-
Was a solution for this problem ever implemented? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
One relatively common approach for handling arbitrary-length inputs is, essentially, to have a system prompt or other global info at the start of the context, with the rest of the context filled with as much of the tail end of the input as will fit.
So e.g. if I had a context length of 8, and my system prompt was
1
, and my current input wasaBcDeFgH
, the resulting context would be1BcDeFgH
.Unfortunately, this currently interacts extremely poorly with prompt caching. Suppose the user responds with
i
, and hence the input now becauseaBcDeFgHi
. Without a system prompt, this becomes a context ofBcDeFgHi
, which the current caching system handles reasonably well. But with a system prompt of1
, this becomes a context of1cDeFgHi
... which the cache cannot handle, and as a result the entire input beyond the system prompt needs to be re-ingested.In chat-like scenarios (model outputs a sentence or two, then user inputs a sentence or two, rinse, repeat), this results in a performance cliff (or rather, a drastic increase in response latency) once history grows to beyond the context size. This is because the number of tokens that must be ingested before starting to generate the output suddenly jump from to <context length - system prompt length>. This can be a significant increase, especially now that
-c8
means that longer contexts are viable.Is there a way around this that allows for somewhat more graceful behavior here?
One obvious approach is to artificially over-truncate the context. So e.g. go from
1BcDeFgH
to1gHi
. This would then mean that a few response cycles could happen before having to do a full re-ingestion. Unfortunately, this still means occasional full stalls, just less of them, and this costs context length in the process.Another approach would be to re-add the system prompt at the end any time it drops out of the context window - so
cDeFgHi1
in this example. Unfortunately, attention only allows attending to prior tokens, and models don't do well with input data followed by input instructions as a result. Maybe they could with explicit training, who knows, but not for existing models at least.Beta Was this translation helpful? Give feedback.
All reactions