Skip to content

Prefilling assistant message in openai compatible API #13174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 29, 2025

Conversation

matteoserva
Copy link
Contributor

@matteoserva matteoserva commented Apr 29, 2025

This adds support for prefilling assistant response (or its thought process) using the OpenAI compatible API.

The feature is used for example by Claude.

It can be tested using open-webui or with the following curl command:

curl http://localhost:8080/apply-template \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
 {
    "role": "system",
    "content": "SYSTEM"
 },
 {
    "role": "user",
    "content": "USERMESSAGE"
 },
 {
    "role": "assistant",
    "content": "ASSISTANT"
 }
]
}'

Example advanced scenario: time limit for the thinking process

  • launch a reasoning model and stop its thought early
  • append </think> to its partial response
  • prefill the response and let it continue generating tokens

@ngxson ngxson merged commit e2e1ddb into ggml-org:master Apr 29, 2025
47 of 48 checks passed
@isaac-mcfadyen
Copy link
Contributor

isaac-mcfadyen commented Apr 30, 2025

Just a heads-up that this is potentially a very breaking change, especially because this is an OpenAI compatible API but this is not OpenAI's behavior.

The main situation I can think of is if someone wants to generate a new assistant message after the last one - i.e for ChatML they want the <|im_end|><|im_start|>assistant added between the last message and the new one, rather than the last message to just be continued.

I'd suggest we add this to #9291 at a minimum.

@99991
Copy link
Contributor

99991 commented May 9, 2025

Just a heads-up that this is potentially a very breaking change, especially because this is an OpenAI compatible API but this is not OpenAI's behavior.

A better alternative would be to use an additional "prefix": True key in the message dict as in the Mistral API.

There is also this issue about a prefix API. I think there is an issue with token healing.

@matteoserva
Copy link
Contributor Author

The feature is aligned with the claude api and the open-webui client.

Using "prefix": True would break most clients that expect the current api.

@99991
Copy link
Contributor

99991 commented May 9, 2025

The feature is aligned with the claude api and the open-webui client.

Using "prefix": True would break most clients that expect the current api.

That is because the Claude API is strictly worse than the Mistral API. You can't even tell whether the Claude API is broken without inspecting the output and you can't shut it off if you don't want that behavior.

@isaac-mcfadyen
Copy link
Contributor

The feature is aligned with the claude api and the open-webui client.

I believe llama-server is meant to be OpenAI compatible (which does not have this behavior), not Claude compatible.

Using "prefix": True would break most clients that expect the current api.

I believe those clients would still allow adding custom metadata, correct? In which case using prefix: True in the metadata as suggested would work and still allow them to work with the official Claude API because that metadata entry would just be ignored.

@matteoserva
Copy link
Contributor Author

I believe those clients would still allow adding custom metadata, correct? In which case using prefix: True in the metadata as suggested would work and still allow them to work with the official Claude API because that metadata entry would just be ignored.

I am not aware of clients that support prefix: True in the message item but my knowledge is very limited.

An alternative implementation is continue_final_message in the request body as used by vllm.
Alternate alternative: add a command line option to disable the prefill feature.

For reference, here is an example code that shows how to use both options:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="test")

completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hello!", "prefix": True}
  ],
  extra_body = {"continue_final_message": True}
)

@isaac-mcfadyen
Copy link
Contributor

isaac-mcfadyen commented May 9, 2025

That sounds good, I'd very much vote for this being changed to a field in the body rather than default. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants