server: add minimax-m2 reasoning format override for MiniMax-M2 compatibility #16933

ServeurpersoCom · 2025-11-02T09:47:14Z

Description

MiniMax-M2 models require the complete <think>...</think> block including tags to be present in the context for proper reasoning. This PR adds a minimal reasoning format override that injects a synthetic opening <think> tag while keeping all reasoning content inline, ensuring compatibility with existing clients without modifying the current parsing behavior.
This approach is equivalent to reasoning_format=none but with synthetic prefix injection. When set via --reasoning-format minimax-m2 at server startup, it overrides client API requests that specify reasoning_format=auto, allowing the model to receive the full reasoning block it needs while remaining compatible with all OpenAI-compatible clients.

Changes

Add COMMON_REASONING_FORMAT_MINIMAX_M2 enum value to common_reasoning_format
Implement minimax-m2 format parsing that bypasses reasoning extraction
Inject synthetic <think>\n chunk before first generated token when minimax-m2 is active
Track injection state with minimax_reasoning_prefix_injected and minimax_reasoning_prefix_streamed slot flags
Prepend <think>\n to generated_text for final response and chat parsing
Prevent client reasoning_format=auto from overriding server CLI setting
Add minimax-m2 to CLI help, README.md, and code documentation
Handle LLAMA_TOKEN_NULL in send_partial_response to skip token recording
Update process_token to preserve delta_to_send for streaming correctness
Defer synthetic prefix injection until first generated token for better UX

Testing

Tested with MiniMax-M2-230B model using --reasoning-format minimax-m2 flag on stock Svelte UI.

…tibility MiniMax-M2 models require the complete <think>...</think> block including tags to be present in the context for proper reasoning. This mode injects a synthetic opening <think> tag in the stream while keeping all reasoning tags inline in message.content, ensuring the model receives the full reasoning block it needs. Changes: - Add COMMON_REASONING_FORMAT_MINIMAX_M2 enum value to common_reasoning_format - Implement minimax-m2 format parsing that bypasses reasoning extraction - Inject synthetic <think>\n chunk at slot start when minimax-m2 is active - Track injection state with minimax_reasoning_prefix_injected slot flag - Prepend <think>\n to generated_text for final response and chat parsing - Prevent client reasoning_format=auto from overriding server CLI setting - Add minimax-m2 to CLI help, README.md, and code documentation - Handle LLAMA_TOKEN_NULL in send_partial_response to skip token recording - Update process_token to preserve delta_to_send for streaming correctness

ngxson

I still don't understand why this is need, can you give a concrete example?

Also, I feel like this can be a patch to chat.cpp instead of extending server.cpp code. The server.cpp code is already very complex. We should not adding too much code for non-inference functionalities, including chat template and formatting logic. These functionalities should be confined in a dedicated module.

ngxson · 2025-11-02T11:14:30Z

common/arg.cpp

        "- none: leaves thoughts unparsed in `message.content`\n"
        "- deepseek: puts thoughts in `message.reasoning_content`\n"
        "- deepseek-legacy: keeps `<think>` tags in `message.content` while also populating `message.reasoning_content`\n"
+        "- minimax-m2: streams a synthetic opening `<think>` and keeps `</think>` tags in `message.content`\n"


should we name this something more generic? like synthetic

@ngxson I've moved as much as possible to chat.cpp. For parameter naming, I kept consistency with existing formats, treating the first model (DeepSeek) as the "parent" behavior reference.

However, we could prepare a more modular refactor by renaming the parameters to better reflect their actual behavior:

none -> disables the backend parser (name already good)

deepseek -> remove or document it's an "auto" alias (most used, backend reasoning parser, writes reasoning inside reasoning_content chunks: the OpenAI-compatible target)

deepseek-legacy -> rename to clone or something clearer? (inline <think> tags + duplicate inside reasoning_content = Legacy+OAI-Compat mirroring, I don't have a use case for this)

minimax-m2 (this PR) -> inline reasoning tags + adds a missing <think> opening tag

To make this truly generic, we'd need an additional parameter to define the prepended string instead of hardcoding <think>. Use case: anyone dealing with Jinja templates that pre-open reasoning tags, causing the model to not regenerate them, making subsequent parsing difficult?

Would you prefer I open a follow-up issue to discuss a more generic synthetic-prefix approach with configurable strings?

ServeurpersoCom · 2025-11-02T11:55:12Z

Thanks for the feedback.
That's exactly why I marked this PR as draft.
This part of the server is already complex, and I wanted other eyes on it before deciding how far to modularize or relocate the logic.

The behavior I'm implementing is indeed a special case, which probably deserves a more generic approach, as you mentioned. Ideally, the refactor would rename the existing formats to make them clearer and more consistent, for example:

none -> it disables the backend parser (name already good)
deepseek -> remove or just inform it's already an "auto" alias (the most used one, backend reasoning parser, writes reasoning inside reasoning.content chunks: it's the OpenAI-compatible target)
deepseek-legacy -> "clone" as a better name? (inline <think> tags + duplicate inside reasoning content, I don't have usecase)
This one from the PR -> inline reasoning tags + adds a missing <think> opening tag

https://huggingface.co/MiniMaxAI/MiniMax-M2
They state:
"IMPORTANT: MiniMax-M2 is an interleaved thinking model. Therefore, when using it, it is important to retain the thinking content from the assistant's turns within the historical messages. In the model's output content, we use the <think>...</think> format to wrap the assistant's thinking content. When using the model, you must ensure that the historical content is passed back in its original format. Do not remove the <think>...</think> part, otherwise, the model's performance will be negatively affected."

So by exposing this prefix injection and parsing behavior as modular options, we could easily handle other models with similar reasoning requirements without changing the core server logic.

In the best of all possible worlds, in the refactor we should also have an output template, just like the Jinja template currently a model input template. To remove/reduce kilometers of hardcoded parsing logic

pwilkin · 2025-11-02T13:21:56Z

I've got a PR for Minimax standard chat format in the works (without interleave), but ran into some weird corruption problems (but it might be related to #16935). Gonna upload today probably.

Move minimax-m2 prefix injection logic from server.cpp to chat.cpp via common_chat_stream_state

ServeurpersoCom · 2025-11-02T15:56:53Z

I've got a PR for Minimax standard chat format in the works (without interleave), but ran into some weird corruption problems (but it might be related to #16935). Gonna upload today probably.

The "interleave" part is simply keeping the reasoning in context by sending it in delta.content along with the rest of the conversation, which is what reasoning_format=none does, and what I've implemented here to avoid touching the rest of the codebase.

This model consumes context heavily during reasoning. I'm hoping improvements to the calculations will help it behave better, because right now it needs the full 128K token context to be useful.

Looking forward to your PR for the standard chat format!

ServeurpersoCom · 2025-11-02T16:28:19Z

Here you can see <think>...</think> present in the regular delta.content, which gets sent back to the model in subsequent messages, forming the interleaving that respects this model's specific training. Later, displaying a spoiler on the frontend remains possible if desired: but a different kind of spoiler, since this one gets sent back to context, which we don't do with OAI-compatible models.

Alternatively, we could skip this PR entirely and use the standard OAI reasoning_content with a proper backend parser, then add an exceptional frontend checkbox to optionally send the spoiler content back to context. Feasibility needs study, but a real parser would be better.

More broadly, a new templating engine specifically for streaming, not hardcoded in C++ would be revolutionary in the LLM world. Something like output templates to complement the existing input Jinja templates, reducing the kilometers of hardcoded parsing logic we currently maintain.

hksdpc255 · 2025-11-03T01:05:24Z

MiniMax-M2 models require the complete ... block including tags to be present in the context for proper reasoning.

It seems that the <think>...</think> block, including tags required for proper reasoning, is already implemented in my PR #16932?

I admit that the current implementation of try_parse_reasoning is buggy for this situation, so I’m handling the reasoning content without relying on it for now.

aldehir · 2025-11-03T03:57:43Z

I don't think this is necessary. If you look at the MiniMax-M2 template, the reasoning is only kept for assistant messages that follow the last user message. This happens during a tool call loop, where the client message has role tool and not user. Preserving reasoning content is not required for basic conversations.

Since you need a backend chat parser to handle tool calling, the extraction of reasoning content is straightforward. From there, clients can pass along the reasoning_content from assistant messages and the chat template will render it as needed.

ServeurpersoCom · 2025-11-03T07:16:13Z

Yes, it’s better to start with PR #16932 and, if necessary, figure out how to feed the content back into the context to preserve the model’s training behavior during longer conversations.

ServeurpersoCom added 2 commits November 2, 2025 07:46

server: defer minimax-m2 synthetic <think> until first generated token

39351b1

ServeurpersoCom requested review from ggerganov and ngxson as code owners November 2, 2025 09:47

ServeurpersoCom marked this pull request as draft November 2, 2025 09:47

github-actions bot added examples server labels Nov 2, 2025

ngxson reviewed Nov 2, 2025

View reviewed changes

DajanaV mentioned this pull request Nov 2, 2025

UPSTREAM PR #16933: server: add minimax-m2 reasoning format override for MiniMax-M2 compatibility auroralabs-loci/llama.cpp#45

Closed

server: address review feedback from ngxson

57db9d7

Move minimax-m2 prefix injection logic from server.cpp to chat.cpp via common_chat_stream_state

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: add minimax-m2 reasoning format override for MiniMax-M2 compatibility #16933

server: add minimax-m2 reasoning format override for MiniMax-M2 compatibility #16933

ServeurpersoCom commented Nov 2, 2025 •

edited

Loading

Uh oh!

ngxson left a comment

Uh oh!

ngxson Nov 2, 2025

Uh oh!

ServeurpersoCom Nov 2, 2025

Uh oh!

ServeurpersoCom commented Nov 2, 2025 •

edited

Loading

Uh oh!

pwilkin commented Nov 2, 2025

Uh oh!

ServeurpersoCom commented Nov 2, 2025

Uh oh!

ServeurpersoCom commented Nov 2, 2025 •

edited

Loading

Uh oh!

hksdpc255 commented Nov 3, 2025 •

edited

Loading

Uh oh!

aldehir commented Nov 3, 2025

Uh oh!

ServeurpersoCom commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

server: add minimax-m2 reasoning format override for MiniMax-M2 compatibility #16933

Are you sure you want to change the base?

server: add minimax-m2 reasoning format override for MiniMax-M2 compatibility #16933

Conversation

ServeurpersoCom commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Testing

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

ngxson Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

ServeurpersoCom Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

ServeurpersoCom commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Nov 2, 2025

Uh oh!

ServeurpersoCom commented Nov 2, 2025

Uh oh!

ServeurpersoCom commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hksdpc255 commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aldehir commented Nov 3, 2025

Uh oh!

ServeurpersoCom commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ServeurpersoCom commented Nov 2, 2025 •

edited

Loading

ServeurpersoCom commented Nov 2, 2025 •

edited

Loading

ServeurpersoCom commented Nov 2, 2025 •

edited

Loading

hksdpc255 commented Nov 3, 2025 •

edited

Loading