-
Notifications
You must be signed in to change notification settings - Fork 13.6k
server: add minimax-m2 reasoning format override for MiniMax-M2 compatibility #16933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
server: add minimax-m2 reasoning format override for MiniMax-M2 compatibility #16933
Conversation
…tibility MiniMax-M2 models require the complete <think>...</think> block including tags to be present in the context for proper reasoning. This mode injects a synthetic opening <think> tag in the stream while keeping all reasoning tags inline in message.content, ensuring the model receives the full reasoning block it needs. Changes: - Add COMMON_REASONING_FORMAT_MINIMAX_M2 enum value to common_reasoning_format - Implement minimax-m2 format parsing that bypasses reasoning extraction - Inject synthetic <think>\n chunk at slot start when minimax-m2 is active - Track injection state with minimax_reasoning_prefix_injected slot flag - Prepend <think>\n to generated_text for final response and chat parsing - Prevent client reasoning_format=auto from overriding server CLI setting - Add minimax-m2 to CLI help, README.md, and code documentation - Handle LLAMA_TOKEN_NULL in send_partial_response to skip token recording - Update process_token to preserve delta_to_send for streaming correctness
ngxson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand why this is need, can you give a concrete example?
Also, I feel like this can be a patch to chat.cpp instead of extending server.cpp code. The server.cpp code is already very complex. We should not adding too much code for non-inference functionalities, including chat template and formatting logic. These functionalities should be confined in a dedicated module.
| "- none: leaves thoughts unparsed in `message.content`\n" | ||
| "- deepseek: puts thoughts in `message.reasoning_content`\n" | ||
| "- deepseek-legacy: keeps `<think>` tags in `message.content` while also populating `message.reasoning_content`\n" | ||
| "- minimax-m2: streams a synthetic opening `<think>` and keeps `</think>` tags in `message.content`\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we name this something more generic? like synthetic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngxson I've moved as much as possible to chat.cpp. For parameter naming, I kept consistency with existing formats, treating the first model (DeepSeek) as the "parent" behavior reference.
However, we could prepare a more modular refactor by renaming the parameters to better reflect their actual behavior:
none-> disables the backend parser (name already good)deepseek-> remove or document it's an "auto" alias (most used, backend reasoning parser, writes reasoning insidereasoning_contentchunks: the OpenAI-compatible target)deepseek-legacy-> rename tocloneor something clearer? (inline<think>tags + duplicate inside reasoning_content = Legacy+OAI-Compat mirroring, I don't have a use case for this)minimax-m2(this PR) -> inline reasoning tags + adds a missing<think>opening tag
To make this truly generic, we'd need an additional parameter to define the prepended string instead of hardcoding <think>. Use case: anyone dealing with Jinja templates that pre-open reasoning tags, causing the model to not regenerate them, making subsequent parsing difficult?
Would you prefer I open a follow-up issue to discuss a more generic synthetic-prefix approach with configurable strings?
|
Thanks for the feedback. The behavior I'm implementing is indeed a special case, which probably deserves a more generic approach, as you mentioned. Ideally, the refactor would rename the existing formats to make them clearer and more consistent, for example: none -> it disables the backend parser (name already good) https://huggingface.co/MiniMaxAI/MiniMax-M2 So by exposing this prefix injection and parsing behavior as modular options, we could easily handle other models with similar reasoning requirements without changing the core server logic. In the best of all possible worlds, in the refactor we should also have an output template, just like the Jinja template currently a model input template. To remove/reduce kilometers of hardcoded parsing logic |
|
I've got a PR for Minimax standard chat format in the works (without interleave), but ran into some weird corruption problems (but it might be related to #16935). Gonna upload today probably. |
Move minimax-m2 prefix injection logic from server.cpp to chat.cpp via common_chat_stream_state
The "interleave" part is simply keeping the reasoning in context by sending it in This model consumes context heavily during reasoning. I'm hoping improvements to the calculations will help it behave better, because right now it needs the full 128K token context to be useful. Looking forward to your PR for the standard chat format! |
It seems that the I admit that the current implementation of |
|
I don't think this is necessary. If you look at the MiniMax-M2 template, the reasoning is only kept for assistant messages that follow the last user message. This happens during a tool call loop, where the client message has role Since you need a backend chat parser to handle tool calling, the extraction of reasoning content is straightforward. From there, clients can pass along the |
|
Yes, it’s better to start with PR #16932 and, if necessary, figure out how to feed the content back into the context to preserve the model’s training behavior during longer conversations. |

Description
MiniMax-M2 models require the complete <think>...</think> block including tags to be present in the context for proper reasoning. This PR adds a minimal reasoning format override that injects a synthetic opening <think> tag while keeping all reasoning content inline, ensuring compatibility with existing clients without modifying the current parsing behavior.
This approach is equivalent to reasoning_format=none but with synthetic prefix injection. When set via --reasoning-format minimax-m2 at server startup, it overrides client API requests that specify reasoning_format=auto, allowing the model to receive the full reasoning block it needs while remaining compatible with all OpenAI-compatible clients.
Changes
Testing
Tested with MiniMax-M2-230B model using --reasoning-format minimax-m2 flag on stock Svelte UI.